
## Exploring Hackers News Posts
We are comparing two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories (or 'posts') are voted and commented upon. The two types of posts we'll explore begin with either Ask HN or Show HN.

My focus on this projects lies in 

- List and for loops
- Datetime module
- Data Cleaning

[Dataset source](https://www.kaggle.com/hacker-news/hacker-news-posts/home)

# Introductions

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn= hn[1:]
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN posts

In [5]:
ask_posts,show_posts,other_posts =list(),list(),list()

In [6]:
for post in hn:
    
    if post[1].lower().startswith('ask hn'): #title of the post is in the first index column
        ask_posts.append(post) #Append the whole row
        
    elif post[1].lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        


In [8]:
print('There are {:,d} post for ask HN posts'.format(len(ask_posts)))
print('There are {:,d} post for show HN posts'.format(len(show_posts)))
print('There are {:,d} remaining post'.format(len(other_posts)))

There are 1,744 post for ask HN posts
There are 1,162 post for show HN posts
There are 17,194 remaining post


## Calculating average number of comments for Ask HN and Show HN pots

In [10]:
total_ask_comments = 0
for i in ask_posts:
    total_ask_comments+=int(i[4])
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Avergae comments on Ask HN posts are {:,.2f} comments per post'.format(avg_ask_comments))   

Avergae comments on Ask HN posts are 14.04 per post


In [12]:
total_show_comments = 0
for i in show_posts:
    total_show_comments+=int(i[4])
    
avg_show_comments = total_show_comments/len(show_posts)

print('Avergae comments on Ask HN posts are {:,.2f} comments per post'.format(avg_show_comments)) 

Avergae comments on Ask HN posts are 10.32 comments per post


On average, ask posts in our sample receive approximately 14 comments, whereas show posts receive approximately 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the post by hour for ask HN posts 

In [13]:
import datetime as dt
result_list = []
for i in ask_posts:
    result_list.append([i[6],int(i[4])]) # isolating the info required - Comments and time

    
result_list[:5]

    

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [18]:
count_by_hour,comments_by_hour = dict(),dict()
for i in result_list:
    date = dt.datetime.strptime(i[0],'%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date,'%H')
    
    if hour not in count_by_hour:
        count_by_hour[hour] =1
        comments_by_hour[hour] =i[1]
        
        
    else:
        count_by_hour[hour] +=1
        comments_by_hour[hour] += i[1]
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [21]:
# calculating average number of comments per hour per post

avg_by_hour=[]

for i in comments_by_hour:
    avg_by_hour.append([i,comments_by_hour[i]/count_by_hour[i]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Making the out put prettier

In [22]:
swap_avg_by_hour =[]

for i in avg_by_hour:
    swap_avg_by_hour.append([i[1],i[0]])
                             
print(swap_avg_by_hour)


[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [23]:
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [24]:
print("Top 5 Hours for Ask Posts Comments")

for avg,hr in sorted_swap[:5]:
    print('{}:00: {:.2f} average comments per post'.format(dt.datetime.strptime(hr,"%H").strftime("%H"),avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).