# Exploring Hacker News Posts

The goal here is to look at data on Hacker News website posts and ask questions such as:

Do Ask HN or Show HN receive more comments on average?

Do posts created at a certain time receive more comments on average?


In [10]:
from csv import reader

In [11]:
hn = list(reader(open("hacker_news.csv"))) #read in the file 

In [12]:
print(hn[:5]) #explore the dataset

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [14]:
header = hn[0] # assign first row as the header


In [15]:
print(header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [6]:
hn = hn[1:] #remove row from df

In [7]:
print(hn[:4])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [8]:
# # we now want to extract ask hn and show hn posts. So we first turn
# whatever the title is, into lowercase and then check if it starts with
# ask hn or show hn
# so a post with a title such as 'Ask HN: I am...' becomes 'ask hn...'
# then you can check

ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts)) 
print(len(other_posts))        

1744
1162
17194


Next, let's determine if ask posts or show posts receive more comments on average

In [16]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)


total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)



14.038417431192661
10.31669535283993


it seems that on average, ask posts have more comments. Although the differnce is not too stark rn

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments

In [25]:
# for this we will just create a dictionary of posts by hour created
# then for each hour we will get the average comments in that hour

#first we will just make a df of post created at and num of comments

import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

# # now we will make 2 dictionaries:
# 1) dictionary with hour created and number of posts created at this hour
# 2) dictionary with hour created and number of comments for posts at this hour
counts_by_hour = {}
comments_by_hour = {}

#first convert the created_at into datetime and then extract only the
#hour

# this is the date format in the df: '8/4/2016 11:52'
datetemplate = '%m/%d/%Y %H:%M'    
for row in result_list:
    date_time = dt.datetime.strptime(row[0], datetemplate )
    hour = dt.datetime.strftime(date_time, '%H')
    
#     #now that we have the hour, update both dictionaries together
#     if hour is not in dict, add it as key and set the value as 1
#     in counts dictionary and n_comments for that row in n_comments
#     dictionary
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
    
    




calculate the average number of comments per post for posts created during each hour of the day

The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post.

In [28]:
avg_by_hour = []
for hour in comments_by_hour: # iterate over each key in dictionaru
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    

In [30]:
print(avg_by_hour) #hours are in 24 hrs

[['14', 13.233644859813085], ['11', 11.051724137931034], ['12', 9.41095890410959], ['01', 11.383333333333333], ['06', 9.022727272727273], ['08', 10.25], ['18', 13.20183486238532], ['16', 16.796296296296298], ['00', 8.127272727272727], ['05', 10.08695652173913], ['15', 38.5948275862069], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059], ['13', 14.741176470588234], ['19', 10.8], ['04', 7.170212765957447], ['07', 7.852941176470588], ['03', 7.796296296296297], ['17', 11.46], ['20', 21.525], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['02', 23.810344827586206]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read

In [42]:
swap_avg_by_hour = []
for row in avg_by_hour:
    new_row = [row[1],row[0]]
    swap_avg_by_hour.append(new_row)
    
#now that we have the list with average as the first list value, we will sort the list of list

    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
#sorted in descednding order


print("Top 5 Hours for Ask Posts Comments")
print('\n')
for row in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(dt.datetime.strptime(row[1],"%H").strftime("%H:%M"),row[0]))


Top 5 Hours for Ask Posts Comments


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


In [None]:
So now we know the Ask hn posts at 3 pm generally average the most comments