# Hacker News Project

In this project I am going to look at posts on the Hacker News website from 2016.  I'm interested in if there is a relation between the time a post is created and the number of comments the post received, and I'm also interested in if the Ask HN or Show HN posts receive more comments.

I'm going to first change the spreadsheet data into a list of lists and to also remove the data about any posts that did not recieve comments.

In [1]:
from csv import reader
open_file = open('HN_posts_year_to_Sep_26_2016.csv')
read_file = reader(open_file)
hn = list(read_file)

print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]


In [3]:
hn_1 = []
for row in hn:
    comments = row[4]
    if comments != '0':
        hn_1.append(row)
        
print("Length of Original Dataset", len(hn))
print("Length of New Dataset", len(hn_1))
print(hn_1[:4])

Length of Original Dataset 293120
Length of New Dataset 80402
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']]


I am now going to separate the header row from all the data rows:

In [4]:
header = hn_1[0]
hn_1 = hn_1[1:]
print(header)
print(hn_1[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'], ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']]


In this next step I am going to look at how many posts started with Ask HN vs Show HN:

In [26]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_1:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of Ask HN posts:', len(ask_posts))
print('Number of Show HN posts:', len(show_posts))
print('All other posts:', len(other_posts))

Number of Ask HN posts: 6911
Number of Show HN posts: 5059
All other posts: 68431


Now I will both types of posts and see which on average received the most comments:

In [33]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average # of comments for Ask HN posts', avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print('Average # of comments for Show HN posts', avg_show_comments)

Average # of comments for Ask HN posts 13.744175951381855
Average # of comments for Show HN posts 9.810832180272781


From the output of my code above I see that in general Ask HN posts receive more comments per post.  I will now focus the rest of my analysis on those posts.

# Looking at Time Data
Now I will look at if Ask HN posts created at a certain time were more likely to attract comments

In [46]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = row[0]
    date_time = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

Above I created two dictionaries with the hours as keys and the values as either the number of posts for that hour or the number of comments for that hour.
Now I am going to use those two dictionaries to calculate for each hour the average number of comments a post receives:

In [60]:
avg_by_hour = []
for hour in comments_by_hour:
    avg = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
    
print(avg_by_hour)

[['02', 13.171052631578947], ['01', 9.339285714285714], ['22', 11.71875], ['21', 11.034313725490197], ['19', 9.394299287410927], ['17', 13.703703703703704], ['15', 39.611111111111114], ['14', 13.12664907651715], ['13', 22.162079510703364], ['11', 11.107142857142858], ['10', 13.722727272727273], ['09', 8.892655367231638], ['07', 10.056962025316455], ['03', 10.11737089201878], ['16', 10.747596153846153], ['08', 12.403141361256544], ['00', 9.823275862068966], ['23', 8.31407942238267], ['20', 11.358778625954198], ['18', 10.792494481236202], ['12', 15.418181818181818], ['04', 12.631016042780749], ['06', 8.971751412429379], ['05', 11.090361445783133]]


I would like to reformat the list printed aobve for better readability, and I am only going to look at the top 5 hours for Ask Posts Comments.

In [58]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours (EST) For Ask Posts Comments")
for row in sorted_swap[:5]:
    time = dt.datetime.strptime(row[1], "%H")
    time_string = time.strftime("%H:%M")
    comment = row[0]
    print(time_string+" {:.2f} average comments per post".format(comment))

Top 5 Hours (EST) For Ask Posts Comments
15:00 39.61 average comments per post
13:00 22.16 average comments per post
12:00 15.42 average comments per post
10:00 13.72 average comments per post
17:00 13.70 average comments per post
