# Analysing HackerNews Posts
We will compare posts that are asking a question; Ask HN and posts that show the community something such as a project or product; Show HN.
We'll determine the following:
* Does Ask HN or Show HN receive more comments on average?
* Do posts creating at a certain time receive more comments on average?

We'll start with importing all libraries needed and reading the data set as a list of lists. Then partition Ask HN and Show HN into different lists and compute the average number of comments for each post in both partitions:

In [1]:
from csv import reader
open_file = open('datasets/hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
header = hn[0]
hn = hn[1:]
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [2]:
ask_posts = []
show_posts = [] 
other_posts = []
for row in hn:
    title = (row[1].lower())
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts)+len(show_posts)+len(other_posts))
print(len(hn))

293119
293119


In [3]:
total_ask_comments = 0 
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)

total_show_comments = 0 
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)

print(len(ask_posts), len(show_posts))
print(avg_ask_comments)
print(avg_show_comments)

9139 10158
10.393478498741656
4.886099625910612


By observation the average number of comments on ask posts are approximately 5 greater than the average number of comments on show posts. The number of posts in both lists are negligible, and we assume outliers in both lists are averaged out by a large enough dataset.

# Finding the Amount of Ask Post, Comments by Hour Created and Calculating the Average of Ask HN Posts by Hour
We'll first find the amount of ask posts created during each hour of the day and put them into a dictionary, likewise with the number of comments those posts received. Then we'll calculate the average number of comments on ask posts at each hour:

In [4]:
import datetime as dt
result_list = []
for row in ask_posts:
    parse = [row[6], int(row[4])]  
    result_list.append(parse)
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    hour = row[0]
    comment = row[1]
    hour = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [5]:
avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])

print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


# Sorting List of Lists and Printing the Five Highest Values:

In [6]:
swap_avg_by_hour = []
for hr in avg_by_hour:
    swap = [hr[1],hr[0]]
    swap_avg_by_hour.append(swap)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post'.format(
        dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))
    

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


So the highest average comments per post is at 15:00, with 28.68 average comments per post. Note this is in Eastern timezone which would be 7pm GMT. There is a 57% difference between the two highest average comments per post timeframes.

# Conclusion
We've analysed ask posts and show posts to determine on average which type of post and time received the most comments. After analysis, we can deduce the best time to maximise the amounts of comments you'd receive when posting, of the posts that received comments, would be between 15:00-16:00pm est. 