# Hacker News Project 

The goal of this project was to evaluate a dataset from Hacker News, specifically evaluating the "Show HN" and "Ask HN" to identify any key insights:

1. Do "Ask HN" or "Show HN" posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [1]:
# Import dataset and separate header row from the data.
from csv import reader
opened = open('hacker_news.csv')
read = reader(opened)
hn = list(read)
headers = hn[0]
hn = hn[1:]

# Examine dataset
print(headers, '\n')
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
# Filter Ask HN, Show HN, and Other posts into separate lists
ask_posts, show_posts, other_posts = [], [], []

for row in hn:
    title = row[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Ask posts: {}'.format(len(ask_posts)))
print('Show posts: {}'.format(len(show_posts)))
print('Other posts: {}'.format(len(other_posts)))

Ask posts: 1744
Show posts: 1162
Other posts: 17194


In [3]:
# Return sum of Ask HN post comments and the average number of comments

total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments per Ask HN post: {}'.format(avg_ask_comments))

Average number of comments per Ask HN post: 14.038417431192661


In [4]:
# Return sum of Show HN post comments and the average number of comments

total_show_comments = 0
    
for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)
    
print('Average number of comments per Show HN post: {}'.format(avg_show_comments))

Average number of comments per Show HN post: 10.31669535283993


It seems like Ask HN posts generate more comments on average with an average of 14 comments per post, indicating that they are more popular than Show HN posts which average 10 comments per post.
***

Next it's time to learn if Ask HN posts that are created at certain times of the day generate more comments than others. I'm going to evaluate:

1. Amount of Ask posts created each hour of the day
2. The number of comments each Ask post received
3. Calculate the average number of comments per hour created.

In [28]:
#Create a list of lists, containing the time a post was created along with the number of comments.

import datetime as dt
result_list = []

for post in ask_posts:
    post_created = post[6]
    comment_count = int(post[4])
    result_list.append([post_created, comment_count])

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for each_row in result_list:
    comment_count = each_row[1]
    hour = dt.datetime.strptime(each_row[0], date_format).strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_count
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_count
        
print('Ask posts created per hour: {}'.format(counts_by_hour), '\n')
print('Comments per Ask post per hour: {}'.format(comments_by_hour))

Ask posts created per hour: {'13': 85, '07': 34, '17': 100, '03': 54, '09': 45, '20': 80, '01': 60, '15': 116, '19': 110, '10': 59, '18': 109, '02': 58, '04': 47, '05': 46, '12': 73, '16': 108, '23': 68, '11': 58, '14': 107, '21': 109, '06': 44, '22': 71, '00': 55, '08': 48} 

Comments per Ask post per hour: {'13': 1253, '07': 267, '17': 1146, '03': 421, '09': 251, '20': 1722, '01': 683, '15': 4477, '19': 1188, '10': 793, '18': 1439, '02': 1381, '04': 337, '05': 464, '12': 687, '16': 1814, '23': 543, '11': 641, '14': 1416, '21': 1745, '06': 397, '22': 479, '00': 447, '08': 492}


***
Once these datasets were learned, next was to learn the average number of comments created per hour.

In [32]:
avg_comments_per_hour = []

for hour in comments_by_hour:
    avg_comments_per_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_comments_per_hour

[['13', 14.741176470588234],
 ['07', 7.852941176470588],
 ['17', 11.46],
 ['03', 7.796296296296297],
 ['09', 5.5777777777777775],
 ['20', 21.525],
 ['01', 11.383333333333333],
 ['15', 38.5948275862069],
 ['19', 10.8],
 ['10', 13.440677966101696],
 ['18', 13.20183486238532],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['12', 9.41095890410959],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['11', 11.051724137931034],
 ['14', 13.233644859813085],
 ['21', 16.009174311926607],
 ['06', 9.022727272727273],
 ['22', 6.746478873239437],
 ['00', 8.127272727272727],
 ['08', 10.25]]

In [43]:
swap_avg_per_hour = []

for row in avg_comments_per_hour:
    swap_avg_per_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_per_hour, reverse=True)
print('Top 5 hours for Ask Post Comments:','\n')

for avg, hour in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post.".format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg))

Top 5 hours for Ask Post Comments: 

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.
