# Hacker News Analysis

Goal: Determine the time of day of posts on Hacker News that experience the most comments.

I'll be exploring Hacker News data. You can find the data set here: https://www.kaggle.com/hacker-news/hacker-news-posts.

I'm specifically interested in posts which ask the Hacker News community specific questions. These posts begin with "Ask HN" or "Show HN" in the title. Let's begin by importing the data and exploring the first few rows.


In [1]:
# importing modules
from csv import reader
import datetime as dt

In [2]:
#reading in the csv file and displaying first five rows

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

# assign headers list
headers = hn[0]

# delete headers row
del hn[0]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [5]:
# seperating out posts which begin with "Ask HN" or "Show HN"

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

#QC
for row in ask_posts[0:5]:
    print(row)
print('\n')    
for x in show_posts[0:5]:
    print(row)
print('\n')      
for x in other_posts[0:5]:
    print(row)
print('\n')     

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:3

In [6]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Next, let's determine if ask posts or show posts recieve more comments on average

In [11]:
total_ask_coments = 0
for row in ask_posts:
    total_ask_coments += int(row[4])

total_show_coments = 0
for row in show_posts:
    total_show_coments += int(row[4])

avg_ask_comments = total_ask_coments / len(ask_posts)
avg_show_comments = total_show_coments / len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Ask posts recieve 14.0 comments on average compared to 10.3 comments on average for show posts. Since ask posts recieve more comments, I'll focus the analysis on just these posts.

Next, let's determine if asks posts created at a certain time are more likely to attract comments. I'll do this in two steps: 

1. calculate the number of ask posts created and the number of comments recived in each hour of the day (so 24 bins)
2. calculate the average number of comments ask posts recieve by each hour created

In [15]:
# initialize dictionaries
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    #strip and format "created_at" column into 24 hour datetime objects
    created_at_datetime = dt.datetime.strptime(row[6], "%m/%d/%Y %H:%M")
    created_at_hour = dt.datetime.strftime(created_at_datetime, "%H")
    number_of_comments = int(row[4])

    #logic to incremnt dictionaries
    if created_at_hour in counts_by_hour:
        counts_by_hour[created_at_hour] += 1
    else:
        counts_by_hour[created_at_hour] = 1

    if created_at_hour in comments_by_hour:
        comments_by_hour[created_at_hour] += number_of_comments
    else:
        comments_by_hour[created_at_hour] = number_of_comments

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In [19]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [20]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [32]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    print("{0}:00: {1:.2f} average comments per post".format(row[1], row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The time zone of the data is EST which is GMT-4. I live in PST which is GMT-7. So, to have the greatest chance of comments, I should create posts at 12PM (15:00 minus three hours), 11 PM (02:00 minus three hours), or 5 PM (17:00 minus three hours).