# Exploring Hacker News Posts

This project will explore a dataset of Hacker News Posts made from approximately startdate to enddate. The data set source is: https://www.kaggle.com/hacker-news/hacker-news-posts. Should include description of what the columns mean (in app profiles project too). Note that the dataset has been reduced from approximately 300,000 observations to about 20,000 observations by removing posts which did not receive any comments and then taking a random sample of the remaining posts.

In this analysis, we will focus specifically on Ask HN (Ask Hacker News) and Show HN (Show Hacker News) posts. The goal of this analysis is to determine if Ask HN or Show HN posts get more comments on average. We will also try to determine if posts created a certain time receive more comments on average.

In [1]:
from csv import reader
hn = list(reader(open("hacker_news.csv")))

In [2]:
for row in hn[0:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [3]:
headers = hn[0]
hn = hn[1:]
print(headers)
for row in hn[0:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [5]:
assert (len(ask_posts) + len(show_posts) + len(other_posts)) == len(hn)

In [6]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

14.038417431192661
10.31669535283993


It appears that ask posts receive more comments on average than do show posts. Show posts averaged approximately 10 comments per post, while ask posts averaged about 14 comments per post. Because ask posts receive more comments on average, we will focus the rest of our analysis just on ask posts. Our next task is to determine if ask posts created at a certain time are more likely to attract comments.

In [7]:
import datetime as dt

In [9]:
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

In [17]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    post_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    post_hour = dt.datetime.strftime(post_time, "%H")
    
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = row[1]
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += row[1]

In [21]:
avg_by_hour = []

for key in counts_by_hour:
    avg_by_hour.append([key, (comments_by_hour[key] / counts_by_hour[key])])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [23]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [25]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [27]:
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strftime(dt.datetime.strptime(row[1], "%H"), "%H:%M"), row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


As shown, the best hour of the day to create a post with the highest chance of receiving comments is 3:00 PM EST. Since I am located in the Mountain time zone, this means I should create a post at 1:00 PM EST.

# Add more analysis, comments, clean up the whole notebook before posting to GH.