# Hacker News User Content Analysis

We are going to be looking at Hacker News posts, specifically those designated as "Ask Hacker News" (Ask HN) and "Show Hacker News" (Show HN).

In [7]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




In [8]:
# Remove the headers from the dataset and display seperately
headers = hn[0]
hn = hn[1:]
print(headers)
for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




In [9]:
# Split posts into ask, show, and other, then count

ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [11]:
# Find out whether ask or show posts get more comments on average

total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Avg. ask comments: ', avg_ask_comments)

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
avg_show_comments = total_show_comments / len(show_posts)
print('Avg. show comments: ', avg_show_comments)

Avg. ask comments:  14.038417431192661
Avg. show comments:  10.31669535283993


On average, ask posts receive more comments than show posts. This is not too surprising, since ask posts are soliciting the community for help through the comments.

We will be focusing our remaining analysis on ask posts. Next we will be exploring whether posts made at a certain time receive more comments.

In [13]:
# Count posts and comments for each hour of the day

import datetime as dt

result_list = []
for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    result_list.append([created_at, comments])
    
counts_by_hour = {}
comments_by_hour = {}

for line in result_list:
    post_time = dt.datetime.strptime(line[0],'%m/%d/%Y %H:%M')
    hour = post_time.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += line[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = line[1]
        
print(counts_by_hour)
print(comments_by_hour)

{'15': 116, '20': 80, '19': 110, '10': 59, '06': 44, '12': 73, '09': 45, '17': 100, '02': 58, '11': 58, '03': 54, '22': 71, '07': 34, '13': 85, '23': 68, '14': 107, '01': 60, '16': 108, '21': 109, '18': 109, '08': 48, '04': 47, '00': 55, '05': 46}
{'15': 4477, '20': 1722, '19': 1188, '10': 793, '06': 397, '12': 687, '09': 251, '17': 1146, '02': 1381, '11': 641, '03': 421, '22': 479, '07': 267, '13': 1253, '23': 543, '14': 1416, '01': 683, '16': 1814, '21': 1745, '18': 1439, '08': 492, '04': 337, '00': 447, '05': 464}


In [14]:
# Find the average comments per post in each hour

avg_by_hour = []
for hour in counts_by_hour:
    posts = counts_by_hour[hour]
    comments = comments_by_hour[hour]
    avg_by_hour.append([hour, comments/posts])
    
print(avg_by_hour)

[['15', 38.5948275862069], ['20', 21.525], ['19', 10.8], ['10', 13.440677966101696], ['06', 9.022727272727273], ['12', 9.41095890410959], ['09', 5.5777777777777775], ['17', 11.46], ['02', 23.810344827586206], ['11', 11.051724137931034], ['03', 7.796296296296297], ['22', 6.746478873239437], ['07', 7.852941176470588], ['13', 14.741176470588234], ['23', 7.985294117647059], ['14', 13.233644859813085], ['01', 11.383333333333333], ['16', 16.796296296296298], ['21', 16.009174311926607], ['18', 13.20183486238532], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['05', 10.08695652173913]]


In [15]:
# Create a swapped list for sorting

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[38.5948275862069, '15'], [21.525, '20'], [10.8, '19'], [13.440677966101696, '10'], [9.022727272727273, '06'], [9.41095890410959, '12'], [5.5777777777777775, '09'], [11.46, '17'], [23.810344827586206, '02'], [11.051724137931034, '11'], [7.796296296296297, '03'], [6.746478873239437, '22'], [7.852941176470588, '07'], [14.741176470588234, '13'], [7.985294117647059, '23'], [13.233644859813085, '14'], [11.383333333333333, '01'], [16.796296296296298, '16'], [16.009174311926607, '21'], [13.20183486238532, '18'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [10.08695652173913, '05']]


In [16]:
# Sort in descending order

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [20]:
# Display the top results

print('Top 5 Hours for Ask Post Comments')
for row in sorted_swap[:5]:
    output = "{hour}:00- {avg:.2f} average comments per post".format(hour=row[1],avg=row[0])
    print(output)

Top 5 Hours for Ask Post Comments
15:00- 38.59 average comments per post
02:00- 23.81 average comments per post
20:00- 21.52 average comments per post
16:00- 16.80 average comments per post
21:00- 16.01 average comments per post


## Results

Ask posts made from 15:00 - 15:59 have the most comments on average. The [documentation for the data set](https://www.kaggle.com/hacker-news/hacker-news-posts) shows that the times posted are EST. That means that where I am, on Pacfic Time, the best time to post for maximized comments would be 12:00 - 12:59.