# Hacker News Post Analysis
#### by Andrew MacDonald

Hacker News is a technology-focused social news site. Founded in 2007, it allows users to submit both news article links and self-posts. After submission, other users can comment on the post and reply to other comments.

This dataset consists of ~20,000 submissions that contain at least one comment. Submissions were randomly selected.

Columns in the dataset include:
* **id**: The unique identifier from Hacker News for the post
* **title**: The title of the post
* **url**: The URL that the posts links to, if it the post has a URL
* **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments**: The number of comments that were made on the post
* **author**: The username of the person who submitted the post
* **created_at**: The date and time at which the post was submitted

### Initial Import and Review of Dataset

In [2]:
# import modules
from csv import reader
import pandas as pd
import datetime as dt

In [3]:
# import dataset
hn = list(reader(open('hacker_news.csv')))

In [4]:
# separate out title row
hn_with_title = hn
hn_title = hn[:1]
hn = hn[1:]

In [5]:
# print title row
print(hn_title)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [6]:
# print first 5 rows from dataset as a list of lists
for post in hn[:5]:
    print(*post)

12224879 Interactive Dynamic Video http://www.interactivedynamicvideo.com/ 386 52 ne0phyte 8/4/2016 11:52
10975351 How to Use Open Source and Shut the Fuck Up at the Same Time http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/ 39 10 josep2 1/26/2016 19:30
11964716 Florida DJs May Face Felony for April Fools' Water Joke http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/ 2 1 vezycash 6/23/2016 22:20
11919867 Technology ventures: From Idea to Enterprise https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429 3 1 hswarna 6/17/2016 0:01
10301696 Note by Note: The Making of Steinway L1037 (2007) http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0 8 2 walterbell 9/30/2015 4:12


In [7]:
# print first 5 rows as a pandas dataframe
hn_df = pd.DataFrame(hn_with_title[1:], columns=hn_with_title[0])
hn_df[:5]

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


### Ask Hacker News / Show Hacker News

Of the self-posts that users submit, some of them can be grouped into either "Ask HN" posts or "Show HN" posts. The below code separates these into separate lists.

Which of these two post categories receives more user comments?

In [8]:
# create three lists to store posts for the respective categories
ask_posts = []
show_posts = []
other_posts = []

In [9]:
# append posts to respective lists based on category
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [10]:
# calculate the average number of comments on an "Ask HN" post
total_ask_comments = 0 

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [11]:
# calculate the average number of comments on a "Show HN" post
total_show_comments = 0 

for post in show_posts:
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

10.31669535283993


As shown by the calculations above, "Ask HN" posts receive an average of ~14 commments per post. "Show HN" posts receive considerably fewer comments, at an average of ~10.

### Ask HN: Time of Day Influence

Does the time of day posted have an influence on the number of comments on an "Ask HN" post?

In [12]:
# create a list of lists that stores the time of each post and number of comments
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])

In [27]:
# create two dictionaries
# count_by_hour: number of posts per hour of the day
# comments_by_hour: number of comments per hour of the day
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    posttime = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    posthour = posttime.strftime('%H')
    if posthour in counts_by_hour:
        counts_by_hour[posthour] += 1
        comments_by_hour[posthour] += row[1]
    else:
        counts_by_hour[posthour] = 1
        comments_by_hour[posthour] = row[1]

In [28]:
# create a list of lists that stores the average number of comments for each hour of the day
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [29]:
# reverse the child lists so that the average number of comments is first and hour is second
swap_avg_by_hour = []

for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1], hour[0]])

In [36]:
# sort the reversed list in descending order by avgerage number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [35]:
# format the output to display the top five hours of the day to post in order to receive the most comments
for hour in sorted_swap[:5]:
    postdatetime = dt.datetime.strptime(hour[1], '%H')
    post_hour = postdatetime.strftime('%H:%M')
    hour_comments = str(round(hour[0], 2))
    print(f"{post_hour}: {hour_comments} average comments per post")

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post
