# Analysis of Posts from Hacker News

### A Data Science project with Python

The popular website [Hacker News](https://news.ycombinator.com/) is a nice source of information and fun. If you click the link to this site, you will see it is composed of a great many numbered posts by contributors on disparate technology topics. You can view comments (if any) for each post, and select special posts that begin with "Ask HN:" (ask Hacker News) and "Show HN:" (show Hacker News). It doesn't take long to learn to navigate through all these posts and find interesting topics and questions.

In our project we will analyze and otherwise have some fun with a data set taken from these posts on Hacker News. The data set samples a year of posts and a description of the data can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). One thing to note in the link is that time stamps in the posts are Eastern Time. 

We will use Python programming language and concepts at an intermediate level to accomplish our mission. Roughly put, our mission is to analyze poularity of posts made on Hacker News during different times of the day. We will focus on analysis of string objects, and time and date objects in what follows.

Let's open up this data set we will use for the project, and look at the header and first four rows.

In [183]:
from csv import reader
opened_file = open("hacker_news.csv") #this is our data set for the project
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[0:5]:
    print(row)
    print()

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']



Here is the meaning of the columns you see in the header row:

column heading | meaning
--- | --- 
id | the unique identifier from Hacker News for the post
title | title of the post
url | the URL the posts links to, if it has one
num_points | the number of points the post got (upvotes minus downvotes)
num_comments | the number of comments made on the post
author | username of the person who made the post
created_at | date and time the post was submitted 



In our project we will be interested mainly in posts of the variety "Ask HN" and "Show HN". Let's list just a few of these in rough fashion.

In [184]:
ask_hn = []
show_hn = []

for row in hn[1:]:
    title = row[1]
    
    if "Ask HN:" in title:
        ask_hn.append(row)
    if "Show HN:" in title:
        show_hn.append(row)
        
print(ask_hn[0:2])
print("...and {:,} more rows like this".format(len(ask_hn)))
print()
print(show_hn[0:2])
print("...and {:,} more rows like this".format(len(show_hn)))

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]
...and 1,736 more rows like this

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]
...and 1,161 more rows like this


Turning to our main data set `hn`, we remove the header (shown and explained above), and work with the remaining rows. Let us also list the number of rows in this date set.

In [185]:
hn = hn[1:] #all rows except the header row
n = len(hn)
print("There are {:,} rows".format(n))

There are 20,100 rows


Now that we have a feel for the data, let's group it more carefully into the categories we want to study, making sure we don't lose anything due to upper and lower cases in the titles, etc. (eg. if a title begins with "ask hn" or "Ask HN" we put it into the list `ask_posts`, but if "ask hn" appears in the title but not at the beginning of the title, we don't want it in our `ask_posts` set).

In [186]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower() #get rid of all capitalizations
    
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("There are {:,} posts in ask_posts".format(len(ask_posts)))
print("There are {:,} posts in show_posts".format(len(show_posts)))
print("There are {:,} posts in other_posts".format(len(other_posts)))

There are 1,744 posts in ask_posts
There are 1,162 posts in show_posts
There are 17,194 posts in other_posts


First of all, let's see if ask posts or show posts receive more comments on average. Remember from the table above that the number of comments is in column 5 (corresponding to index 4).

In [187]:
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
avg = total_ask_comments/len(ask_posts)
print("Ask posts get {:,.2f} comments on average".format(avg))

total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
avg = total_show_comments/len(show_posts)
print("Show posts get {:,.2f} comments on average".format(avg))

Ask posts get 14.04 comments on average
Show posts get 10.32 comments on average


We see that the average "Ask HN" post receives about 40% more comments than the average "Show HN" post on Hacker News (14 vs. 10). Let's focus our remaining analysis on the "Ask HN" posts. In particular, let's see if ask posts created during certain times of the day are more likely to attract comments. We do this by computing the average number of posts made per "Ask HN" comment during each hour of the day. 

Recall that the seventh column (index 6) labeled `created at` contains the time stamp for each post. We work with Python datetime objects below to accomplish our goal.

In [188]:
import datetime as dt #import Python's datetime module and abbreviate it
result_list = []

for row in ask_posts:
    t = row[6] #raw time data, ex '6/23/2016 22:10'
    n = int(row[4]) #number of comments
    result_list.append([t,n])

#create frequency tables
counts_by_hour = {} 
comments_by_hour = {}

for row in result_list:
    #get the hour the comment was posted
    t = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hr = t.hour
    
    n = row[1]
    
    if hr in counts_by_hour:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += n
    else:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = n

Now to find the average number of comments per post during each hour:

In [189]:
avg_by_hour = []

for hr in counts_by_hour:
    avg = comments_by_hour[hr]/counts_by_hour[hr]
    avg_by_hour.append([hr,avg])
    
print(avg_by_hour[0:5])
print("...etc.")

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447]]
...etc.


Let's display the five hours with the highest number of comments per post in descending order: We swap the order of the columns in `avg_per_hour` and use the sorting function on this new list.

In [190]:
swap_avg_by_hour = []
#swap columns in the lists
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

#sort in descending order, highest to lowest
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print(sorted_swap[0:5])
print("...etc.")
print()
#print this out in readable form
print("Top 5 Hours for Ask Posts Comments:")
print()

for row in sorted_swap[0:5]:
    hr = dt.time(row[1], 0, 0) #time object
    hr = hr.strftime("%H:%M") #format time as eg. 14:00
    str = "{}: {:.2f} average comments per post" 
    avg = row[0]
    print(str.format(hr, avg))

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21]]
...etc.

Top 5 Hours for Ask Posts Comments:

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We noted earlier that these times are Eastern Time. What about someone in Central Time (eg. the author of this report) who wants to make posts on Hacker News during these peak times? We have to subtract an hour from each of these times; this is done in Python with the timedelta function.

In [191]:
print("Top 5 Hours for Ask Posts Comments in Central Time:")
print()
for row in sorted_swap[0:5]:
    hr = dt.datetime(1,1,1,row[1],0,0) #use dummy value for date
    hr = hr - dt.timedelta(hours = 1) #now subtract an hour
    hr = hr.strftime("%H:%M") #extract the hour from datetime object
    str = "{}: {:.2f} average comments per post"
    avg = row[0]
    print(str.format(hr, avg))

Top 5 Hours for Ask Posts Comments in Central Time:

14:00: 38.59 average comments per post
01:00: 23.81 average comments per post
19:00: 21.52 average comments per post
15:00: 16.80 average comments per post
20:00: 16.01 average comments per post


In conclusion, if I wanted to submit "Ask HN" posts to Hacker News and receive lots of comments, the best time of day for me would appear to be during the 2:00 p.m. (14:00) hour, followed in succession by the 1:00 a.m., 7:00 p.m., 3:00 p.m., and 8:00 p.m. hours. 

**Broadly put, the most best times to get comments on Ask Posts are mid-afternoon, late evening, and the middle of the night.**

This project could be extended by 

    1. Comparing the popularity of "Ask HN" and "Show HN" posts
    2. Seeing if posts made at certain times receive more points  
       on average
    
The datasets created above harbor the answers, and analysis would proceed in similar fashion. 