# Exploring Hacker News Posts

In this project we are going to analyse [Hacker News](https://news.ycombinator.com/) posts. Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

The aim of our analysis is to identify the types of posts which would generate the most interest as well as when it is best to post on Hacker News to generate the most attention. The data set that we will be using for this analysis is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Below are the descriptions of the columns:

|  Column Name |                                                      Description                                                      |
|:------------:|:---------------------------------------------------------------------------------------------------------------------:|
|      id      |                                  The unique identifier from Hacker News for the post                                  |
|      URL     |                                 The URL that the posts links to, if the post has a URL                                |
|  num_points  | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments |                                   The number of comments that were made on the post                                   |
|    author    |                                   The username of the person who submitted the post                                   |
|  created_at  |                                   The date and time (GMT-5) at which the post was submitted                                  |


We are specifically interested in the posts which start with "Ask HN" or "Show HN". Users submit "Ask HN" posts to ask Hacker News community a specific question. Users submit "Show HN" posts to show the Hacker News community a project, product, or just generally something interesting. The goal of both of these types of posts is to garner attention. Ask HN type posts prompts users to comment on the answer. Show HN posts serves as a type of advertisement for the user. On both occassions it would be useful to know when is the best time to post. We will be using the number of comments on the post as a proxy for the attention for that post.

## Opening and Cleaning the Data Set

Let's open the dataset and inspect the first five rows:

In [1]:
from csv import reader

with open("hacker_news.csv", encoding="utf8") as file:
    read_file = reader(file)
    hn_data = list(read_file)
    hn_header = hn_data[0]
    hn = hn_data[1:]
  
print(f"Total Rows: {len(hn):,}")

import pandas as pd
data = hn[0:5]
pd.DataFrame(data, columns=hn_header, index=[f"Row: {x + 1}" for x in range(len(data))])

Total Rows: 293,119


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
Row: 1,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
Row: 2,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
Row: 3,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
Row: 4,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
Row: 5,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


The dataset has almost 300,000 rows of data. However most of these posts would contain no comments, as seen in the first five rows of the dataset. To reduce the bias this introduces, we would be removing all submissions which did not receive any comments.

In [2]:
hn_reduced = []    

for row in hn:
    comments = float(row[4])
    if comments != 0:
        hn_reduced.append(row)
        
print(f"Total Rows: {len(hn_reduced):,}")
        
import pandas as pd
data = hn_reduced[0:5]
pd.DataFrame(data, columns=hn_header, index=[f"Row: {x + 1}" for x in range(len(data))])

Total Rows: 80,401


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
Row: 1,12578975,Saving the Hassle of Shopping,https://blog.menswr.com/2016/09/07/whats-new-w...,1,1,bdoux,9/26/2016 3:13
Row: 2,12578908,Ask HN: What TLD do you use for local developm...,,4,7,Sevrene,9/26/2016 2:53
Row: 3,12578822,Amazons Algorithms Dont Find You the Best Deals,https://www.technologyreview.com/s/602442/amaz...,1,1,yarapavan,9/26/2016 2:26
Row: 4,12578694,Emergency dose of epinephrine that does not co...,http://m.imgur.com/gallery/th6Ua,2,1,dredmorbius,9/26/2016 1:54
Row: 5,12578624,Phone Makers Could Cut Off Drivers. So Why Don...,http://www.nytimes.com/2016/09/25/technology/p...,4,1,danso,9/26/2016 1:37


The number of rows has reduced down to a little over 80,000. 

## Extracting Ask HN and Show HN Posts

Now we will inspect the data at a more granular level, more specifically categorising the types of posts into "Ask HN", "Show HN" or "Other Posts". The way we do this is:

- If the post begins with "Ask HN", then it is a "Ask HN" post
- If the post begins with "Show HN", then it is a "Show HN" post
- Otherwise it is categorised as "Other Posts"

To avoid any issues with upper and lower case, we will convert the title string into lower case and then use the "startswith" string method to check which of the catogories the post belongs to.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_reduced:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f"Ask Posts has {len(ask_posts):,} rows")
print(f"Show Posts has {len(show_posts):,} rows")
print(f"Other Posts has {len(other_posts):,} rows")

Ask Posts has 6,911 rows
Show Posts has 5,059 rows
Other Posts has 68,431 rows


We are left with a small subset of the original data after all the exclusions with just under 12,000 rows of data which we are interested in.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Next we will inspect the nature of the distributions of the number of posts for Ask HN and Show HN posts. To do this we:

- Extract the number of comments for Ask HN and Show HN into a list, its important to convert num_comments into a float before we append it into their respective lists in order to perform numerical operations on it
- Next, we calculate the average number of comments in each list, by calculating the sum divided the length of the list
- We are also interested in the variance, to calculate this, we utilise the "variance" function from the statitistics built-in Python module
- We are also interested in the skew, to calculate this, we utilise the "skew" function from the scipy.stats module
- Then we display all the relevant results in a table format using DateFrame from the pandas module

In [4]:
from statistics import variance
from scipy.stats import skew

ask_comments = []
show_comments = []

for row in hn_reduced:
    title = row[1].lower()
    comments = float(row[4])
    if title.startswith("ask hn"):
        ask_comments.append(comments)
    elif title.startswith("show hn"):
        show_comments.append(comments)

avg_ask_comments = round(sum(ask_comments) / len(ask_comments),2)
avg_show_comments = round(sum(show_comments) / len(show_comments),2)

var_ask_comments = round(variance(ask_comments),2)
var_show_comments = round(variance(show_comments),2)

skew_ask_comments = round(skew(ask_comments),2)
skew_show_comments = round(skew(show_comments),2)

data_ask = [len(ask_comments), avg_ask_comments, var_ask_comments, skew_ask_comments]
data_show = [len(show_comments), avg_show_comments, var_show_comments, skew_show_comments]
data = [data_ask, data_show]

import pandas as pd
pd.DataFrame(data, columns=["Number of Posts", "Average", "Variance", "Skew"], index = ["Ask HN Comments", "Show HN Comments"] )

Unnamed: 0,Number of Posts,Average,Variance,Skew
Ask HN Comments,6911,13.74,2457.25,12.01
Show HN Comments,5059,9.81,475.71,5.41


On average, Ask HN generate more comments on average than Show HN comments. This is unsurprising since the main purpose of Ask HN posts is to encourage users to make comments on the post, whereas Show HN posts is more for advertising and users aren't as inclined to make comments unless it really captures their interest.

It's important to note that the average does not capture the entire picture of the distribution of comments made for each type of posts. For Ask HN comments, there is a greater spread of number of comments made compared to Show HN as shown by a higher variance. The skewness is also highly positive for types of posts, this suggests that a high number of posts has comments that are smaller than the average. 

Interestingly, despite Ask HN posts having a higher average and higher variance, it also has a higher skewness. This suggests that the high average figure may be affected by many posts which has very high number of comments. This is expected as there is a greater emphasis on discussion compared to Show HN posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

For the next part of our analysis, we will analyse when is the best time to post on Hacker News so we can generate the highest number of comments. We will first analyse the Ask HN posts, then move onto Show HN posts.

The method we will do this is to create two frequency table:

- Create an empty dictionary to store the frequency table
- The keys for both dictionaries will be the hour at which the post was made
- The way we determine the hour is the use the "strptime" method from the datetime built-in module, imported as dt, in order to extract the hour of post from the date string
- The first dictionary is the number of posts by hour, the value represents the number of posts made by that hour, so each time loop finds a post that is within a certain hour, it will add one, otherwise it will initialise that key-value pair as one
- - The second dictionary is the number of comments by hour, the value represents the number of comments made by that hour, so each time loop finds a post that is within a certain hour, it will add the number of comments, otherwise it will initialise that key-value pair as the number of comments
- We then display the frequency tables together in a table format with the DataFrame from the pandas module

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created = row[0]
    created = dt.datetime.strptime(created, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(created, "%H")
    
    if hour in counts_by_hour.keys():
        counts_by_hour[hour] += 1
    else:
        counts_by_hour[hour] = 1
    
    if hour in comments_by_hour.keys():
        comments_by_hour[hour] += row[1]
    else:
        comments_by_hour[hour] = row[1]

i = []
data = []

for key in counts_by_hour:
    i.append(f"{key}:00 to {key}:59")
    data.append([counts_by_hour[key], comments_by_hour[key]])

import pandas as pd
table = pd.DataFrame(data, columns=["Posts by Hour", "Comments by Hour"], index=i)
table.sort_index()

Unnamed: 0,Posts by Hour,Comments by Hour
00:00 to 00:59,231,2277
01:00 to 01:59,223,2089
02:00 to 02:59,227,2996
03:00 to 03:59,212,2154
04:00 to 04:59,186,2360
05:00 to 05:59,165,1838
06:00 to 06:59,176,1587
07:00 to 07:59,157,1585
08:00 to 08:59,190,2362
09:00 to 09:59,176,1477


At first glance, Ask HN posts made between 15:00 to 15:59 has the highest number of posts as well as most amount of comments.

## Calculating the Average Number of Comments for Ask HN Posts by Hour

Next we calculate the average number of comments for Ask HN posts, this helps us make an even comparison by hour since there is a different number of posts made each hour. To do this, we utilise a loop which divides the total number of comments made that hour by the number of posts. 

In [6]:
avg_by_hour = {}

for key in counts_by_hour:
    count = counts_by_hour[key]
    comments = comments_by_hour[key]
    average = comments / count
    avg_by_hour[key] = average

i = []
data = []

for key in counts_by_hour:
    i.append(f"{key}:00 to {key}:59")
    data.append([counts_by_hour[key], comments_by_hour[key], round(avg_by_hour[key],2)])

import pandas as pd
table = pd.DataFrame(data, columns=["Posts by Hour", "Comments by Hour", "Average Comments by Hour"], index=i)
table.sort_index()

Unnamed: 0,Posts by Hour,Comments by Hour,Average Comments by Hour
00:00 to 00:59,231,2277,9.86
01:00 to 01:59,223,2089,9.37
02:00 to 02:59,227,2996,13.2
03:00 to 03:59,212,2154,10.16
04:00 to 04:59,186,2360,12.69
05:00 to 05:59,165,1838,11.14
06:00 to 06:59,176,1587,9.02
07:00 to 07:59,157,1585,10.1
08:00 to 08:59,190,2362,12.43
09:00 to 09:59,176,1477,8.39


Posts made between 15:00 to 15:59 generates the highest average number of comments. 

## Sorting and Printing Values from a List of Lists

To assist with our analysis, we will be sorting the table by average number of comments by hour. To do this, we first need to create a list of lists where the first value of the inner list is the average comments by hour.

In [7]:
swap_avg_by_hour = []

for entry in avg_by_hour:
    swap_avg_by_hour.append([avg_by_hour[entry], entry])

print(swap_avg_by_hour)

[[13.198237885462555, '02'], [9.367713004484305, '01'], [11.749128919860627, '22'], [11.056511056511056, '21'], [9.414285714285715, '19'], [13.73019801980198, '17'], [39.66809421841542, '15'], [13.153439153439153, '14'], [22.2239263803681, '13'], [11.143426294820717, '11'], [13.757990867579908, '10'], [8.392045454545455, '09'], [10.095541401273886, '07'], [10.160377358490566, '03'], [10.76144578313253, '16'], [12.43157894736842, '08'], [9.857142857142858, '00'], [8.322463768115941, '23'], [11.38265306122449, '20'], [10.789823008849558, '18'], [15.452554744525548, '12'], [12.688172043010752, '04'], [9.017045454545455, '06'], [11.139393939393939, '05']]


Next we can utilise the "sorted" built-in method to sort the list in descending order (hence the reverse=True argument). Then we print out top 5 hours to make a post.

In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments:")

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    print(f"{hour}: {row[0]:.2f} average comments per post")

Top 5 Hours for Ask Posts Comments:
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


It's important to note that the times are based on Eastern Time in the US (i.e. GMT-5). So the best time to post Ask HN posts in Australia (GMT+10) is 06:00 to 06:59.

## Analysing Show Posts by Hour

Next we conduct the same analysis on the Show HN posts:

In [9]:
import datetime as dt

result_list = []

for row in show_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created = row[0]
    created = dt.datetime.strptime(created, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(created, "%H")
    
    if hour in counts_by_hour.keys():
        counts_by_hour[hour] += 1
    else:
        counts_by_hour[hour] = 1
    
    if hour in comments_by_hour.keys():
        comments_by_hour[hour] += row[1]
    else:
        comments_by_hour[hour] = row[1]

avg_by_hour = {}

for key in counts_by_hour:
    count = counts_by_hour[key]
    comments = comments_by_hour[key]
    average = comments / count
    avg_by_hour[key] = average

swap_avg_by_hour = []

for entry in avg_by_hour:
    swap_avg_by_hour.append([avg_by_hour[entry], entry])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Show Posts Comments:")

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = dt.datetime.strftime(hour, "%H:%M")
    print(f"{hour}: {row[0]:.2f} average comments per post")

Top 5 Hours for Show Posts Comments:
07:00: 12.42 average comments per post
12:00: 12.03 average comments per post
14:00: 11.60 average comments per post
08:00: 11.07 average comments per post
04:00: 10.87 average comments per post


The best time to post Show HN is different compared to Ask HN posts. The best time to post Show HN posts in Australia is 22:00 to 22:59.

## Conclusion

On average Ask HN generate more comments on average compared to Show HN posts which is unsurprising due to the nature of the posts. 

Interestingly, the best time to post is different for each type of post. For Ask HN the best time to post in Australia is 6AM. For Show HN the best time to post in Australia is 10PM. 