# Exploring the Best Type of Post and Timing for Posting on Hacker News Portal 

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the [data](https://www.kaggle.com/hacker-news/hacker-news-posts) set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

- Ask HN: How to improve my personal website?
- Ask HN: Am I the only one outraged by Twitter shutting down share counts?
- Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

- Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- Show HN: Something pointless I made
- Show HN: Shanhu.io, a programming playground powered by e8vm

**In this project, we'll aim to find out what kind of posts are more likely to receive attention on Hacker News. To do so, we will answer two questions:**

- **Do Ask HN or Show HN receive more comments on average?**
- **Do posts created at a certain time receive more comments on average?**

### Summary of Results

The conclusion from this data analysis is that Ask HN post on Hacker News received more comments than Show HN posts. In addition, Ask HN posts, that were posted on 15:00 EST received the biggest average number of comments.

### Opening and Preparing the Data

Firstly, we read hacker_news.csv data set using pandas and assign it to the variable hn.

In [11]:
import pandas as pd
hn = pd.read_csv("hacker_news.csv", encoding='utf8')
print(type(hn))

<class 'pandas.core.frame.DataFrame'>


Then, we print few of the first rows of the data set.

In [12]:
hn.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [13]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            293119 non-null  int64 
 1   title         293119 non-null  object
 2   url           279256 non-null  object
 3   num_points    293119 non-null  int64 
 4   num_comments  293119 non-null  int64 
 5   author        293119 non-null  object
 6   created_at    293119 non-null  object
dtypes: int64(3), object(4)
memory usage: 11.2+ MB


## Data Analysis

### Counting number of posts (Ask HN vs Show HN) and which of them receive more comments on average

Sinnce we're only concerned with post titles beginning with Ask HN or Show HN, we'll calculate the number of posts accounting to each of these groups.

In [14]:
ask = hn[hn['title'].str.lower().str.startswith('ask hn')].copy()
show = hn[hn['title'].str.lower().str.startswith('show hn')].copy()
other = hn[~((hn['title'].str.lower().str.startswith('ask hn')) | (hn['title'].str.lower().str.startswith('show hn')))].copy()               
         

In [15]:
print(ask.shape[0])
print(show.shape[0])
print(other.shape[0])

9139
10158
273822


There is 9139 Ask HN posts and 10158 Show HN posts in the dataset. Next, we will look whether Ask HN or Show HN receive more comments on average.

In [16]:
ask.describe()

Unnamed: 0,id,num_points,num_comments
count,9139.0,9139.0,9139.0
mean,11389960.0,11.311741,10.393478
std,702638.1,41.946308,43.508148
min,10176920.0,1.0,0.0
25%,10793490.0,1.0,1.0
50%,11376890.0,3.0,2.0
75%,12021290.0,6.0,6.0
max,12578910.0,1213.0,1007.0


In [17]:
show.describe()

Unnamed: 0,id,num_points,num_comments
count,10158.0,10158.0,10158.0
mean,11313510.0,14.843572,4.8861
std,698379.7,51.04185,16.154288
min,10177420.0,1.0,0.0
25%,10712290.0,2.0,0.0
50%,11255890.0,3.0,0.0
75%,11938300.0,7.0,2.0
max,12578340.0,1624.0,306.0


It seems that Ask HN posts receive over two times more comments than Show HN posts (10.39 in comparision to 4.89). Since Ask HN are more likely to receive comments, we'll focus our remaining analysis just on these posts. 

Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- we will calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- then we will calculate the average number of comments ask posts receive by hour created.

### Determining the hour during which Ask HN Post receive the most comments on average

Information about the time of the post creation is stored in the created_at column of the dataset. To determine what is the best timing during the day for the Ask HN post creation, we will convert the values to datetime, then create the additional 'hour' column.

In [18]:
ask['created_at'] = pd.to_datetime(ask['created_at']) #convertion to datetime

In [19]:
ask['hour'] = ask['created_at'].dt.hour.astype(int) #creation of hour column

Now we can calculate the average number of comments per post created in each hour of the day.

In [20]:
ask.groupby('hour').num_comments.mean()

hour
0      7.564784
1      7.407801
2     11.137546
3      7.948339
4      9.711934
5      8.794258
6      6.782051
7      7.013274
8      9.190661
9      6.653153
10    10.684397
11     8.964744
12    12.380117
13    16.317568
14     9.692008
15    28.676471
16     7.713299
17     9.449744
18     7.942997
19     7.163043
20     8.749020
21     8.687259
22     8.804178
23     6.696793
Name: num_comments, dtype: float64

### Conclusions

In this data set, the Ask HN posts that were posted on 15:00 EST were receiving the highest average number of comments. 

There is also huge difference between those posted on 15:00 EST and the second timeframe with the highest number of comments (13:00 EST), with (28.68 to 16.32 average comments per post, i.e. approx. 1.8x times less).

The fair conclusion from this data analysis is that historically, out of 'Ask Posts' posted on Hacker News that received any comments, posts that were posted on 15:00 EST received the biggest average number of comments.