# Exploring Hacker News Post

In this project;

I will compare 3 types of posts from [Hacker News](https://news.ycombinator.com), a popular technology website where users ask questions, share information etc.

In the analysis part,  data (posts) will be analyzed in 3 three divided groups based on their type;

1. __Posts to ask questions__<br>
Users submit the post starting with Ask HN<br>
<br>
2. __Posts to show something interesting__<br>
Users submit the post starting with Show HN<br>
<br>   
3. __Other Posts__<br>
The posts other than first two groups<br>
    
___I'll specifically compare these three types of posts to determine the following:___

- If show, ask or other posts receive more comments/points on average?
- If posts created at a certain time are more likely to receive more comments/points?
- What is the type of top 10/50/100/1000 post receiving the most comments/points?

According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

# Exploration of Data

First, I'll read in the data and find the size of the data.

In [1]:
# Import library
import pandas as pd

# Read the data
hn = pd.read_csv('hacker_news.csv')
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12579008,You have two days to comment if you want stem ...,http://www.regulations.gov/document?D=FDA-2015...,1,0,altstar,9/26/2016 3:26
1,12579005,SQLAR the SQLite Archiver,https://www.sqlite.org/sqlar/doc/trunk/README.md,1,0,blacksqr,9/26/2016 3:24
2,12578997,What if we just printed a flatscreen televisio...,https://medium.com/vanmoof/our-secrets-out-f21...,1,0,pavel_lishin,9/26/2016 3:19
3,12578989,algorithmic music,http://cacm.acm.org/magazines/2011/7/109891-al...,1,0,poindontcare,9/26/2016 3:16
4,12578979,How the Data Vault Enables the Next-Gen Data W...,https://www.talend.com/blog/2016/05/12/talend-...,1,0,markgainor1,9/26/2016 3:14


In [2]:
hn.info()
print('\nThe size of data:', hn.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293119 entries, 0 to 293118
Data columns (total 7 columns):
id              293119 non-null int64
title           293119 non-null object
url             279256 non-null object
num_points      293119 non-null int64
num_comments    293119 non-null int64
author          293119 non-null object
created_at      293119 non-null object
dtypes: int64(3), object(4)
memory usage: 15.7+ MB

The size of data: (293119, 7)


# Cleaning of Data

Observing the data, I will be using the columns which contains;

1) The title of the post 

2) The number of points received for each post

3) The number comments received for each post

4) The author of each post

5) The time created for each post

Other columns containing id and url will be deleted

In [3]:
# Drop id and url column
hn = hn.drop(['id','url'], axis=1)
hn.head()

Unnamed: 0,title,num_points,num_comments,author,created_at
0,You have two days to comment if you want stem ...,1,0,altstar,9/26/2016 3:26
1,SQLAR the SQLite Archiver,1,0,blacksqr,9/26/2016 3:24
2,What if we just printed a flatscreen televisio...,1,0,pavel_lishin,9/26/2016 3:19
3,algorithmic music,1,0,poindontcare,9/26/2016 3:16
4,How the Data Vault Enables the Next-Gen Data W...,1,0,markgainor1,9/26/2016 3:14


The next thing I observe is the title column has both upper case and lower case values. I'll make the title column lower case to clean.

In [4]:
# Convert the title column lower case 
hn['title'] = hn['title'].str.lower()

Finally, to realize the analysis, I need to convert 'created_at' column into datetime

In [5]:
# Convert column into datetime
date_format = "%m/%d/%Y %H:%M" # format of date to use datetime class
hn['created_at'] = pd.to_datetime(hn['created_at'], format=date_format)

# Grouping the Ask Post, Show Post and Other Posts

Now, I will tag the posts based on their category and assign the label creating a new column called 'label'. We will perform it by checking whether the topic starts by "ask hn", "show hn" or none of these.

In [6]:
# Function which labels the posts according to its category
def tag(title):
    if title.startswith('ask hn'):
        return 'ask'
    elif title.startswith('show hn'):
        return 'show'
    else:
        return 'other'
hn['label'] = hn['title'].apply(tag)
print('Count of categories:\n', hn.groupby('label').count()['title'])

Count of categories:
 label
ask        9139
other    273822
show      10158
Name: title, dtype: int64


In [7]:
# Example row for each type of post
print('An example row from ask post:')
print(hn[hn['label']=='ask'].iloc[0])

print('\nAn example row from show post:')
print(hn[hn['label']=='show'].iloc[0])

print('\nAn example row from other post:')
print(hn[hn['label']=='other'].iloc[0])

An example row from ask post:
title           ask hn: what tld do you use for local developm...
num_points                                                      4
num_comments                                                    7
author                                                    Sevrene
created_at                                    2016-09-26 02:53:00
label                                                         ask
Name: 10, dtype: object

An example row from show post:
title           show hn: finding puns computationally
num_points                                          2
num_comments                                        0
author                                          saamm
created_at                        2016-09-26 00:36:00
label                                            show
Name: 52, dtype: object

An example row from other post:
title           you have two days to comment if you want stem ...
num_points                                                      1
num_c

# Average Number of Comments and points for Each Type of Post

After the separation of data into groups, I will calculate the __average number of comments and points received__ for each type of post.

In [8]:
# Average number of comments and points according to post type
print('Average number of posts regarding types:')
hn.groupby('label').mean()

Average number of posts regarding types:


Unnamed: 0_level_0,num_points,num_comments
label,Unnamed: 1_level_1,Unnamed: 2_level_1
ask,11.311741,10.393478
other,15.15601,6.457268
show,14.843572,4.8861


As shown above, the average quantity of comments of an ask post __(10.39)__ is about __2.5 times__ more than those of a show post __(4.89)__. The average quantity of comments of an other post __(6.40)__ is between those of ask post and show post.

According to the result, the number of points received by show post and other post looks very close around __15__. Interestingly, the number of points received by other posts on average are __the most with 15.16__ whereas they receive __the second most__ comments among the types. Also, ask posts have __the least__ number of points received with ___11.31__ per post whereas they have __the most__ number of comments. 

This result looks normal because generally people tend to not like ask posts but to make comment. On the other hand, show posts are more likely to receive like than comment. 

# Average Number of Comments as well as Points for Each Type of Post by Hour Created

Next, I will investigate if creating each type of post in a certain time will effect the number of comments received as well as points received.

In [9]:
# Extract hour from created_at column assigning to a new column
def hr_func(t):
    return t.hour
hn['hour'] = hn['created_at'].apply(hr_func)

In [10]:
# Top 5 hours for the average number of comments and points of ask posts
print('Average number of ask posts regarding hours created:')
hn.groupby(['label', 'hour']).mean().loc['ask'].sort_values(by='num_comments', ascending=False).head()

Average number of ask posts regarding hours created:


Unnamed: 0_level_0,num_points,num_comments
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
15,21.637771,28.676471
13,17.932432,16.317568
12,13.576023,12.380117
2,10.944238,11.137546
10,13.43617,10.684397


According to the top 5 ask posts, the hour that receives the most comments per post with 28.68 is between __15:00 - 16:00__. Average number of comments received for ask post in 15:00 is __75%__ more than average number of comments received with 16.32 in 13:00. The other common thing among hours is the top 3 is right after the noon and almost consequtive hours. 

Also, comparing the hours when average number of points and comments received in ask posts, the top three hours which are __15:00__ with 21.64 points per post, __13:00__ with 17.93 points per post and __12:00__ with 13.58 points per post, are the same proving that those hours are the best hours to create ask posts to maximize points and comments received.

In [11]:
# Top 5 hours for the average number of comments and points of show posts
print('Average number of show posts regarding hours created:')
hn.groupby(['label', 'hour']).mean().loc['show'].sort_values(by='num_comments', ascending=False).head()

Average number of show posts regarding hours created:


Unnamed: 0_level_0,num_points,num_comments
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
12,20.905039,6.994186
7,13.995763,6.682203
11,19.258706,6.002488
8,14.683544,5.60443
14,15.090517,5.515805


According to the top 5 show posts, the top hour that receives the most comments per post with 6.99 is between __12:00 - 13:00__. However, there is no crucial difference among top 5 hours based on the number of comments received per show post.

When we look at the top 5 hours that the most points received, between __12:00 and 13:00__ with 20.91 points per post is also the best to receive the most points as comments for the show post. There is also another common hour, which one of the most points and comments received, in show post which is __11:00__ with 19.26 points per post. 

In [12]:
# Top 5 hours for the average number of comments and points of other posts
print('Average number of other posts regarding hours created:')
hn.groupby(['label', 'hour']).mean().loc['other'].sort_values(by='num_comments', ascending=False).head()

Average number of other posts regarding hours created:


Unnamed: 0_level_0,num_points,num_comments
hour,Unnamed: 1_level_1,Unnamed: 2_level_1
12,16.699394,7.585214
11,16.292903,7.374144
2,16.712054,7.180737
13,16.017749,7.146833
5,15.697319,6.78684


According to the top 5 other posts, the hour that receives the most comments per post with 7.59 is between __12:00 - 13:00__. However, there is no crucial difference among the top 5 hours based on the average number of comments received. Interestingly, we observe that the hour that receives most comment per other post which is __12:00__ is the same for show post. Also, between show post and other post, there is no significant difference in the average number of comments received regarding hours.

Among top 5 hours that receives the most points for other post, the average quantites lies around __16__ and __doesn't change much__.

In addition to that, there is a significant difference in amount of comments received per post between ask posts and other types of posts comparing the top hour that receives the most comment; __310%__ more than show post, __278%__ more than other post. 

__12:00__ is the most common time in the top three hour among different types of post regarding the highest average number of comments as well as points received.

Furthermore, comparing the most points received in the top hour among the post types, ask post and show post are very close to each other and approximately __30%__ more than those of other post.

# Finding the Type of Posts in Top 1000 / 100 / 50 / 10 Comments and Points 

Finally, I will find the number of post types which are in __top 1000 / 100 / 50 / 10__ based on number of comments and points received.

I will sort the whole data descending regarding comments as well as points. Then, I will investigate top 1000, 100, 50 and 10 and find how many post types there are.

In [13]:
top_n = [10, 50, 100, 1000] # Top n post
comment_or_point = ['num_comments', 'num_points'] # For sorting the posts
for n in top_n:
    for value in comment_or_point:
        title = 'Top {} {}:'
        print(title.format(n, value.split('_')[1]), hn.sort_values(by=value, ascending=False).head(n).groupby('label').count()['title'], '\n')

Top 10 comments: label
ask      4
other    6
Name: title, dtype: int64 

Top 10 points: label
other    10
Name: title, dtype: int64 

Top 50 comments: label
ask      17
other    33
Name: title, dtype: int64 

Top 50 points: label
other    49
show      1
Name: title, dtype: int64 

Top 100 comments: label
ask      19
other    81
Name: title, dtype: int64 

Top 100 points: label
ask       1
other    96
show      3
Name: title, dtype: int64 

Top 1000 comments: label
ask       44
other    950
show       6
Name: title, dtype: int64 

Top 1000 points: label
ask       21
other    955
show      24
Name: title, dtype: int64 



I observe that in top comments, ask posts are always more than show posts whereas in top points, except top 10, show posts are always more than ask posts. Among 293,119 post, there are 4 ask posts (__40%__ of top 10 comments) but no show post in top 10 also, 17 ask posts (__34%__ of top 50 comments) in top 50 comments which shows the power of good questions. Therefore, I can say that ask questions receive mostly comments whereas show posts receive mostly points even there are only __3__ show posts in top 100 points.

# Conclusion

In this project, I analyzed the number of comments and points which posts receive submitted on [Hacker News](https://news.ycombinator.com) based on post type (ask, show or other) and post submission time. Additionally, I analyzed the type of top 10, 50, 100 and 1000 posts among 293,119 posts based on number of received comments and points.

Based on the analysis, to maximize the number of comment as well as point a post receives, I recommend to submit an __ask post__ __between 15:00 and 16:00__. However, in general, to get more points on average for your post, submitting a __show post__ is a better option.

For other posts, submission time is __not significant__ on the number of comments or points a post recieved whereas for ask posts, I directly observe a __significant change__ on the number of comments or points a post recieved. 

For each post type, submitting a post at __12:00__ is in the top 5 hours where maximize the average number of comments and points received.

In top 100 posts where the most points or comments received, there are only __3 show posts__ but for ask posts this number is __20 (19 for top comments and 1 for top points)__ which is almost __7 times__ of number of show posts.