# Exploring Hackers News Posts
Hacker News is a site where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.In this project, we'll compare two different types of posts Ask HN or Show HN from Hacker News.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

    Do Ask HN or Show HN receive more comments on average?
    Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

# Exploring the dataset

The dataset is from the below link
https://www.kaggle.com/hacker-news/hacker-news-posts

In [1]:
import pandas as pd

In [2]:
hn = pd.read_csv("hacker_news.csv")
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


# Extract 'ask hn' and 'show hn' posts
Filter the 'ask hn' and 'show hn' and create a column 'posts' which indicates the type of posts

In [3]:
import numpy as np
hn['posts']= np.nan
hn.loc[hn['title'].str.lower().str.startswith('ask hn'),'posts']="ask_posts"
hn.loc[hn['title'].str.lower().str.startswith('show hn'),'posts']="show_posts"
hn.loc[hn['posts'].isna(),'posts']='other_posts'

# Calcualating the number of posts, sum & avg of comments of posts

In [4]:
hn['posts'].value_counts()

other_posts    17194
ask_posts       1744
show_posts      1162
Name: posts, dtype: int64

In [5]:
hn.groupby('posts').sum()['num_comments']

posts
ask_posts       24483
other_posts    462055
show_posts      11988
Name: num_comments, dtype: int64

In [6]:
hn.groupby('posts').mean()['num_comments']

posts
ask_posts      14.038417
other_posts    26.873037
show_posts     10.316695
Name: num_comments, dtype: float64

The number of comments for ask_posts are higher on an average

# Calculating the time of day at which ask_posts receive more comments 
create a column 'hour' which records the hour part of the 'created_at' column

In [7]:
hn_ask_posts=hn[hn['posts']=='ask_posts']
hn_ask_posts['created_at']=pd.to_datetime(hn_ask_posts['created_at'])
hn_ask_posts['hour']=hn_ask_posts['created_at'].dt.strftime('%H')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [8]:
hn_ask_posts.groupby('hour').count()['num_comments']
hn_ask_posts.groupby('hour').sum()['num_comments']

hour
00     447
01     683
02    1381
03     421
04     337
05     464
06     397
07     267
08     492
09     251
10     793
11     641
12     687
13    1253
14    1416
15    4477
16    1814
17    1146
18    1439
19    1188
20    1722
21    1745
22     479
23     543
Name: num_comments, dtype: int64

calculate the mean of 'num_comments' per each hour

In [9]:
hn_avg_ask_posts=hn_ask_posts.groupby('hour').mean()['num_comments']

In [10]:
hn_avg_ask_posts.sort_values(ascending=False)[0:5]

hour
15    38.594828
02    23.810345
20    21.525000
16    16.796296
21    16.009174
Name: num_comments, dtype: float64

# Conclusion

Since the data set we analyzed excluded posts without any comments we can conclude our analysis saying that of the posts that received comments, 'Askposts' receieved more comments on an average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.