# Hacker News 

In this project we are working with a dataset of submissions (posts) to Hacker News. Hacker News is a popular technology site, similar to reddit, where posts receive votes and comments. 
The objective of this project is to determine: 
* whether the **Ask HN** or the **Show HN** posts receive more comments on average.
* whether posts created times receive more comments on average

Column descriptions for the Hacker News dataset: 
* id : the unique identifier from Hacker News for the post
* title : the title of the post 
* url : the URL that the posts link to, if the post has a URL 
* num_points : the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes. 
* num_comments : the number of comments on the post 
* author : the username of the person who submitted the post
* created_at : the date and time of the post's submission

In [454]:
# importing libraries 
import pandas as pd 
import numpy as np
from datetime import date 
import string 

In [455]:
# load hacker news dataset 
hacker_news_df = pd.read_csv('hacker_news.csv')

# Preview data 
hacker_news_df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [456]:
# Get info about the entire DataFram 
hacker_news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20100 non-null  int64 
 1   title         20100 non-null  object
 2   url           17660 non-null  object
 3   num_points    20100 non-null  int64 
 4   num_comments  20100 non-null  int64 
 5   author        20100 non-null  object
 6   created_at    20100 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


missing values in the URL column

In [457]:
hacker_news_df['title'] = hacker_news_df['title'].str.lower()
hacker_news_df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,interactive dynamic video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,how to use open source and shut the fuck up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,florida djs may face felony for april fools' w...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,technology ventures: from idea to enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,note by note: the making of steinway l1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [458]:
# Create seperate Dataframes for different types of posts 
ask_posts_df = hacker_news_df[hacker_news_df['title'].str.startswith('ask hn')]
show_posts_df = hacker_news_df[hacker_news_df['title'].str.startswith('show hn')]
other_posts_df = hacker_news_df[~((hacker_news_df['title'].str.startswith('ask hn')) | (hacker_news_df['title'].str.startswith('show hn')))]

ask_posts_df.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
7,12296411,ask hn: how to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
17,10610020,ask hn: am i the only one outraged by twitter ...,,28,29,tkfx,11/22/2015 13:43
22,11610310,ask hn: aby recent changes to css that broke m...,,1,1,polskibus,5/2/2016 10:14
30,12210105,ask hn: looking for employee #3 how do i do it?,,1,3,sph130,8/2/2016 14:20
31,10394168,ask hn: someone offered to buy my browser exte...,,28,17,roykolak,10/15/2015 16:38


In [459]:
# The distribution of the posts 
print(f'The total number of posts is {len(hacker_news_df)}.')
print(f'There are {len(ask_posts_df)} ask hn posts.')
print(f'There are {len(show_posts_df)} show hn posts.')
print(f'Finally there are {len(other_posts_df)} other posts.')

The total number of posts is 20100.
There are 1744 ask hn posts.
There are 1162 show hn posts.
Finally there are 17194 other posts.


## Post engagement 

In [460]:
# Caluculating average comments on ask_hn posts
total_ask_comments = ask_posts_df['num_comments'].sum()
average_ask_comments = total_ask_comments/len(ask_posts_df)

# Calculating average comments on show_hn posts
total_show_comments = show_posts_df['num_comments'].sum()
average_show_comments = total_show_comments/len(show_posts_df)

print(f'The average number of comments on an Ask HN post is {round(average_ask_comments,3)}')
print(f'The average number comments on a Show HN pot is {round(average_show_comments,3)}')

The average number of comments on an Ask HN post is 14.038
The average number comments on a Show HN pot is 10.317


On average the Ask HN posts recieve more comments than the Show HN posts. 

In [461]:
# Determining if Ask HN posts created at a certain time attract more attention 
# Column of times the posts were created at 
ask_posts_df['created_at'] = pd.to_datetime(ask_posts_df['created_at'], format= '%m/%d/%Y %H:%M')
ask_posts_df['created_at'].dtype

ask_posts_df.groupby('created_at')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ask_posts_df['created_at'] = pd.to_datetime(ask_posts_df['created_at'], format= '%m/%d/%Y %H:%M')


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016139C9BE30>