# Hacker News Posts

Hacker news is a site similar to reddit where users submit topics that are further discussed and commented. It is a popular forum to discuss start ups and techology where users are able to upvote topics. 

Originally the dataset had 300k rows from 2016, but rows where submissions with no comments were removed (80k posts remaining). The original dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts):

* **id**: The unique identifier from Hacker News for the post
* **title**: The title of the post
* **url**: The URL that the posts links to, if it the post has a URL
* **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments**: The number of comments that were made on the post
* **author**: The username of the person who submitted the post
* **created_at**: The date and time at which the post was submitted

There are two topics that we're interested in 'Show HN' and 'Ask HN'. In particular we're concerned with the following:
* If these two topics get more recieve more comments than the average topics?
* Do posts created at certain times recieve more comments?


In [2]:
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
from csv import reader
read_file = reader(opened_file)
hn_data = list(read_file)
opened_file.close()
headers = hn_data[0]

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(hn_data, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']




In [8]:
len(hn_data)

293120

## Remove topics with 0 comments

In [12]:
hn = []

for row in hn_data[1:]:
    num_comments = int(row[4])
    if num_comments > 0:
        hn.append(row)

In [13]:
len(hn)

80401

## Split data into Ask, Show and Other

In [25]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)


6911
5059
68431


In [26]:
explore_data(ask_posts, 0, 5)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']


['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']




In [27]:
explore_data(show_posts, 0, 5)

['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']


['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']


['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']


['12575471', 'Show HN: Project-Okot: Novel, CODE-FREE data-apps in mere seconds', 'https://studio.nuchwezi.com/', '3', '1', 'nfixx', '9/25/2016 14:30']


['12574773', 'Show HN: Cursor that Screenshot', 'http://edward.codes/cursor-that-screenshot', '3', '3', 'ed-bit', '9/25/2016 10:50']




In [28]:
explore_data(other_posts, 0, 5)

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']


['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']


['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']


['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


['12578556', 'OpenMW, Open Source Elderscrolls III: Morrowind Reimplementation', 'https://openmw.org/en/', '32', '3', 'rocky1138', '9/26/2016 1:24']




In [29]:
len_ask = len(ask_posts)

total_ask_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

print(total_ask_comments/len_ask)

13.744175951381855


In [30]:
len_show = len(show_posts)

total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
print(total_show_comments/len_show)

9.810832180272781


In [31]:
len_other = len(other_posts)

total_other_comments = 0

for row in other_posts:
    comments = int(row[4])
    total_other_comments += comments
    
print(total_other_comments/len_other)

25.838318890561297


## Avg. Comments by Topic

* Other Topics: ~26 
* Ask Topics: ~14
* Show Topics: ~10

The Other Topics generally get more comments than Ask and Show topics on HNs. 

In [34]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    comments = int(row[4])
    result_list.append([created_at, comments])

9/26/2016 2:53


In [35]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")    
    hour = dt.datetime.strftime(date, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [51]:
avg_by_hour = {}

for hr in comments_by_hour:
    avg_by_hour[hr] = comments_by_hour[hr] / counts_by_hour[hr]

In [62]:
for k in range(24):
    if k < 10:
        print(k, round(avg_by_hour['0'+str(k)],2),"Comments per hour" )
    else:
        print(k, round(avg_by_hour[str(k)],2),"Comments per hour" )

0 9.86 Comments per hour
1 9.37 Comments per hour
2 13.2 Comments per hour
3 10.16 Comments per hour
4 12.69 Comments per hour
5 11.14 Comments per hour
6 9.02 Comments per hour
7 10.1 Comments per hour
8 12.43 Comments per hour
9 8.39 Comments per hour
10 13.76 Comments per hour
11 11.14 Comments per hour
12 15.45 Comments per hour
13 22.22 Comments per hour
14 13.15 Comments per hour
15 39.67 Comments per hour
16 10.76 Comments per hour
17 13.73 Comments per hour
18 10.79 Comments per hour
19 9.41 Comments per hour
20 11.38 Comments per hour
21 11.06 Comments per hour
22 11.75 Comments per hour
23 8.32 Comments per hour


Peak Comments:
* 3 PM: 39.67
* 1 PM: 22.22
* 12PM: 15.45
* 10AM: 13.76
* 5 PM: 13.73

Most comments happen around lunch and peak around 3 PM. 