# Exploring HackerNews Posts

This project is going to work with a data set of submissions to a popular technology site, Hacker News.

Hacker News is a site started by the startup icubator Y Comabinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result

In this project, the data set has been reduced from 300,000 rows to 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Specifically posts whose titles begin with either __Ask HN__ or __Show HN__. Users submit __Ask HN__ posts to ask the Hacker News community a specific question. Below are a couple examples:

- __Ask HN__: How to improve my personal website?
- __Ask HN__: Am I the only one outraged by Twitter shutting down share counts?
- __Ask HN__: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

- __Show HN__: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
- __Show HN__: Something pointless I made
- __Show HN__: Shanhu.io, a programming playground powered by e8vm

This project will compare these two types of posts to determine the following:

- Do __Ask HN__ or __Show HN__ receive more comments on average?
- Do posts created at a certain time receive more comments on average?

\* The data set can be found here https://www.kaggle.com/hacker-news/hacker-news-posts

----------------------------------------------------------

# Code

### Read the file
First, we are going to read the file, convert it into a list, then separate the header.

In [1]:
# import csv reader library to read the file
from csv import reader

In [2]:
# read the file in as a list of lists
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

### Check the header and first 5 lines of the data set.
Confirm that the header has been separated.

In [3]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
for row in hn[:5]:
    print(row)
    print('\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




### Extract Ask HN and Show HN posts
Now that the header is removed from the data set, we are ready to filter our data. Since we are only concerned with post titles beginning with *Ask HN* or *Show HN*, we will create new lists of lists containing just the data for those titles.

In [5]:
# create empty lists
ask_posts = []
show_posts = []
other_posts = []

In [6]:
# loop through each row in the data set
for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Check the number of posts in each list.

In [7]:
print('ask_posts: {}'.format(len(ask_posts)))
print('show_posts: {}'.format(len(show_posts)))
print('other_posts: {}'.format(len(other_posts)))

ask_posts: 1744
show_posts: 1162
other_posts: 17194


### Find the total number and average number of comments in each list
Now that ask posts and show posts are separated, we are going to get the total number and average number of comments to determine which post receive more comments on average.

In [8]:
# ask_posts
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)

In [9]:
# show_posts
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)

### Compare
Now, we are going to check and compare the total number and average number of comments.

In [10]:
# total number of comments
print('total_ask_comments: {}'.format(total_ask_comments))
print('total_show_comments {}'.format(total_show_comments))

total_ask_comments: 24483
total_show_comments 11988


In [11]:
# average number of comments
print('avg_ask_comments: {}'.format(avg_ask_comments))
print('avg_show_comments {}'.format(avg_show_comments))

avg_ask_comments: 14.038417431192661
avg_show_comments 10.31669535283993


We can determine that, on average, *ask posts* receive more comments than *show posts*. Because of this, we can assume that *ask posts* are more likely to receive comments. We will focus the remaining analysis just on these posts.

### Calculate Posts by Hour Created
Next, we will determine if ask posts created at a certain *time* are more likely to attract comments.

In [12]:
# datetime module to work with the data in the *created_at* column
import datetime as dt

In [13]:
# iterate over *ask posts* and append a two element list
result_list = []

for row in ask_posts:
    comments = int(row[4])
    created_at = row[6]
    result_list.append([created_at, comments])

In [14]:
# iterate over *result_list* to extract the hour from the date
counts_by_hour = {}
comments_by_hour = {}
datetime_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date_time = row[0]
    comments = row[1]
    
    hour_time = dt.datetime.strptime(date_time, datetime_format).strftime('%H')
    
    if hour_time in counts_by_hour:
        counts_by_hour[hour_time] += 1
        comments_by_hour[hour_time] += comments
    else:
        counts_by_hour[hour_time] = 1
        comments_by_hour[hour_time] = comments

Create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [15]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

In [16]:
# Display the results
avg_by_hour

[['13', 14.741176470588234],
 ['07', 7.852941176470588],
 ['02', 23.810344827586206],
 ['09', 5.5777777777777775],
 ['11', 11.051724137931034],
 ['18', 13.20183486238532],
 ['06', 9.022727272727273],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['17', 11.46],
 ['23', 7.985294117647059],
 ['14', 13.233644859813085],
 ['20', 21.525],
 ['19', 10.8],
 ['15', 38.5948275862069],
 ['03', 7.796296296296297],
 ['12', 9.41095890410959],
 ['04', 7.170212765957447],
 ['21', 16.009174311926607],
 ['08', 10.25],
 ['05', 10.08695652173913],
 ['00', 8.127272727272727],
 ['16', 16.796296296296298],
 ['10', 13.440677966101696]]

In [17]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [18]:
for row in swap_avg_by_hour:
    print(row)

[14.741176470588234, '13']
[7.852941176470588, '07']
[23.810344827586206, '02']
[5.5777777777777775, '09']
[11.051724137931034, '11']
[13.20183486238532, '18']
[9.022727272727273, '06']
[11.383333333333333, '01']
[6.746478873239437, '22']
[11.46, '17']
[7.985294117647059, '23']
[13.233644859813085, '14']
[21.525, '20']
[10.8, '19']
[38.5948275862069, '15']
[7.796296296296297, '03']
[9.41095890410959, '12']
[7.170212765957447, '04']
[16.009174311926607, '21']
[10.25, '08']
[10.08695652173913, '05']
[8.127272727272727, '00']
[16.796296296296298, '16']
[13.440677966101696, '10']


Sort the list containing the hours during which posts were created and the average number of comments those posts received.

In [28]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

In [29]:
for row in sorted_swap:
    print(row)

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


### Top 5 Hours for Ask Posts Comments

Based on the analysis that we have done, we can conclude that the 'Top 5 Hours to Ask Post Comments' are:

In [27]:
datetime_format = '%H'
for row in sorted_swap[:5]:
    time = dt.datetime.strptime(row[1], datetime_format).strftime('%H:%M')
    print("'{}': {:.2f} average comments per post.".format(time, row[0]))

'15:00': 38.59 average comments per post.
'02:00': 23.81 average comments per post.
'20:00': 21.52 average comments per post.
'16:00': 16.80 average comments per post.
'21:00': 16.01 average comments per post.


# Conclusion

Based on the analysis in this project, we can conclude that for a post to get the most comment, it is to be posted between 15:00 - 16:00 and have the start of the title 'Ask HN'.

*Note: The "Hacker News Posts" data set from kaggle used in this project has been reduced by removing all submissions that did not receive any comments.*