# Exploring Y-Combinator Hacker News Posts

## Introduction

Hacker News was created by the startup incubator Y Combinator. Similar to reddit, user-submitted stories known as posts are voted and commented upon. In technology and startup circles, Hacker News is extremely popular and top posts can generate hundreds of thousands of visitors as a result.


## Data Set Description

The columns from the original data set are as follows:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if the post has a URL
- num_points: The number of points the post has acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

## Methodology

For this project, the posts that we're interested in have titles that begin with either "Ask HN" or "Show HN". "Ask HN" style posts ask the Hacker News community specific questions in a format such as:

Ask HN: Any recent changes to CSS that broke mobile?

Similarly, "Show HN" style posts are used to show the Hacker News community a project, product or something interesting. These posts use the format:

Show HN: Check out this website I made

What we would like to achieve as part of this project is to determine the following:

- Do "Ask HN" or "Show HN" posts receive more comments on average?
- Do posts created at a certain time receive more comments on average?

## File Import

To begin, we import the csv file that you can find [here](https://www.kaggle.com/hacker-news/hacker-news-posts/download).

In [1]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [2]:
for rows in hn[:5]:
    print(rows,"\n")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 



## Methodology

For this project, we're only interested in posts whose titles begin with "Ask HN" or "Show HN". We need to filter the data set by creating new lists with data just from these posts.

To filter the data, we will use the string method "startswith". This method will pass through an object and will return either True or False. As this method is case sensitive, we will use the "lower" method to return a lowercase version of the starting string so that all strings are the same. 

In [3]:
'''
We begin by creating three empty lists
'''

ask_posts = []
show_posts = []
other_posts = []

'''
We loop through each row in the hackernews data set and we assign the title
in each row to a variable named title and convert the title to lowercase. 

If the title begins with "ask hn" we add the row to the ask_posts list and if the
title begins with "show hn" we add that row to the show_posts list. Otherwise it
will go to the other_posts list.
'''

for rows in hn:
    title = rows[1]
    title = title.lower()
    
    if title.startswith('ask hn') == True:
        ask_posts.append(rows)
    
    elif title.startswith('show hn') == True:
        show_posts.append(rows)
    
    else:
        other_posts.append(rows)
    

In [4]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


Now that we have separated the posts, we can see that there are 1,744 ask hn posts, 1,162 show hn posts and 17,194 other type posts out of our sample set.

We will now determine if ask posts or show posts receive more comments on average.

In [5]:
'''
We create a counter called total_ask_comments and set it at 0. For each row in 
the ask_posts list, we add the number of comments to the counter.
'''

total_ask_comments = 0

for rows in ask_posts:
    num_comments = rows[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments

In [6]:
print(total_ask_comments)

24483


In [7]:
avg_ask_comments = total_ask_comments / 1744
print(avg_ask_comments)

14.038417431192661


In [8]:
total_show_comments = 0

for rows in show_posts:
    num_comments = rows[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

print(total_show_comments)

11988


In [9]:
avg_show_comments = total_show_comments / 1162
print(avg_show_comments)

10.31669535283993


As we can see, ask hn style posts receive almost 4 comments more per post on average than show comments. This may be because people are more willing to give advice than they are to provide feedback proactively. 

## Are posts created at a certain time more likely to attract comments?

Now that we have determined that "ask hn" style posts attract more comments, we want to investigate whether posting at certain times will attract more comments. 

To do this we will calculate the number of posts created in each hour of the day and the number of comments they receive. 

We will then calculate the average number of comments the posts receive by hour created.

In [19]:
'''
We need to import the datetime module and create an empty list. 

For each row in the ask_posts list, we take out the created at time and the
number of comments and create a pair to put into the new list.

'''

import datetime as dt

result_list = []

for rows in ask_posts:
    created_at = rows[6]
    num_comments = rows[4]
    
    
    num_comments = int(num_comments)
    
    time_comment_pair = [created_at, num_comments]
    
    result_list.append(time_comment_pair)


In [23]:
result_list[:3]

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

In [29]:
'''
We now create two empty dictionaries.

We need to create a datetime object from the created_at data so we use the
datetime.strptime() method to parse the date and create a datetime object.

We then use the datetime.strftime() method to select just the hour from the 
datetime object. 

From there if the hour selected is not in the counts_by_hour dictionary, we 
create a new instance and set it equal to 1. We do the same for comment_by_hour 
except we set the default value to the number of comments left.

If the hour is in the counts_by_hour dictionary, we increase the count by 1 and 
also add the number of comments to the comment_by_hour figure.

'''



counts_by_hour = {}
comment_by_hour = {}

for rows in result_list:
    dt_x = rows[0]
    dt_x = dt.datetime.strptime(dt_x, "%m/%d/%Y %H:%M")
    
    dt_hour = dt_x.strftime("%H")
    
    if dt_hour not in counts_by_hour:
        counts_by_hour[dt_hour] = 1
        comment_by_hour[dt_hour] = rows[1]
    elif dt_hour in counts_by_hour:
        counts_by_hour[dt_hour] += 1
        comment_by_hour[dt_hour] += rows[1]

In [30]:
print(counts_by_hour)

{'13': 85, '16': 108, '14': 107, '07': 34, '00': 55, '17': 100, '02': 58, '03': 54, '05': 46, '15': 116, '04': 47, '11': 58, '21': 109, '08': 48, '09': 45, '18': 109, '12': 73, '10': 59, '06': 44, '20': 80, '19': 110, '01': 60, '22': 71, '23': 68}


In [35]:
print(comment_by_hour)

{'13': 1253, '16': 1814, '14': 1416, '07': 267, '00': 447, '17': 1146, '02': 1381, '03': 421, '05': 464, '15': 4477, '04': 337, '11': 641, '21': 1745, '08': 492, '09': 251, '18': 1439, '12': 687, '10': 793, '06': 397, '20': 1722, '19': 1188, '01': 683, '22': 479, '23': 543}


Now that we have our dictionaries, we can calculate the average number of comments per hour.

In [33]:
avg_by_hour = []

for item in counts_by_hour:
    avg_by_hour.append([item,comment_by_hour[item] / counts_by_hour[item]])

In [43]:
print(avg_by_hour)

[['13', 14.741176470588234], ['16', 16.796296296296298], ['14', 13.233644859813085], ['07', 7.852941176470588], ['00', 8.127272727272727], ['17', 11.46], ['02', 23.810344827586206], ['03', 7.796296296296297], ['05', 10.08695652173913], ['15', 38.5948275862069], ['04', 7.170212765957447], ['11', 11.051724137931034], ['21', 16.009174311926607], ['08', 10.25], ['09', 5.5777777777777775], ['18', 13.20183486238532], ['12', 9.41095890410959], ['10', 13.440677966101696], ['06', 9.022727272727273], ['20', 21.525], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['23', 7.985294117647059]]


While we have the results we're after, we just need to make this a bit more readable.

In [42]:
swap_avg_by_hour = []

for rows in avg_by_hour:
    first_element = rows[0]
    second_element = rows[1]
    
    swap_avg_by_hour.append([second_element,first_element])

print(swap_avg_by_hour)

[[14.741176470588234, '13'], [16.796296296296298, '16'], [13.233644859813085, '14'], [7.852941176470588, '07'], [8.127272727272727, '00'], [11.46, '17'], [23.810344827586206, '02'], [7.796296296296297, '03'], [10.08695652173913, '05'], [38.5948275862069, '15'], [7.170212765957447, '04'], [11.051724137931034, '11'], [16.009174311926607, '21'], [10.25, '08'], [5.5777777777777775, '09'], [13.20183486238532, '18'], [9.41095890410959, '12'], [13.440677966101696, '10'], [9.022727272727273, '06'], [21.525, '20'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [7.985294117647059, '23']]


In [40]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

In [41]:
for rows in sorted_swap:
    dt_x = rows[1]
    dt_x = dt.datetime.strptime(dt_x,"%H")
    
    dt_y = rows[0]
    
    dt_hour = dt_x.strftime("%H:%M")
    
    print("{0}: {1:.2f} average comments per post".format(dt_hour,dt_y))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post


We can see from the above that the best times to post are 3pm US Eastern standard time, 2am US EST and 8pm US EST for ask_hn style posts. The top ranked time of 3pm EST is roughly peak time to catch people on their lunchbreak in LA where there is a hub of tech and startup activity as well as 8pm London time where people would typically be at home.

We would typically expect this to be the best time to post given that San Francisco is the hub of all things tech-related.