# Exploring Hacker News Posts

Hacker News is a site and startup backed by Y Combinator. The site functions in a similar way to reddit where users post content and other users vote, comment on and discuss the content. Hacker News is especially popular within Technology and Startup circles and some of the top posts generate a large amount of traffic, often drawing hundreds of thousands of visitors.

The aim of this project will be to determine what types of post recieve more comments on average, Specifically whether 'Ask HN' or 'Show HN' posts recieve more comments. 'Ask HN' are posts were a user asks the Hacker News community questions. 'Show HN' are posts where a user shares something with the community.

Additionally, it will be investigated whether posting at certain times recieves more comments.

## Opening and Expoloring the Data

The data used for this project is taken from a 12 month period between Spetember 2015 and September 2016. The original dataset contained almost 300,000 posts but has been reduced down to 20,000 posts by removing posts with no comments and then taking a random sample from the remaining entries. The dataset can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts) on Kaggle.

In [1]:
from csv import reader

openfile = open('hacker_news.csv', encoding='utf8')
readfile = reader(openfile)
hn = list(readfile) ## Create a list of lists from data

hn_header = hn[0] # split header row from main body of data
hn_data = hn[1:]

To aid in data analysis, a function `explore_data` will be used repeatedly to visualise the data more clearly. It is defined below:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

To begin, the first five rows of the dataset will be explored.

In [3]:
print(hn_header)
print('\n')
explore_data(hn_data, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




## Filtering the Data

Below, the data is filtered into 3 lists based on whether the post is 'Ask HN', 'Show HN', or neither of the two.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_data:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of ask posts: ' + str(len(ask_posts)))
print('Number of show posts: ' + str(len(show_posts)))
print('Number of other posts: ' + str(len(other_posts)))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


As can be seen above, There are just under 300,000 posts, with a minority of them being ask or show posts. There are about 10% more show posts than ask posts in the dataset.

## Calculating the Average Number of Comments for 'Ask HN' and 'Show HN' Posts

Next, the total and average number of comments will be determined for both the ask post and show post data:

In [5]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Total 'Ask HN' Comments: " + str(total_ask_comments))
print("Average 'Ask HN' Comments: " + str(avg_ask_comments) + '\n')

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("Total 'Show HN' Comments: " + str(total_show_comments))
print("Average 'Show Hn' Comments: " + str(avg_show_comments))

Total 'Ask HN' Comments: 24483
Average 'Ask HN' Comments: 14.038417431192661

Total 'Show HN' Comments: 11988
Average 'Show Hn' Comments: 10.31669535283993


After analysing both 'Ask HN' and 'Show HN' data as above, it can be seen that 'Ask HN' has far more user interaction with almost double the amount of total comments and over double the amount of average comments per post.

This is likely due to how an ask post is naturally going to create more discussion. Users are more likely to respond to a post when it is a direct question rather than a project, fact or general information sharing.

## Finding the Average Number of Comments of 'Ask HN' Posts by Hour

The following code cells will calculate the average number of comments an 'Ask HN' post revieves based on which hour of the day it was created in (US Eastern Time). This is so that it can be determined which hours of the day are best to make a post on Hacker News to recieve the most user interation.

The code cell below creates two frequency tables: once for the number of posts by hour (`counts_by_hour`) and one for the number of comments by hour (`comments_by_hour`). 

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {} # Create empty freq. table for posts by hour
comments_by_hour = {} # Create empty freq. table comments by hour

for row in result_list:
    time = row[0]
    time = time.split(' ')[1] # Single out time
    
    if len(time) == 4:
        time = '0' + time # Zero pad the hour where hour < 10

    time = dt.datetime.strptime(time, "%H:%M")
    hour = time.strftime("%H")
    
    if hour not in counts_by_hour: # Populate both frequency tables
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

After creating both of the frequency tables as above, a list of lists of the hour of the day and average number of comments per post per hour will be created as below.

In [7]:
avg_by_hour = []

for hour in counts_by_hour:
    comments_hourly = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, comments_hourly])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

To sort the this list of lists based on the average number of comments, the indecies must be swapped for every row, then the whole list be sorted using the `sorted()` function and setting `reverse=True` to sort in descending order.

After the list has been sorted, the top 5 hours will be formatted and printed in an easy to read way.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [9]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for 'Ask HN' Post Comments:")

for avg, hour in sorted_swap[:5]:
    
    time_format = dt.datetime.strptime(hour, "%H")
    time_format = time_format.strftime("%H:%M")
    
    print("{time}: {average:.2f} average comments per post.".format(average=avg, time=time_format))
      

Top 5 Hours for 'Ask HN' Post Comments:
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


As can be seen above, the hour that revieves the most comments per 'Ask HN' post is `15:00` with almost 40 comments per post. This is a large increase over `02:00` with just under 24 comments per post. 

At a first glance, 2AM coming in second place seems strange, however it is worth mentioning that this data is in US eastern time. When it is 2AM in the US, it is 4PM in Japan for example, so this could be highlighting the best time to comment based on the audience of which timezone is being targeted.

In the UK for example, if targeting a western audience in the US, it can then be concluded that to maximise user interation with 'Ask HN' posts, posts should be created at around 20:00 GMT or 8PM.

## Conclusion 

From this project, it has been concluded that from posts recieving comments, the post should be an 'Ask HN' post as they recieve around double the average amount of comments when compared to 'Show HN' posts.

It can also been determined that the best time to post an 'Ask HN' post is at 8PM GMT for UK users as this maximises user engagement with a western audience. It is worth remebering how the data has been filtered to disregard posts that have no comments. Therefore it should be made clear that these conclusions only apply to posts that recieve comments to begin with.