# Exploring Hacker News Posts

## 1. Introduction

In this project, we'll work with a dataset of submissions to a popular technology site known as Hacker News.

Hacker News was started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments in a similar vein to Reddit. Hacker News is overly popular within technology and startup circles and hence, posts at the top of the Hacker News listings can get thousands of visitors as a result.

The original data source in Kaggle which can be accessed here contains nearly 300,000 rows in total but this has been reduced to roughly 20,000 rows by removing submissions without any comments, then randomly sampling from the remaining observations. The descriptions of the columns are as follows:

- **id**: the unique identifier from Hacker News for the post
- **title**: the title of the post
- **url**: the URL that the posts links to, if the post has a URL
- **num_points**: the number of points the post acquired, - calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: the number of comments on the post
- **author**: the username of the person who submitted the post
- **created_at**: the date and time of the post's submission

### 1.1 Loading the Data

Before we get started, we have to read in the data from the **hacker_news.csv** file as a list of lists.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

Now let's print out the first five rows of our loaded dataset.

In [2]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## 2. Removing Headers from our Dataset

Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers.

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## 3. Extracting Ask HN and Show HN Posts

Having removed the headers from the **hn** dataset, we are now ready to filter our data. Given that we only care about post titles beginning with **Ask HN** or **Show HN**, we'll create new lists of lists containing only the data for these titles, respectively.

In [4]:
ask_posts, show_posts, other_posts = [], [], []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('There are {} Ask HN posts.'.format(len(ask_posts)))
print('There are {} Show HN posts.'.format(len(show_posts)))
print('There are {} other posts.'.format(len(other_posts)))

There are 1744 Ask HN posts.
There are 1162 Show HN posts.
There are 17194 other posts.


## 4. Calculating Average Number of Comments for Ask HN and Show HN Posts

Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0
for post in ask_posts:
    n_comments = int(post[4])
    total_ask_comments += n_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average number of comments on ask posts is {:.0f}.'.format(avg_ask_comments))

total_show_comments = 0
for post in show_posts:
    n_comments = int(post[4])
    total_show_comments += n_comments
avg_show_comments = total_show_comments / len(show_posts)
print('The average number of comments on show posts is {:.0f}.'.format(avg_show_comments))

The average number of comments on ask posts is 14.
The average number of comments on show posts is 10.


Based on the findings from the analysis above, it appears that ask posts receive considerably more comments on average than show posts. This makes perfect sense because users who post questions would normally expect others to comment with answers to their ask posts.

## 5. Finding the Number of Ask Posts and Comments by Hour Created

Next we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created

In [6]:
import datetime as dt

result_list = []
for post in ask_posts:
    create_time = post[6]
    n_comments = int(post[4])
    result_list.append([create_time, n_comments])

counts_by_hour, comments_by_hour = {}, {}
for row in result_list:
    date_str = row[0]
    date = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

## 6. Calculating the Average Number of Comments for Ask HN Posts by Hour

In the previous section, we created two dictionaries:
- **counts_by_hour**: contains the number of ask posts created during each hour of the day
- **comments_by_hour**: contains the corresponding number of comments on ask posts created at each hour received

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [7]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]])
print(avg_by_hour)

[['09', 251], ['13', 1253], ['10', 793], ['14', 1416], ['16', 1814], ['23', 543], ['12', 687], ['17', 1146], ['15', 4477], ['21', 1745], ['20', 1722], ['02', 1381], ['18', 1439], ['03', 421], ['05', 464], ['19', 1188], ['01', 683], ['22', 479], ['08', 492], ['04', 337], ['00', 447], ['06', 397], ['07', 267], ['11', 641]]


## 7. Sorting and Printing Values from a List of Lists

We have calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named **avg_by_hour**.

However, the results are currently formatted in a way that makes if hard to identify the hours with the highest values. Let's finish off the task by sorting the list of lists and printing the five highest values in a format that's human readable.

In [11]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
# print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')

for avg, hour in sorted_swap[:5]:
    hour = dt.datetime.strptime(hour, '%H')
    hour = dt.datetime.strftime(hour, '%H:%M')
    print('{}: {:.2f} average comments per post'.format(hour, avg))

Top 5 Hours for Ask Posts Comments
15:00: 4477.00 average comments per post
16:00: 1814.00 average comments per post
21:00: 1745.00 average comments per post
20:00: 1722.00 average comments per post
18:00: 1439.00 average comments per post


Based on the above findings, the best time to create an Ask HN post is around mid-afternoon after 3pm but before 5pm. Certain hours in the evening before 10pm are also good times to post as commentors are pretty active during these hours also. Interestingly, posts created during the 3pm hour seem to receive by far the most comments; more than double that of the second best hour 4pm. This is quite likely the result of a extremely popular or viral post that happened to be posted at 3pm on a particular day which had significantly more comments than any other post, thus driving the average amount up.