
# Exploring Hacker News Posts

On this project, we will explore a dataset of submissions to the thecnology site [Hacker News](https://news.ycombinator.com/). We are going to focus our analysis on those posts that begin with `Ask HN` or `Show HN` in order to compare them and determine:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

For this purpose, we are going to use [a dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) extracted from the Hacker News site. This dataset has been reduced from almost 300,000 entries to around 20,000 by removing those submissions without any comments and then selecting a random sample from the remaining entries.

The dataset has the following columns:

- `id`: The unique identifier from Hacker News for the post.
- `title`: The title of the post.
- `url`: The URL that the posts links to, if it the post has an URL.
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post.
- `author`: The username of the person who submitted the post.
- `created_at`: The date and time at which the post was submitted.

## Exploring the data

We are going to start by opening and reading the dataset:

In [1]:
from csv import reader

# Opening the file and storing it in a variable.
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

Let's check the first rows of the file:

In [2]:
# Printing the first five rows of the dataset.
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers From a List of Lists

We can notice that the fist row of the dataset contains the header row. In order to properly analyze the data, we are going to remove the header from the rest of the dataset:

In [3]:
# Storing the header in a separate variable and then removing it from the dataset.
headers = hn[0]
hn = hn[1:]

We can check the header and the first five rows of the remaining dataset:

In [4]:
# Printing the headers variable and the first five rows of the dataset.
print(headers, '\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts

We are going to separate the posts that start with `Ask HN` or `Show HN` from the rest, given that we want to make our analysis with those specific posts.

For this purpose, we are going to separate those posts into two diferent lists, and store the rest of the dataset in a third list. We start by creating three empty lists to store each kind of post in our dataset. 

In [5]:
# Creating the empty lists to store each kind of post.
ask_posts = []
show_posts = []
other_posts = []


Now, we loop through the dataset and check if the posts that start with `Ask HN`, `Show HN`, or any other combination. Once we checked this, we store each row in the corresponding list.

In [6]:
# Looping through the dataset to separate each kind of entry.
for post in hn:
    title = post[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

We can check the length of each list:

In [7]:
# Printing the length of each resulting dataset.
print('Ask HN posts:', len(ask_posts))
print('Show HN posts:', len(show_posts))
print('Other posts:', len(other_posts))

Ask HN posts: 1744
Show HN posts: 1162
Other posts: 17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we have separated the ask posts and the show posts, we can check which one received more comments on average.

In [8]:
# Average number of comments in Ask HN posts.
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments = total_ask_comments + num_comments

avg_ask_comments = total_ask_comments /len(ask_posts)
print('Average number of comments in Ask HN posts:', round(avg_ask_comments))

## Average number of comments in Show HN posts.
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments = total_show_comments + num_comments

avg_show_comments = total_show_comments /len(show_posts)
print('Average number of comments in Show HN posts:', round(avg_show_comments))

Average number of comments in Ask HN posts: 14
Average number of comments in Show HN posts: 10


On average, `Ask HN` posts received more comments (14) than `Show HN` posts (10). Since ask posts are more likely to receive comments, we are going to focus our remaining analysis on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

Next, we are going to determine if ask posts creating at a certain time are more likely to have comments. For this analysis, we will use the following steps:

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.

We will start by calculating the amount of ask posts created per hour, along with the total amount of comments.

In [9]:
# Importing the datetime module.
import datetime as dt

result_list = []

# Looping through ask_posts to separate the time of the post and the number of comments.
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments])

# Dictionaries to store the number of ask posts by hour and the number of comments by hour.
counts_by_hour = {} # Ask posts by hour
comments_by_hour = {} # Comments by hour

# Counting the number of posts by hour and the number of comments by hour.
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
    date = row[0]
    comments = row[1]
    date = dt.datetime.strptime(date, date_format)
    hour = date.strftime('%H')
    
    if hour in counts_by_hour:
        counts_by_hour[hour] +=1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] =1
        comments_by_hour[hour] = comments


## Calculating the Average Number of Comments for Aks HN Posts by Hour

Now, we can use the two dictionaries created above to calculate the average number of comments for post created during each hour of the day. For this purpose, we are going to store the results in a list, with the hours of the day and the average number of comments per hour.

In [10]:
# Creating a list with the average number of comments per hour of the day.
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour] / counts_by_hour[hour])])

# Displaying the list in ascending order of the hours of the day.
sorted(avg_by_hour)

[['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['22', 6.746478873239437],
 ['23', 7.985294117647059]]

## Sorting and Printing Values From a List of Lists

As we wanted to know which hours of a day has the highest number of comments per post, we need to sort the results above to display the highest average of comments per hour.

First of all, we are going to create a new list with the average comments per hour and the hours of the day in inverted order from the `avg_by_hour` list.

In [11]:
# Inverting the places of hours and average number of comments per hour for sorting purposes.
swap_avg_by_hour = []

for row in avg_by_hour:
    avg_comments = row[1]
    hour = row[0]
    swap_avg_by_hour.append([avg_comments, hour])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Then, we sort the new list to place the hours with the highest number of average comments on top. We can print the first five hours with the higher number of average comments in a more readable way.

In [12]:
# Sort and print the five hours with the highest average number of comments.
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 hours for ask Posts Comments')

for avg, hour in sorted_swap[:5]:
    print('{}: {:.2f} average comments per post.'.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg) )

Top 5 hours for ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


Acording to the results, the hour when the `Ask HN` posts has the most comments is 15:00 EST in the US. The average number of comments received by the posts made at that time is highly superior than the next hour with more average comments per post (around 60% more average comments per post).


## Conclusions

On this project, we made an analysis of Hacker News post to identify which posts of those that start with `Ask HN` or `Show HN` had more number of comments per post. Besides, we wanted to know at what hour of the day we should publish a post to receive more number of comments.

Acording to the results, the posts that start with `Ask HN` are those that receive in average more comments per post. Also, posts made at 15:00 hour (3:pm) EST in the US are the ones that receive more numbers of comments per post.