# Exploring Hacker News Posts

Hacker News is a very popular site which talks about the technology and startup-related topics. There are two types of posts: ``Ask HN`` or ``Show HN``.

In this project, the aim is to determine which of the two types of posts above receive more comments on average; and whether the time of posting influences the number of comments received on average.

The original data set can be found on [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts). For this project, the data set was reduced to 20,000 rows eliminating posts that didn't receive any comment. 

# Introduction

Let's start by reading in the file as a list of lists.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We can see that the data set contains, at each row, the title of post, number of comments received, and the creation date of post, among a few other columns. 

# Remove Header from Data Set

Next, we'll remove the first row of data containing column names to focus on the data rows.

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extract Ask HN and Show HN Posts

In this step, we'll extract each post, distinguish its type by using the `startswith()` function and then append it to either ``ask_posts`` or ``show_posts`` list. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


# Ask HN and Show HN: Calculate the Average Number of Comments

Next, we'll calculate the average number of comments for both types of posts.

In [4]:
# Ask HN posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
    

14.038417431192661


In [5]:
# Show HN posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


The ``Ask HN`` posts receive an average of approximately 14 comments while the ``Show HN`` posts receive an average of approximately 10 comments. Since the ``Ask HN`` posts receive more comments, we'll focus the remaining of our analysis on this post type.

# Ask HN Posts

## Find the Amount of Posts and Comments Created by the Hour

In this section, we'll find out if the creation time of posts influences the number of comments.

Let's determine if we can maximize the number of comments an ``Ask HN`` post receive by creating it at a certain time. First, we'll find out how many ``Ask HN`` posts are created at each hour of the day together with the number of comments received. Then, we'll calculate the average number of comments for every hour.

In [33]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculate the Average Number of Comments by the Hour

In [8]:
avg_by_hour = []

for row in comments_by_hour:
    avg_by_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting the List

The list above shows the average number of comments for each hour but they're not very easy to read. Hence, we'll sort this list in descending order of the average number of comments. To do this, we'll swap the ``avg_by_hour`` columns &ndash; the average number of comments will the in the first column so we can use the ``sorted()`` function. 

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [32]:
print("Top 5 Hours for Ask Post Comments")

for avg, time in sorted_swap[:5]:
    
    time = dt.datetime.strptime(time, "%H").strftime("%H:%M:")
    
    print(
        time,"{:.2f} average comments per post".format(avg)
        )

Top 5 Hours for Ask Post Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments on average per post is 15:00 (3pm) with 38.59 comments followed by 02:00 (2am) with 23.81 comments. There is about 60% increase in the number of comments between these two times.

According to the data set documentation, the time is set in Eastern Time Zone (EST) in the US (UTC -5). It has a 13-hour difference with the MYT timezone (UTC +8).

# Conclusion

In this project, we analyzed two types of posts on Hacker News (``Ask HN`` and ``Show HN``) to determine which one receives the most comments on average, and whether creating post at a certain would attract more comments. Based on our analysis, to maximize the number of comments, we recommend to submit ``Ask HN`` type of posts between 15:00 to 16:00 (3pm to 4pm EST).

However, it should be noted that this analysis didn't take into account the posts without comments. Therefore, it's more accurate to conclude that of the posts **with** comments, ``Ask HN`` posts received more comments and these posts that are created between 15:00 and 16:00 received the highest average number of comments.