# Hacker News Posts  

In this project I will analyse a reduced version of [this dataset](https://www.kaggle.com/hacker-news/hacker-news-posts). The original dataset contains stats on posts from September 2015 to September 2016. The columns are as follows:

* title: title of the post (self explanatory)

* url: the url of the item being linked to

* num_points: the number of upvotes the post received

* num_comments: the number of comments the post received

* author: the name of the account that made the post

* created_at: the date and time the post was made (the time zone is Eastern Time in the US)

The focus of the analysis will be placed on the posts that start with Ask HN, the ones that pose a question to the community, and Show HN, the ones that share something.  
**The project's aim** is to find out which posts recieve more comments and whether there is a specific timing for more commented posts.
***

## Exploring the data

First, I will read in the dataset.


In [1]:
from csv import reader 
opened_file = open('hacker_news.csv')
read_file = reader (opened_file)
hn = list(read_file)
print (hn[:5])



[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


To make my data ready for analysis, I will remove the header row and store it in a variable apart. 

In [2]:
headers = hn[0]
hn = hn[1:]
print(headers)

print (hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Ask or Share

Now I will try to find out which category of HN posts receives more involvement from the community. I will measure this by comparing the amount of comments under the posts starting with Ask HN and Show HN.
First, I will need to put those two groups of posts apart. I will create three separate lists: Ask posts, Show posts and Other posts. 

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row [1]
    if title.startswith('Ask HN'):
        ask_posts.append(row)
    elif title.startswith('Show HN'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask posts: ", len (ask_posts))
print("Show posts: ", len(show_posts))
print("Other posts: ", len(other_posts))




Ask posts:  1742
Show posts:  1161
Other posts:  17197


Now that we have the Ask posts separated form the Show posts, we can compare the average amount of comments under each category of posts. 

In [4]:
#Calculate average number of comments per Ask post
total_ask_comments = 0
for row in ask_posts:
    num_com = int(row[4])
    total_ask_comments = total_ask_comments + num_com
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments per Ask post: ', avg_ask_comments)

#Calculate average number of comments per Show post
total_show_comments = 0
for row in show_posts:
    num_com = int(row[4])
    total_show_comments = total_show_comments + num_com

avg_show_comments = total_show_comments / len(show_posts)
print('Average number of comments per Show post: ', avg_show_comments)

Average number of comments per Ask post:  14.044776119402986
Average number of comments per Show post:  10.324720068906116


As we can see, the Ask posts receive 4 more comments on average than Show posts. 

This may be explained by the fact that a question invites for active involvement from the reader. Whereas a curious fact or other kind of info shared under Show HN title, does not necessarily provoke comment, even if it is very much liked by the audience.  



## Best time to post

Our next question is whether posts published at a certain time receive more comments. To find this out, I will work with Ask HN posts that already showed to be most popular.
I will need the datetime module to complete this task.

In [5]:
#Create a list of lists with only time and number of comments

import datetime as dt
result_list = []
for row in ask_posts:
    time = row [6]
    nc = int(row [4])
    result_list.append([time,nc])
    
#Create a frequency table for each hour and 
#a dictionary with number of comments by hour

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    ncom = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = ncom
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += ncom
        
print (counts_by_hour)
print('\n')
print (comments_by_hour)
    
    


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 108, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 54, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1430, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 439, '06': 397, '07': 267, '11': 641}


Above I created two dictionaries: one with amount of posts created at a specific hour of the day, and another with the sum number of comments received by those posts. 
Below I will calculate the average amount of comments by hour when a post was created. 

In [6]:
#Calculate average num of comments by hour
avg_by_hour = []
for hour in comments_by_hour:
    avg_com = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour,avg_com])    

#Display the results as a table
print('Average number of comments by hour')
print('\n')
for row in avg_by_hour:
    hour = row[0]
    average = row[1]
    print(hour, average)

Average number of comments by hour


09 5.5777777777777775
13 14.741176470588234
10 13.440677966101696
14 13.233644859813085
16 16.796296296296298
23 7.985294117647059
12 9.41095890410959
17 11.46
15 38.5948275862069
21 16.009174311926607
20 21.525
02 23.810344827586206
18 13.24074074074074
03 7.796296296296297
05 10.08695652173913
19 10.8
01 11.383333333333333
22 6.746478873239437
08 10.25
04 7.170212765957447
00 8.12962962962963
06 9.022727272727273
07 7.852941176470588
11 11.051724137931034


As I live in Buenos Aires, for this analysis to be useful I need to convert the time from Eastern USA time to Buenos Aires. Luckily, there is only one hour of difference.

In [7]:
#Convert the time from East US to Argentina 
for row in avg_by_hour:
    hour = int(row[0])
    hour_BA = hour + 1
    row[0] = str(hour_BA)

In order to make comparison between the rows easier, I'll sort the list from the highest number of comments to the lowest.

In [8]:
swap_avg_by_hour = []
for row in avg_by_hour:
    first = row[1]
    second = row[0]
    swap_avg_by_hour.append([first,second])
    
sorted_swap = sorted(swap_avg_by_hour, reverse =True)


print('Top 5 Hours for Ask Posts Comments')
print('\n')
import datetime as dt
for row in sorted_swap [:5]:
    avg_com = row[0]
    hour = row[1]
    hour_f = dt.datetime.strptime(row[1], '%H').strftime("%H:%M")
    print("{} has {:.2F} average comments per post".format (hour_f, avg_com))

Top 5 Hours for Ask Posts Comments


16:00 has 38.59 average comments per post
03:00 has 23.81 average comments per post
21:00 has 21.52 average comments per post
17:00 has 16.80 average comments per post
22:00 has 16.01 average comments per post


Looks like if one is to receive more engagement from the readers on Hacker News, three schedules are the best:
**between 16 and 18 hours**,   
at night **between 21 and 23**,   
or at a pre-dawn hour **between 3 and 4am**. 

***

## Conclusion

Based on the analysis of a reduced dataset of posts from 2015-2016, for maximum response from the Hacker News community, a person from Argentina is:  
- to post a question rather than share something, and    
- post it in the afternoon between 16 and 18 hours.