# Analysis of posts at Hacker News

Goal: The project aims to give recommendations at what time it is best to create a post in order to get the largest number of comments based on the analysis of posts from the Hacker News web platform.

Method: The analysis is targeting posts containing *Ask HN* or *Show HN* in their titles, which include posts with specific questions from users or posts with user-created projects. We are interested in the number of comments they recive on average and how it is influenced by their creation time.

Data set: ~20,000 posts from Hacker News.

Key parameter: *Ask HN* or *Show HN* posts

## Data

The sample data set of ~20,000 posts from Hacker News is avaliable [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

## Open the data

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
hn_ln = len(hn)
print('Lengthe of Hacker News data without the header:', hn_ln)


The headers of the data set:

In [2]:
n=0
for line in headers:
    n +=1
    print(n-1, ": ", line)

0 :  id
1 :  title
2 :  url
3 :  num_points
4 :  num_comments
5 :  author
6 :  created_at


The first 5 rows of the data set:

In [3]:
for row in hn[0:5]:
    print(row)
    print("\n")

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




To select the data starting with *Ask HN* from *Show HN* and the rest into three separate lists:

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_low = title.lower()
    if title_low.startswith('ask hn'):
        ask_posts.append(row)
    elif title_low.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)        
        
n_ask_posts = len(ask_posts)
n_show_posts = len(show_posts)
n_other_posts = len(other_posts)
print("Number of Ask HN posts: ", n_ask_posts)
print("Number of Show HN posts: ", n_show_posts)
print("Number of other posts: ", n_other_posts)

print(ask_posts[0:5])

Number of Ask HN posts:  1744
Number of Show HN posts:  1162
Number of other posts:  17194
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


To determine if *Ask HN* or *Show HN* posts recieve more comments (column 4 in our data set):

In [5]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = round(total_ask_comments/len(ask_posts),2)
print("Total number of Ask HN comments: ", total_ask_comments)
print("Average number of Ask HN comments: ", avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = round(total_show_comments/len(show_posts),2)
print("Total number of Show HN comments: ", total_show_comments)
print("Average number of Show HN comments: ", avg_show_comments)

Total number of Ask HN comments:  24483
Average number of Ask HN comments:  14.04
Total number of Show HN comments:  11988
Average number of Show HN comments:  10.32


According to the anaysis, *Ask HN* posts recieve more comments in total, being ~14 comments per post on avergae. This could be due to the specifics of these post, that assume interaction witht he other users, involving them in conversation.
The remaining analysis will focus on *Ask HN* posts.

Firstly, we will calculate the amount of *Ask HN* posts and comments by hour created with the help of **datetime** module. 

In [6]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

print("Example of the list of list containig only date of the post and the number of comments:")
print("\n")
print(result_list[0:5])
print("\n")

counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    comment = int(row[1])
    date = row[0]
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date_dt.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1 
        comments_by_hour[hour] += comment

print("Number of posts created by hour: ")
print(counts_by_hour)
print("\n")
print("Number of comments recieved by hour: ")
print(comments_by_hour)

Example of the list of list containig only date of the post and the number of comments:


[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


Number of posts created by hour: 
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Number of comments recieved by hour: 
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


We will calculate the average number of comments for posts created during each hour of the day. The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post.

In [7]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour],2)])

print("The average number of comments for posts created during each hour of the day:")    
print("\n")
print(avg_by_hour)

The average number of comments for posts created during each hour of the day:


[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]


We will print the list of five highest values of from *avg_by_hour* data set. Firstly, we will sort the mosr commented hours and select the top five from them.

In [8]:
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")


for av_n_com, hour in sorted_swap[0:5]:
        template = "{h}: {av} average comments per post"
        hour_dt = dt.datetime.strptime(hour, "%H")
        hout_min = hour_dt.strftime("%H:%M")
        output = template.format(h = hout_min, av = av_n_com)
        print(output)
        

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


## Conclusions


According to the most commented hours, the best time intervals to create a post that have higher chances to collect most comments would be 15:00-16:00, 20:00-21:00 and late at night at ~2:00. 

The first time interval 15:00-16:00 corresponds to the after-lunch time. Possibly at that time users are back to their working desks and instead of jumping into work flow they are reading the Hacker News and perhaps discussing them with collegues.

The second time interval 20:00-21:00 corresponds to the after-dinner time (if take that in the US typicall dinner time is 18:00-19:30). Possibly at that time users are back home and would like to rest by reading Hacker News.

The ~2:00 time perhaps corresponds to the time when freelanser IT specialists/programmers...ets with flexible schedule are reading Hacker News after finishing work, maybe before going to bed. Also this time is for a specific time zone, i.e. the Eastern Time in the US.