# When to ask your question on Hacker News to get the highest number of comments
In this project, we will examine ~ 300 000 user-submitted posts on Hacker News to explore some of the factors that affect how many comments a post receives.

The full dataset, and related documentation, can be found on Kaggle: https://www.kaggle.com/hacker-news/hacker-news-posts

The key questions we seek to answer are:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


## Preparing the dataset

Let's start by opening the csv file containing our Hacker Newa dataset.

In [63]:
import csv
import datetime as dt

with open("Documents/datasets/HN_posts_year_to_Sep_26_2016.csv", encoding="utf-8") as a:
    read_file = csv.reader(a)
    hn = list(read_file)

Now, let's familiarize ourselves with the dataset by looking at the first five rows. As we'll see below, the first row contains the table headers.

In [64]:
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


To make the dataset easier to work with, we'll split the row headers and the actual post data into separate variables.

In [65]:
headers = hn[0]

hn = hn[1:]

print("Headers:\n", headers,"\n")
print("First 5 rows of HN post data:\n", hn[0:5])

Headers:
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

First 5 rows of HN post data:
 [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markga

## Some initial insights: types of posts

Let's continue by counting the number of posts belonging to each of the following categories:
- "Ask HN" posts
- "Show HN" posts
- Other posts

In [66]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Ask HN: ", len(ask_posts), "\nShow HN: ", len(show_posts), "\nOther posts: ", len(other_posts))
        

Ask HN:  9139 
Show HN:  10158 
Other posts:  273822


We'll continue by analyzing the average number of user comments on Ask HN and Show HN posts, respectively.

In [67]:
total_ask_comments = sum_comments(ask_posts)
total_show_comments = sum_comments(show_posts)

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Average number of comments, Ask HN: ", avg_ask_comments)
print("Average number of comments, Show HN: ", avg_show_comments)


Average number of comments, Ask HN:  10.393478498741656
Average number of comments, Show HN:  4.886099625910612


We can conlude that Ask HN posts have a higher number of comments than Show HN posts.

## A deep dive into Ask HN posts

### Publication hour and number of comments
Having concludeded that Ask HN posts have a higher number of comments than Show HN posts, let's dig deeper into the Ask HN posts. We will start by finding out at which hour of the day (24h) these posts were published, and the number of comments received for posts published at different times of the day.

In [68]:
result_list = []

for post in ask_posts:
    post_data = []
    post_data.append(post[6])
    post_data.append(int(post[4]))
    result_list.append(post_data)

In [69]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time_posted = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(time_posted, "%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])

### Average number of comments per post publication hour

Next up, let's find out the average number of comments per post publication hour.

In [70]:
avg_by_hour = []

for hour in comments_by_hour:
        avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
        
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


In [71]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
print(swap_avg_by_hour)

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [72]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


### Top 5 Hours for Ask Posts Comments
Let's have a look at the top 5 post publication hours for Ask Posts comments.

In [73]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[0:5]:
    hour = dt.datetime.strptime(row[1],"%H")
    hour_frmt = dt.datetime.strftime(hour, "%H:%S")
    avg = row[0]
    
    print("{h} {a:.2f} average comments per post".format(h = hour_frmt, a = avg))

Top 5 Hours for Ask Posts Comments
15:00 28.68 average comments per post
13:00 16.32 average comments per post
12:00 12.38 average comments per post
02:00 11.14 average comments per post
10:00 10.68 average comments per post


As demonstrated above, Ask HN posts created at the following hours receive the highest number of comments:
15:00 ET
13:00 ET
12:00 ET
02:00 ET
10:00 ET 

This means that, in order to increase their chances of getting more comments, a HN user in Sweden should post their Ask HN post at one of the following hours:
21:00 CET
19:00 CET
18:00 CET
08:00 CET
16:00 CET

While this leaves some choice for both early bird and night owl, 15:00 ET/21:00 CET is the clear winner.