# ANALYSIS OF SUBMISSIONS TO HACKER NEWS - INSPIRE OTHERS WITH YOUR STORIES

Hacker news is a forum where users post stories and they are voted and commented upon. It is very popular in technology and startup circles and top posts can get several views in the tune of hundreds of thousands. The site was started by *Y Combinator.*

The aim of this project is to investigate the number of comments that posts receive with a specific focus on **ASK HN and SHOW HN** posts. We will also seek to determine if the time at which a post is submitted has any effect on the avreage number of comments the post receives.ASK HN posts are those where users ask question to the community while SHOW HN posts are those where users show the community a product, project or any interesting thing.

In order to achieve our goals, we will compare the average number of comments for ASK HN and SHOW HN posts. In addition, the average number of comments per post for each hour will be calculated to figure out the best time to post a story. We were able to determine that the post that will more likely receive comments is ASK HN and the best time to post is 11pm (GMT+3).

In [1]:
#import reader and read file
from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)

#Transform read file to a list of lists
hacker_news = list(read_file)
print(hacker_news[:5])
print()

# Exract the headers as the first row of the dataset
headers = hacker_news[0]
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [2]:
#Update the dataset to remove the headers
hacker_news = hacker_news[1:]
print(hacker_news[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [3]:
ask_posts = []
show_posts = []
other_posts = []

# For each row we convert the title to lowercase and check if it starts with 'ask hn' or 'show hn'.
# If so we append the rows to their specific lists
for row in hacker_news:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    

In [5]:
#Check the number of posts for each category
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


9139
10158
273822


In [10]:
print(ask_posts[:5])
print()
print(show_posts[:5])
print()
print(other_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/2

In [13]:
# Calculating the average number of ask comments
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])   # We convert the data type of number of comments to allow mathematical operations
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [14]:
# Calculating the average number of show comments
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])   # We convert the data type of number of comments to allow mathematical operations
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

4.886099625910612


## Findings for the post that receives most comments 
From the average number of comments for both the ask and show posts, we can observe that ask posts have a higher number of comments-close to twice that of show posts. Since ask posts tend to receive more comments we will focus the time analysis on it.

In [25]:
# Creatig a list of lists with the date created and number of comments lists

import datetime as dt
result_list = []

for row in ask_posts:
    #To ensure the result_list is a list of lists we initialize a list, for each iteration, that is appended to the result_list
    
    initial_list = [] 
    date_created = row[6]
    num_comments = int(row[4])
    initial_list.append(date_created)
    initial_list.append(num_comments)
    result_list.append(initial_list)
    
result_list[:5]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2]]

In [39]:
# Calculating the posts and comments made in each hour
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = row[0]
    date = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")     # Only the hour of the day will be used in the analysis
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print(counts_by_hour)
(comments_by_hour)

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [44]:
# Calculating the average number of comments per post in each hour

avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]]) 

    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

In [50]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]]) #swapping the columns
swap_avg_by_hour

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [56]:
# Sorting the average number of comments by hour list of lists
sorted_swap = sorted(swap_avg_by_hour, reverse=True)   #Descending order
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [62]:
# Displaying the output in the desired format

print('Top 5 Hours for Asks Posts Comments')
for row in sorted_swap[:5]:
    hour = row[1]
    hour = dt.datetime.strptime(hour, '%H')
    hour = hour.strftime("%H:%M")
    output = "{}: {:.2f} average comments per post.".format(hour, row[0])
    print(output)

Top 5 Hours for Asks Posts Comments
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


## Findings for the effect of the time at which the post is submitted
The times at which posts receive the highest number of comments are 3pm, 1pm, 12pm, 2am and 10am respectively, Eastern US time. For the GMT+3 time zone (Kenya) we would have to add 8 hours to the results giving 11pm, 9pm, 8pm, 10am and 6pm respectively. 

The goal of the project was to determine the best kind of post and the best time to post in order to have the highest chances of receiving comments. To achieve this the average number of comments and the average number of comments per post for each hour was calculated. It was observed that the best type is an ASK HN post and the most favourable time is 11pm (GMT+3).