ESTIMATING IF EITHER ASK OR SHOW POSTS GET MORE COMMENTS THAN THE OTHER AND THE INFLUENCE OF TIME ON NUMBER OF COMMENTS 

Introduction
Hacker News is a site by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments. It is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

The data used was goten from dataquest and it is approximately 20,000 rows, generated by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

The interest is in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. For example:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users also submit Show HN posts to show the Hacker News community a project, product, or just something interesting. For example:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

OBJECTIVE
These two types of posts will be used to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

LOADING AND EXPLORING DATA
I start by importing the required module, a csv reader to open and read read the dataset into a list of lists. Next the first few rows are explored.

In [1]:
opened_file = open ('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
first_five_rows = hn[:5]
print(first_five_rows)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


The first row of data is extracted and assigned as headers, and we exclude the first row which is the header by updating the data set through list indexing.
Explore the first five rows of hn to be sure header row has been removed.

In [5]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
[['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos'

We separate posts beginning with Ask HN and Show HN (and case variations) into two different lists and finally a thrid list for posts that dont fall under these two categories.

In [6]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are', len(ask_posts), 'ask HN posts')
print('There are', len(show_posts), 'show HN posts')
print('There are', len(other_posts), 'other posts')        

There are 1744 ask HN posts
There are 1162 show HN posts
There are 17192 other posts


ANALYSIS
Here we try to determine if the ask posts or show posts receive more comments on average. To do this, we need to get the total number of comments in ask posts as well as the lenght of the ask post and compute the ratio to get the average. Same procedure is repeated for for the show comments and show post.

In [7]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments = total_ask_comments + num_comments
    average_ask_comments = total_ask_comments/len(ask_posts)
print(f'the average comment in ask posts is {average_ask_comments:.2f}')

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments/len(show_posts)
print(f'the average comment in show posts is {avg_show_comments:.2f}')

the average comment in ask posts is 14.04
the average comment in show posts is 10.32


We see by the numbers that the ask posts have a higher average comment than the show posts. However to conclusively state this fact, it has to be subjected to a statistical of significance.

Next, since ask posts had a higher value, we will check if ask posts created at a certain time are more likely to attract comments. First we calculate the number of ask posts created in each hour of the day, along with the number of comments received.

In [10]:
import datetime as dt
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append((created_at, num_comments))
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    num_comments = int(row[1])
    date = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
print('Post counts per Hour:', counts_by_hour)

print('Post counts per Hour:', comments_by_hour)

Post counts per Hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Post counts per Hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next we compute the average number of comments ask posts receive by hour created

In [15]:
avg_by_hour = []

for key in comments_by_hour:
    total_comments = comments_by_hour[key]
    total_posts= counts_by_hour[key]
    total_post = counts_by_hour[key]
    average_comment_per_post = (total_comments/total_post)
    avg_by_hour.append([key, average_comment_per_post])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


For readability, sorting the list of lists is necessary after which we format 

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [

In [18]:
for avg, hr in sorted_swap[:5]:
    print("{} : {:.2f} average comments per post".format
          (dt.datetime.strptime(hr,"%H").strftime("%H:%M"), avg))

15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post


CONCLUSION
By the numbers we can say that the best time of the day to write a post that will get lots of comments will be in the 15:00 hours and the likelihood of getting much comments within the hours of 21:00 is low.