# Comparing two types of posts on Hacker News

**Hacker News** is a site where user-submmited stories (posts) are voted and commented upon, similar to Reddit. It is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listing can get hundreds of thousands of visitors as a result.

In this project, we are interested especifically in two types of posts: whose titles begin with either `Ask HN` or `Show HN`. The first one is used to ask the Hacker News community a specific question and the last one to show the community a project, product or just generally something interesting.

We will compare these two types of posts to determine **what type (Ask HN or Show HN) receive more comments on average and if posts created at a certain time receive more comments on average.**

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [2]:
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]
print(hn[:5])
print('\n')
print(len(hn))

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


20100


# Looking for what type receives more comments

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


After separating `Ask HN` and `Show HN` into two different lists, we're going to determine whether ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0
for row in ask_posts:
    comment = int(row[4])
    total_ask_comments += comment
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [6]:
total_show_comments = 0
for row in show_posts:
    comment = int(row[4])
    total_show_comments += comment
    
avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


Looking to this overview, we have more ask posts comments (1744 against 1162 show posts). Even though, having a higher denominator in average calculation, the average number of comments on ask posts (14.04) is higher than show posts (10.32).

However, in order to confirm if these numbers aren't distorsed by few posts with a lot of comments, let's sort these lists in descending order of number of comments and check this out.

In [7]:
sorted_ask_posts = []
for row in ask_posts:
    comments = int(row[4])
    sorted_ask_posts.append(comments)
sorted_ask_posts = sorted(sorted_ask_posts, reverse=True)
print(sorted_ask_posts[:15])

[947, 910, 868, 691, 520, 514, 477, 383, 283, 281, 266, 250, 234, 231, 202]


In [8]:
sorted_show_posts = []
for row in show_posts:
    comments = int(row[4])
    sorted_show_posts.append(comments)
sorted_show_posts = sorted(sorted_show_posts, reverse=True)
print(sorted_show_posts[:15])

[306, 233, 206, 197, 168, 167, 163, 143, 134, 113, 106, 103, 102, 100, 98]


As we can see, if we sum the most commented 15 ask posts and compare to the same list of show posts, it's possible to see a huge difference between them.

In [9]:
sum_ask = 0
n1 = 0
for row in sorted_ask_posts:
    sum_ask += row
    n1 += 1
    if n1 == 15:
        break

sum_show = 0
n2 = 0
for row in sorted_show_posts:
    sum_show += row
    n2 += 1
    if n2 == 15:
        break
    
print('Sum of the most commented 15 Ask Posts: ' + str(sum_ask))
print('Sum of the most commented 15 Show Posts: ' + str(sum_show))

Sum of the most commented 15 Ask Posts: 7057
Sum of the most commented 15 Show Posts: 2339


We can say that although ask posts have a higher average of number of comments, it's heavily influenced by few posts with a lot of comments. Actually, the "most right thing" to do is to calculate the standard deviation of both lists and do a hypotesis test to confirm if that difference of averages is statistically significant. However, that's not the purpose of the project and we can deepen our analysis later after gathering more knowledge with Python.

# Time to attract comments
Now, let's determine if posts created at a certain time are more likely to attract comments, starting with ask posts. 

In [10]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    counts_by_hour = {}
    comments_by_hour = {}
    for row in result_list:
        dt_str = row[0]
        comment = row[1]
        dt_str_splitted = dt_str.split()
        dt_str_hour = dt_str_splitted[1]
        dt_object = dt.datetime.strptime(dt_str_hour, "%H:%M")
        dt_hour = dt_object.strftime("%H")
        if dt_hour not in counts_by_hour:
            counts_by_hour[dt_hour] = 1
            comments_by_hour[dt_hour] = comment
        else:
            counts_by_hour[dt_hour] += 1
            comments_by_hour[dt_hour] += comment
            
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [11]:
counts_by_hour

{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [12]:
average = []
for hour in comments_by_hour:
    key1 = hour
    for anotherhour in counts_by_hour:
        key2 = anotherhour
        if key1 == key2:
            avg_by_hour = comments_by_hour[key1]/counts_by_hour[key2]
            average.append([key2, avg_by_hour])
            
average

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

To make easier the visualization, we gonna sort this list of lists and print the five highest values.

In [13]:
swap_avg_by_hour = []
for row in average:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [20]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    avg = row[0]
    time = row[1]    
    string = "{hour}: {average:.2f} average comments per post"
    str_formatted = string.format(hour=dt.datetime.strptime(time, "%H").strftime("%H:%M"),
              average=avg)
    print(str_formatted)
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


As we can see above, the highest average comments per post is in a working hour at 15:00 with 38.59 comments per post, approximately 60% greater than the runner-up. Curiously, the 16:00 hour is the fourth place with 16.80, indicating high chances of receiving comments between this 15:00 - 17:00 interval. Furthermore, another good hour to receive comments in in 20:00 - 22:00 time range, since 20:00 is the 3th place and 21:00 is the 5th. 

According to the [dataset documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US (GMT-5).

Where I live, Brazil, the timezone is two hours late (GMT-3), so the Top 5 would be:
* 17:00: 38.59 average comments per post
* 04:00: 23.81 average comments per post
* 22:00: 21.52 average comments per post
* 18:00: 16.80 average comments per post
* 23:00: 16.01 average comments per post
