# Exploring Hacker News Dataset

In this project we are going to make a quick exploration of Hacker News Dataset. Two main questions that draw our attention were:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

First, we need to import relevant modules and read our csv table:

In [6]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Then we remove all headers:

In [7]:
headers = hn[0]
hn = hn[1:]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [8]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Do Ask HN or Show HN receive more comments on average?

Let's count how many Ask Posts, Show Posts and other posts do we have:

In [9]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


In [10]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


And now it is interesting to find out how many comments in total and on average do Ask and Show posts receive:

In [12]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
total_ask_comments

24483

In [14]:
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_ask_comments

14.038417431192661

In [15]:
total_show_comments = 0
for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
total_show_comments

11988

In [16]:
avg_show_comments = total_show_comments / len(show_posts)
avg_show_comments

10.31669535283993

In [17]:
avg_ask_comments > avg_show_comments

True

As we can see, the value of avg_ask_comments is greater than the value of avg_show_comments (14.0 versus 10.3). Therefore, the answer for the first question is:
    Ask HN posts receive more comments on average.

### Do posts created at a certain time receive more comments on average?

To answer this question let's create a new list with the values of comments count and the hour of creation only.

In [18]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    comments_count = row[4]
    comments_count = int(comments_count)
    result_list.append([created_at, comments_count])

The next step is to create two dictionaries to calculate the amount of ask posts and comments by hour.

In [23]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date = row[0]
    date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    time_string = dt.datetime.strftime(date_dt, "%H")
    if time_string not in counts_by_hour:
        counts_by_hour[time_string] = 1
        comments_by_hour[time_string] = row[1]
    else:
        counts_by_hour[time_string] += 1
        comments_by_hour[time_string] += row[1]
        
print(counts_by_hour)
print(comments_by_hour)

{'05': 46, '13': 85, '17': 100, '16': 108, '19': 110, '00': 55, '02': 58, '12': 73, '09': 45, '10': 59, '01': 60, '22': 71, '04': 47, '15': 116, '06': 44, '14': 107, '20': 80, '11': 58, '03': 54, '21': 109, '23': 68, '08': 48, '18': 109, '07': 34}
{'05': 464, '13': 1253, '17': 1146, '16': 1814, '19': 1188, '00': 447, '02': 1381, '12': 687, '09': 251, '10': 793, '01': 683, '22': 479, '04': 337, '15': 4477, '06': 397, '14': 1416, '20': 1722, '11': 641, '03': 421, '21': 1745, '23': 543, '08': 492, '18': 1439, '07': 267}


Now let's calculate the average number of comments for posts created during each hour of the day.

In [24]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour

[['05', 10.08695652173913],
 ['13', 14.741176470588234],
 ['17', 11.46],
 ['16', 16.796296296296298],
 ['19', 10.8],
 ['00', 8.127272727272727],
 ['02', 23.810344827586206],
 ['12', 9.41095890410959],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['04', 7.170212765957447],
 ['15', 38.5948275862069],
 ['06', 9.022727272727273],
 ['14', 13.233644859813085],
 ['20', 21.525],
 ['11', 11.051724137931034],
 ['03', 7.796296296296297],
 ['21', 16.009174311926607],
 ['23', 7.985294117647059],
 ['08', 10.25],
 ['18', 13.20183486238532],
 ['07', 7.852941176470588]]

This step is about swapping columns in avg_by_hour:

In [26]:
swap_avg_by_hour = []
for row in avg_by_hour:
    first_el = row[1]
    second_el = row[0]
    swap_avg_by_hour.append([first_el, second_el])
swap_avg_by_hour

[[10.08695652173913, '05'],
 [14.741176470588234, '13'],
 [11.46, '17'],
 [16.796296296296298, '16'],
 [10.8, '19'],
 [8.127272727272727, '00'],
 [23.810344827586206, '02'],
 [9.41095890410959, '12'],
 [5.5777777777777775, '09'],
 [13.440677966101696, '10'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [7.170212765957447, '04'],
 [38.5948275862069, '15'],
 [9.022727272727273, '06'],
 [13.233644859813085, '14'],
 [21.525, '20'],
 [11.051724137931034, '11'],
 [7.796296296296297, '03'],
 [16.009174311926607, '21'],
 [7.985294117647059, '23'],
 [10.25, '08'],
 [13.20183486238532, '18'],
 [7.852941176470588, '07']]

Let's sort our list in descending order:

In [27]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

And finally, print Top 5 Hours for Ask Posts Comments:

In [28]:
print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments


In [32]:
for row in sorted_swap[:5]:
    average = row[0]
    hour = row[1]
    hour_dt = dt.datetime.strptime(hour, "%H")
    hour_str = dt.datetime.strftime(hour_dt, "%H:%M")
    string_avg_cmt = "{0}: {1:.2f} average comments per post."
    print(string_avg_cmt.format(hour_str, average))

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


So we got the top 5 hours during the day when posts on average receive the biggest number of comments. The time zone is Eastern Time in the US. The last step is to convert the times to the time zone of Minsk. The difference is +8 hours. 

In [33]:
for row in sorted_swap[:5]:
    average = row[0]
    hour = row[1]
    hour_dt = dt.datetime.strptime(hour, "%H")
    time_diff = dt.timedelta(hours = 8)
    hour_dt = hour_dt + time_diff
    hour_str = dt.datetime.strftime(hour_dt, "%H:%M")
    string_avg_cmt = "{0}: {1:.2f} average comments per post."
    print(string_avg_cmt.format(hour_str, average))

23:00: 38.59 average comments per post.
10:00: 23.81 average comments per post.
04:00: 21.52 average comments per post.
00:00: 16.80 average comments per post.
05:00: 16.01 average comments per post.


In order to have a higher chance of receiving comments we should create an Ask Post at 23:00, 10:00, 4:00, 5:00 or midnight.