# Exploring Hacker News Posts

## Description of the project

This is a project to showcase my Python data analysis skills. I am going to analyze [a dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) of submissions to [Hacker News](https://news.ycombinator.com/).

For the purpose of the analysis, the dataset has been previously reduced to approximately 20,000 rows, deleting submission that didn't receive any comments, and then randomly sampling from the remaining submissions.

### Research questions

* Do submissions of the Ask HN or Show HN get more comments?
* Do posts created at a certain time receive more comments on average?

Let's start by importing the dataset and converting it into a list of list format, to better analyze it. I am going to print the first lines to get a glimpse at the data.

In [11]:
from csv import reader

opened = open('/Users/Damiano/Datasets/hacker_news.csv', encoding="utf8")
read = reader(opened)
hn = list(read)

print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Next I am going to isolate the headers and delete them from the original dataset.

In [12]:
headers = hn[0]

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [13]:
del(hn[0])

Let's check if everything went the right way:

In [14]:
print(headers)
print('\n')
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


My analysis right now is focused only on the post marked as **Show HN** and **Ask HN** so I am going to isolate them on the dataset.

In [16]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print('\n')
print(len(show_posts))
print('\n')
print(len(other_posts))

1744


1162


17194


Now I will count the **number of comments** (index number **4**) for each category, to see which category is getting more comments.

In [21]:
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print('The average number of comments in every Ask HN post is:', avg_ask_comments)
print('\n')

total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print('The average number of comments in every Show HN post is:', avg_show_comments)
print('\n')

total_other_comments = 0

for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments
    
avg_other_comments = total_other_comments / len(other_posts)

print('The average number of comments in every other post is:', avg_other_comments)

The average number of comments in every Ask HN post is: 14.038417431192661


The average number of comments in every Show HN post is: 10.31669535283993


The average number of comments in every other post is: 26.8730371059672


Based on this preliminary analysis, **Ask HN** posts are more likely to get comments than **Show HN** posts. The hig average number on the **other** category is probably inflated on some discussions with a high number of comments.

Now it's time to find out if the **time** is an influence on the number of comments. To do that I first need to convert the **created_at** column into a *datetime* object and calculate the amount of post and comments for each hour of the day.

In [24]:
import datetime as dt

result_list = []

for post in ask_posts:
    date = post[6]
    num_comments = int(post[4])
    result_list.append([date, num_comments])
    
print(result_list[0:5])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


In [48]:
counts_by_hour = {} # will contain the number of ask posts created during each hour of the day.
comments_by_hour = {} # will contain the corresponding number of comments ask posts created at each hour received
date_formula = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    num_comments = row[1]
    hours_object = dt.datetime.strptime(date, date_formula).strftime('%H')
    # To be clear, I can also use this syntax, which is useful to better understand the difference between strptime and strftime
    # I can use every format code I want and extract that specific time using strftime
    # hours_object = dt.datetime.strftime(datetime_object, '%H')
    if hours_object not in counts_by_hour:
        counts_by_hour[hours_object] = 1
        comments_by_hour[hours_object] = num_comments
    else:
        counts_by_hour[hours_object] += 1
        comments_by_hour[hours_object] += num_comments

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

We can use the dictionary previously created to calculate the average number of comments for posts created during each hour of the day.

In [52]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
                       
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Let's sort this data in a more readable way. First we need to swap the two items, so we can use the *sort()* built-in function.

In [55]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour[0:5])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16']]


In [59]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [60]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print("{}: {:.2f} average comments per post.".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.
