# Project: Exploring Hacker News Posts

### We'll explore the Hacker News Post dataset to find an answer to the following questions:

#### Do Ask HN or Show HN receive more comments on average?

#### Do posts created at a certain time receive more comments on average?

We'll open the file called "HN_posts_year_to_Sep_26_2016.csv". It contains a collection of HN posts. I'll import the CSV as a list of list Python objects.

In [278]:
import csv
opened_file = open('hacker_news.csv')
hn = list(csv.reader(opened_file))

In [279]:
print("The total number of elements (Hacker News posts) in the CSV file is {quantity}.".format(quantity=len(hn)))

The total number of elements (Hacker News posts) in the CSV file is 20101.


Show the first five (5) elements of the list.

In [280]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Assign the header to a variable.

In [281]:
headers = hn[0]

Delete the header from the dataset

In [282]:
del hn[0]

Print the value of the headers variable to see the column's names

In [283]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Show the first five elements of the list to verify whether the header is present o not.

In [284]:
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We'll create three (3) lists to use as a container to classify posts according to some strings found in the post's title.

In [285]:
ask_posts = list()
show_posts = list()
other_posts = list()

The code below, check the column 'title' to find certain strings so we can classify the post in different buckets:

In [286]:
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [287]:
print("The total amount of posts on the list ask_posts is {quantity}".format(quantity=len(ask_posts)))

The total amount of posts on the list ask_posts is 1744


In [288]:
print("The total amount of posts on the list show_posts is {quantity}".format(quantity=len(show_posts)))

The total amount of posts on the list show_posts is 1162


In [289]:
print("The total amount of posts on the list other_posts is {quantity}".format(quantity=len(other_posts)))

The total amount of posts on the list other_posts is 17194


Find the total number of comments in ask posts

In [290]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
print(total_ask_comments)

24483


Compute the average number of comments on ask posts, and assign it to

In [291]:
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


Find the total number of comments in show posts

In [292]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
print(total_show_comments)

11988


Compute the average number of comments on show posts, and assign it to

In [293]:
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


### Do show posts or ask posts receive more comments on average?

Ask posts receives more comments on average (10.39 vs. 4.88).

We'll create a list called result_list where each list contains two elements: a) daytime/time; B) a number of comments. 

In [294]:
from datetime import datetime
result_list = list()
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    

We'll create two dictionaries two store two pieces of information: a) Number of posts per hour; b) Number of comments per hour.

In [295]:
counts_by_hour = dict()
comments_by_hour = dict()

for row in result_list:    
    date_time_obj = datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_time_obj.hour
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1        
    else:
        counts_by_hour[hour] =1 
    
    if hour in comments_by_hour:
        comments_by_hour[hour] += row[1]
    else:
        comments_by_hour[hour] = row[1]
        

print(result_list[0])

['8/16/2016 9:55', 6]


The following dictionary shows the hour as the key and the number of messages per hour as the value:

In [296]:
print(counts_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


The following dictionary shows hour as the key and the number of messages per hour as the value:

In [297]:
print(comments_by_hour)

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


We'll create a list of lists in which the first element is the hour and the second element is the average number of comments

In [298]:
avg_by_hour = list()
for item in counts_by_hour.items():
    avg_by_hour.append([item[0], comments_by_hour[item[0]] / item[1]])
    

Below, you can see inside each list the hour (first element) and the average of ask posts per hour (second element) of the day.

In [299]:
print(avg_by_hour)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


We'll create a dictionary where the key is the average number of ask posts and the hour value. After that, we'll sort the dictionary in descending order to find the top five (hours) where most comments are posted.

In [300]:
avg_by_hour_dict = dict()
for element in avg_by_hour:    
    avg_by_hour_dict[element[0]] = element[1]
print(sorted(avg_by_hour_dict, key=avg_by_hour_dict.get, reverse=True)[:5])

[15, 2, 20, 16, 21]


The top five (5) hours of the day to post Ask Posts in Hacker News are the following:

In [302]:
top_5 = sorted(avg_by_hour_dict, key=avg_by_hour_dict.get, reverse=True)[:5]
for element in top_5:
    print("{time}: {posts:.2f} average comments per post".format(time=datetime.strptime(str(element), "%H").time(), posts=avg_by_hour_dict[element]))

15:00:00: 38.59 average comments per post
02:00:00: 23.81 average comments per post
20:00:00: 21.52 average comments per post
16:00:00: 16.80 average comments per post
21:00:00: 16.01 average comments per post
