# Exploring Hacker News Posts

In this project, we are going to analyze the 'Ask HN' (posts that ask a specific question) and the 'Show HN' (posts that show a project, product or something interesting) posts in the Hacker News website to identify which type receives more comments on average and if posts created at certain times receive more comments on average, as well.

------

## Opening and exploring the data

We import the necessary libraries.

In [1]:
from csv import reader
import datetime as dt

We open the file, read it and convert it into a list, so we can work on it. At the end, we close the file. 

In [2]:
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()

We print the first five rows of the dataset to see what it looks like. 

In [3]:
for i in range(5):
    print(hn[i])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


We store the header row in a separate variable and then we remove the header from our dataset.

In [4]:
header = hn[0]
hn = hn[1:]

## Analyzing the data

### Calculating the average number of comments for each type of post

We split our dataset into three different types of posts: 'Ask Posts', 'Show Posts', and 'Other Posts'. To do this, we loop through the whole dataset and check the title of the post. It if starts with 'Ask HN', we add it to the respective list, and we do that as well with the posts that start with 'Show HN'. All other posts are added to the 'Other Posts' list. 

Finally, we print the number of posts there are for each type that we defined.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lower = title.lower()
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Num. of Ask Posts: {:,}'.format(len(ask_posts)))
print('Num. of Show Posts: {:,}'.format(len(show_posts)))
print('Num. of Other Posts: {:,}'.format(len(other_posts)))

Num. of Ask Posts: 9,139
Num. of Show Posts: 10,158
Num. of Other Posts: 273,822


We print the first five rows for the 'Ask Posts' and 'Show Posts' to check that the title were correctly identified. 

In [6]:
print('Ask Posts:\n{}\n'.format(ask_posts[:5]))
print('Show Posts:\n{}\n'.format(show_posts[:5]))

Ask Posts:
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]

Show Posts:
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1'

We calculate the average amount of comments that the 'Ask Posts' have. We take the number of comments for each row, sum them up, and divide them by the total amount of posts in that category. We then print the result.

In [7]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = round(total_ask_comments / len(ask_posts), 2)
print('Average Ask Comments: {}'.format(avg_ask_comments))

Average Ask Comments: 10.39


We calculate the average amount of comments that the 'Show Posts' have. Same as before, we take the number of comments for each row, sum them up, and divide them by the total amount of posts in that category. We then print the result.

In [8]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = round(total_show_comments / len(show_posts), 2)
print('Average Show Comments: {}'.format(avg_show_comments))

Average Show Comments: 4.89


**Comment:** Since we can see that 'Ask Posts' receive on average over double the amount of comments than 'Show Posts', we will continue to focus our analysis on the first ones.

### Calculating the average number of comments for each hour

For each row of the 'Ask Posts', we take the 'created_at' and the 'num_comments' columns and add it to a list. We print the first five rows to take a look at the data.

In [9]:
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
print(result_list[:5])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


We calculate the number of posts and comments by hour. To do this, we loop through the list we created above and extract the hour from the date. We create two dictionaries with the hours as keys, and count the number of posts for each hour, while adding the number of comments for each hour. 

We print the results to get an idea of how they look.

In [10]:
counts_by_hour = {}
comments_by_hours = {}

for row in result_list:
    created_at = row[0]
    created_at_dt = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    hour = created_at_dt.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hours[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hours[hour] += row[1]
        
print('Posts per Hour: {}\n'.format(counts_by_hour))
print('Comments per Hour: {}'.format(comments_by_hours))

Posts per Hour: {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}

Comments per Hour: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


By using the two dictionaries that we created above, we can calculate the average number of comments per post by hour. We loop through one of the dictionaries to loop through all the available hours, then for each hour we divide the number of comments by the number of posts. 

We then print the results. 

In [11]:
avg_by_hour = []

for hour in counts_by_hour:
    num_posts = counts_by_hour[hour]
    num_comments = comments_by_hours[hour]
    avg_comments = round(num_comments / num_posts, 2)
    avg_by_hour.append([hour, avg_comments])
    
print('Avg. Num. of Comments per Post by Hour: {}'.format(avg_by_hour))

Avg. Num. of Comments per Post by Hour: [[2, 11.14], [1, 7.41], [22, 8.8], [21, 8.69], [19, 7.16], [17, 9.45], [15, 28.68], [14, 9.69], [13, 16.32], [11, 8.96], [10, 10.68], [9, 6.65], [7, 7.01], [3, 7.95], [23, 6.7], [20, 8.75], [16, 7.71], [8, 9.19], [0, 7.56], [18, 7.94], [12, 12.38], [4, 9.71], [6, 6.78], [5, 8.79]]


Once we have the data from above, we swap the columns so we can sort them from most average comments to lowest. We then print the top five hours.

In [12]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg_comments = row[1]
    swap_avg_by_hour.append([avg_comments, hour])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments:')

for row in sorted_swap[:5]:
    comments = row[0]
    hour = row[1]
    print('{}:00: {} average comments per post.'.format(hour, comments))

Top 5 Hours for Ask Posts Comments:
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
2:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


**Comment:** After looking at the above results, we can conclude that, in order to have a higher chance of receiving comments for your 'Ask Posts', you should be posting at 15:00 EST, followed by 13:00 EST and 12:00 EST. 