# Exploring Hacker News Posts

Hacker News--created by [Y Combinator](https://www.ycombinator.com/)--is a user-submitted posting site where individuals interested or working in startup and technology circles ask questions, respond to others' inquiries, and share stories. The top listings on the site can receive hundreds of thousands of views and comments due to the popularity it garners. 

Two types of posts on Hacker News that are of interest to us are those titled `Ask HN` and `Show HN`. Think of `Ask HN` as questions posed to the Hacker News community and `Show HN` as posts to share with the community such as products, projects, or general topics of interest.

Comparing these two types of posts, let's explore a sampling of both to find some further insights to this questions:

- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?


This data set is Hacker News posts from the last 12 months (up to September 26 2016).

It includes the following columns:

- title: title of the post (self explanatory)

- url: the url of the item being linked to

- num_points: the number of upvotes the post received

- num_comments: the number of comments the post received

- author: the name of the account that made the post

- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

## Loading the Data 

To begin, let's import the necessary libraries and read in the data set into a list of lists:

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()

# remove the header
header = hn[0]
hn = hn[1:]
# display header, first five rows, and number of instances
print(header, '\n')
print(hn[:5], '\n')
print(len(hn))


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']] 

20100


## Extracting Ask HN and Show HN Posts

Now, let's isolate the post types beginning with `Ask HN` and `Show HN` from the others:

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# display the number of posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Determining whether ask posts of show posts receive more comments on average will be done below:

In [3]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [4]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_ask_comments = total_show_comments / len(show_posts)
print(avg_ask_comments)

10.31669535283993


It seems that `Ask HN` posts receive more posts on average (14.04 comments) than `Show HN` posts (10.32 comments). Given this finding, let's focus our analysis on the `Ask HN` posts to explore our second query regarding time of post influencing the number of comments received on average.

## Finding the Amount of Ask Posts and Comments by Hour Created

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M').strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour]= row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
print("Counts by hour: \n", counts_by_hour)
print("\nComments by hour: \n", comments_by_hour)

Counts by hour: 
 {'16': 108, '01': 60, '06': 44, '18': 109, '04': 47, '15': 116, '05': 46, '09': 45, '12': 73, '03': 54, '10': 59, '00': 55, '08': 48, '14': 107, '17': 100, '23': 68, '13': 85, '21': 109, '11': 58, '20': 80, '19': 110, '07': 34, '22': 71, '02': 58}

Comments by hour: 
 {'16': 1814, '01': 683, '06': 397, '18': 1439, '04': 337, '15': 4477, '05': 464, '09': 251, '12': 687, '03': 421, '10': 793, '00': 447, '08': 492, '14': 1416, '17': 1146, '23': 543, '13': 1253, '21': 1745, '11': 641, '20': 1722, '19': 1188, '07': 267, '22': 479, '02': 1381}



## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [6]:
avg_by_hour = []

for hr in counts_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
    
avg_by_hour

[['16', 16.796296296296298],
 ['01', 11.383333333333333],
 ['06', 9.022727272727273],
 ['18', 13.20183486238532],
 ['04', 7.170212765957447],
 ['15', 38.5948275862069],
 ['05', 10.08695652173913],
 ['09', 5.5777777777777775],
 ['12', 9.41095890410959],
 ['03', 7.796296296296297],
 ['10', 13.440677966101696],
 ['00', 8.127272727272727],
 ['08', 10.25],
 ['14', 13.233644859813085],
 ['17', 11.46],
 ['23', 7.985294117647059],
 ['13', 14.741176470588234],
 ['21', 16.009174311926607],
 ['11', 11.051724137931034],
 ['20', 21.525],
 ['19', 10.8],
 ['07', 7.852941176470588],
 ['22', 6.746478873239437],
 ['02', 23.810344827586206]]

## Sorting and Printing Values from a List of Lists

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[16.796296296296298, '16'], [11.383333333333333, '01'], [9.022727272727273, '06'], [13.20183486238532, '18'], [7.170212765957447, '04'], [38.5948275862069, '15'], [10.08695652173913, '05'], [5.5777777777777775, '09'], [9.41095890410959, '12'], [7.796296296296297, '03'], [13.440677966101696, '10'], [8.127272727272727, '00'], [10.25, '08'], [13.233644859813085, '14'], [11.46, '17'], [7.985294117647059, '23'], [14.741176470588234, '13'], [16.009174311926607, '21'], [11.051724137931034, '11'], [21.525, '20'], [10.8, '19'], [7.852941176470588, '07'], [6.746478873239437, '22'], [23.810344827586206, '02']]


In [8]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [9]:
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print(
        "{}:00: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )   
    

Top 5 Hours for Ask Posts Comments
15:00:00: 38.59 average comments per post
02:00:00: 23.81 average comments per post
20:00:00: 21.52 average comments per post
16:00:00: 16.80 average comments per post
21:00:00: 16.01 average comments per post


With an average 38.59 comments per post, 15:00 is the hour that tops our list of most comments received per post. To give some perspective, the increase in the number of comments between the hour ranked second, 2:00 with 23.81 average comments per post, and the highest ranked is roughly 60%.


According to the data set documentation, the timezone used is Eastern Time in the US. So, we could also write 15:00 as 3:00 pm est.

## Conclusion

The scope of this project was to determine what types of posts, ask or show, and time of day receive the most comments on average on the Hacker News site. Through our analysis, it has been determined that ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) rank highest for average number of comments throughout the day.


However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.