# Exploring Hacker News Posts

Darren Ho

## Introduction

In this project, we will work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/)

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
- `id`: the unique identifier from Hacker News for the post    
- `title`: the title of the post
- `url`: the URL that the posts links to, if the post has a URL
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes                     minus the total number of downvotes
- `num_comments`: the number of comments on the post
- `author`: the username of the person who submitted the post
- `created_at`: the date and time of the post's submission

Here are the first five rows of the dataset:

In [4]:
from csv import reader

open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)
    print('\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




For this project, we are interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting. 

We will compare the two types of posts to deteremine the following:
- On average, do `Ask HN` or `Show HN` posts receive more comments?
- On average, do posts created at a certain time receive more comments?

## Removing Headers from a List of Lists

In the previous section, we read our data into a list of lists called `hn`. Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze the data, we need to first remove the row containing the column headers.

In [5]:
hn_header = hn[0]   # Extracting first row of data 
hn = hn[1:]         # Removing first row from hn data set

print(hn_header)    # displaying header columns
print('\n')         
for row in hn[:5]:
    print(row)
    print('\n')     # displaying first 5 rows of the data set minus header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




## Extracting Ask HN and Show HN Posts

Now that we have removed the headers from `hn`, we are ready to filter our data. Since we are only concerned with post titles beginning with `Ask HN` or `Show HN`, we will create new lists of lists containing just the data for those specific titles. 

To do so, we will be using the string method `startswith`. Given a string object, say, `string1`, we can check if it starts with, say, `dq`, by inspecting the output of the object `string1.startswith('dq')`. If `string1` starts with `dq`, it will return `True`; otherwise, it will return `False`.

In [18]:
ask_posts = []                # Creating 3 empty lists
show_posts = []
other_posts = []

for row in hn:                                  # Looping through each row in our hn data set
    title = row[1]                              # Assigning the title in each row to a variable
    title = title.lower()                       # controlling for case, so we use lower method
    if title.startswith('ask hn') == True:    # if title starts with 'ask hn' then append
        ask_posts.append(row)
    elif title.startswith('show hn') == True: #if title starts with 'show hn' then append
        show_posts.append(row)
    else:
        other_posts.append(row)                # else, then append

In [25]:
print('Total Ask Posts: ', len(ask_posts)) 
print('Total Show Posts: ', len(show_posts)) 
print('Total Other Posts: ', str(len(other_posts))) 

Total Ask Posts:  1744
Total Show Posts:  1162
Total Other Posts:  17194


Checking the number of posts in the 3 new lists, we see that there are 1,744 `Ask HN` posts, 1,162 `Show HN` posts, and 17,194 `Other` posts.

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

We can now use the lists that we have created to determine if ask posts or show posts receive more comments on average. Lets begin with calculating the average number of comments for `Ask HN` posts.

In [32]:
# Ask HN Posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Total Number of Ask Comments: ", str(total_ask_comments))
print("Average Number of Ask Comments: ", str(avg_ask_comments))
print("Rounded Average of Ask Comments: {:.2f}".format(avg_ask_comments))

Total Number of Ask Comments:  24483
Average Number of Ask Comments:  14.038417431192661
Rounded Average of Ask Comments: 14.04


In [33]:
# Show HN Posts

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("Total Number of Show Comments: ", str(total_show_comments))
print("Average Number of Show Comments: ", str(avg_show_comments))
print("Rounded Average of Show Comments: {:.2f}".format(avg_show_comments))

Total Number of Show Comments:  11988
Average Number of Show Comments:  10.31669535283993
Rounded Average of Show Comments: 10.32


In [34]:
# Other Posts

total_other_comments = 0

for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
    
avg_other_comments = total_other_comments / len(other_posts)

print("Total Number of Other Comments: ", str(total_other_comments))
print("Average Number of Other Comments: ", str(avg_other_comments))
print("Rounded Average of Other Comments: {:.2f}".format(avg_other_comments))

Total Number of Other Comments:  462055
Average Number of Other Comments:  26.8730371059672
Rounded Average of Other Comments: 26.87


- `Ask HN` Average Comments = 14.04
- `Show HN` Average Comments = 10.32
- `Other` Average Comments = 26.87


After calculating the average number of comments for the 3 created lists, we see that `Other` posts actually recieves more comments on average than that of `Ask HN` and `Show HN`. However, for the purpose of our analysis, we can conclude that `Ask HN` posts receives more comments on average than `Show HN` posts. 

## Finding the Number of Ask Posts and Comments by Hour Created

Because we were able to conclude that `Ask HN` posts were more likely to receive comments, we will focus our remaining analysis just on these posts.

We will now determine if ask posts created at a certain *time* are more likely to attract comments. We will use the [`datetime` module](https://docs.python.org/3/library/datetime.html) to work with the data in the `created_at` column. We will use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received
2. Calculate the average number of comments ask posts receive by hour created

In [47]:
import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    comment_num = row[1]
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    hour_only = date_dt.hour
    if hour_only not in counts_by_hour:
        counts_by_hour[hour_only] = 1
        comments_by_hour[hour_only] = comment_num
    else:
        counts_by_hour[hour_only] += 1
        comments_by_hour[hour_only] += comment_num
        
print("Numer of Ask Posts created during each hour of the day:", counts_by_hour) 
print("\n")
print("Numer of comments Ask Posts created at each hour received:", comments_by_hour)  

Numer of Ask Posts created during each hour of the day: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


Numer of comments Ask Posts created at each hour received: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


## Calculating the Average Number of Comments for `Ask HN` Posts by Hour

In the code above, we created two dictionaries:

- `counts_by_hour`: contains the number of ask posts created during each hour of the day
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received

Next, we will use the two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [53]:
avg_by_hour = []

for i in counts_by_hour:
    avg_com_per_post = comments_by_hour[i] / counts_by_hour[i]   # calc avg num of comments per post for posts created during each hour of the day
    avg_by_hour.append([i, avg_com_per_post])           # appending the hour and the avg number of comments per post to list

for row in avg_by_hour:
    print(row)
    

[9, 5.5777777777777775]
[13, 14.741176470588234]
[10, 13.440677966101696]
[14, 13.233644859813085]
[16, 16.796296296296298]
[23, 7.985294117647059]
[12, 9.41095890410959]
[17, 11.46]
[15, 38.5948275862069]
[21, 16.009174311926607]
[20, 21.525]
[2, 23.810344827586206]
[18, 13.20183486238532]
[3, 7.796296296296297]
[5, 10.08695652173913]
[19, 10.8]
[1, 11.383333333333333]
[22, 6.746478873239437]
[8, 10.25]
[4, 7.170212765957447]
[0, 8.127272727272727]
[6, 9.022727272727273]
[7, 7.852941176470588]
[11, 11.051724137931034]


## Sorting and Printing Values from a List of Lists

We calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named `avg_by_hour`. Although we have the results we need, the format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that is easier to read!

In [56]:
swap_avg_by_hour = []    # equal list to avg_by_hour but with swapped columns

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.20183486238532, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.127272727272727, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

In [60]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour_form = dt.datetime.strptime(str(row[1]), '%H')
    hour_form = hour_form.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_form, row[0]))


Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## Conclusion

Through our analysis, we are now able to answer our two questions that we sought after:

- On average, do `Ask HN` or `Show HN` posts receive more comments?
- On average, do posts created at a certain time receive more comments?

We are able to conclude that, on average, `Ask HN` posts receive more comments. And now, with our latest findings above, we are able to conclude that posts created a certain time receive more comments. 

The Top 5 Hours for Ask Posts Comments (Eastern Time):

- **3:00 PM**: 38.59 average comments per post
- **2:00 AM**: 23.81 average comments per post
- **8:00 PM**: 21.52 average comments per post
- **4:00 PM**: 16.80 average comments per post
- **9:00 PM**: 16.01 average comments per post

If you were to create a `Ask HN` post with the goal of having the highest chances of receiving comments, our analysis suggests that you post around 3:00 PM EST, as that is when you would expect to receive the most comments at 38.59 per post. 

## Next Steps

Possible next steps that I can further dig into:

- Determine if show or ask posts receive more points on average

- Determine if posts created at a certain time are more likely to receive more points

- Compare your results to the average number of comments and points other posts receive
