# Exploring Hacker News Posts

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

### 1. The Dataset

The data set for this project can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts?select=HN_posts_year_to_Sep_26_2016.csv), but please note that it has been downsized from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- **id:** The unique identifier from Hacker News for the post
- **title:** The title of the post
- **url:** The URL that the posts links to, if it the post has a URL
- **num_points:** The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments:** The number of comments that were made on the post
- **author:** The username of the person who submitted the post
- **created_at:** The date and time at which the post was submitted

### 2. Project Purpose

We're specifically interested in posts whose titles begin with either **Ask HN** or **Show HN**. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple of examples:

**Ask HN:** How to improve my personal website?<br>
**Ask HN:** Am I the only one outraged by Twitter shutting down share counts?<br>
**Ask HN:** Aby recent changes to CSS that broke mobile?<br>

Likewise, users submit **Show HN** posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

**Show HN**: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'<br>
**Show HN:** Something pointless I made<br>
**Show HN:** Shanhu.io, a programming playground powered by e8vm<br>

We'll compare these two types of posts to determine the following:

**1. Do Ask HN or Show HN receive more comments on average?**<br>
**2. Do posts created at a certain time receive more comments on average?**<br>

### 3. Read in Data

In [36]:
#read in data
import csv

file = open('HN_posts_year_to_Sep_26_2016.csv',encoding = 'utf-8')
hn = list(csv.reader(file))

#View first 5 rows
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

### 4. Removing Headers from a List of Lists 

In [37]:
#remove the headers
headers = hn[0]
hn = hn[1:]

#display headers and first 5 rows
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with **Ask HN** or **Show HN**, we'll create new lists of lists containing just the data for those titles.

### 5. Extracting Ask HN and Show HN Posts

- We'll Create three empty lists called: ask_posts, show_posts, and other_posts.
- The goal is to identify posts that begin with either **Ask HN** or **Show HN** and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [38]:
#Create 3 lists
ask_posts = []
show_posts = []
other_posts = []


#loop through each row in hn
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        

#check number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


9139
10158
273822


- There are 9,139 posts in the ask_posts ("ask hn") list<br>
- There are 10,158 posts in the show_posts ("show hn") list<br>
- There are 273,822 posts in the other_posts list<br>
- Right now it seems that there are more "Show HN" posts than "Ask HN" posts. Overall the most posts fit into the "other" category. 

### 6. Calculating the Average Number of Comments for Ask HN, Show HN Posts, and Other Posts

In the previous cell we separated the "ask posts" and the "show posts" into two list of lists named ask_posts and show_posts. Now, let's determine if ask posts, show posts, or other posts receive more comments on average.

In [40]:
#calculate the average number of comments 'ASK HN' posts receive
total_ask_comments = 0

#create for loop to iterate over ask posts
for post in ask_posts:
    total_ask_comments += int(post[4])

#Compute the average number of comments on ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

10.393478498741656


In [41]:
#Calculate the average number of comments 'Show HN' posts receive
total_show_posts = 0

#create for loop to iterate over show_posts
for post in show_posts:
    total_show_posts += int(post[4])
    
#compute average number of comments on show posts
avg_show_comments = total_show_posts / len(show_posts)
print(avg_show_comments)

4.886099625910612


In [52]:
#Calculate the average number of comments for "Other Posts"
total_other_posts = 0

#create for loop to iterate over other_posts
for post in other_posts:
    total_other_posts += int(post[4])
    
#compute average number of comments on other posts
avg_other_comments = total_other_posts / len(other_posts)
print(avg_other_comments)

6.4572678601427205


### Summary: It is apparent from our results that highest average number of posts is in the Ask HN category. This is different than the previous finding that the total number of posts is actually higher in the Show HN category. The Other posts category did come in second but was not as significant as Ask HN category. 

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

### 7. Finding the Amount of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:<br>

- Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.<br>
- Calculate the average number of comments ask posts receive by hour created.

In this section, we will tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.

In [45]:
#first import datetime module
import datetime as dt

#create an empty list
result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
         )
         
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
         
for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H") 
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
    
comments_by_hour
         

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

### 8. Calculating the Average Number of Comments for Ask HN Posts by Hour

In [46]:
#calculate the average number of comments per post for posts created during each hour of the day
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
    
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

### 9. Sorting and Printing Values from a List of Lists

In [50]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)


sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap


[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [51]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


The hour that receives the most comments is 15:00 or 3pm which is in eastern standard time. There are almost twice as many comments for 15:00 then there are for the 2nd place 13:00 which only has 16.32. 

# Conclusion

This analysis examined the average number of posts for 3 categories: Ask HN, Show HN, and Other. We only went into more detail with the Ask HN posts since they were more likely to receive comments.<br>

Our Analysis concludes that the best time to leave a post appers to be in the afternoon. The ideal time is 15:00 or 3pm eastern time, but it should be noted that 13:00 (1pm) and 12:00 noon were right behind. Therefore one would want to post in the early to mid afternoon to receive a comment. <br>

We did not go into more detail with the "Other_Posts" or "Show HN" posts. Perhaps another analysis for another time. This would especially be interesting since the most number of posts do occur in the "other" category. 