# Exploring Hacker News Posts

In this guided project by Dataquest (DQ), I worked with a data set of submissions to [Hacker News](https://news.ycombinator.com/) (HN), ["a social news website focusing on computer science and entrepreneurship"](https://en.wikipedia.org/wiki/Hacker_News). According to DQ, HN was started by the startup incubator Y Combinator, where user-submitted stories or "posts" are voted and commented upon, similar to Reddit. HN is extremely popular in technology and startup circles, and posts that make it to the top of HN's listings can get hundreds of thousands of visitors as a result. 

DQ provided the data set that I used for analysis. To produce the data set, DQ initially removed the submissions without comments from the original [data set](https://www.kaggle.com/hacker-news/hacker-news-posts/) and performed a random sampling of the remaining submissions. In the data set, I am interested in posts whose titles begin with `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the HN community a specific question. Meanwhile, `Show HN` posts present the HN community an interesting product or project. In this project, I wanted to know the following:

* Do `Ask HN` or `Show HN` posts receive more comments on average?
* Do posts created at a certain time receive more comments on average?
* Do `Ask HN` or `Show HN` posts receive more upvotes or points on average?
* Do posts created at a certain time receive more points on average?
* Do posts other than `Ask HN` or `Show HN` receive more comments and points on average?

## Introduction

I first opened and explored the data set. 

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

print(headers)
hn[0:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

The data set `hn` has the following columns:

* `id` - A unique identifier from HN for the post
* `title` - The title of the post
* `url` - The url of the item being linked to 
* `num_points` - The number of upvotes the post received
* `num_comments` - The number of comments the post received
* `author` - The name of the account that made the post
* `created_at` - The date and time the post was made (the time zone is Eastern Time in the US)

## Extracting `Ask HN` and `Show HN` Posts

I looped through each row of the `hn` data set and extracted the entries with a `title` that starts with `Ask HN` (`ask_posts`) or `Show HN` (`show_posts`). I saved the remaining posts to `other_posts`. `Ask HN` and `Show HN` posts have 1,744 and 1,162 entries while the other posts have 17,194 entries. 

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Ask HN posts have' ,len(ask_posts), 'entries.')
print('Show HN posts have', len(show_posts), 'entries.')
print('Other posts have', len(other_posts), 'entries.')

Ask HN posts have 1744 entries.
Show HN posts have 1162 entries.
Other posts have 17194 entries.


## Calculating the Average Number of Comments for `Ask HN` and `Show HN` Posts

After the extraction and separation of `Ask HN` and `Show HN` posts from the `hn` data set, I calculated the average number of comments per post using the `num_comment` column. Results show that on the average, `Ask HN` posts receive 14 comments per post while `Show HN` posts get 10 comments per post. 

In [3]:
total_ask_comments = 0
for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments

avg_ask_comments = total_ask_comments/len(ask_posts)
print('The average number of comments for Ask HN posts is', round(avg_ask_comments, 2), 'comments per post.')

total_show_comments = 0
for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments

avg_show_comments = total_show_comments/len(show_posts)
print('The average number of comments for Show HN posts is', round(avg_show_comments, 2),'comments per post.')

The average number of comments for Ask HN posts is 14.04 comments per post.
The average number of comments for Show HN posts is 10.32 comments per post.


## Finding the Number of Ask Posts and Comments by Hour Created

Since `Ask HN` posts receive more comments than `Show HN` posts, I used the `ask_posts` data set to determine the number of posts and comments by hour created (`created_at` column). Results indicate that the highest number of `Ask HN` posts (116 posts) and comments (4,477 comments) were both observed at 15:00.  

In [4]:
import datetime as dt

result_list = []
for row in ask_posts:
    date_comment = [row[6], int(row[4])]
    result_list.append(date_comment)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_dt = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour_str = date_dt.strftime('%H')
    
    if hour_str not in counts_by_hour:
        counts_by_hour[hour_str] = 1
        comments_by_hour[hour_str] = row[1]
    else:
        counts_by_hour[hour_str] += 1
        comments_by_hour[hour_str] += row[1]

print('Hour : Posts : Comments')
for key in sorted(counts_by_hour.keys()):
    print(key , ":" , counts_by_hour[key], ":", comments_by_hour[key])
    

Hour : Posts : Comments
00 : 55 : 447
01 : 60 : 683
02 : 58 : 1381
03 : 54 : 421
04 : 47 : 337
05 : 46 : 464
06 : 44 : 397
07 : 34 : 267
08 : 48 : 492
09 : 45 : 251
10 : 59 : 793
11 : 58 : 641
12 : 73 : 687
13 : 85 : 1253
14 : 107 : 1416
15 : 116 : 4477
16 : 108 : 1814
17 : 100 : 1146
18 : 109 : 1439
19 : 110 : 1188
20 : 80 : 1722
21 : 109 : 1745
22 : 71 : 479
23 : 68 : 543


## Calculating the Average Number of Comments for Ask HN Posts by Hour

I then calculated the average number of comments per `Ask HN` post by the hour.  After sorting, I observed the highest average value (38.59 comments per post) at 15:00. This means that for an `Ask HN` post to receive more comments, it must be posted at around 3:00 EST. 

In [5]:
avg_by_hour = []
for key in counts_by_hour:
    avg_by_hour.append([key, round(comments_by_hour[key]/counts_by_hour[key], 2)])

print('Hour, Comments per post')
sorted(avg_by_hour)

Hour, Comments per post


[['00', 8.13],
 ['01', 11.38],
 ['02', 23.81],
 ['03', 7.8],
 ['04', 7.17],
 ['05', 10.09],
 ['06', 9.02],
 ['07', 7.85],
 ['08', 10.25],
 ['09', 5.58],
 ['10', 13.44],
 ['11', 11.05],
 ['12', 9.41],
 ['13', 14.74],
 ['14', 13.23],
 ['15', 38.59],
 ['16', 16.8],
 ['17', 11.46],
 ['18', 13.2],
 ['19', 10.8],
 ['20', 21.52],
 ['21', 16.01],
 ['22', 6.75],
 ['23', 7.99]]

In [6]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    print('{}: {} comments per post'.format(hour, row[0]))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 comments per post
02:00: 23.81 comments per post
20:00: 21.52 comments per post
16:00: 16.8 comments per post
21:00: 16.01 comments per post


## Calculating the Average Number of Points for `Ask HN` and `Show HN` Posts

Aside from the average number of comments, I also calculated the average number of upvotes or points per post for `Ask HN` and `Show HN` posts using the `num_points` column. Similar to the number of comments, results show that on the average, `Ask HN` posts have a higher number of points (15.06 points per post) than `Show HN` posts (10.32 points per post). 

In [7]:
total_ask_points = 0
for row in ask_posts:
    n_points = int(row[3])
    total_ask_points += n_points

avg_ask_points = total_ask_points/len(ask_posts)
print('The average number of points for Ask HN posts is', round(avg_ask_points, 2), 'points per post.')

total_show_points = 0
for row in show_posts:
    n_points = int(row[3])
    total_show_points += n_points

avg_show_points = total_show_points/len(show_posts)
print('The average number of points for Show HN posts is', round(avg_show_comments, 2), 'points per post.')

The average number of points for Ask HN posts is 15.06 points per post.
The average number of points for Show HN posts is 10.32 points per post.


## Finding the Number of Points by Hour Created for `Ask HN` Posts

Since `Ask HN` posts receive more points than `Show HN` posts, I determined the number of points by hour created for `Ask HN` posts. Similar to the number of posts and comments, I observed the highest number of points (3,479 points) at 15:00. 

In [8]:
points_by_hour = {}
for row in ask_posts:
    hour_str = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M').strftime('%H')
    
    if hour_str not in points_by_hour:
        points_by_hour[hour_str] = int(row[3])
    else:
        points_by_hour[hour_str] += int(row[3])

print('Hour : Points')
for key in sorted(points_by_hour.keys()):
    print(key , ":" , points_by_hour[key])

Hour : Points
00 : 451
01 : 700
02 : 793
03 : 374
04 : 389
05 : 552
06 : 591
07 : 361
08 : 515
09 : 329
10 : 1102
11 : 825
12 : 782
13 : 2062
14 : 1282
15 : 3479
16 : 2522
17 : 1941
18 : 1741
19 : 1513
20 : 1151
21 : 1721
22 : 511
23 : 581


## Calculating the Average Number of Points for Ask HN Posts by Hour

I also calculated the average number of points per `Ask HN` post by the hour. I observed the highest average value (30 points per post) at 15:00. Thus, aside from receiving a high number of comments, an `Ask HN` post created around 3:00 EST will also receive a high number of upvotes. 

In [9]:
avg_points_by_hour = []
for key in counts_by_hour:
    avg_points_by_hour.append([key, round(points_by_hour[key]/counts_by_hour[key], 2)])

print('Hour, Points per post')
sorted(avg_points_by_hour)

Hour, Points per post


[['00', 8.2],
 ['01', 11.67],
 ['02', 13.67],
 ['03', 6.93],
 ['04', 8.28],
 ['05', 12.0],
 ['06', 13.43],
 ['07', 10.62],
 ['08', 10.73],
 ['09', 7.31],
 ['10', 18.68],
 ['11', 14.22],
 ['12', 10.71],
 ['13', 24.26],
 ['14', 11.98],
 ['15', 29.99],
 ['16', 23.35],
 ['17', 19.41],
 ['18', 15.97],
 ['19', 13.75],
 ['20', 14.39],
 ['21', 15.79],
 ['22', 7.2],
 ['23', 8.54]]

In [10]:
swap_avgpts_by_hour = []
for row in avg_points_by_hour:
    swap_avgpts_by_hour.append([row[1], row[0]])
    
sorted_swap_pts = sorted(swap_avgpts_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Points')

for row in sorted_swap_pts[:5]:
    hour = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    print('{}: {} points per post'.format(hour, row[0]))

Top 5 Hours for Ask Posts Points
15:00: 29.99 points per post
13:00: 24.26 points per post
16:00: 23.35 points per post
17:00: 19.41 points per post
10:00: 18.68 points per post


## Average Number of Comments and Points for Other Posts

For comparison to `Ask HN` and `Show HN` posts, I obtained the average number of comments and points for the other posts (`other_posts` data set). Due to its large number of entries, other posts have a higher average number of comments (26.87 comments per post) and points (55.41 points per post) than `Ask HN` and `Show HN` posts. 

In [11]:
total_other_comments = 0
total_other_points = 0
for row in other_posts:
    n_comments = int(row[4])
    n_points = int(row[3])
    total_other_comments += n_comments
    total_other_points += n_points

avg_other_comments = total_other_comments/len(other_posts)
avg_other_points = total_other_points/len(other_posts)

print('The average number of comments for other posts is', round(avg_other_comments, 2), 'comments per post.')
print('The average number of points for other posts is', round(avg_other_points, 2), 'points per post.')

The average number of comments for other posts is 26.87 comments per post.
The average number of points for other posts is 55.41 points per post.


## Conclusion

Based on my data analysis, to maximize the number of comments and upvotes a post receives, the post must be classified as an `Ask HN` post and created between 3:00 - 4:00 EST. 