<img src="https://s3.amazonaws.com/dq-content/354/hacker_news.jpg" alt="Hacker News Logo" title="Hacker News Logo"/>

# Hacker News Post Analysis
#### Author: Frank Pereny
#### Date: November, 2020

## Introduction:
### Project Summary:
[Hacker News](https://news.ycombinator.com/) is a social news site similar to [reddit](https://www.reddit.com/) focused on technology, computer science and entrepeneurship.  Hacker News was developed by [Paul Graham](https://en.wikipedia.org/wiki/Paul_Graham_(programmer)) and launched February, 2007 as Startup News.  Its current name was adopted on August 14, 2007.

Posts can be separated into two categories:
- [Ask HN](https://news.ycombinator.com/ask)
- [Show HN](https://news.ycombinator.com/show)

***Ask HN*** posts allow users to ask questions to the Hacker News community such as:
- Ask HN: Is it time to quit tech industry?
- Ask HN: What is the best money you have spent on professional development?
- Ask HN: How to get rid of impostor syndrome?

***Show HN*** is used to share projects, information, news or anything interesting with the community.  For example:
- Show HN: Podcast API 
- Show HN: A simple but powerful UI for SSH port forwarding 
- Show HN: 15FPS to 60FPS, new GPU real-time flow-based method

### Goals:
The goal of this project is to compare ***Ask HN*** and ***Show HN*** posts to determine the following:
1) Which type of post recieves more comments on average?
2) Does the date or time a post is created affect how many comments it receieves?

### Source Data:
The original source data contained nearly 300,00 posts.  All submissions without any comments were removed.  The final data set consisting of 20,000 entries was created by random sampling.

#### Download:
[Source Data CSV Download](https://www.kaggle.com/hacker-news/hacker-news-posts)

#### Data Format:
The data is a CSV (Comma-separated Values) file with the following format.

| |0|1|2|3|4|5|6|
|--|--|--|--|--|--|--|--| 
|CSV Header String|id| title | url | num_points | num_comments | author | created_at|
|Meaning|Unique post ID| Title of post | Post URL | Number of points post has recieved | Number of comments a post has recieved | Author of the post | Date and time the post was created|

## Results:
### ***Ask HM*** vs. ***Show HM*** Posts
On average ***Ask HM*** posts recieve approximatley 36% more comments than ***Show HM*** posts.  Although not part of this project, it is interesting to note that other posts received the most comments on average, nearly double those of ***Ask HM***.

|   |Posts Analysed | Comment Totals | Average Comment per Post | 
|---|--------------|---------------|------------------------|
|Ask HM Posts| 1,744 | 24, 483 | 14.04 |
|Show HM Posts| 1,162 | 11,988 | 10.32 |
| Other Posts| 17,194 | 462,055 | 26.87|

### Posting Time
Posts created around 3PM Eastern Standard Time have the highest chance of recieving comments.  3PM, averaging over 38 comments per post, was nearly 7 times greater than 9AM EST.  This indicates that the time of posting has a very large impact on the amount comments one can expect to receive.

|Average Comments Per Post | Time Posted|
|------------------------ | ----------|
38.60 | 03pm EST |
23.80 | 02am EST |
21.50 | 08pm EST |
16.80 | 04pm EST |
16.00 | 09pm EST |
14.70 | 01pm EST |
13.40 | 10am EST |
13.20 | 02pm EST |
13.20 | 06pm EST |
11.50 | 05pm EST |
11.40 | 01am EST |
11.10 | 11am EST |
10.80 | 07pm EST |
10.20 | 08am EST |
10.10 | 05am EST |
9.40 | 12pm EST |
9.00 | 06am EST |
8.10 | 12am EST |
8.00 | 11pm EST |
7.90 | 07am EST |
7.80 | 03am EST |
7.20 | 04am EST |
6.70 | 10pm EST |
5.60 | 09am EST |


## Analysis
### Opening & Reading the Data Set

In [10]:
from csv import reader
file_name = 'hacker_news.csv'
opened_file = open(file_name)
read_file = reader(opened_file)
hn = list(read_file)

print("The first 5 rows of the data set:\n")
for row in hn[:5]:
    print(row, '\n')

The first 5 rows of the data set:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



### Extracting Header from Data Set

In [11]:
headers = hn[0:1]
hn = hn[1:]

print("Headers:\n", headers, '\n')
print("The first 5 rows of the data set:\n")
for row in hn[:5]:
    print(row, '\n')

Headers:
 [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']] 

The first 5 rows of the data set:

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2

### Separate by Post Type

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    
    if title[:6] == 'ask hn':
        ask_posts.append(row)
    elif title[0:7] == 'show hn':
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of Ask HN posts: {:,}".format(len(ask_posts)))
print("Number of Show HN posts: {:,}".format(len(show_posts)))
print("Number of Other posts: {:,}".format(len(other_posts)))

Number of Ask HN posts: 1,744
Number of Show HN posts: 1,162
Number of Other posts: 17,194


#### Ask Post Sample

In [13]:
for row in ask_posts[:3]:
    print(row, '\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] 

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] 

['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] 



#### Show Post Sample

In [14]:
for row in show_posts[:3]:
    print(row, '\n')

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] 

['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] 

['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'] 



#### Other Post Sample

In [15]:
for row in other_posts[:3]:
    print(row, '\n')

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 



## Ask HN vs. Show HN Post Analysis
### Effect on Number of Comments Recieved

In [16]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
total_other_comments = 0
for row in other_posts:
    total_other_comments += int(row[4])

ask_text = "Total number of comments in Ask HN posts: {:,}"
show_text = "Total number of comments in Show HN posts: {:,}"
other_text = "Total number of comments in other HN posts: {:,}"
print("Comment Totals:")
print(ask_text.format(total_ask_comments))
print(show_text.format(total_show_comments))
print(other_text.format(total_other_comments))

avg_ask_comments = float(total_ask_comments) / len(ask_posts)
avg_show_comments = float(total_show_comments) / len(show_posts)
avg_other_comments = float(total_other_comments) / len(other_posts)

print("\nComment Averages:")
print('Ask HN average number of comments: {:.2f}'.format(avg_ask_comments))
print('Show HN average number of comments: {:.2f}'.format(avg_show_comments))
print('Other post average number of comments: {:.2f}'.format(avg_other_comments))

Comment Totals:
Total number of comments in Ask HN posts: 24,483
Total number of comments in Show HN posts: 11,988
Total number of comments in other HN posts: 462,055

Comment Averages:
Ask HN average number of comments: 14.04
Show HN average number of comments: 10.32
Other post average number of comments: 26.87


### Result
On average ***Ask HM*** posts recieve approximatley 36% more comments than ***Show HM*** posts.  Although not part of this project, it is interesting to note that other posts received the most comments on average, nearly double those of ***Ask HM***.

|   |Posts Analysed | Comment Totals | Average Comment per Post | 
|---|--------------|---------------|------------------------|
|Ask HM Posts| 1,744 | 24, 483 | 14.04 |
|Show HM Posts| 1,162 | 11,988 | 10.32 |
| Other Posts| 17,194 | 462,055 | 26.87|

## Posting Date & Time Analysis of Ask HM Posts

In [17]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
count_by_hour = {}
comment_by_hour = {}

for result in result_list:
    post_time = dt.datetime.strptime(result[0], '%m/%d/%Y %H:%M')
    hour = post_time.strftime('%H')
    
    if hour not in count_by_hour:
        count_by_hour[hour] = 1
    else:
        count_by_hour[hour] += 1
    if hour not in comment_by_hour:
        comment_by_hour[hour] = result[1]
    else:
        comment_by_hour[hour] += result[1]
        
avg_by_hour = []
for hour in count_by_hour:
    comments = comment_by_hour[hour]
    count = float(count_by_hour[hour])
    average = comments / count
    
    avg_by_hour.append([hour, average])
avg_by_hour.sort()


print("Average Comments Per Post")
for row in avg_by_hour:
    hour = row[0]
    dt_hour = dt.datetime.strptime(hour, '%H')
    avg_comments = round(row[1], 1)
    print("{} - {:,.1f}".format(dt_hour.strftime('%I%P EST'), avg_comments))


Average Comments Per Post
12am EST - 8.1
01am EST - 11.4
02am EST - 23.8
03am EST - 7.8
04am EST - 7.2
05am EST - 10.1
06am EST - 9.0
07am EST - 7.9
08am EST - 10.2
09am EST - 5.6
10am EST - 13.4
11am EST - 11.1
12pm EST - 9.4
01pm EST - 14.7
02pm EST - 13.2
03pm EST - 38.6
04pm EST - 16.8
05pm EST - 11.5
06pm EST - 13.2
07pm EST - 10.8
08pm EST - 21.5
09pm EST - 16.0
10pm EST - 6.7
11pm EST - 8.0


In [18]:
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg_comments = row[1]
    swap_avg_by_hour.append([avg_comments, hour])
swap_avg_by_hour.sort(reverse=True)

print("Average Comments Per Post")
for row in swap_avg_by_hour:
    hour = row[1]
    dt_hour = dt.datetime.strptime(hour, '%H')
    avg_comments = round(row[0], 1)
    print("{:,.2f} - {}".format(avg_comments, dt_hour.strftime('%I%P EST')))    

Average Comments Per Post
38.60 - 03pm EST
23.80 - 02am EST
21.50 - 08pm EST
16.80 - 04pm EST
16.00 - 09pm EST
14.70 - 01pm EST
13.40 - 10am EST
13.20 - 02pm EST
13.20 - 06pm EST
11.50 - 05pm EST
11.40 - 01am EST
11.10 - 11am EST
10.80 - 07pm EST
10.20 - 08am EST
10.10 - 05am EST
9.40 - 12pm EST
9.00 - 06am EST
8.10 - 12am EST
8.00 - 11pm EST
7.90 - 07am EST
7.80 - 03am EST
7.20 - 04am EST
6.70 - 10pm EST
5.60 - 09am EST


### Result
Posts created around 3PM Eastern Standard Time have the highest chance of recieving comments.  3PM, averaging over 38 comments per post, was nearly 7 times greater than 9AM EST.  This indicates that the time of posting has a very large impact on the amount comments one can expect to receive.