# Hacker New Posts

### About this project

This project examines Hacker News, a site where user-submitted storeis receive votes fand comments, similar to Reddit. Hacker News is extremely popular in technology and start up circles. Posts that make it to the top of Hacker News listings can get hundreds of thousands of visitors. User posts are divided into two categories `Ask` and `Show`. 

Users submit Ask posts to ask the Hacker News community a speicific question. Below are a few examples:

`Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?`

Likewise, users submit Show posts to show the Hacker News community a project, product, or something interesting. Below are a few examples: 

`Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm`

By comparing these two types of posts from a dataset containing 20,000 rows, two questions will be answered:
* Do `Ask` or `Show` receive more comments on average?
* Do posts created at a certain time receive more comments on average?

In [73]:
from csv import reader
import datetime as dt

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[:1]
hn = hn[1:]

print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN posts

Since this project is only concerned with post titles beginning with `Ask HN` or `Show HN` a new string method `startswith` will be implemented to determine if a title starts with 'ask' or 'show'



In [74]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('The number of posts in ask posts:', len(ask_posts))
print('The number of posts in show posts:', len(show_posts))
print('The number of posts in other posts:', len(other_posts))

The number of posts in ask posts: 1744
The number of posts in show posts: 1162
The number of posts in other posts: 17194


From the readout of the dataset above there are:
- 1,744 posts in ask
- 1,162 posts in show
- 17,194 posts in other

Example of the first five rows of data contained in ask posts:

In [75]:
print(ask_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]


Example of the first five rows of data contained in show posts:

In [76]:
print(show_posts[:5])

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]


In [77]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


From the data provided above there are an average of 14 comments per `ask`.

In [78]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


From the data provided above there are an average of 10 comments per `show`.

### Do show posts or ask posts receive more comments on average?

Based off the findings from the dataset provided, ask posts receive more comments on average. Ask posts receive an average 14 comments per post, whereas show posts receive an average of 10 comments per post.

Since ask posts are more likely to receive commentst, the focus of the remaining analysis will be on these posts.

## Finding the Number of Ask Posts and Comments by Hour Created

### Are posts created at a certain time more likely to attract comments?

In [79]:

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_string = row[0]
    comments = row[1]
    created_at = dt.datetime.strptime(date_string, '%m/%d/%Y %H:%M')
    hour = created_at.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    elif hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
#     date_string = row[0]
#     date, hour = date_string.split()
#     hr, min = hour.split(':')
#     print(hr)

print('Counts by hour:', counts_by_hour)
print('Comments by hour:', comments_by_hour)
    

Counts by hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Comments by hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Above are the returned frequencies for `counts_by_hour` and `comments_by_hour`:
* Counts by hour: contains the number of ask posts created during each hour of the day.
* Comments by hour: contains the corresponding number of comments ask posts created at each hour received.

### Calculating the Average Number of Comments for Ask Posts by Hour

In [80]:
avg_by_hour = []
for hour in comments_by_hour:
    comments_in_hour = comments_by_hour[hour]
    avg_hour = round(comments_in_hour / counts_by_hour[hour], 1)
    avg_by_hour.append([hour, avg_hour])
    
print(avg_by_hour)

[['09', 5.6], ['13', 14.7], ['10', 13.4], ['14', 13.2], ['16', 16.8], ['23', 8.0], ['12', 9.4], ['17', 11.5], ['15', 38.6], ['21', 16.0], ['20', 21.5], ['02', 23.8], ['18', 13.2], ['03', 7.8], ['05', 10.1], ['19', 10.8], ['01', 11.4], ['22', 6.7], ['08', 10.2], ['04', 7.2], ['00', 8.1], ['06', 9.0], ['07', 7.9], ['11', 11.1]]


Above is the hour of the day, and the average number of comments for the respective hour. The time is represented in a 24 hour period. '09' is 9AM, '23' is 10PM. 

The hour with the **highest** number of comments per hour is **'15'** or **3PM**, with an average of *38.6 comments*.

The hour with the **lowest** number of comments per hour is **'09'** or **9AM**, with an average of *5.6 comments*.

### Sorting and Printing Values

Below are the same values as above, but in reverse order. The average is returned first, followed by the hour. This will allow the data to be easily sorted by average.

In [81]:
swap_avg_by_hour = []
for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])
    
print(swap_avg_by_hour)

[[5.6, '09'], [14.7, '13'], [13.4, '10'], [13.2, '14'], [16.8, '16'], [8.0, '23'], [9.4, '12'], [11.5, '17'], [38.6, '15'], [16.0, '21'], [21.5, '20'], [23.8, '02'], [13.2, '18'], [7.8, '03'], [10.1, '05'], [10.8, '19'], [11.4, '01'], [6.7, '22'], [10.2, '08'], [7.2, '04'], [8.1, '00'], [9.0, '06'], [7.9, '07'], [11.1, '11']]


Now that the data is returned above, it can be sorted in reverse order. This is a descending order, starting with the highest average. Below we can verify that **hour 15** is the **highest** with an *average of 38.6*, and the **lowest** is **hour 9** with an *average of 5.6*.

In [82]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[38.6, '15'], [23.8, '02'], [21.5, '20'], [16.8, '16'], [16.0, '21'], [14.7, '13'], [13.4, '10'], [13.2, '18'], [13.2, '14'], [11.5, '17'], [11.4, '01'], [11.1, '11'], [10.8, '19'], [10.2, '08'], [10.1, '05'], [9.4, '12'], [9.0, '06'], [8.1, '00'], [8.0, '23'], [7.9, '07'], [7.8, '03'], [7.2, '04'], [6.7, '22'], [5.6, '09']]


In [83]:
print('Top 5 Hours for Ask Posts Comments')

for avg in sorted_swap[:5]:
    hour = str(avg[1])
    avg = str(avg[0])
    print(hour + ":00: " + avg + " average comments per post")

Top 5 Hours for Ask Posts Comments
15:00: 38.6 average comments per post
02:00: 23.8 average comments per post
20:00: 21.5 average comments per post
16:00: 16.8 average comments per post
21:00: 16.0 average comments per post


## Top 5 Hours for Ask Posts Comments:

1. 3PM (15:00) with an average of 38.6 comments per post
2. 2AM (02:00) with an average of 23.8 comments per post
3. 8PM (20:00) with an average of 21.5 comments per post
4. 4PM (16:00) with an average of 16.8 comments per post
5. 9PM (21:00) with an average of 16 comments per post

Based off these findings, the best time for a user to post an ask is 3PM. 