# Exploring Hacker News Posts
In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

### Analysis goal
Our main goal will be to deteminate **Top 5 hours for posting to get most comments**

## Step 1
First we need to take a look on our data set and separate headers from rest of the data

In [1]:
import csv

with open("data_sets/HN_posts_year_to_Sep_26_2016.csv", encoding='utf8') as data_file:
    hn = list(csv.reader(data_file))

print(len(hn))
headers = hn[0]
hn.remove(headers)
print(len(hn))

293120
293119


Now let's take a look on the first few rows of our data set

In [2]:
print("Headers:\n%s\n\nData:" % headers)
for row in hn[:5]:
    print(row)

Headers:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Data:
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


## Step 2
As we can see, we have posts without comments. 
So we have to clean our data from such posts.

In [3]:
clean_hn = []
print("hn before cleaning: %s" % len(hn))
for row in hn:
    n_comments = int(row[4])
    if n_comments > 0:
        clean_hn.append(row)
print("clean_hn after cleaning: %s" % len(clean_hn))
headers = clean_hn[0]
clean_hn.remove(headers)
print("clean_hn without header: %s" % len(clean_hn))

hn before cleaning: 293119
clean_hn after cleaning: 80401
clean_hn without header: 80400


## Step 3
We're specifically interested in posts whose titles begin with either _Ask HN_ or _Show HN_. 

Users submit _Ask HN_ posts to ask the Hacker News community a specific question. Below are a couple examples:
* Ask HN: How to improve my personal website?
* Ask HN: Am I the only one outraged by Twitter shutting down share counts?
* Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
* Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
* Show HN: Something pointless I made
* Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

Let's separate posts beginning with _Ask HN_ and _Show HN_ (and case variations) into two different lists next.

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in clean_hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

6911
5059
68430


## Step 4
Next, let's determine if ask posts or show posts receive more comments on average.

In [5]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average comments in 'Ask HN' posts: %s" % avg_ask_comments)

Average comments in 'Ask HN' posts: 13.744175951381855


In [6]:
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print("Average comments in 'Show HN' posts: %s" % avg_show_comments)

Average comments in 'Show HN' posts: 9.810832180272781


We've determined that, on average, _"Ask"_ posts receive more comments than _"Show"_ posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Step 5
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

In [7]:
import datetime as dt
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = row[0]
    n_commnts = row[1]
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_commnts
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_commnts
    
print("Posts count by hour:\n%s\n" % counts_by_hour)
print("Comments count by hour:\n%s" % comments_by_hour)

Posts count by hour:
{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}

Comments count by hour:
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


* Calculate the average number of comments ask posts receive by hour

In [8]:
avg_by_hour = []
for counts_hour, posts in counts_by_hour.items():
    for comments_hour, comments in comments_by_hour.items():
        if counts_hour == comments_hour:
            avg_comments = comments/posts
            avg_by_hour.append([counts_hour, avg_comments])
            
print("Average number of comments ask posts receive by hour created:\n%s" % avg_by_hour)

Average number of comments ask posts receive by hour created:
[['02', 13.198237885462555], ['01', 9.367713004484305], ['22', 11.749128919860627], ['21', 11.056511056511056], ['19', 9.414285714285715], ['17', 13.73019801980198], ['15', 39.66809421841542], ['14', 13.153439153439153], ['13', 22.2239263803681], ['11', 11.143426294820717], ['10', 13.757990867579908], ['09', 8.392045454545455], ['07', 10.095541401273886], ['03', 10.160377358490566], ['16', 10.76144578313253], ['08', 12.43157894736842], ['00', 9.857142857142858], ['23', 8.322463768115941], ['20', 11.38265306122449], ['18', 10.789823008849558], ['12', 15.452554744525548], ['04', 12.688172043010752], ['06', 9.017045454545455], ['05', 11.139393939393939]]
