# Hacker News Analysis (Amit)

---

## Introduction

In this project we will be cleaning and analyzing Hacker News data. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

## Goals

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Below are a couple examples:

* `Ask HN: How to improve my personal website?`
* `Ask HN: Am I the only one outraged by Twitter shutting down share counts?`
* `Ask HN: Aby recent changes to CSS that broke mobile?`

Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

* `Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform`
* `Show HN: Something pointless I made`
* `Show HN: Shanhu.io, a programming playground powered by e8vm`

We'll compare these two types of posts to determine the following:

* **Do Ask HN or Show HN receive more comments on average?**
* **Do posts created at a certain time receive more comments on average?**

## About the data

* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

---

# Code Execution

## Import statements

In [1]:
from csv import reader
import datetime as dt

## Functions

Function `csv_to_list` opens a CSV file and returns a list with data. (Optional) include or not include the header row.

In [2]:
def csv_to_list(file_name, inc_header=True):
    open_file = open(file_name)
    read_file = reader(open_file)

    if inc_header:
        retr_data = list(read_file)
    else:
        retr_data = list(read_file)[1:]

    return retr_data
    

Function `explore_data` takes a dataset `list` as input and displays certain specified rows. (Optional) it can return the length of the full data set. 

In [3]:
def explore_data(data_set, start_row, end_row, disp_info=False):
    
    for idx in range(start_row, end_row):
        print(data_set[idx])
        print('\n')
    
    if disp_info:
        print(len(data_set), 'row(s) in the data set') 

Function `create_title_lists` takes the data set and searches for if the `title` in col `1` starts with `title_start`

In [4]:
def create_title_lists(data_set, title_start):
    retr_list = []
    
    for row in data_set:
        if row[1].lower().startswith(title_start.lower()):
            retr_list.append(row)
    
    return retr_list

Function `get_comment_count` takes a dataset and count the comments in `4`th column. (Optional) returns average also.

In [5]:
def get_comment_count(data_set, get_average=True):
    rtrn_count = None
    
    for row in data_set:
        if row[4] != '':
            if rtrn_count == None:
                rtrn_count = int(row[4])
            else:
                rtrn_count += int(row[4])
        else:
            if rtrn_count == None:
                rtrn_count = 0

    if get_average:
        rtrn_average = round(rtrn_count / len(data_set),2)
        return rtrn_count, rtrn_average
    
    return rtrn_count

Function `get_posts_hours` returns a sorted frequency of total posts per hour

In [6]:
def get_posts_hours(data_set):
    retr_list = []
    hour_list = []
    main_dict = {}
    
    #Example 11/22/2015 13:43
    date_format = '%m/%d/%Y %H:%M'
    
    for row in data_set:
        if row[6] != '':
            prase_date = dt.datetime.strptime(row[6], date_format)
            created_at = prase_date.hour
            # [Hour : Comment Count]
            hour_list.append([created_at, 1])
        
    for hour in hour_list:
        if hour[0] not in main_dict:
            main_dict[hour[0]] = 1
        else:
            main_dict[hour[0]] += 1
    
    retr_list = sort_dict(main_dict)
    return retr_list

Function `get_comment_hours` returns a sorted frequency of total comments per hour

In [7]:
def get_comment_hours(data_set):
    retr_list = []
    hour_list = []
    main_dict = {}
    
    #Example 11/22/2015 13:43
    date_format = '%m/%d/%Y %H:%M'
    
    for row in data_set:
        if row[6] != '':
            prase_date = dt.datetime.strptime(row[6], date_format)
            created_at = prase_date.hour
            # [Hour : Comment Count]
            hour_list.append([created_at, int(row[4])])
    
    for hour in hour_list:
        if hour[0] not in main_dict:
            main_dict[hour[0]] = hour[1]
        else:
            main_dict[hour[0]] += hour[1]
        
    retr_list = sort_dict(main_dict)
    return retr_list

Function `sort_dict` sorts a dictionary in ascending order and returns a list

In [8]:
def sort_dict(obj_dict):
    sort_list = []
    
    for item in obj_dict:
        x, y = item, obj_dict[item]
        sort_list.append((x, y))
        #print(sort_list)
    
    sort_list.sort(reverse=False)
    
    return sort_list

## Workflow

### 1. Open the file and explore the data set

Open file, and explore data

In [9]:
file_name = 'hacker_news.csv'

hn_data = csv_to_list(file_name, True)
explore_data(hn_data, 0, 3, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


20101 row(s) in the data set


---
Save header row to `header` and rest back to `hn_data`

In [10]:
headers = hn_data[0]
hn_data = hn_data[1:]
explore_data(hn_data, 0, 3, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


20100 row(s) in the data set


---
Now that we've removed the headers from hn, we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

Create new lists
1. **Ask Posts**
2. Show Posts
3. Other Posts

In [11]:
ask_posts = create_title_lists(hn_data, 'Ask HN')
explore_data(ask_posts, 0, 3, True)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


1744 row(s) in the data set


Create new lists
1. Ask Posts
2. **Show Posts**
3. Other Posts

In [12]:
show_posts = create_title_lists(hn_data, 'Show HN')
explore_data(show_posts, 0, 3, True)

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']


['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']


1162 row(s) in the data set


Create new lists
1. Ask Posts
2. Show Posts
3. **Other Posts**

In [13]:
combined_ahn_shn = ask_posts + show_posts

In [14]:
other_posts = [post for post in hn_data if post not in combined_ahn_shn]
explore_data(other_posts, 0, 3, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


17194 row(s) in the data set


In [15]:
if (len(show_posts) + len(ask_posts) + len(other_posts)) == len(hn_data):
    print('Data split went well.')
else:
    print('Something didn\'t work.')

Data split went well.


---
### Next, let's determine if ask posts or show posts receive more comments on average.

In [16]:
total_ask_comments, avg_ask_comments = get_comment_count(ask_posts, True)
print('Total Comments on ask posts are {} and average is {} comments per post'.format(total_ask_comments, avg_ask_comments))

Total Comments on ask posts are 24483 and average is 14.04 comments per post


In [17]:
total_show_comments, avg_show_comments = get_comment_count(show_posts, True)
print('Total Comments on show posts are {} and average is {} comments per post'.format(total_show_comments, avg_show_comments))

Total Comments on show posts are 11988 and average is 10.32 comments per post


### Ask posts have more comments at 14.04/post compared to 10.32/post in show posts

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

1. **Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.**
2. Calculate the average number of comments ask posts receive by hour created.

In [20]:
counts_by_hour = get_posts_hours(ask_posts)
print(posts_by_hour)

[(0, 55), (1, 60), (2, 58), (3, 54), (4, 47), (5, 46), (6, 44), (7, 34), (8, 48), (9, 45), (10, 59), (11, 58), (12, 73), (13, 85), (14, 107), (15, 116), (16, 108), (17, 100), (18, 109), (19, 110), (20, 80), (21, 109), (22, 71), (23, 68)]


In [21]:
comments_by_hour = get_comment_hours(ask_posts)
print(comments_by_hour)

[(0, 447), (1, 683), (2, 1381), (3, 421), (4, 337), (5, 464), (6, 397), (7, 267), (8, 492), (9, 251), (10, 793), (11, 641), (12, 687), (13, 1253), (14, 1416), (15, 4477), (16, 1814), (17, 1146), (18, 1439), (19, 1188), (20, 1722), (21, 1745), (22, 479), (23, 543)]


1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. **Calculate the average number of comments ask posts receive by hour created.**

In [41]:
avg_by_hour = [(round((comments[1] / counts_by_hour[comments_by_hour.index(comments)][1]), 2), counts_by_hour[comments_by_hour.index(comments)][0]) for comments in comments_by_hour]
print(avg_by_hour)

[(8.13, 0), (11.38, 1), (23.81, 2), (7.8, 3), (7.17, 4), (10.09, 5), (9.02, 6), (7.85, 7), (10.25, 8), (5.58, 9), (13.44, 10), (11.05, 11), (9.41, 12), (14.74, 13), (13.23, 14), (38.59, 15), (16.8, 16), (11.46, 17), (13.2, 18), (10.8, 19), (21.52, 20), (16.01, 21), (6.75, 22), (7.99, 23)]


In [42]:
avg_by_hour.sort(reverse=True)
print(avg_by_hour)

[(38.59, 15), (23.81, 2), (21.52, 20), (16.8, 16), (16.01, 21), (14.74, 13), (13.44, 10), (13.23, 14), (13.2, 18), (11.46, 17), (11.38, 1), (11.05, 11), (10.8, 19), (10.25, 8), (10.09, 5), (9.41, 12), (9.02, 6), (8.13, 0), (7.99, 23), (7.85, 7), (7.8, 3), (7.17, 4), (6.75, 22), (5.58, 9)]


In [46]:
print('Posting at {}:00 hours with Ask Posts seems to get the best response'.format(avg_by_hour[0][1]))

Posting at 15:00 hours with Ask Posts seems to get the best response


In [50]:
print('Top 5 ask hour posts!', end='\n\n')
for x in range(5):
    print('{}:00 hour is number {} with {} comments/hour.'.format(avg_by_hour[x][1], x+1, avg_by_hour[x][0]), end='\n\n')
    
    
    

Top 5 ask hour posts!

15:00 hour is number 1 with 38.59 comments/hour.

2:00 hour is number 2 with 23.81 comments/hour.

20:00 hour is number 3 with 21.52 comments/hour.

16:00 hour is number 4 with 16.8 comments/hour.

21:00 hour is number 5 with 16.01 comments/hour.



# End of Analysis