# Exploring Hacker News Posts

For this project, we will be exploring the popular website [Hacker News](https://news.ycombinator.com/) by start up incubator [Y Combinator](https://www.ycombinator.com/). Users of the site submit stories known as "posts" which gets voted and commented upon by the its community.

In this study, we will have special interest on posts which has titles starting with either `Ask HN` or `Show HN`. `Ask HN` posts asks the Hacker News Community as specific question, while `Show HN` posts shows generally something of interest. The goal of our study are to answer these questions:

- On average, does `Ask HN` receive more comments than `Show HN`?
- What specific time should a post be created to have a higher chance of getting comments?

## About Our Data Set:

The data set we will be using can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Our data set was updated last September of 2016 and has approximately 20,000 rows. Below are the description of the columns:

- `id`: The unique identifier from Hacker News for the post
- `title`: The title of the post
- `url`: The URL that the posts links to, if the post has a URL
- `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments`: The number of comments that were made on the post
- `author`: The username of the person who submitted the post
- `created_at`: The date and time at which the post was submitted

In [1]:
def explore_data (dataset, begin, end, show_len=True):
    dataset_slice = dataset[begin:end]
    for row in dataset_slice:
        print (row , "\n")
    
    #When show_len is True displays the total number of rows in our data set
    if show_len == True:
        data_set_len = len(dataset)
        data_set_len = "Length of data set is: {:,}".format(data_set_len)
        print (data_set_len)

Our function above `explore_data` will be used throughout our study. It will display a specified amount of rows in our data set and tell us how many rows there are in total.

In [2]:
#Opens our data set and saves it into a variable named hn
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

explore_data(hn, 0, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

Length of data set is: 20,101


Using our code above, we opened and read our data set which was in `hacker_nws.csv` and saved its contents in a variable called `hn`. We also displayed its first 5 rows (The first row displays our data set's headers) and now know that it has exactly 20,101 rows.

## Data Cleaning:

**Removing the Header Row:**

In [3]:
#Extracts our data set headers and put in the headers variable
headers = hn[0]
hn = hn[1:]

explore_data(hn, 0, 5)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] 

Length of data set is: 20,100


Our code above removes the header row from our data set and saves it in a variable named `headers`. We do this to make sure that our data when processed does not confuse the headers as a "post".

We then displayed the first 5 rows to make sure that the headers were indeed removed, and verified that the length of our data set is now 20,100 which was previously 20,101.

**Extracting Ask HN and Show HN Posts:**

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    #Checks if the title of the post begins with "ask hn"
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
        
    #Checks if the title of the post begins with "show hn"
    elif title.startswith('show hn') == True:
        show_posts.append(row)
        
    else:
        other_posts.append(row)        

Our code above extracted from our data set "posts" that starts with either `Ask HN` or `Show HN` and placed them in lists named `ask_posts` and `show_posts` respectively. We will also placed "posts" that does not belong into either category in a list called `other_posts`.

In [5]:
print ("Number of Ask HN posts: {:,}".format(len(ask_posts)))
print ("Number of Show HN posts: {:,}".format(len(show_posts)))
print ("Number of Other posts: {:,}".format(len(other_posts)))

Number of Ask HN posts: 1,744
Number of Show HN posts: 1,162
Number of Other posts: 17,194


Our code above shows that our data set has 1,744 `Ask HN` posts, 1,162 `Show HN` posts, and 17,194 `Other` posts.

## Data Processing:

**Calculating the Average Number of Comments for Ask HN and Show HN Posts:**

In [6]:
def get_avg_no_comments (dataset):
    total_no_of_comments = 0
    for row in dataset:
        num_comments = int(row[4])
        
        #Adds the number of comments
        total_no_of_comments += num_comments 
        
    #Divides total number of comments to number of row
    avg_no_comments = total_no_of_comments / len(dataset)
    return avg_no_comments

Our function `get_avg_no_comments` will be used to get the average number of comments on our newly cleaned data sets. It does this by adding the total number of comments each post has and assign it to our variable `total_no_of_comments`. We then divide `total_no_of_comments` with the number of rows in our data set.

In [7]:
#Displays the average number of comments for Ask HN posts
avg_ask_comments = get_avg_no_comments(ask_posts)
print ("Average number of comments of Ask HN posts:", "{:.2f}".format(avg_ask_comments))

#Displays the average number of comments for Show HN posts
avg_show_comments = get_avg_no_comments(show_posts)
print ("Average number of comments of Show HN posts:", "{:.2f}".format(avg_show_comments))

Average number of comments of Ask HN posts: 14.04
Average number of comments of Show HN posts: 10.32


Our code above shoes that the average number of comments that `Ask HN` posts receives is 14.04, and `Show HN` posts has an average of 10.32. Our findings dictate that `Ask HN` posts on average receive more comments than `Show HN` posts. Since `Ask HN` posts are more likely to receive comments, we'll focus our remaning analysis just on these posts.

**Finding the Amount of Asks Posts and Comments by Hours Created:**

The next step in our data analysis is to find out if `Ask HN` posts that are created at a certain time are more likely to attract comments. We will do this using these 2 steps:

1. Calculate the amount of `Ask HN` posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created. 

In [8]:
result_list =[]
for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    result_list.append([created_at , num_comments])

We created a new data set `result_list`. It contains the values of our `created_at` and `num_comments` columns of our `ask_posts` data set.

In [9]:
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

#Loops through our result_list data set
for row in result_list:
    comments = int(row[1])
    hour = dt.datetime.strptime(row[0] , "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(hour, "%H")
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments

In the code above we looped through our `result_list` and created 2 new dictionaries. These dictionaries are:
- `counts_by_hour`: contains the total number of `Ask HN` posts created at a specific.
- `comments_by_hour`: contains the total number of comments created on `Ask HN` posts at a specific hour.

**Calculating the Average Number of Comments for Ask HN Posts by Hour:**

In [10]:
avg_by_hour = []

#Divides comments_by_hour by counts_by_hour to get the average number of comments per hour
for hour in counts_by_hour:
    avg_comment_hour = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour , avg_comment_hour])

print (avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


Using our code above we divided the values on our `comments_by_hour` dictionary with the values on our `counts_by_hour` dictionary to get the average number of comments per hour. We then stored our average values in our new data set `avg_by_hour`. We then displayed these values, but unfortunately it is hard to read due to its current format.

**Sorting and Formatting Average Number of Comments Data:**

In [11]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1] , row[0]])

#Sorts our data set fron greatest to least comments per hour
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print ("Top 5 Hours for Ask Posts Comments:")

#Formats our time to HH:MM and limits the average number of comments to 2 decimal places
for row in sorted_swap[:5]:
    time = dt.datetime.strptime(row[1], "%H")
    time = dt.datetime.strftime(time, "%H:%M")
    
    avg_comments = "{:.2f}".format(row[0])
    
    print (time , "-" , avg_comments)

Top 5 Hours for Ask Posts Comments:
15:00 - 38.59
02:00 - 23.81
20:00 - 21.52
16:00 - 16.80
21:00 - 16.01


Our code above sorted our data from greatest to least and displayed only the top 5 results. It also formatted the average number of comments per hour to a maximum of 2 decimal places.

## Conclusion:

For a post to have a higher chance of getting more comments it should be an `Ask HN` type of post and be posted around `15:00`.