## An Analysis of _Hacker News_ Posts

This project will examine a sample of posts from the _Hacker News_ website. 

The data set comprises a sample of c. 20,000 user-submitted stories, and it includes only posts that have been commented on by other users.

The analysis will focus on "Ask HN" posts and "Show HN" posts.
* "Ask HN" are posts where users ask the Hacker News community a question. 
* "Show HN" are posts where users show the community something of interest. 

The two main questions this project seeks to answer are:

* Does Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

### Introduction 

First, we'll open the file and take a look at the first five rows: 

In [1]:
## Opening Hacker News dataset and printing the first 5 rows  

opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


### Removing Headers from a List of Lists

Next, we will:
* extract the header row and assign it to a variable called `headers`.
* remove this first row from `hn`
* we'll display `headers` and the first five rows of `hn`


In [2]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[1:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting "Ask HN" and "Show HN" posts from the dataset

In this step, we will filter the data so that we can just look at the "Ask HN" and the "Show HN" posts, and disregard any other posts. For this, we will create three separate lists. 
* `ask_posts` = these are all the "Ask HN" posts
* `show_posts`= these are all the "Show HN" posts
* `other_posts` = these are all the remaining posts in the dataset 

We will also count the total number of posts for each list below, and print the result. 

In [3]:
ask_posts = [] # start with empty lists 
show_posts = [] 
other_posts = [] 

for row in hn: 
    title = str.lower(row[1]) # loop over each row to get title
    
    if title.startswith('ask hn'):
        ask_posts.append(row) # add the whole row to the list 
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of Ask HN posts:', len(ask_posts)) # print the length of the list
print('Number of Show HN posts:', len(show_posts))
print('Number of other posts:', len(other_posts))
    

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


### Calculating the Average Number of Comments for Ask HN and Show HN Posts 

Since we want to find out which type of post (Ask HN or Show HN) receives the most comments, the next thing we will do is find the average number of comments for each type of post. 

Here we will start with the average number of comments for each "Ask HN" post: 

In [4]:
total_ask_comments = 0 # start at a count of zero 

for row in ask_posts: # loop over each row in ask_posts
    num_comments = int(row[4]) # convert number of comments to an integer
    total_ask_comments += num_comments; # add number of comments to total number of comment

avg_ask_comments = total_ask_comments / len(ask_posts) # calculate the average number of comments per post
print(int(avg_ask_comments)) # convert to int to round the value.
    
    

14


Next, we will calculate the average number of comments for each "Show HN" post:

In [5]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments;

avg_show_comments = total_show_comments / len(show_posts)
print(int(avg_show_comments))

10


#### Results:

* Average number of comments for an "Ask HN" post: 14
* Average number of comments for a "Show HN" post: 10


We can see that "Ask HN" posts get more comments than "Show HN" posts on average. 

### Finding the Number of Ask Posts and Comments by Hour Created

We're next going to see whether posts created at a certain time of day are likely to receive more comments. For this, we will just use the `ask_posts` dataset, since this had the largest average number of comments per post. 

We will first calculate how many posts are created in each hour of the day. We will also calculate the number of comments received in each hour. 

We will then calculate the average number of comments a post receives by hour.  

The code below: 
* imports the datetime module, which we'll use to isolate the hour of the day, e.g. 09 (9am), 14 (2pm), etc. 
* creates a new list of lists from `ask_posts`, called `result_list`, which just contains `created_at` and `num_comments`.
* loops over `result_list` and creates two dictionaries: 
  - `counts_by_hour` = capture the number of Ask HN posts created in each hour of the day
  - `comments_by_hour` = capture the number of Ask HN comments created in each hour of the day


In [6]:
import datetime as dt # import datetime module

result_list = [] # create empty list

for row in ask_posts: 
    created_at = row[6] # get created_at element
    num_comments = int(row[4]) # get num_comments element
    result_list.append([created_at, num_comments]); # add elements to the result_list

print(result_list[:11])   



[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1], ['4/22/2016 12:24', 4], ['11/16/2015 9:22', 1], ['2/24/2016 17:57', 1], ['6/4/2016 17:17', 2], ['9/19/2015 17:04', 7]]


In [7]:
counts_by_hour = {} # new dictionary to count how many posts in each hour of the day
comments_by_hour = {} # new dictionary to count total no. of comments in each hour of the day

for row in result_list:
    created_at = row[0]
    num_comments = int(row[1])
    dt_created_at = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M") # convert created_at into a datetime object
    hour = dt.datetime.strftime(dt_created_at, "%H") # isolate the hour from the datetime object
    if hour not in counts_by_hour:  
        counts_by_hour[hour] = 1 # if hour isn't counted yet, assign the key a value of 1.
        comments_by_hour[hour] = num_comments # if hour isn't counted yet, assign the key a value that equals number of comments 
    else:
        counts_by_hour[hour] += 1 # if hour is counted, increase count by 1
        comments_by_hour[hour] += num_comments # if hour is counted, add the number of comments to the existing number of comments (the total)
print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Next, we need to calculate the average number of comments a post receives in a certain hour of the day. 

The code below: 
* creates an empty list, called `avg_by_hour`
* loops over the dictionary called `comments_by_hour`
* appends to the list two elements:
  - the key (`hour`) as the first element in the list
  - and then for the second element of the list, we calculate the average number of comments per post in that hour, using the values from the two dictionaries we created earlier (`comments_by_hour`, and `counts_by_hour`).  


In [8]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])


for row in avg_by_hour:
    print(row)
    

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]


### Sorting and Printing Values from a List of Lists

We'll now sort and print the values so that we can see which hour of day is most likely that you will receive comments to an Ask HN post.

To do this, we will first sorts the data in order (from highest average number of comments, to the lowest. 

The code below:
* creates an empty list, which will store the swapped values
* loops over `avg_by_hour` list
* adds `avg` as the first element of the list 
* adds `hour` as the second element of the list
* uses the sorted() function to sort the values from highest to lowest, using `reverse=True`
* assigns the sorted values to `sorted_swap` and prints the results. 


In [9]:
swap_avg_by_hour = [] # create empty list

for row in avg_by_hour: 
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour]) # append to list, swapping the current order

sorted_swap = sorted(swap_avg_by_hour, reverse=True) # sorting
print(sorted_swap)

    


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


Next, we will print our formatted results.  

The code below: 
* prints the first statement
* loops through our sorted list
* converts the average amount into 2 decimal places
* converts the hour into a datetime object and then formats it to our desired format
* formats the final output into a statement using `str.format`

In [35]:
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    average = "{:.2f}".format(row[0]) # gets average and formats to 2 dp
    hour = row[1] 
    hour_dt = dt.datetime.strptime(hour, "%H") # converts to datetime
    hour_dt_object = dt.datetime.strftime(hour_dt, "%H:%M") # converts to our desired format of datetime
    output = "{}: {} average comments per post".format(hour_dt_object, average) # formats the output into a sentence
    print(output) # prints the results

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


### Conclusion

The results show that if you want the highest chance of receiving comments, you should create as Ask HM post at 15:00, 02:00, 20:00, 16:00, or 21:00. The documentation states that the timezone for the times in the dataset are in Eastern US time.  

So if I wanted to create a post with a higher chance of receiving comments, I should convert these times into my timezone (GMT). This means that I should create my posts at 19:00, 06:00, 00:00, 20:00, or 01:00. 