# Hacker News 

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

The present analysis intends to determine which of type of post, `Ask HN` or `Show HN`, receive more comments on average. Plus, the time component is introduced to check if there is a posting time that receives more comments on average.

The current data set has approximately 20,000 rows with each row containing the following columns:
* `id`: The unique identifier from Hacker News for the post
* `title`: The title of the post
* `url`: The URL that the posts links to, if it the post has a URL
* `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the post
* `author`: The username of the person who submitted the post
* `created_at`: The date and time at which the post was submitted

---
## Read and Store file in dataset variable `hn`

Below `reader` class form `csv` module is used to read `.csv` file, which is then stored under `hn` variable.

In [1]:
from csv import reader
hn = list(reader(open("hacker_news.csv")))

def print_first_5_rows(dataset):
    for row in dataset[:5]:
        print(row)
        print('\n')

print_first_5_rows(hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




---
## Separate header row from dataset

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print_first_5_rows(hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




---
## Create lists containing `Ask HN` and `Show HN` rows

Below dataset rows will be split into 3 lists, `ask_posts` will be used for the `Ask HN` (stores the questions asked in Hacker News platform), `show_posts` list to store `Show HN` rows (stores posts of users that want to show something), and the remaining rows shall be stored under `other_posts`.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

#check number of posts in ask_posts, show_posts and other_posts
print('The number of posts in ask_posts is: {}'.format(len(ask_posts)))
print('The number of posts in show_posts is: {}'.format(len(show_posts)))
print('The number of posts in other_posts is: {}'.format(len(other_posts)))

The number of posts in ask_posts is: 1744
The number of posts in show_posts is: 1162
The number of posts in other_posts is: 17194


---
## Which type of post receives more comments on average?

Determine which list `ask_posts` or `show_posts` contain more comments.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average number of comments in ask_posts is: {:.2f}'.format(avg_ask_comments))    

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print('The average number of comments in show_posts is: {:.2f}'.format(avg_show_comments))    


The average number of comments in ask_posts is: 14.04
The average number of comments in show_posts is: 10.32


As the average number of comments in `ask_posts` is almost 40% larger than that of the `show_posts`, there is a clear tendency that posts containg questions received more adherence from the Hacker News community. 

For the next step - "if there is a best period to post" - one will only use the `ask_posts` list.

---
## Posting time vs number of comments

If ask posts are created at a certain time are they more likely to attract a larger number of comments? We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

In [5]:
# start by importing datetime module to transform dates in datetime objects
import datetime as dt

results_list = []
for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    results_list.append([created_at, n_comments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in results_list:
    date = dt.datetime.strptime(row[0], date_format)
    hour = date.strftime("%H")
    n_comments = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
        

In [6]:
# calculate the average number of comments for posts created during each hour of the day.
avg_by_hour = []
for hour in counts_by_hour:
    average = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, average])

for row in sorted(avg_by_hour):
    print("The average number of comments in hour: {} is {:.2f}".format(row[0], row[1]))


The average number of comments in hour: 00 is 8.13
The average number of comments in hour: 01 is 11.38
The average number of comments in hour: 02 is 23.81
The average number of comments in hour: 03 is 7.80
The average number of comments in hour: 04 is 7.17
The average number of comments in hour: 05 is 10.09
The average number of comments in hour: 06 is 9.02
The average number of comments in hour: 07 is 7.85
The average number of comments in hour: 08 is 10.25
The average number of comments in hour: 09 is 5.58
The average number of comments in hour: 10 is 13.44
The average number of comments in hour: 11 is 11.05
The average number of comments in hour: 12 is 9.41
The average number of comments in hour: 13 is 14.74
The average number of comments in hour: 14 is 13.23
The average number of comments in hour: 15 is 38.59
The average number of comments in hour: 16 is 16.80
The average number of comments in hour: 17 is 11.46
The average number of comments in hour: 18 is 13.20
The average number 

To make the result analysis easier another list is going to be created which will store first the number of comments per hour and then the corresponding hour.

In [7]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[8.127272727272727, '00'], [11.383333333333333, '01'], [23.810344827586206, '02'], [11.051724137931034, '11'], [11.46, '17'], [10.8, '19'], [38.5948275862069, '15'], [9.41095890410959, '12'], [9.022727272727273, '06'], [5.5777777777777775, '09'], [21.525, '20'], [13.233644859813085, '14'], [6.746478873239437, '22'], [7.796296296296297, '03'], [16.009174311926607, '21'], [10.25, '08'], [13.440677966101696, '10'], [7.852941176470588, '07'], [14.741176470588234, '13'], [10.08695652173913, '05'], [13.20183486238532, '18'], [7.985294117647059, '23'], [16.796296296296298, '16'], [7.170212765957447, '04']]


In [9]:
# Use the sorted() function to sort swap_avg_by_hour in descending order. 
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:6]:
    date = dt.datetime.strptime(row[1], "%H")
    time = date.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(time, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post


Taking into account the results shown above and that my timezone is GMT (Greenwich Meridian Time) and that of the dataset is ET (Eastern Time = GMT-5), I would create a post at around 20:00 (= 15:00 + 5:00) to increase the chance of having a larger number of comments in my post.

---

# NEXT STEPS

* Determine if show or ask posts receive more points on average.
* Determine if posts created at a certain time are more likely to receive more points.
* Compare your results to the average number of comments and points other posts receive.
* Use Dataquest's data science project style guide to format your project.