# Exploring Hacker News Posts
In this project, we will compare two types of posts from [Hacker News](https://news.ycombinator.com/), an extremely popular site in technology and startup circles where user-submitted stories (also known as "posts") are voted and commented upon. The posts we're specifically interested in exploring begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question (i.e, "Ask HN: How to improve my personal website?"). Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

We will compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

The data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that the data set we will be working with has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Introduction
Let's start by reading in the data.

In [1]:
# Read in the data.
import csv

opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)

# Display the first five rows.
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

## Removing Headers from a List of Lists

In [2]:
# Remove the headers.
hn_header = hn[0]
hn = hn[1:]

# Display the headers and display the first five rows of data.
print(hn_header)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts
Since we're only concerned with posts beginning with `Ask HN` or `Show HN`, we will separate the data for these posts into different lists of lists, making the data easier to analyze.

In [3]:
# Identify posts beginning with `Ask HN` or `Show HN`.
# Separate the data into different lists.
ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)
        
# Checck the number of posts in each list.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts
Let's determine if ask posts or show posts receive more comments on average.

In [4]:
# Calculate average number of comments `Ask HN` posts receive.
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
# Calculate average number of comments `Show HN` posts receive.
total_show_comments = 0

for post in show_posts:
    total_show_comments += int(post[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


On average, it appears that ask posts (about 14 comments per post) receive more comments than show posts (about 10 comments per post). Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created
Next, we will determine if creating posts at a certain time would maximize the amount of comments an ask post receives. To perform this analysis, we will first find out how many ask posts were created in each hour of day, along with how many comments those posts received. Then, we will find the average number of comments ask posts receive by hour of day created.

In [10]:
# Calculate the amount of ask posts created each hour of the day.
# Calculate the amount of comments those posts received.
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    n_comment = row[1]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comment

# Check how many comments ask posts created at each hour received.        
comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [12]:
# Calculate the average number of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append(
        [hour, comments_by_hour[hour]/counts_by_hour[hour]]
    )
    
avg_by_hour

[['20', 21.525],
 ['11', 11.051724137931034],
 ['23', 7.985294117647059],
 ['09', 5.5777777777777775],
 ['06', 9.022727272727273],
 ['12', 9.41095890410959],
 ['19', 10.8],
 ['05', 10.08695652173913],
 ['22', 6.746478873239437],
 ['03', 7.796296296296297],
 ['00', 8.127272727272727],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['21', 16.009174311926607],
 ['14', 13.233644859813085],
 ['08', 10.25],
 ['10', 13.440677966101696],
 ['15', 38.5948275862069],
 ['13', 14.741176470588234],
 ['18', 13.20183486238532],
 ['07', 7.852941176470588],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['04', 7.170212765957447]]

## Sorting and Printing Values from a List of Lists

In [14]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

# Sort list by descending average number of comments received by posts created at each hour.
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[21.525, '20'], [11.051724137931034, '11'], [7.985294117647059, '23'], [5.5777777777777775, '09'], [9.022727272727273, '06'], [9.41095890410959, '12'], [10.8, '19'], [10.08695652173913, '05'], [6.746478873239437, '22'], [7.796296296296297, '03'], [8.127272727272727, '00'], [16.796296296296298, '16'], [11.46, '17'], [16.009174311926607, '21'], [13.233644859813085, '14'], [10.25, '08'], [13.440677966101696, '10'], [38.5948275862069, '15'], [14.741176470588234, '13'], [13.20183486238532, '18'], [7.852941176470588, '07'], [11.383333333333333, '01'], [23.810344827586206, '02'], [7.170212765957447, '04']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [17]:
# Display the 5 hours with the highest average number of comments.
print("Top 5 Hours for Ask Posts Comments")

for avg, hr in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg)
    )

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to our results, the hour that receives the most comments per `Ask HN` post is 15:00, averaging 38.59 comments per post. Even among the top 5 hours, this average is 60% more than the second highest result (23.81 average comments per post) and 140% more than the fifth highest result (16.01 average comments per post).

According to this [documentation for the data set](https://www.kaggle.com/hacker-news/hacker-news-posts), the time zone these results are based on is Eastern Time in the US (i.e, 15:00 is 3:00pm est).

For Pacific Time in the US (such as in California), the top times, in order, for ask posts comments would be 12pm, 11pm, 5pm, 1pm, and 6pm.

## Conclusion
In this project, we compared ask posts to show posts to determine which kind of posts receives more comments on average, and then we determined which hours of the days receive the most comments on average. Based on our findings, an ask post created between 12:00 and 13:00 (12:00 pm pst to 1:00 pm pst) is recommended for receiving the highest number of comments.

Note that our data set excluded posts without any comments. Hence, it is of the posts that received comments that ask posts received more comments on average, with asks posts created between 12:00 pm pst and 1:00 pm pst receving the most comments on average.