# Exploring Hackers News Posts

In this project, I'll compare two different types of posts from [Hacker News](https://news.ycombinator.com/), a popular site where technology related stories are voted and commented upon. I will explore two types of posts - Ask HN and Show HN. 

Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll specifically compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

The data set I'll be working with is approximately 20,000 rows which should be enough for our analysis.

# Introduction

I will first import the data and remove the header column.

In [1]:
# Read in the data.
import csv

f = open('hacker_news.csv')
hn = list(csv.reader(f))
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# removing the header
header = hn[0]
hn = hn[1:]
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Extracting Ask HN and Show HN Posts

I will first identify the posts that are either Ask HN or Show HN and seperate the data into different lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that I have separated ask posts and show posts into different lists, I'll calculate the average number of comments each type of post receives.

In [4]:
# Average Ask HN comments
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
# Average Show HN comments
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Ask posts on average receive approximately 14 comments. Show posts, on the other hand, receive approximately 10 comments. Since ask posts are more likely to receive comments, I'll focus the remaining analysis on just these posts.

# Finding the Amount of Ask Posts and Comments by Hour Created

Next, I'll determine if I can maximize the amount of comments an ask post receives by creating it at a certain time. 

I will first figure out the amount of ask posts created for each hour of the day, as well as the number of comments those posts received. I'll then calculate the average number of comments ask posts created at each hour of the day.

In [6]:
import datetime as dt

result_list =[]

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}
datetime = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comments = row[1]
    hour = dt.datetime.strptime(date, datetime).strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments

comments_by_hour

{'00': 447,
 '01': 683,
 '02': 1381,
 '03': 421,
 '04': 337,
 '05': 464,
 '06': 397,
 '07': 267,
 '08': 492,
 '09': 251,
 '10': 793,
 '11': 641,
 '12': 687,
 '13': 1253,
 '14': 1416,
 '15': 4477,
 '16': 1814,
 '17': 1146,
 '18': 1439,
 '19': 1188,
 '20': 1722,
 '21': 1745,
 '22': 479,
 '23': 543}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

In [7]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    
avg_by_hour

[['03', 7.796296296296297],
 ['13', 14.741176470588234],
 ['20', 21.525],
 ['17', 11.46],
 ['12', 9.41095890410959],
 ['04', 7.170212765957447],
 ['01', 11.383333333333333],
 ['09', 5.5777777777777775],
 ['11', 11.051724137931034],
 ['08', 10.25],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['16', 16.796296296296298],
 ['05', 10.08695652173913],
 ['10', 13.440677966101696],
 ['21', 16.009174311926607],
 ['18', 13.20183486238532],
 ['15', 38.5948275862069],
 ['14', 13.233644859813085],
 ['07', 7.852941176470588],
 ['06', 9.022727272727273],
 ['22', 6.746478873239437],
 ['19', 10.8],
 ['02', 23.810344827586206]]

# Sorting and Printing Values from a List of Lists

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('\n')
print(sorted_swap)

[[7.796296296296297, '03'], [14.741176470588234, '13'], [21.525, '20'], [11.46, '17'], [9.41095890410959, '12'], [7.170212765957447, '04'], [11.383333333333333, '01'], [5.5777777777777775, '09'], [11.051724137931034, '11'], [10.25, '08'], [8.127272727272727, '00'], [7.985294117647059, '23'], [16.796296296296298, '16'], [10.08695652173913, '05'], [13.440677966101696, '10'], [16.009174311926607, '21'], [13.20183486238532, '18'], [38.5948275862069, '15'], [13.233644859813085, '14'], [7.852941176470588, '07'], [9.022727272727273, '06'], [6.746478873239437, '22'], [10.8, '19'], [23.810344827586206, '02']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'],

In [9]:
# Sort the values and print the the 5 hours with the highest average comments.

print("Top 5 Hours for Ask Posts Comments: \n")
for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

Top 5 Hours for Ask Posts Comments: 

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


In [13]:
# showing timezone as ET and changing it from 24 hour time to 12 hour time.

print("Top 5 Hours for Ask Posts Comments: \n")
for avg, hour in sorted_swap[:5]:
    if int(hour) < 12:
        print(
            "{} a.m. ET: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%I"), avg)
        )
    else:
        print(
            "{} p.m. ET: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%I"), avg)
        )

Top 5 Hours for Ask Posts Comments: 

03 p.m. ET: 38.59 average comments per post
02 a.m. ET: 23.81 average comments per post
08 p.m. ET: 21.52 average comments per post
04 p.m. ET: 16.80 average comments per post
09 p.m. ET: 16.01 average comments per post


The hour that receives the most comments per post on average is 3 pm ET, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

## Conclusion

In this project, I analyzed ask posts and show posts to determine the type of post and the time of day for the post that received the most comments on average. Based on this analysis, I'd recommend the post be categorized as ask post and created between 3:00 pm ET - 4:00 pm ET.