# Exploring Hacker News Posts
### We would like to compare the two types of posts to determine the following questions:<br>Do Ask HN or Show HN receive more comments on average?<br>Do posts created at a certain time receive more comments on average?

In [1]:
# We first start from reading the data set and saving it as a list of lists
from csv import reader
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)

# Display the first five rows to have an idea of the format of the data set
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# Separate the data set into the header and data parts
headers = hn[0]
hn = hn[1:]
print("The header is ", headers, "\n")
print("The first five rows of the data are ", hn[:5])

The header is  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

The first five rows of the data are  [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', 

In [3]:
# Categorize all the posts into three types: ask, show, and others
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("The number of posts in 'ask hn' is ", len(ask_posts), "\n")
print("The number of posts in 'show hn' is ", len(show_posts), "\n")
print("The number of posts in 'others' is ", len(other_posts), "\n")

# Check several staring rows
print(ask_posts[:5], "\n")
print(show_posts[:5])

The number of posts in 'ask hn' is  1744 

The number of posts in 'show hn' is  1162 

The number of posts in 'others' is  17194 

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']] 

[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson'

In [4]:
# The next step is to find the average comments received for the ask posts and show posts
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments received for ask posts is ", avg_ask_comments, "\n")
print("The average number of comments received for show posts is ", avg_show_comments, "\n")

The average number of comments received for ask posts is  14.038417431192661 

The average number of comments received for show posts is  10.31669535283993 



The average number of comments received for ask posts is about 14 while its is around 10 for a show post, we can clearly see the difference between these two types of posts that the ask posts receive abouth 40% more comments than the show posts. However, we also notice that, the average number of comments received for both types of posts are lmited.
Since the ask posts receive more comments on average, we will focus our further analysis on ask posts.

In [5]:
# These ask posts will be separated by the time period they were posted and the analysis will be based on this
import datetime as dt
result_list = []

# Store the pairs of the creating time and the number of comments for a post into a list
for row in ask_posts:
    creat_time = row[6]
    num_comments = int(row[4])
    result_list.append([creat_time, num_comments])

counts_by_hour = {}
comments_by_hour = {}

# Store the number of posts created and corresponding number of comments based on the hour they were created
for row in result_list:
    creat_time_dt = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    creat_hour_dt = creat_time_dt.strftime("%H")
    if creat_hour_dt not in counts_by_hour:
        counts_by_hour[creat_hour_dt] = 1
        comments_by_hour[creat_hour_dt] = int(row[1])
    else:
        counts_by_hour[creat_hour_dt] += 1
        comments_by_hour[creat_hour_dt] += int(row[1])

In [6]:
# Calculate the average number of comments for the postes created during different hours
avg_by_hour = []

# Iterate over the hours to calculate the average numbers of comments
for key in counts_by_hour:
    avg_num_comments = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg_num_comments])

print(avg_by_hour)

[['01', 11.383333333333333], ['06', 9.022727272727273], ['07', 7.852941176470588], ['03', 7.796296296296297], ['21', 16.009174311926607], ['18', 13.20183486238532], ['23', 7.985294117647059], ['09', 5.5777777777777775], ['12', 9.41095890410959], ['14', 13.233644859813085], ['05', 10.08695652173913], ['20', 21.525], ['19', 10.8], ['04', 7.170212765957447], ['02', 23.810344827586206], ['13', 14.741176470588234], ['15', 38.5948275862069], ['10', 13.440677966101696], ['00', 8.127272727272727], ['22', 6.746478873239437], ['08', 10.25], ['16', 16.796296296296298], ['17', 11.46], ['11', 11.051724137931034]]


In [7]:
# Re-format the result for easier reading purpose
swap_avg_by_hour = []

# Make the first element in the each list in the list of lists be the number for sorting
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

# Sort by the average number of comments in a reverse order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments")

# Print the top 5 hours for ask posts comments with a specific format
for row in sorted_swap[:5]:
    output_time_dt = dt.datetime.strptime(row[1], "%H")
    output_time_str = output_time_dt.strftime("%H:%M")
    output = "{}: {} average comments per post".format(output_time_str, row[0])
    print(output)

[[11.383333333333333, '01'], [9.022727272727273, '06'], [7.852941176470588, '07'], [7.796296296296297, '03'], [16.009174311926607, '21'], [13.20183486238532, '18'], [7.985294117647059, '23'], [5.5777777777777775, '09'], [9.41095890410959, '12'], [13.233644859813085, '14'], [10.08695652173913, '05'], [21.525, '20'], [10.8, '19'], [7.170212765957447, '04'], [23.810344827586206, '02'], [14.741176470588234, '13'], [38.5948275862069, '15'], [13.440677966101696, '10'], [8.127272727272727, '00'], [6.746478873239437, '22'], [10.25, '08'], [16.796296296296298, '16'], [11.46, '17'], [11.051724137931034, '11']]
Top 5 Hours for Ask Posts Comments
15:00: 38.5948275862069 average comments per post
02:00: 23.810344827586206 average comments per post
20:00: 21.525 average comments per post
16:00: 16.796296296296298 average comments per post
21:00: 16.009174311926607 average comments per post


We can clearly see that the average number of comments of posts created at 15:00 is around 38, which is much higher than all other times. The runner-ups are 02:00 and 20:00 with the average numbers of comments greater than 20.
Since the time zone in the data set used is the Eastern Time in the US, which is the same as I live in, the best time to create a post to receive more comments during the day is 15:00.