# Exploring Hacker News Posts
## Introduction

In this project we will be analysing a dataset of submissions to the popular technology site Hacker News (https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. An example being:

 - Ask HN: How to improve my personal website?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. For example:

 - Show HN: Something pointless I made

We will be comparing these two types of posts, to determine the following:

 - Do Ask HN or Show HN receive more comments on average?
 - Do posts created at a certain time receive more comments on average?

We will begin by importing the libraries we need and importing the data set.

In [2]:
import csv
opened_file = open("hacker_news.csv", encoding= "utf8")
read_file = csv.reader(opened_file)
hn = list(read_file)
# header only
header = hn[0]
# data
hn = hn[1:]
# peak at the headrer and data
print(header)
print("\n")
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


We are only concerned with post titles that begin with either Ask HN or Show HN, so we will create a new lists of lists containing only those.
To find the posts that begin with either Ask HN or Show HN, we'll use the string method `startswith`. Given a string object, say, string1, we can check if it starts with, say, dq by inspecting the output of the object string1.startswith('dq'). If string1 starts with dq, it will return True, otherwise it will return False. 
Note, that this is case sensitive. We can specify lowercase for example, using the `lower` method, which returns a lowercase version of the starting string.

**Let's use these methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists next.**

In [3]:
# create empty lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# peak at ask_posts
print(ask_posts[:5])
print("Number of ask posts is: {posts:,}".format(posts = len(ask_posts)))
print("\n")
# peak at show_posts
print(show_posts[:5])
print("Number of show posts is: {posts:,}".format(posts = len(show_posts)))
print("\n")
# peak at other_posts
print(other_posts[:5])
print("Number of other posts is: {posts:,}".format(posts = len(other_posts)))

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
Number of ask posts is: 9,139


[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.ht

Now let's determine if ask posts or show posts receive more comments on average.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

print("Total number of comments in the ask_posts set = {comments:,}".format(comments = total_ask_comments)) # total number of comments in the ask_posts set

# to determine the average, divide by length of ask_posts (number of posts with 'ask hn')
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Average number of comments for Ask HN = {avg_posts:.2f}".format(avg_posts = avg_ask_comments))

Total number of comments in the ask_posts set = 94,986
Average number of comments for Ask HN = 10.39


Now will will do the same for show posts.

In [5]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

print("Total number of comments in the show_posts set = {comments:,}".format(comments = total_show_comments)) # total number of comments in the show_posts set

# to determine the average, divide by length of show_posts (number of posts with 'show hn')
avg_show_comments = total_show_comments / len(show_posts)

print("Average number of comments for Show HN = {avg_posts:.2f}".format(avg_posts = avg_show_comments))

Total number of comments in the show_posts set = 49,633
Average number of comments for Show HN = 4.89


The results above indicate that in this dataset ask posts received almost twice as many comments as show posts, as well as having on average, more than twice the number of comments per post. This is interesting, given within this dataset (which covers a period between 2015 and 2016), there were about 1,000 more show posts.

Given we have determined that ask posts receive more comments than show posts, for our remaining analysis we will focus on these alone.

Next, we will determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

 - Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
 - Calculate the average number of comments ask posts receive by hour created.

First, we'll tackle the first step — calculating the amount of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments]) # append both elements to the result_list, creating a list of lists

print("Extract from the result list showing the date/time of the post and the number of comments:")
print(result_list[:5])
print("\n")

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print("Our dictionary of counts by hour (number of ask hn posts per hour):")
print(counts_by_hour)
print("\n")
print("Our dictionary of comments by hour (number of comments posted per hour for ask hn posts):")
print(comments_by_hour)
print("\n")



Extract from the result list showing the date/time of the post and the number of comments:
[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


Our dictionary of counts by hour (number of ask hn posts per hour):
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


Our dictionary of comments by hour (number of comments posted per hour for ask hn posts):
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}




Now that we have two dictionaries showing count(posts) by hour and number of comments by hour, we can calculate the average number of comments for posts created during each hour of the day. Therefore identifying which hour of the day an ask hn post will likely generate most comments. 

The following is a very basic example of the logic we will follow to do this:

In [7]:
# lets use the following dictionary:
sample_dict = {
                'apple': 2, 
                'banana': 4, 
                'orange': 6
               }
# Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:

fruits = []

for fruit in sample_dict:
    fruits.append([fruit, 10 * sample_dict[fruit]])

# which gives:
print(fruits)

[['apple', 20], ['banana', 40], ['orange', 60]]


In the example above, we:

 - Initialized an empty list (of lists) and assigned it to fruits.
 - Iterated over the keys of sample_dict and appended to fruits a list whose:
    - First element is the key from sample_dict.
    - Second element is the value corresponding to that key multiplied by ten.

**Let's use this format to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.**

In [8]:
# result will be a list of lists in which the first element is the hour and the second element is the average number of comments per post.

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour] / counts_by_hour[hour], 2)]) # extract total number of comments by hour and divide by extracted total number of posts per hour

print("Following list of lists displays the hour of the day, followed by the average number of comments an ask hn post gets in that hour")
print("\n")
print(avg_by_hour)

Following list of lists displays the hour of the day, followed by the average number of comments an ask hn post gets in that hour


[['02', 11.14], ['01', 7.41], ['22', 8.8], ['21', 8.69], ['19', 7.16], ['17', 9.45], ['15', 28.68], ['14', 9.69], ['13', 16.32], ['11', 8.96], ['10', 10.68], ['09', 6.65], ['07', 7.01], ['03', 7.95], ['23', 6.7], ['20', 8.75], ['16', 7.71], ['08', 9.19], ['00', 7.56], ['18', 7.94], ['12', 12.38], ['04', 9.71], ['06', 6.78], ['05', 8.79]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print("list of hour and average number of comments with swapped columns:")
print("\n")
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse= True)
print("\n")
print("Sorted (desc):")
print(sorted_swap)
print("\n")

print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = hour.strftime("%H")
    average = row[0]
    template = "{hour}:00: {average:.2f} average comments per post".format(hour = hour, average = average)
    print(template)

list of hour and average number of comments with swapped columns:


[[11.14, '02'], [7.41, '01'], [8.8, '22'], [8.69, '21'], [7.16, '19'], [9.45, '17'], [28.68, '15'], [9.69, '14'], [16.32, '13'], [8.96, '11'], [10.68, '10'], [6.65, '09'], [7.01, '07'], [7.95, '03'], [6.7, '23'], [8.75, '20'], [7.71, '16'], [9.19, '08'], [7.56, '00'], [7.94, '18'], [12.38, '12'], [9.71, '04'], [6.78, '06'], [8.79, '05']]


Sorted (desc):
[[28.68, '15'], [16.32, '13'], [12.38, '12'], [11.14, '02'], [10.68, '10'], [9.71, '04'], [9.69, '14'], [9.45, '17'], [9.19, '08'], [8.96, '11'], [8.8, '22'], [8.79, '05'], [8.75, '20'], [8.69, '21'], [7.95, '03'], [7.94, '18'], [7.71, '16'], [7.56, '00'], [7.41, '01'], [7.16, '19'], [7.01, '07'], [6.78, '06'], [6.7, '23'], [6.65, '09']]


Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


**Conclusion**

**On average, the most number of comments relating to Ask HN are posted at 3pm.** Therefore, if we were to recommend a time to post an Ask HN with the aim of receiving maximum number of comments, it would be then!