### Exploring Hacker News Posts


Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

We are going to work with a subsample.

Below are descriptions of the columns:

* id: The unique identifier from Hacker News for the post
* title: The title of the post
* url: The URL that the posts links to, if it the post has a URL
* num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* num_comments: The number of comments that were made on the post
* author: The username of the person who submitted the post
* created_at: The date and time at which the post was submitted

We'll compare these two types of posts to determine the following:

* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
from csv import reader

with open("datasets/hacker_news.csv") as file:
    hn = list(reader(file))
    
for row in hn[:5]:
    print(row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']




The first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers.

In [3]:
headers = hn[0]

hn = hn[1:] #removes first row

In [4]:
print(headers)
print("\n")

for row in hn[:5]:
    print(row)
    print("\n")

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




We're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method **startswith**.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts)) 

1744
1162
17194


Next, let's determine if ask posts or show posts receive more comments on average

In [6]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments/len(ask_posts)

print(avg_ask_comments)

14.038417431192661


Next, we find the total number of comments in show posts and the average.

In [7]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments/len(show_posts)

print(avg_show_comments)

10.31669535283993


We can see that on average posts that ask a question receive more comments on average (about 35% more)

Next, we'll determine if *ask posts* created at a certain time are more likely to attract comments

In [9]:
import datetime as dt

result_list = []

for row in ask_posts:
    date = dt.datetime.strptime(row[-1], "%m/%d/%Y %H:%M")
    result_list.append([date, int(row[4])])

In [10]:
result_list[0:3]

[[datetime.datetime(2016, 8, 16, 9, 55), 6],
 [datetime.datetime(2015, 11, 22, 13, 43), 29],
 [datetime.datetime(2016, 5, 2, 10, 14), 1]]

In [11]:
counts_by_hour = {} # contains the number of ask posts created during each hour of the day.
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received

for row in result_list:
    hour = row[0].hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [12]:
counts_by_hour

{9: 45,
 13: 85,
 10: 59,
 14: 107,
 16: 108,
 23: 68,
 12: 73,
 17: 100,
 15: 116,
 21: 109,
 20: 80,
 2: 58,
 18: 109,
 3: 54,
 5: 46,
 19: 110,
 1: 60,
 22: 71,
 8: 48,
 4: 47,
 0: 55,
 6: 44,
 7: 34,
 11: 58}

In [13]:
comments_by_hour

{9: 251,
 13: 1253,
 10: 793,
 14: 1416,
 16: 1814,
 23: 543,
 12: 687,
 17: 1146,
 15: 4477,
 21: 1745,
 20: 1722,
 2: 1381,
 18: 1439,
 3: 421,
 5: 464,
 19: 1188,
 1: 683,
 22: 479,
 8: 492,
 4: 337,
 0: 447,
 6: 397,
 7: 267,
 11: 641}

In [14]:
avg_by_hour = []
for key in comments_by_hour:
    avg = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key, avg])

In [15]:
avg_by_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. We will sort the list of lists and print the five highest values in a format that's easier to read.

In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [17]:
swap_avg_by_hour

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.20183486238532, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.127272727272727, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

We use the sorted() function to sort swap_avg_by_hour in descending order.

In [18]:
sorted_swap = sorted(swap_avg_by_hour, reverse= True)

In [19]:
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    date = dt.datetime.strptime(str(row[1]), "%H")
    hour = date.strftime("%H:%M")
    print(f"At {hour} there are {row[0]:.0f} comments per post")


Top 5 Hours for Ask Posts Comments
At 15:00 there are 39 comments per post
At 02:00 there are 24 comments per post
At 20:00 there are 22 comments per post
At 16:00 there are 17 comments per post
At 21:00 there are 16 comments per post


Based on our findings, the best time to create an **Ask Post** is at 15:00. On average posts written during this hour receive 39 comments.