# Exploring Hacker News Posts

In this project, we want to look at Hacker News, a website where users share, ask questions, and comment on tech related stories.
Users can ask questions, which is denoted by Ask HN, or show stories, denoted by Show HN. The goal of the project is to  determine which of the two types of posts to the website recieve more comments on average, and how the time of day affects the number of comments on average.

The data set that will be used consists of approximately 20 000 rows. The headings for the data set are as follows:

id - The unique identifier from Hacker News for the post
title - The title of the post
url - The URL that the posts links to, if it the post has a URL
num_points - The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments - The number of comments that were made on the post
author - The username of the person who submitted the post
created_at - The date and time at which the post was submitted

The first few rows of the data set appear as such:

|id|title|url|num_points|num_comments|author|created_at|
|--|-----|---|----------|------------|------|----------|
|12224879|	Interactive Dynamic Video|	http://www.interactivedynamicvideo.com/|	386|	52|	ne0phyte|	8/4/2016 11:52|
|10975351|	How to Use Open Source and Shut the Fuck Up at the Same Time|	http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/|	39|	10	|josep2|	1/26/2016 19:30|
|11964716|	Florida DJs May Face Felony for April Fools' Water Joke|	http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/|	2|	1	|vezycash|	6/23/2016 22:20|
|11919867|	Technology ventures: From Idea to Enterprise	|https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429|	3	|1	|hswarna	|6/17/2016 0:01|
|10301696	|Note by Note: The Making of Steinway L1037 (2007)	|http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0	|8	|2	|walterbell|	9/30/2015 4:12|




# Introduction

First, we will read in the dataset as a list of lists (1st cell), and remove the header row (2nd cell)

In [2]:
from csv import reader
opened_file  = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:6])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


In [3]:
headers = hn[0]
hn.remove(hn[0])
print(headers)
print(hn[:6])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Ti

# Extract Ask HN and Show HN Posts

Now, we will look at posts that contain Ask HN or Show HN in the title, and create a list for each type in order to better analyze the data.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))


1744
1162
17194


# Calculate Average Number of Comments for Ask HN & Show HN Posts

We will calculate the average number of comments for Ask HN & Show HN posts by iterating through each list and summing the total number of comments. Then we will divide by the length of each list to get the average.

In [5]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print('Average number of comments on ask posts:',avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print('Average number of comments on show posts:',avg_show_comments)


Average number of comments on ask posts: 14.038417431192661
Average number of comments on show posts: 10.31669535283993


From our analysis, we can see that the average number of comments on ask posts is higher (approximately 14 comments) to the number of comments on show posts (approximately 10 comments). Therefore, for the rest of our analysis, we will look at ask posts.

# Finding Number of Posts & Average Comments Per Post Per Hour of the Day (Ask Posts)

We will now seperate posts based on the hour of the day, find the number of posts per hour, and then the average number of comments per post per hour. 

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])
    

  
counts_by_hour = {}
comments_by_hour = {}
template = "%m/%d/%Y %H:%M"
for rows in result_list:
    date = rows[0] 
    comments = rows[1]
    hour = dt.datetime.strptime(date, template).strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        

comments_by_hour    
    

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [7]:


avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr,(comments_by_hour[hr]/counts_by_hour[hr])]) 

avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

# Sorting and Printing Values from a List of Lists

In [8]:
swap_avg_by_hour = []

for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])

print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour,reverse=True)




[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [9]:
print("Top 5 Hours for Ask Posts Comments")
for avg,hour in sorted_swap[:5]:
    hour = dt.datetime.strptime(hour,"%H").strftime("%H:%M")
    template = "{}: {:.2f} average comments per post.".format(hour,avg)
    print(template)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


According to the results, the time of day that recieves the most comments per post is 15:00. According to the [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts/home), the timezone used is EST, so 15:00 is equivalent to 3:00 pm.

# Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, the best time to post to the Hacker New website (for Ask Posts) is 15:00 - 16:00 (3-4 pm est). Another point to take into account while examining the results is that the analysis excluded posts without comments.

In [19]:
sum_askpoints = 0
for row in ask_posts:
   n_points = row[3]
   sum_askpoints += int(n_points)
    
sum_showpoints = 0
for row in show_posts:
    n_points = row[3]
    sum_showpoints += int(n_points)
    
avg_askpoints = sum_askpoints/len(ask_posts)
print("The average number of points per ask post is",avg_askpoints)
avg_showpoints = sum_showpoints/len(show_posts)
print("The average number of points per show post is",avg_showpoints)

The average number of points per ask post is 15.061926605504587
The average number of points per show post is 27.555077452667813


Here, we can see that show posts generate more points on average, so we will continue to look at them in determining how the time of day affects the number of points a post gets.

# Finding the Average Number of Points Per Post For Different Times of the Day (Show Posts)

We will follow the same procedure for finding the average number of comments, but this time we will be looking at the average number of points.

In [61]:
import datetime as dt

points_list = []

for row in show_posts:
    created_at = row[6]
    num_points = int(row[3])
    points_list.append([created_at,num_points])
    

  
show_counts_by_hour = {}
points_by_hour = {}
template = "%m/%d/%Y %H:%M"
for rows in points_list:
    date = rows[0] 
    points = rows[1]
    hour = dt.datetime.strptime(date, template).strftime("%H")
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        points_by_hour[hour] = points
    else:
        show_counts_by_hour[hour] += 1
        points_by_hour[hour] += points

        

In [62]:
avg_points_by_hour = []

for hr in points_by_hour:
    avg_points_by_hour.append([hr,(points_by_hour[hr]/counts_by_hour[hr])]) 

avg_points_by_hour

[['14', 25.430232558139537],
 ['22', 40.34782608695652],
 ['18', 36.31147540983606],
 ['07', 19.0],
 ['20', 30.316666666666666],
 ['05', 5.473684210526316],
 ['16', 28.322580645161292],
 ['19', 30.945454545454545],
 ['15', 28.564102564102566],
 ['03', 25.14814814814815],
 ['17', 27.107526881720432],
 ['06', 23.4375],
 ['02', 11.333333333333334],
 ['13', 24.626262626262626],
 ['08', 15.264705882352942],
 ['21', 18.425531914893618],
 ['04', 14.846153846153847],
 ['11', 33.63636363636363],
 ['12', 41.68852459016394],
 ['23', 42.388888888888886],
 ['09', 18.433333333333334],
 ['01', 25.0],
 ['10', 18.916666666666668],
 ['00', 37.83870967741935]]

In [64]:
swap_avg_points_by_hour = []

for hour in avg_points_by_hour:
    swap_avg_points_by_hour.append([hour[1],hour[0]])

print(swap_avg_points_by_hour)
sorted_swap_points = sorted(swap_avg_points_by_hour,reverse=True)

[[25.430232558139537, '14'], [40.34782608695652, '22'], [36.31147540983606, '18'], [19.0, '07'], [30.316666666666666, '20'], [5.473684210526316, '05'], [28.322580645161292, '16'], [30.945454545454545, '19'], [28.564102564102566, '15'], [25.14814814814815, '03'], [27.107526881720432, '17'], [23.4375, '06'], [11.333333333333334, '02'], [24.626262626262626, '13'], [15.264705882352942, '08'], [18.425531914893618, '21'], [14.846153846153847, '04'], [33.63636363636363, '11'], [41.68852459016394, '12'], [42.388888888888886, '23'], [18.433333333333334, '09'], [25.0, '01'], [18.916666666666668, '10'], [37.83870967741935, '00']]


In [66]:
print("Top 5 Hours for Ask Posts Points")
for avg,hour in sorted_swap_points[:5]:
    hour = dt.datetime.strptime(hour,"%H").strftime("%H:%M")
    template = "{}: {:.2f} average points per post.".format(hour,avg)
    print(template)

Top 5 Hours for Ask Posts Points
23:00: 42.39 average points per post.
12:00: 41.69 average points per post.
22:00: 40.35 average points per post.
00:00: 37.84 average points per post.
18:00: 36.31 average points per post.


The results show that the top 5 show posts that get the most points are close to midnight (18:00,22:00,23:00,0:00) or noon (12:00).

# Finding Number of Posts & Average Comments Per Post Per Hour of the Day (Other Posts)

We will now look at the same statistics (average comments, average points, etc) for the posts that do not contain "ask hn" or "show hn" in the title.

In [32]:
total_other_comments = 0

for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
    
avg_other_comments = total_other_comments/len(other_posts)

print('Average number of comments on other posts:',avg_other_comments)

Average number of comments on other posts: 26.8730371059672


In [72]:
result_list_other = []

for row in other_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list_other.append([created_at,num_comments])
    

  
other_counts_by_hour = {}
other_comments_by_hour = {}
template = "%m/%d/%Y %H:%M"
for rows in result_list_other:
    date = rows[0] 
    comments = rows[1]
    hour = dt.datetime.strptime(date, template).strftime("%H")
    if hour not in other_counts_by_hour:
        other_counts_by_hour[hour] = 1
        other_comments_by_hour[hour] = comments
    else:
        other_counts_by_hour[hour] += 1
        other_comments_by_hour[hour] += comments
        

In [73]:
avg_other_by_hour = []

for hr in other_comments_by_hour:
    avg_other_by_hour.append([hr,(other_comments_by_hour[hr]/other_counts_by_hour[hr])]) 

avg_other_by_hour

[['11', 29.593939393939394],
 ['19', 26.701020408163266],
 ['22', 23.265171503957784],
 ['00', 27.076923076923077],
 ['04', 24.125550660792953],
 ['09', 27.588014981273407],
 ['16', 25.394187102633968],
 ['18', 26.924354243542435],
 ['10', 26.612521150592215],
 ['12', 30.34727503168568],
 ['20', 23.13940724478595],
 ['03', 26.825552825552826],
 ['17', 27.99572284003422],
 ['14', 32.33089770354906],
 ['13', 30.896514161220043],
 ['01', 23.072],
 ['23', 24.617210682492583],
 ['08', 27.026209677419356],
 ['02', 27.786848072562357],
 ['21', 23.60983981693364],
 ['15', 29.51923076923077],
 ['06', 21.357843137254903],
 ['05', 25.175257731958762],
 ['07', 26.808035714285715]]

In [75]:
swap_avg_other_by_hour = []

for hour in avg_other_by_hour:
    swap_avg_other_by_hour.append([hour[1],hour[0]])

print(swap_avg_other_by_hour)
sorted_swap_other = sorted(swap_avg_other_by_hour,reverse=True)



[[29.593939393939394, '11'], [26.701020408163266, '19'], [23.265171503957784, '22'], [27.076923076923077, '00'], [24.125550660792953, '04'], [27.588014981273407, '09'], [25.394187102633968, '16'], [26.924354243542435, '18'], [26.612521150592215, '10'], [30.34727503168568, '12'], [23.13940724478595, '20'], [26.825552825552826, '03'], [27.99572284003422, '17'], [32.33089770354906, '14'], [30.896514161220043, '13'], [23.072, '01'], [24.617210682492583, '23'], [27.026209677419356, '08'], [27.786848072562357, '02'], [23.60983981693364, '21'], [29.51923076923077, '15'], [21.357843137254903, '06'], [25.175257731958762, '05'], [26.808035714285715, '07']]


In [77]:
print("Top 5 Hours for Other Posts Comments")
for avg,hour in sorted_swap_other[:5]:
    hour = dt.datetime.strptime(hour,"%H").strftime("%H:%M")
    template = "{}: {:.2f} average comments per post.".format(hour,avg)
    print(template)

Top 5 Hours for Other Posts Comments
14:00: 32.33 average comments per post.
13:00: 30.90 average comments per post.
12:00: 30.35 average comments per post.
11:00: 29.59 average comments per post.
15:00: 29.52 average comments per post.


For other posts, the top 5 hours for the highest average comments per post was 32.33 at 14:00. The other 4 hours were during the same time of the day (afternoon), ranging from 11:00 to 15:00. 

# Finding the Average Number of Points Per Post For Different Times of the Day (Other Posts)

In [78]:
other_points_list = []

for row in other_posts:
    created_at = row[6]
    num_points = int(row[3])
    other_points_list.append([created_at,num_points])
    

  
other_points_counts_by_hour = {}
other_points_by_hour = {}
template = "%m/%d/%Y %H:%M"
for rows in other_points_list:
    date = rows[0] 
    points = rows[1]
    hour = dt.datetime.strptime(date, template).strftime("%H")
    if hour not in other_points_counts_by_hour:
        other_points_counts_by_hour[hour] = 1
        other_points_by_hour[hour] = points
    else:
        other_points_counts_by_hour[hour] += 1
        other_points_by_hour[hour] += points


In [80]:
avg_other_points_by_hour = []

for hr in other_points_by_hour:
    avg_other_points_by_hour.append([hr,(other_points_by_hour[hr]/other_points_counts_by_hour[hr])]) 

avg_other_points_by_hour

[['11', 57.56818181818182],
 ['19', 60.01122448979592],
 ['22', 50.236147757255935],
 ['00', 58.4582651391162],
 ['04', 49.66740088105727],
 ['09', 53.93632958801498],
 ['16', 54.182561307901906],
 ['18', 53.928966789667896],
 ['10', 60.4839255499154],
 ['12', 57.3979721166033],
 ['20', 45.24478594950604],
 ['03', 56.92137592137592],
 ['17', 57.97861420017109],
 ['14', 61.78601252609603],
 ['13', 62.525054466230934],
 ['01', 50.606],
 ['23', 52.02967359050445],
 ['08', 54.09274193548387],
 ['02', 58.471655328798185],
 ['21', 49.369565217391305],
 ['15', 60.542307692307695],
 ['06', 46.23529411764706],
 ['05', 49.96649484536083],
 ['07', 56.832589285714285]]

In [81]:
swap_avg_other_points_by_hour = []

for hour in avg_other_points_by_hour:
    swap_avg_other_points_by_hour.append([hour[1],hour[0]])

print(swap_avg_other_points_by_hour)
sorted_swap_points_other = sorted(swap_avg_other_points_by_hour,reverse=True)

[[57.56818181818182, '11'], [60.01122448979592, '19'], [50.236147757255935, '22'], [58.4582651391162, '00'], [49.66740088105727, '04'], [53.93632958801498, '09'], [54.182561307901906, '16'], [53.928966789667896, '18'], [60.4839255499154, '10'], [57.3979721166033, '12'], [45.24478594950604, '20'], [56.92137592137592, '03'], [57.97861420017109, '17'], [61.78601252609603, '14'], [62.525054466230934, '13'], [50.606, '01'], [52.02967359050445, '23'], [54.09274193548387, '08'], [58.471655328798185, '02'], [49.369565217391305, '21'], [60.542307692307695, '15'], [46.23529411764706, '06'], [49.96649484536083, '05'], [56.832589285714285, '07']]


In [83]:
print("Top 5 Hours for Other Posts Points")
for avg,hour in sorted_swap_points_other[:5]:
    hour = dt.datetime.strptime(hour,"%H").strftime("%H:%M")
    template = "{}: {:.2f} average points per post.".format(hour,avg)
    print(template)

Top 5 Hours for Other Posts Points
13:00: 62.53 average points per post.
14:00: 61.79 average points per post.
15:00: 60.54 average points per post.
10:00: 60.48 average points per post.
19:00: 60.01 average points per post.


For other posts, the highest average number of points was 62.53, and the times at which these highest instances occured ranged from 10:00 to 19:00.