# Hacker News Project

Hacker news is a site where users can post questions or comments that receive up votes similar to reddit. Many of the posts start with "Ask HN" or "Show HN" which denote a question vs a comment. The goal of this project is to examine these two groups and deteremine if one receives more comments than the other and if so, are there specific times during the day which receive the most comments.

### Importing and examining the data

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file) 

# import data as a list of lists

In [2]:
print(hn[0:5]) 

#print the first five rows

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
headers = hn[0]
print(headers)

 #extract the header row

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [4]:
hn = hn[1:] 
print(hn[:5])

#removing the header row from the source data

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Separating data into ask, show and other posts

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row) 
        
#separates posts into three lists categories we're analyzing

In [6]:
print("total ask posts " + str(len(ask_posts)))
print("total show posts " + str(len(show_posts)))
print("total other posts " + str(len(other_posts))) 

#prints total rows for each category

total ask posts 1744
total show posts 1162
total other posts 17194


In [7]:
total_rows_separated = len(ask_posts) + len(show_posts) + len(other_posts)
total_rows = len(hn)
print (total_rows_separated == total_rows)

#confirms groups total rows matches the original data

True


### Computing average comments for ask vs show posts

In [8]:
total_ask_comments = 0

for row in hn:
    comment_total = int(row[4])
    title = row[1].lower()
    if title.startswith("ask hn"):
        total_ask_comments += comment_total
        
print(total_ask_comments)

#computes the total comments for ask posts

24483


In [9]:
total_show_comments = 0

for row in hn:
    comment_total = int(row[4])
    title = row[1].lower()
    if title.startswith("show hn"):
        total_show_comments += comment_total
        
print(total_show_comments)

#computes the total comments for show posts

11988


In [10]:
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)

print("Ask posts comment average is " + str(avg_ask_comments))
print("Show posts comment average is " + str(avg_show_comments))

#compute posts averages within ask vs show

Ask posts comment average is 14.038417431192661
Show posts comment average is 10.31669535283993


### Ask vs Show Post Analysis
Ask posts appear to receive about 4 more comments on average than show posts without in comparion in time

### Analyzing ask posts by hour

In [11]:
import datetime as dt

#importing the datetime module

In [12]:
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
#New list for time the post was created and number of comments to itereate

In [14]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    comments = row[1]
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

comments_by_hour  
# creates data dictionaries to track posts per hour and comments per hour for ask posts

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [15]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(avg_by_hour)

#calculates the average comments per hour for ask posts

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In [16]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

#swaps the the first and second elements of the lists from the code above

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [17]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
for row in sorted_swap:
    print(row, sep = "\n")
    
#sorts the average comment by hour descending for ask posts

[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']


In [18]:
for row in sorted_swap[0:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    print(str(hour.strftime("%H:%M")) + ": " + str(round(row[0],2)) +  " average comments per post" , sep = "\n")
    
    #prints top 5 average comments per hour for ask posts

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


In [19]:
for row in sorted_swap[0:5]:
    print(str(dt.datetime.strptime(row[1], "%H").strftime("%H:%M")) + ": " + str(round(row[0],2)) +  " average comments per post" , sep = "\n")
    
    # another way to print top 5 average all in one line of code

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


### Conclusions

We determined in this analysis that ask posts receive approximately 4 more comments per hour than show posts.  We further analyzed this data to determine the hours of the day which receive the most comments on average.  It appears near the end of a typical workday (1500 and 1600) is in the top five as well as late in the evening (2000 and 2100) with some late night folks at 0200. 