# Analyzing Hacker News Posts

## Downloading The Data

The original data set and data description can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). A reduced data set of Hacker News Posts are used for this analysis.

## Importing packages

In [13]:
#import packages
from csv import reader
import datetime as dt

In [14]:
#import csv file and save as a list of lists
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

## Data Inspection

In [15]:
#display first five rows of hn data set
hn_head = hn[0:5]
print(hn_head)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [16]:
#save header row as 'headers'
headers = hn[0]
#print header row from hn dataset
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [17]:
#remove header row from hn dataset
hn = hn[1:]
#print first five rows in hn dataset
print(hn[0:5])


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Data preparation

In [18]:
#create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

In [19]:
#loop through the titles in hn dataset and fill the 3 lists
for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [20]:
#display the number of lowercase titled posts in the 3 lists
print("Number of ask posts: " + str(len(ask_posts)))
print("Number of show posts: " + str(len(show_posts)))
print("Number of other posts: " + str(len(other_posts)))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


## Data Exploration 
Do ask posts or show posts receive more comments on average?

In [21]:
#determine the average number of ask comments
total = 0
for row in ask_posts:
    total += int(row[4])
    
avg_ask_comments = total / len(ask_posts)
print('Average number of ask posts : ' + str(avg_ask_comments))

Average number of ask posts : 14.038417431192661


In [22]:
#determine the average number of show comments
total = 0
for row in show_posts:
    total += int(row[4])
    
avg_show_comments = total / len(show_posts)
print('Average number of show posts : ' + str(avg_show_comments))

Average number of show posts : 10.31669535283993


Show posts receive more comments on average than ask posts.

## Calculate the average number of ask posts and comments created each hour

In [26]:
#create empty list
result_list = []
#add to the list:
#(date & time post was created, number of comments)
for row in ask_posts:
    result_list.append([row[6], int(row[4])])    

In [33]:
#create two dictionaries
counts_by_hour = {}
comments_by_hour = {}

#loop over result_list
for row in result_list:
    #create a datetime object, assign datetime format that matches the date time stamp of the data
    date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    #select only the hour from the datetime object
    hour = date_time.strftime("%H")
    #conditional statement: 
    #if the 'hour' is not in 'counts_by_hour' dictionary,
    #then add the 'hour' to the dictionary and assign it a value of 1
    #then add the corresponding number of comments to the 'comments_by_hour' dictionary
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    #conditional statement: 
    #if the 'hour' is in 'counts_by_hour' dictionary,
    #then increase it's value by 1
    #then increase the number of comments by the comment number
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [35]:
#check format
print(counts_by_hour)
print(comments_by_hour)

{'14': 108, '08': 49, '05': 47, '04': 48, '17': 101, '03': 55, '22': 72, '09': 46, '20': 81, '15': 117, '06': 45, '12': 74, '02': 59, '10': 60, '13': 86, '00': 56, '11': 59, '07': 35, '18': 110, '21': 110, '16': 109, '19': 111, '01': 61, '23': 69}
{'14': 1419, '08': 497, '05': 493, '04': 340, '17': 1147, '03': 422, '22': 481, '09': 257, '20': 1724, '15': 4478, '06': 398, '12': 691, '02': 1384, '10': 794, '13': 1282, '00': 457, '11': 643, '07': 269, '18': 1441, '21': 1749, '16': 1831, '19': 1191, '01': 716, '23': 544}


In [39]:
#calculate the average number of comments per post
#created during each hour of the day
avg_by_hour = []

for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour,avg])
    
print(avg_by_hour)

[['14', 13.13888888888889], ['08', 10.142857142857142], ['05', 10.48936170212766], ['04', 7.083333333333333], ['17', 11.356435643564357], ['03', 7.672727272727273], ['22', 6.680555555555555], ['09', 5.586956521739131], ['20', 21.28395061728395], ['15', 38.27350427350427], ['06', 8.844444444444445], ['12', 9.337837837837839], ['02', 23.45762711864407], ['10', 13.233333333333333], ['13', 14.906976744186046], ['00', 8.160714285714286], ['11', 10.898305084745763], ['07', 7.685714285714286], ['18', 13.1], ['21', 15.9], ['16', 16.798165137614678], ['19', 10.72972972972973], ['01', 11.737704918032787], ['23', 7.884057971014493]]


In [40]:
#swap the columns of 'avg_by_hour' list of lists
#Now, the average of comments by the hour column is the first element
#Now, the hour column is the second element
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[13.13888888888889, '14'], [10.142857142857142, '08'], [10.48936170212766, '05'], [7.083333333333333, '04'], [11.356435643564357, '17'], [7.672727272727273, '03'], [6.680555555555555, '22'], [5.586956521739131, '09'], [21.28395061728395, '20'], [38.27350427350427, '15'], [8.844444444444445, '06'], [9.337837837837839, '12'], [23.45762711864407, '02'], [13.233333333333333, '10'], [14.906976744186046, '13'], [8.160714285714286, '00'], [10.898305084745763, '11'], [7.685714285714286, '07'], [13.1, '18'], [15.9, '21'], [16.798165137614678, '16'], [10.72972972972973, '19'], [11.737704918032787, '01'], [7.884057971014493, '23']]


In [41]:
#sort the 'swap_avg_by_hour' list of lists
#sort by descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[38.27350427350427, '15'], [23.45762711864407, '02'], [21.28395061728395, '20'], [16.798165137614678, '16'], [15.9, '21'], [14.906976744186046, '13'], [13.233333333333333, '10'], [13.13888888888889, '14'], [13.1, '18'], [11.737704918032787, '01'], [11.356435643564357, '17'], [10.898305084745763, '11'], [10.72972972972973, '19'], [10.48936170212766, '05'], [10.142857142857142, '08'], [9.337837837837839, '12'], [8.844444444444445, '06'], [8.160714285714286, '00'], [7.884057971014493, '23'], [7.685714285714286, '07'], [7.672727272727273, '03'], [7.083333333333333, '04'], [6.680555555555555, '22'], [5.586956521739131, '09']]


In [46]:
#print the string : "Top 5 Hours for Ask Posts Comments"
print("Top 5 Hours for Ask Posts Comments") 

#loop through first 5 rows of sorted_swap
for row in sorted_swap[:5]:
    #create a datetime object, assign datetime format that matches the date time stamp of the data
    time = dt.datetime.strptime(row[1], "%H")
    #select only the hour, and assign the hour:minute format to the datetime object
    hour_min = time.strftime("%H:00")
    #print statement for each hour and its average
    print("{hour}: {avg:.2f} average comments per post".format(hour = hour_min, avg = row[0]))


Top 5 Hours for Ask Posts Comments
15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post


The highest average comments for a post occur at 3:00pm EST.