# Exploring Hacker News Posts

In this project, I will work with a dataset that contains submissions to the Hacker News website. This website is popular in technology and startup circles, and posts are voted and commented upon, similar to reddit. Specific posts whose titles begin with "Ask HN" is directed to ask the readers a question, whereas, titles beginning with "Show HN" posts show the community a product, project, or just something interesting. We will also conduct analysis on the types of the posts and conclude by finding the hours of the day o
Here is the link to the dataset: https://www.kaggle.com/hacker-news/hacker-news-posts

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print (hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


### Extracting header row

In [2]:
hn_header = hn[0]
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [3]:
hn = hn[1:]
print (hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Filtering out "Ask HN" and "Show HN" posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [5]:
print (len(ask_posts))
print (len(show_posts))
print (len(other_posts))

1744
1162
17194


In [6]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


From our anaysis above, we can see that the average number of comments in an "Ask HM" post is greater than the average number of comments for "Show HM" posts. From this, we can conclude that "Ask HM" posts on Hacker News website are the more popular and gain more interactions than "Show HM" posts since one is more likely to recieve comments than another.

### Number of Post Interactions based on the Time of Posting

In [7]:
import datetime as dt

result_list = []

for row in ask_posts:
    ask_time = row[6]
    num_comments = int(row[4])
    result_list.append([ask_time, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time_of_post = row[0]
    comment_count = row[1]
    
    #parsing time index from result_list as a dt object
    time_of_post_dt = dt.datetime.strptime(time_of_post, "%m/%d/%Y %H:%M")

    hour_of_post = dt.datetime.strftime(time_of_post_dt, "%H")
    
    
    if hour_of_post in counts_by_hour:
        counts_by_hour[hour_of_post] += 1
        comments_by_hour[hour_of_post] += comment_count
    else:
        counts_by_hour[hour_of_post] = 1
        comments_by_hour[hour_of_post] = comment_count
    
print(counts_by_hour)
print(comments_by_hour)
    


{'14': 107, '20': 80, '07': 34, '04': 47, '19': 110, '15': 116, '18': 109, '00': 55, '09': 45, '22': 71, '05': 46, '12': 73, '17': 100, '08': 48, '03': 54, '10': 59, '06': 44, '16': 108, '23': 68, '02': 58, '01': 60, '11': 58, '21': 109, '13': 85}
{'14': 1416, '20': 1722, '07': 267, '04': 337, '19': 1188, '15': 4477, '18': 1439, '00': 447, '09': 251, '22': 479, '05': 464, '12': 687, '17': 1146, '08': 492, '03': 421, '10': 793, '06': 397, '16': 1814, '23': 543, '02': 1381, '01': 683, '11': 641, '21': 1745, '13': 1253}


Above, we created 2 dictionaries: counts_by_hour and comments_by_hour

counts_by_hour : contains the number of ask posts created during each hour of the day.

comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

In [8]:
avg_comments_by_hr = []

for hour in counts_by_hour:
    avg_comments_by_hr.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print (avg_comments_by_hr)

[['14', 13.233644859813085], ['20', 21.525], ['07', 7.852941176470588], ['04', 7.170212765957447], ['19', 10.8], ['15', 38.5948275862069], ['18', 13.20183486238532], ['00', 8.127272727272727], ['09', 5.5777777777777775], ['22', 6.746478873239437], ['05', 10.08695652173913], ['12', 9.41095890410959], ['17', 11.46], ['08', 10.25], ['03', 7.796296296296297], ['10', 13.440677966101696], ['06', 9.022727272727273], ['16', 16.796296296296298], ['23', 7.985294117647059], ['02', 23.810344827586206], ['01', 11.383333333333333], ['11', 11.051724137931034], ['21', 16.009174311926607], ['13', 14.741176470588234]]


### Improving Readablity

In [27]:
swap_avg_by_hour = []

for row in avg_comments_by_hr:
    swap_avg_by_hour.append([row[1], row[0]]) #swap columns

swap_avg_by_hour = sorted(swap_avg_by_hour, reverse = True) #descending sort

print ("Top 5 Hours for Ask Posts Comments")

for row in swap_avg_by_hour[:5]:
    hour = row[1]
    avg = row[0]
    
    hour = dt.datetime.strptime(hour, "%H") #parse hour string
    hour = dt.datetime.strftime(hour, "%H:00") #format hour as hour:minute
    
    template = "{hour} --- {avg:.2f}" #floating decimal precision
    output = template.format(hour = hour, avg = avg)
    print(output)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
Top 5 Hours for Ask Posts Comments
15:00 --- 38.59
02:00 --- 23.81
20:00 --- 21.52
16:00 --- 16.80
21:00 --- 16.01


We can see from the output above, the top 5 hours that get the most comments. From this we can conclude that the most popular times that yield most interactions are: 3PM, 2AM, 8PM, 4PM and 9PM. 