# HACKER NEWS ENGAGEMENT ANALYSIS

### Introduction:

Hacker news is a well known tech forum that hosts UGC within the tech and startup circles. The users on Hacker News engage with the posts from other users through up and down votes that aggregate to a total number of "points" (total up-votes - total down-votes) and commenting directly under posts. We have aggregated a downsampled data set (300,000 rows down to 20,000) through elimination of posts without any comments and then randomly sampling the remaing data. The purpose of this analysis hones in on the "Ask HN" and "Show HN" posts, identified by the begining of the title. These two post types will serve as our comparison to understand how users engage with posts through comment analysis. 

### Understanding the Dataset:

To begin, lets open the dataset and look at the first few rows of data-

In [3]:
from csv import reader 

with open("hacker_news.csv") as file:
    read_file = reader(file) 
    hn = list(read_file)
    
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


For easier analysis, we are going to extract the header row from the dataset using indexing: 

In [5]:
headers = hn[0]
print(headers)

hn = hn[1:]
print(hn[0:5])

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
[['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http:

Now that we understand what's in our dataset, we can sort the data into three distinct lists: ask_posts, show_posts, and other_posts. Using .lower() to unify all the titles into lowercase, and the .startswith() to extract the begining of a post's title, we can cleanly extract all data without accounting for every case (title case, etc.).

In [8]:
ask_posts = []
show_posts = [] 
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(post)
    elif title.lower().startswith("show hn"):
        show_posts.append(post)
    else:
        other_posts.append(post)

print(len(ask_posts))
print(len(show_posts)) 
print(len(other_posts))

1744
1162
17193


### Ask Comments vs. Show Comments:

We have out distinct lists, so we can begin our analysis on user engagement through the number of comments under a post. First, we are going to focus on wether ask posts or show posts on average receive more comments. 

In [11]:
total_ask_comments = 0

for post in ask_posts:
    total_ask_comments += int(post[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
    
print(avg_ask_comments)

14.038417431192661


In [14]:
total_show_comments = 0

for post in show_posts: 
    total_show_comments += int(post[4])
    
avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

10.31669535283993


Based on the analysis, show posts only receive an average of 10 comments compared to the ask posts who receive an average of 14 comments. Since the ask posts are likely to receive more comments, we will focus on this area of data. 

### Timing of Posts and User Engagement

We want to see if these ask post receive more engagement based on the time of creation. This requires a two step approach. First, we'll need to calculate the number of ask posts created within each hour of the day, as well as the number of comments received. Then, we'll calculate the average number of comments ask posts receive by the hour created. 
To deal with the datetime format in the created_at column, we"ll need to import the datetime module. 

In [16]:
import datetime as dt

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
         )
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [18]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append(
        [hour, comments_by_hour[hour] / counts_by_hour[hour]]
    )
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Sorting and Printing Values from a List of Lists

In [20]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

The hour with the highest likelihood of recieving the most comments is Hour 15 which translates to 3:00pm on a 12-hour timeclock (est timezone according to documentation). Ask HN posts that are created within the 3pm timeframe on average garner almost 39 comments on average. This is much higher traffic even compared to the next highest traffic time 2am, that averages just under 24 comments. This is a 60% increase from the second highest hour to the highest hour.

### Conclusion

Through our engagement analysis of Hacker News posts, of posts that garner comments, the most likely post to recieve higher number of comments are posts that are tagged as Ask HN and posted between 3:00pm est - 4:00pm est. 