# Exploring Hacker News Posts

This project looks at the posts that make it to the top of Hacker New's listings (a site started by the startup incubator Y Combinator). These listings can get hundreds of thousands of visitors, and it would be interesting to explore the submissions. The dataset is found [here](https://www.kaggle.com/hacker-news/hacker-news).

First we'll take a look at the provided csv taken from a scrape of their site and provided on Kaggle:

In [1]:
from csv import reader

with open('hacker_news.csv') as f:
    hn = list(reader(f))
    
# Print headers with indices
print([str(element[1]) + ': ' + element[0] for element in zip(hn[0],range(len(hn[0])))])
print('\n')
    
for i in range(1,6):
    print(hn[i])
    print('\n')

['0: id', '1: title', '2: url', '3: num_points', '4: num_comments', '5: author', '6: created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/201

## Investigating Types of Posts

We'll first look at which of the types of posts would merit more investigation.

In [2]:
ask_posts = []
show_posts = []
other_posts = []

# To categorize, ask and show posts are prefixed with those titles
for row in hn[1:]:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else: 
        other_posts.append(row)
        
print('# of ask posts: ' + str(len(ask_posts)))
print('# of show posts: ' + str(len(show_posts)))
print('# of other posts: ' + str(len(other_posts)))

# of ask posts: 1744
# of show posts: 1162
# of other posts: 17194


In [3]:
num_comments_idx = 4

def avg_comments(posts: list[any], post_name: str) -> None:
    '''
    Takes a list of posts and outputs the average number of comments along with the string of the list's name
    '''
    total_comments = 0
    all_posts = len(posts)
    for post in posts:
        total_comments += int(post[num_comments_idx])
    avg_comments = total_comments / all_posts
    print('The average number of comments on ' + post_name + ' is: ' + '{:.0f}'.format(avg_comments))

avg_comments(ask_posts, 'Ask Posts')
avg_comments(show_posts, 'Show Posts')
avg_comments(other_posts, 'Other Posts')

The average number of comments on Ask Posts is: 14
The average number of comments on Show Posts is: 10
The average number of comments on Other Posts is: 27


The Ask Posts have a higher number of comments per post excluding the Other Posts which we will ignore in this analysis. Thus, we will be continuing the analysis of the Ask Posts.

## What is the Best Time to Ask an Ask Post?

We'll build a frequency table to find out when ask posts get the highest average comments.

In [4]:
created_at_idx = 6
num_points_idx = 3

import datetime as dt

# Begin generation of frequency table
def avg_count_per_hour(posts: list[any], stat_idx: int) -> list[int]:
    '''
    Returns a list of the average statistic value per hour
    '''
    counts_by_hour = {}
    stat_by_hour = {}
    avg_by_hour = []
    for row in posts:
        date = dt.datetime.strptime(row[created_at_idx], '%m/%d/%Y %H:%M')
        hour = date.strftime('%H')
        
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            stat_by_hour[hour] += int(row[stat_idx])
        else:
            counts_by_hour[hour] = 1
            stat_by_hour[hour] = int(row[stat_idx])
    
    for key in counts_by_hour:
        avg_stat = stat_by_hour[key] / counts_by_hour[key]
        avg_for_hour = [key, avg_stat]
        avg_by_hour.append(avg_for_hour)
    
    return avg_by_hour

In [5]:
avg_comments_by_hour = avg_count_per_hour(ask_posts, num_comments_idx)

avg_comments_by_hour.sort()
for hour_avg in avg_comments_by_hour:
    print('Hour '+str(hour_avg[0])+': {:.0f}'.format(hour_avg[1]))

Hour 00: 8
Hour 01: 11
Hour 02: 24
Hour 03: 8
Hour 04: 7
Hour 05: 10
Hour 06: 9
Hour 07: 8
Hour 08: 10
Hour 09: 6
Hour 10: 13
Hour 11: 11
Hour 12: 9
Hour 13: 15
Hour 14: 13
Hour 15: 39
Hour 16: 17
Hour 17: 11
Hour 18: 13
Hour 19: 11
Hour 20: 22
Hour 21: 16
Hour 22: 7
Hour 23: 8


In [6]:
from operator import itemgetter, attrgetter

sorted_swap = sorted(avg_comments_by_hour,key=itemgetter(1),reverse=True)
print('Top 5 Hours for Ask Posts Comments')

txt = '{hour}:00: {avg:.0f} average comments per post'
for i in range(5):
    hour = sorted_swap[i][0]
    avg = float((sorted_swap[i][1]))
    print(txt.format(hour=hour, avg=avg))

Top 5 Hours for Ask Posts Comments
15:00: 39 average comments per post
02:00: 24 average comments per post
20:00: 22 average comments per post
16:00: 17 average comments per post
21:00: 16 average comments per post


The best time to be posting Ask Posts is 15:00 (3:00 PM).

## Further Investigations Between Ask and Show Posts

First, we'll look at if show or asks posts receive more points on average.

In [7]:
def avg_points(posts: list[any], post_name: str) -> None:
    '''
    Prints the average number of posts for a type of post given by the list and name of the post list
    '''
    total_points = 0
    all_posts = len(posts)
    for post in posts:
        total_points += int(post[num_points_idx])
    avg_points = total_points / all_posts
    print('The average number of points on ' + post_name + ' is: ' + '{:.0f}'.format(avg_points))
    
avg_points(ask_posts, 'Ask Posts')
avg_points(show_posts, 'Show Posts')
avg_points(other_posts, 'Other Posts')

The average number of points on Ask Posts is: 15
The average number of points on Show Posts is: 28
The average number of points on Other Posts is: 55


We see that show posts have more on average. Now, we'll perform a temporal analysis on show post's points.

In [8]:
avg_points_by_hour_show = avg_count_per_hour(show_posts, num_points_idx)

In [9]:
sorted_swap = sorted(avg_points_by_hour_show,key=itemgetter(1),reverse=True)
print('Top 5 Hours for Show Posts Points')

txt = '{hour}:00: {avg:.1f} average points per post'
for i in range(5):
    hour = sorted_swap[i][0]
    avg = float((sorted_swap[i][1]))
    print(txt.format(hour=hour, avg=avg))

Top 5 Hours for Show Posts Points
23:00: 42.4 average points per post
12:00: 41.7 average points per post
22:00: 40.3 average points per post
00:00: 37.8 average points per post
18:00: 36.3 average points per post


We see here that show posts have a higher average points per post when posted at 23:00 (11:00 PM).

Now let's compare all of the above results to other posts.

In [10]:
avg_comments_by_hour_other = avg_count_per_hour(other_posts, num_points_idx)

In [11]:
sorted_swap = sorted(avg_comments_by_hour_other,key=itemgetter(1),reverse=True)
print('Top 5 Hours for Other Posts Points')

txt = '{hour}:00: {avg:.1f} average points per post'
for i in range(5):
    hour = sorted_swap[i][0]
    avg = float((sorted_swap[i][1]))
    print(txt.format(hour=hour, avg=avg))

Top 5 Hours for Other Posts Points
13:00: 62.5 average points per post
14:00: 61.8 average points per post
15:00: 60.5 average points per post
10:00: 60.5 average points per post
19:00: 60.0 average points per post


We can see that other posts do receive more points on average than show posts. The hours created also differ.

## Conclusion

While the created time and type of a post definitely can indicate some degree of the type of engagement, there are definitely more factors that contribute to the overall engagement with a post.