# Analyzing Submissions to Hacker News


## Introduction

[Hacker News](https://news.ycombinator.com/) is a popular website where users can post stories and exchange ideas with other members in an online community. When users want to ask the community a specific question, the post title is prefixed with 'Ask HN:'. If a member wants to share a project or something interesting with the community, the post title is prefixed with 'Show HN'. Users can also submit links to other stories they find interesting and want to share with the community.

Members can 'upvote' and comment on the posts, which effect their ranking in the newsfeed. It is not unusual for top ranked posts on Hacker News to receive more than 100,000 views.

### Objectives

We will use Jupyter Notebook to analyze a dataset from posts submitted to [Hacker News](https://news.ycombinator.com/).

The purpose of our analysis is to determine the following:

    1) Which type of posts receive the most comments on average?

    2) Do posts created at a certain time receive more comments on average?

### Data

The data for this analysis was obtained from [kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). We will use a random sample of posts which received comments from the kaggle dataset. Any posts that did not receive comments have been excluded from this analysis. The dataset includes more than 20,000 posts and spans from September 2015 until September 2016.

Each records contains data for up to seven fields. A description of each field is provided below:
    
`id`: a unique identifier

`title`: the title of the post

`url`: if applicable, the url link included in the post

`num_points`: the number of upvotes acquired by the post

`num_comments`: the number of comments on the post

`author`: the username of the member who submitted the post

`created_at`: the date and time the post was submitted


## Importing Data

The first step is to gather the data. We will read the dataset into Jupyter Notebook using the reader function in the CSV module.

In [92]:
# Import the reader function from the CSV module

from csv import reader

In [93]:
# Read the csv file as a list of lists

open_file = open(r"C:\Users\awaul\OneDrive\Documents\Data\Hacker_News_Data\hacker_news.csv")
reader_file = reader(open_file)
hn_data = list(reader_file)

In [94]:
# Print the first 5 rows of data

print(hn_data[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hackers Cleared of Federal Charges', 'http://www.wired.com/2013/11/video--poker-case/', '23', '3', 'trengrj', '9/6/2015 7:25'], ['10177048', 'The Microservices Way  Weekly Microserivces Newsletter', 'https://www.getrevue.co/profile/microservices', '1', '1', 'britman', '9/6/2015 7:50'], ['10177077', 'The Hitler at Home stories of the pre-WWII American press', 'http://www.atlasobscura.com/articles/the-american-medias-awkward-fawning-over-hitlers-taste-in-home-decor', '75', '75', 'aaronbrethorst', '9/6/2015 8:05']]


To simplify our code, we will extract the header row from the dataset and assign it to the variable `headers`. The dataset is saved as a list of lists and assigned to the variable `hn _data`.

In [95]:
# Extract the header row

headers = hn_data[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [96]:
# Remove the header row from hn_data

hn_data = hn_data[1:]
print(hn_data[:5])

[['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hackers Cleared of Federal Charges', 'http://www.wired.com/2013/11/video--poker-case/', '23', '3', 'trengrj', '9/6/2015 7:25'], ['10177048', 'The Microservices Way  Weekly Microserivces Newsletter', 'https://www.getrevue.co/profile/microservices', '1', '1', 'britman', '9/6/2015 7:50'], ['10177077', 'The Hitler at Home stories of the pre-WWII American press', 'http://www.atlasobscura.com/articles/the-american-medias-awkward-fawning-over-hitlers-taste-in-home-decor', '75', '75', 'aaronbrethorst', '9/6/2015 8:05'], ['10177103', 'GM crops created superweed, say scientists (2005)', 'http://www.theguardian.com/science/2005/jul/25/gm.food', '58', '27', 'x5n1', '9/6/2015 8:24']]


In [97]:
# Review the field names in the header row

print("Data Fields: ", headers)

Data Fields:  ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [98]:
# Examine the number of records or rows in the dataset

print("There are", f'{len(hn_data):,}',"records in the dataset.")

There are 20,099 records in the dataset.


In [99]:
# Examine the timespan covered by the dataset

import datetime as dt
date_format = "%m/%d/%Y %H:%M"

dates = []

for row in hn_data:
    date = row[-1]
    date = dt.datetime.strptime(date,date_format)
    dates.append(date)

print("Dataset begins on: ",dt.datetime.strftime(min(dates),"%m/%d/%Y"))
print("Dataset ends on: ", dt.datetime.strftime(max(dates),"%m/%d/%Y"))


Dataset begins on:  09/06/2015
Dataset ends on:  09/26/2016


## Filter the Data

The first objective of this study is to identify which category of posts receives the most comments on average. We will segment the data into three groups; `ask_posts`, `show_posts` and `other_posts`.

We can use the `.startswith` method to filter posts by their title and append each record to the appropriate list.

In [100]:
# Filter the posts and assign to the appropriate category
# (ask_posts, show_posts or other_posts)

ask_posts = []
show_posts = []
other_posts = []

for post in hn_data:
    title = post[1].lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(post)
        
    elif title.startswith('show hn'):
        show_posts.append(post)
        
    else:
        other_posts.append(post)

In [101]:
# View the first 3 posts for each category

print(ask_posts[:3])
print('\n')
print(show_posts[:3])
print('\n')
print(other_posts[:3])

[['10177801', 'Ask HN: How to keep young developers', '', '3', '6', 'orangeplus', '9/6/2015 14:53'], ['10182770', 'Ask HN: If you are learning Chinese', '', '1', '2', 'goodcharacters', '9/7/2015 19:54'], ['10182780', 'Ask HN: Are freemium microservices a thing?', '', '1', '2', 'hyperpallium', '9/7/2015 19:56']]


[['10177459', 'Show HN: AppyPaper  Gift wrap with app icons printed on it', 'http://www.appypaper.com/', '6', '4', 'submitstartup', '9/6/2015 12:38'], ['10179920', 'Show HN: Easiest way to build html tables in React', 'https://github.com/legitcode/table', '3', '2', 'zackify', '9/7/2015 3:20'], ['10180369', 'Show HN: Chemozart  molecule editor and visualizer with mechanics calculators', 'https://github.com/mohebifar/chemozart', '34', '17', 'mohebifar', '9/7/2015 6:50']]


[['10176923', "Why we aren't tempted to use ACLs on our Unix machines", 'https://utcc.utoronto.ca/~cks/space/blog/sysadmin/NoACLTemptation', '34', '23', 'mjn', '9/6/2015 6:03'], ['10177011', 'Video Poker Hacke

In [102]:
# Review the number of posts for each category 

print("Number of 'Ask HN' posts:", f'{len(ask_posts):,}')
print("Number of 'Show HN' posts:", f'{len(show_posts):,}')
print("Number of  'Other' posts:", f'{len(other_posts):,}')

Number of 'Ask HN' posts: 1,744
Number of 'Show HN' posts: 1,162
Number of  'Other' posts: 17,193


## Average Number of Comments by Category

We can write a function, `avg_comments`, to determine the average number of posts within each category. The parameters for the function are list and index, the default argument for index is set to 4. 

Writing a function for this task allows us to leverage our code and calculate the average number of comments within each category.

In [103]:
# Which type of post receives the most comments on average?

# Use a function to calculate the average number of comments per category

def avg_comments(list, index=4):
    total_comments = 0
    for row in list:
        num_comments = int(row[index])
        total_comments += num_comments
    
    avg_comments = total_comments / len(list)
    return avg_comments

In [104]:
# Which category of posts receives the most comments on average?

print("Average comments for 'Ask HN':",f'{avg_comments(ask_posts):.1f}')
print('\n')
print("Average comments for 'Show HN':",f'{avg_comments(show_posts):.1f}')
print('\n')
print("Average comments for 'Other':",f'{avg_comments(other_posts):.1f}')
print('\n')

Average comments for 'Ask HN': 14.0


Average comments for 'Show HN': 10.3


Average comments for 'Other': 26.9




**'On average, we can see 'Ask HN' posts generate more comments vs. 'Show HN' posts. Posts within the 'Other' category attract the greatest number of comments on average.**

## Best Times to Post on Hacker News

The next objective is to determine the best times to submit a post that will attract comments from the Hacker News community. We will write a series of functions that will generate a frequency table for each type of post. The tables will contain the best hours for a user submit a post, when ranked by average number of comments in descending order.

In [105]:
# Write a function to calculate the frequency of posts by hour
# Recall, date_format = "%m/%d/%Y %H:%M"

def posts_by_hour(category_list):
    
    counts_by_hour = {}
    
    for post in category_list:
        date = post[-1]
        time = dt.datetime.strptime(date, date_format).strftime('%H')
        
        if time in counts_by_hour:
            counts_by_hour[time] += 1
        else:
            counts_by_hour[time] = 1
            
    return counts_by_hour


a = posts_by_hour(ask_posts)
print(a)

{'14': 107, '19': 110, '15': 116, '20': 80, '00': 55, '01': 60, '03': 54, '07': 34, '16': 108, '22': 71, '05': 46, '13': 85, '10': 59, '11': 58, '17': 100, '23': 68, '12': 73, '06': 44, '18': 109, '09': 45, '04': 47, '21': 109, '02': 58, '08': 48}


The keys in the `counts_by_hour` dictionary represents the hour of day, values represent the number of posts submitted during the hour of day.

Example: The total number of Ask HN posts submitted during 14:00 hours was 107.

In [115]:
print(posts_by_hour(ask_posts)['14'])

107


In [106]:
# Write a function to calculate the total number of comments for posts submitted in a particular time range

def comments_by_hour(category_list):
    com_by_hour = {}
    for post in category_list:
        date = post[-1]
        time = dt.datetime.strptime(date, date_format).strftime('%H')
        num_comments = int(post[4])
        
        if time in com_by_hour:
            com_by_hour[time] += num_comments
        else:
            com_by_hour[time] = num_comments
            
    return com_by_hour

b = comments_by_hour(ask_posts)
print(b)

{'14': 1416, '19': 1188, '15': 4477, '20': 1722, '00': 447, '01': 683, '03': 421, '07': 267, '16': 1814, '22': 479, '05': 464, '13': 1253, '10': 793, '11': 641, '17': 1146, '23': 543, '12': 687, '06': 397, '18': 1439, '09': 251, '04': 337, '21': 1745, '02': 1381, '08': 492}


The keys in the `com_by_hour` dictionary represents the hour of day, values represent the total number of comments received by posts submitted during the hour of day.

Example: Ask HN posts submitted during 14:00 hours received a total of 1,416 comments.

In [107]:
print(comments_by_hour(ask_posts)['14'])

1416


In [108]:
# Write a function to calculate the average number of comments for posts submitted in a particular time range

def avg_by_hour(category_comments_by_hr, category_posts_by_hour):
    avg_by_hour = []

    for hr in category_comments_by_hr:
        avg_by_hour.append([hr, category_comments_by_hr[hr] / category_posts_by_hour[hr]])
    
    return avg_by_hour

In [121]:
print(avg_by_hour(ask_hn_com_by_hr, ask_hn_posts_by_hr)[0])

['14', 13.233644859813085]


In [120]:
# Write a function to sort avg_by_hour results in descending order

def sort_by_avg_comments(category_avg_by_hr):
    swap_columns = []
    for row in category_avg_by_hr:
        swap_columns.append([row[1],row[0]])
    sorted_swap = sorted(swap_columns, reverse=True)
    return sorted_swap

In [122]:
print(sort_by_avg_comments(ask_hn_avg_by_hr))

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


In [110]:
# Write a function to format the results

def f_print_top5(title, list_sort_by_avg):
    print(title)
    for avg, hr in list_sort_by_avg[:5]:
        print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%I:%M %p"),avg))

Now we can call on the functions to produce a frequency table for each category. We will start with the posts prefixed with Ask HN, then we will compare the results for the Show HN and Other categories.

In [111]:
# When are the best times for Ask HN posts?

ask_hn_com_by_hr = comments_by_hour(ask_posts)

ask_hn_posts_by_hr = posts_by_hour(ask_posts)

ask_hn_avg_by_hr = avg_by_hour(ask_hn_com_by_hr, ask_hn_posts_by_hr)

best_time_to_post_ask_hn = sort_by_avg_comments(ask_hn_avg_by_hr)

f_print_top5("Best Times to Post for 'Ask HN' Comments",best_time_to_post_ask_hn)

Best Times to Post for 'Ask HN' Comments
03:00 PM: 38.59 average comments per post
02:00 AM: 23.81 average comments per post
08:00 PM: 21.52 average comments per post
04:00 PM: 16.80 average comments per post
09:00 PM: 16.01 average comments per post


The best time to submit a post to ask the Hacker News community a question is generally in the afternoon or evening hours.

In [113]:
# When are the best times for Show HN posts?

show_hn_com_by_hr = comments_by_hour(show_posts)

show_hn_posts_by_hr = posts_by_hour(show_posts)

show_hn_avg_by_hr = avg_by_hour(show_hn_com_by_hr, show_hn_posts_by_hr)

best_time_to_post_show_hn = sort_by_avg_comments(show_hn_avg_by_hr)

f_print_top5("Best Times to Post for 'Show HN' Comments", best_time_to_post_show_hn)


Best Times to Post for 'Show HN' Comments
06:00 PM: 15.77 average comments per post
12:00 AM: 15.71 average comments per post
02:00 PM: 13.44 average comments per post
11:00 PM: 12.42 average comments per post
10:00 PM: 12.39 average comments per post


The best time to submit a post to show the Hacker New community a project that you are working on is generally later in the evening.

In [114]:
# When are the best times for Other Posts?

other_com_by_hr = comments_by_hour(other_posts)

other_posts_by_hr = posts_by_hour(other_posts)

other_avg_by_hr = avg_by_hour(other_com_by_hr, other_posts_by_hr)

best_time_to_post_other = sort_by_avg_comments(other_avg_by_hr)

f_print_top5("Best Times to Post 'Other Content' for Comments", best_time_to_post_other)

Best Times to Post 'Other Content' for Comments
02:00 PM: 32.33 average comments per post
01:00 PM: 30.90 average comments per post
12:00 PM: 30.35 average comments per post
11:00 AM: 29.59 average comments per post
03:00 PM: 29.52 average comments per post


The afternoon is generally the best time to submit a post with 'Other Content' to generate comments from the Hacker News community.

## Conclusion

To summarize our results, we found that on average 'Ask HN' posts received more comments than posts in the 'Show HN' category. But overall, posts to the 'Other' category received the most comments, on average, from the Hacker News community.

The afternoon is the best time for members to post 'Other' content to generate comments from the Hacker News community. The top 5 hours to submit a post are between 11 AM and 3 PM.

The best time to submit a post to ask the Hacker News community a question is generally later in the afternoon or early evening hours. The best hours to submit posts to 'Ask HN' were 3 PM, 4 PM, 8 PM, 9 PM and 2 AM.

Posts to 'Show HN' attract more comments on average if they are submitted later in the afternoon or evening hours. The best hours to submit a 'Show HN' post were 2 PM, 6 PM, 10 PM, 11 PM and 12 AM.