# Exploring Hacker News Posts

## Introduction

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") receive votes and comments, similar to Reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

For this project, we are interested in posts with titles that begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just something interesting. We'll examine these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Getting Started

We're going to start by importing the libraries we need and reading the dataset into a list of lists.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

We notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers.

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

We can see that each row consists of a unique post ID, title, URL, number of upvotes, number of comments, author, and date and time of creation. The columns that will concern us are the title, number of comments, and date and time of creation - `title`, `num_comments`, and `created_at`.

Now we're ready to filter our data. Since we're only concerned with post titles beginning with `Ask HN` or `Show HN`, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either `Ask HN` or `Show HN`, we'll use the string method `startswith`. It's also good to keep in mind that we can control for case by using the `lower` method, which returns a lowercase version of the starting string.

Using these methods, we're going to separate posts beginning with `Ask HN` and `Show HN` (and case variations) into their own lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that we have our lists, we'll be able to determine if ask posts or show posts receive more comments on average.

In [4]:
# Average for ask posts

total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [5]:
# Average for show posts

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


We see that ask posts receive more comments on average, with 14 comments per post as opposed to show posts, which receive an average of about 10 comments.

## Finding the Number of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain *time* are more likely to attract comments. We'll use the following steps to perform this analysis:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

For the first step, we'll use the `datetime` module to work with the data in the `created_at` column. We are keeping in mind that we can use the `datetime.strptime()` constructor to parse dates stored as strings and return datetime objects.

In [6]:
# First import the datetime module

import datetime as dt

# Create a modified list with which to construct the dictionaries

result_list = []

for row in ask_posts:
    result_list.append(
        [row[6], int(row[4])]
    )

# Create dictionaries - one for ask posts created per hour
# Another for the corresponding number of comments
    
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    comments = row[1]
    post_hour = dt.datetime.strptime(date, date_format).strftime('%H')
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = comments
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += comments
        
comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

## Calculating the Average Number of Comments for Ask HN Posts by Hour

To recap, we just created two dictionaries:

- `counts_by_hour`: contains the number of ask posts created during each hour of the day.
- `comments_by_hour`: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Here we initialized an empty list (of lists) and assigned it to `avg_by_hour`, then iterated over the keys of `comments_by_hour` and appended to `avg_by_hour` a list with the following attributes:

- The first element is the key from `comments_by_hour`.
- The second element is the average number of comments per post. To do this, we divided the value corresponding to the `comments_by_hour` key (i.e. the total number of comments in a given hour) by the value corresponding to the `counts_by_hour` key (i.e. the total number of posts in a given hour).

## Sorting and Printing Values from a List of Lists

Although now we have the results we need, this format makes it difficult to identify the hours with the highest values. We're going to finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [8]:
# Create a new list that equals `avg_by_hour` with swapped columns

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append(
    [row[1], row[0]]
    )
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [9]:
# Create another new list with the values of `swap_avg_by_hour` in descending order

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# I will display all the rows here for edification purposes

sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [10]:
# Present the findings in an easily readable form - stick to top 5 hours

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    post_h = dt.datetime.strptime(row[1], '%H').strftime('%H:%M')
    post_c = row[0]
    listing = '{}: {:.2f} average comments per post'.format(post_h, post_c)
    print(listing)

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


At last we have uncovered the findings we set out to discover. As per our last step, after converting the 24 hour format, the hour with the highest chance of receiving comments begins at 3pm. There is about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

According to the [original dataset's](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) compiler, the post creation times under the `created at` column reflect US Eastern Standard Time. Hence it would be most accurate to report our hour as 3pm EST, and luckily that also happens to be my time zone.

## Conclusion

In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as an ask post and created between 15:00 and 16:00 (3pm - 4pm EST).

With all that said, there are a few caveats to this analysis. Firstly, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that *of the posts that received comments,* ask posts received more comments on average and ask posts created between 15:00 and 16:00 received the most comments on average.

Additionally, we noted that the amount of comments in our most popular hour is about 60% higher than our second place hour. This seems pretty high, and because we only worked with averages, it is entirely possible that one or a couple viral posts artifically inflated the average number of comments. However, testing this would necessitate a deeper analysis into the dataset, and for now that is beyond the scope of this project.