# Exploring Hacker News Posts

### Context 

[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as 'posts') are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

In this project, we'll work with a [dataset](https://www.kaggle.com/hacker-news/hacker-news-posts) of submissions to Hacker News (HN).

| Syntax | Description | 
| --- | --- |
| `id` | The unique identifier from Hacker News for the post |
| `title` | The title of the post |
| `url` | The URL that the posts links to, if it the post has a URL |
| `num_points` | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| `num_comments` | The number of comments that were made on the post |
| `author` | The username of the person who submitted the post |
| `created_at` | The date and time at which the post was submitted |
  
  
  

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. 

- Users submit `Ask HN` posts to ask the Hacker News community a specific question.
   - `Ask HN: How to improve my personal website?`
   - `Ask HN: Am I the only one outraged by Twitter shutting down share counts?`
   
   
- Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.
   - `Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform`
   - `Show HN: Something pointless I made`


### Aim 

The aim of this project is to compare these two types of posts (`Ask HN` and `Show HN`) to determine the following:

1. Do `Ask HN` or `Show HN` receive more comments on average?
2. Do posts created at a certain time receive more comments on average?


### Skills / Libraries / Tools

We continue to use the standard Python Library only to do our practical data analysis, touching upon:
- Working with strings
- Dates and times
- Jupyter Notebook


### Read in our Data and Remove Headers from a List of Lists

We start by opening `hacker_news.csv` and converting it to a list of lists.

We remove the header row and store it separately.

We now have our header `header` and our data set `hn`.

In [1]:
from csv import reader

hn = list(reader(open('hacker_news.csv')))

# Examine the first 5 rows
for row in hn[:5]:
    print(row)
    print('\n')

# Split into header and data set
headers = hn[0]
hn = hn[1:]
print(headers)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Aks HN and Show HN Posts

As we are only interested in posts whose titles begin with `Ask HN` or `Show HN`, we will work on extracting these posts from the data set.

We will create 2 new lists of lists, containing data just for posts with those titles.

In [2]:
# Instaniate lists that will be used to store our data
ask_posts = []
show_posts = []
other_posts = []

# Iterate over data set and append rows to lists as appropriate
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn') == True:
        ask_posts.append(row)
    elif title.startswith('show hn') == True:
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(len(hn) == len(ask_posts) + len(show_posts) + len(other_posts))


1744
1162
17194
True


We now have our two lists of lists, `ask_posts` and `show_posts`.


## Calculate the average number of comments for `Ask HN` and `Show HN` posts

Let's see which categoy receives more comments on average, `Ask HN` or `Show HN` posts.

In [3]:
# Get avg number of comments for `Ask HN` posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of comments per Ask post: ', round(avg_ask_comments, 2))


# Get avg number of comments for `Show HN` posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of comments per Show post: ', round(avg_show_comments, 2))


# Print results
if avg_ask_comments > avg_show_comments:
    print('Ask posts receive {} more comments on average than Show posts.'.format(round(avg_ask_comments - avg_show_comments, 2)))
else:
    print('Show posts receive {} more comments on average than Ask posts.'.format(round(avg_show_comments - avg_ask_comments, 2)))


Average number of comments per Ask post:  14.04
Average number of comments per Show post:  10.32
Ask posts receive 3.72 more comments on average than Show posts.


As we can see above, `Ask HN` posts receive more posts on average than `Show HN` posts.

> Average number of comments per Ask post:  14.04

> Average number of comments per Show post:  10.32

> Ask posts receive 3.72 more comments on average than Show posts.

Given this, we will focus our remaining analysis on **`Ask HN` posts only**.



## Find the Amount of `Ask HN` Post and Comments by Hour Created

We want to find out if the *time of day* at which a post is created impacts the number of comments it receives.

Considering the ask posts only, we will calculate: 
1. The number of ask posts created in each hour of the day along with the number of comments received
2. The average number of comments ask posts receive by hour created

Let's look at the first step - calculating the amount of ask posts and their comments by hour created.

In [4]:
# Import the datetime module
import datetime as dt


# Create a list of lists, containing the created time and number of comments for each ask post
result_list = []

for row in ask_posts:
    created_at = row[-1]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

    
# extract the number of posts and comments in an hour
counts_by_hour = {}
comments_by_hour = {}


# Turn the created time string into a usable format
# First turn it into a datetime object, then a string with just the hour info
for row in result_list:
    
    # Parse the string, resolving its components and assigning their syntactic role to create the datetime object
    dt_obj = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M') 
    
    # Convert the dt object to a string displaying just the hour
    hour_str = dt_obj.strftime('%H')
    
    num_comments = row[1]
    
    if hour_str not in counts_by_hour:
        counts_by_hour[hour_str] = 1
        comments_by_hour[hour_str] = num_comments
    
    elif hour_str in counts_by_hour:
        counts_by_hour[hour_str] += 1
        comments_by_hour[hour_str] += num_comments

# Output
print('Counts by hour:', counts_by_hour)
print('\n')
print('Comments by hour:', comments_by_hour)


Counts by hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Comments by hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


We now know: 
- How many posts were created in each hour
- How many comments we created in each hour


## Calculate the Average Number of Comments for `Ask HN` Posts by Hour

Next we tackle step 2 - calculating the average number of comments per post created during each hour of the day.

Let's create a list of lists containing: 
- The hours that posts were created
- The average number of comments those posts received.

In [5]:
avg_by_hour = []

for key in comments_by_hour:
    num_comments = comments_by_hour[key]
    for index in counts_by_hour:
        post_count = counts_by_hour[index]
        if key == index:
            average = round(num_comments / post_count, 3)
            avg_by_hour.append([key, average])

avg_by_hour

[['09', 5.578],
 ['13', 14.741],
 ['10', 13.441],
 ['14', 13.234],
 ['16', 16.796],
 ['23', 7.985],
 ['12', 9.411],
 ['17', 11.46],
 ['15', 38.595],
 ['21', 16.009],
 ['20', 21.525],
 ['02', 23.81],
 ['18', 13.202],
 ['03', 7.796],
 ['05', 10.087],
 ['19', 10.8],
 ['01', 11.383],
 ['22', 6.746],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.127],
 ['06', 9.023],
 ['07', 7.853],
 ['11', 11.052]]

We now know the average number of comments per post for each hour of the day. Our info isn't very easy to read in the above format however.


## Sorting and Printing Values from a List of Lists

Let's finish by sorting the list of lists and printing the five highest values in a format that is easier to read.

In [6]:
# Create a list that is the same as the above but with swapped columns
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print('swap_avg_by_hour:', swap_avg_by_hour, '\n')


# Sort swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('sorted_swap:', sorted_swap[:5], '\n')
print('Top 5 Hours for Ask Posts Comments')


# Make our results more readable
for average, hour in sorted_swap[:5]:
    
    # Parse hour string to create datetime object
    hour_dt_obj = dt.datetime.strptime(hour, '%H')
    
    # Convert datetime object to string in format HH:MM, round average to precision 2 and format
    print('{:%H:%M}: {:.2f} average comments per post'.format(hour_dt_obj, average))
    

swap_avg_by_hour: [[5.578, '09'], [14.741, '13'], [13.441, '10'], [13.234, '14'], [16.796, '16'], [7.985, '23'], [9.411, '12'], [11.46, '17'], [38.595, '15'], [16.009, '21'], [21.525, '20'], [23.81, '02'], [13.202, '18'], [7.796, '03'], [10.087, '05'], [10.8, '19'], [11.383, '01'], [6.746, '22'], [10.25, '08'], [7.17, '04'], [8.127, '00'], [9.023, '06'], [7.853, '07'], [11.052, '11']] 

sorted_swap: [[38.595, '15'], [23.81, '02'], [21.525, '20'], [16.796, '16'], [16.009, '21']] 

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


We have found the top 5 hours in which to create an `Ask HN` post, in order to receive the most comments.

Note however that the [timezone](https://www.kaggle.com/hacker-news/hacker-news-posts/home) is **US Eastern Time**.

To make this locally applicable, we will have to convert the times to the local timezone, which is AEDT. 

At present time (August), AEDT is 14 hours ahead of US Eastern time.

To do this we create a timedelta of 14 hours and add it to the above times.

In [8]:
# Create time delta
time_diff_td = dt.timedelta(hours=14)

print('Current time difference between US Eastern and AEDT:', time_diff_td, '\n')

# List times in AEDT
aedt_list = []

for row in sorted_swap:
    dt_obj = dt.datetime.strptime(row[1], '%H')
    aedt_time_obj = dt_obj + time_diff_td   #AEDT is 14 hours ahead
    aedt_time_str = aedt_time_obj.strftime('%H')
    aedt_list.append([row[0], aedt_time_str])

# Output top 5    
print('Top 5 Hours for Ask Posts Comments - AEDT:')

# change the format of the hour, round the average to precision 2, print in a more readable format.
for average, hour in aedt_list[:5]:
    
    # parse hour string to create datetime object
    hour_dt_obj = dt.datetime.strptime(hour, '%H')
    
    # convert datetime object to string in format HH:MM, round average to precision 2 and print in a nice format
    print('{:%H:%M} - {:.2f} average comments per post'.format(hour_dt_obj, average))
    

Current time difference between US Eastern and AEDT: 14:00:00 

Top 5 Hours for Ask Posts Comments - AEDT:
05:00 - 38.59 average comments per post
16:00 - 23.81 average comments per post
10:00 - 21.52 average comments per post
06:00 - 16.80 average comments per post
11:00 - 16.01 average comments per post


# Conclusion

We have been able to answer the questions posed at the beginning of this analysis:

#### 1. Do `Ask HN` or `Show HN` receive more comments on average?

`Ask HN` posts receive more comments on average.
   

#### 2. Do posts created at a certain time receive more comments on average?

Yes, the time of post creation does impact the number of comments received.

We have found the top 5 hours of the day (US Eastern Standard Time) in which to create an `Ask HN` post in order to receive the most comments.

`Top 5 Hours for Ask Posts Comments - EST
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post`




**Note: to make this information locally applicable, we have converted the times from US Eastern to AEDT**

`Top 5 Hours for Ask Posts Comments - AEDT
07:00: 38.59 average comments per post
18:00: 23.81 average comments per post
12:00: 21.52 average comments per post
08:00: 16.80 average comments per post
13:00: 16.01 average comments per post`
