# Determining the Kind of Content on the Hacker News Website that Prompts Higher Community Interaction

## Introduction
The objective of this project is to find what type of post on the Hacker News website tends to receive high numbers of comments from the community. Within this subset of posts, the objective is also to determine how time of day of post creation affects numbers of comments, and which time of day is most likely to receive higher numbers. There is a particular focus on 'Ask HN' and 'Show HN' posts.

The dataset used for this investigation comes from [here](https://www.kaggle.com/hacker-news/hacker-news-posts/data), though the .csv file processed herein, and downloaded from the Dataquest course, may have been cleaned up somewhat. This data set includes data on posts created in the 12 months leading up to 26th September 2016. The data includes the following columns:

- title: title of the post (self explanatory)

- url: the url of the item being linked to

- num_points: the number of upvotes the post received

- num_comments: the number of comments the post received

- author: the name of the account that made the post

- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Summary findings are that 'Ask HN' posts tend to receive high numbers of comments, and that creating these types of posts between the hours of 12:00 and 16:00 US Eastern time are most likely to receive high numbers of comments from the community.

## Importing the Data
To start with, the data from the 'hacker_news.csv' file is imported as a list of lists called `hn`, with the header row stored in a separate variable `headers`.

In [1]:
# import the data as a list of lists
from csv import reader
with open('./hacker_news.csv') as file:
    read_file = reader(file)
    read_file = list(read_file)
headers = read_file[0]    # separate out header information in another list
hn = read_file[1:]

# print first couple of rows of data
print(headers)
for row in hn[0:3]:
    print('\n', row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

 ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']

 ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']

 ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


## Filtering for the Posts of Interest
The posts of specific interest are the 'Ask HN' and 'Show HN' ones, so the imported data is split into three sets:

1) Ask HN posts, `ask_posts`

2) Show HN posts, `show_posts`

3) All other posts, `other_posts`

In [2]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]  # post title is in column index 1
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

## Determining the Average Numbers of Comments for Ask HN and Show HN Posts

In [3]:
# find average number of comments for Ask HN posts
total_ask_comments = 0
zero_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])  # the number of posts is in column index 4
    total_ask_comments += num_comments
    if num_comments == 0:
        zero_ask_comments += 1
avg_ask_comments = total_ask_comments / len(ask_posts)

# find average number of comments for Show HN posts
total_show_comments = 0
zero_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    if num_comments == 0:
        zero_show_comments += 1
avg_show_comments = total_show_comments / len(show_posts)

# print average comment numbers
avg_comments_string = "The average number of comments for {} posts is {}."
print(avg_comments_string.format("Ask HN", round(avg_ask_comments, 1)))
print(avg_comments_string.format("Show HN", round(avg_show_comments, 1)))

The average number of comments for Ask HN posts is 10.4.
The average number of comments for Show HN posts is 4.9.


The average number of Ask HN posts is about 10, and the average number of Show HN posts is about 5. It should also be noted that around a quarter of Ask HN posts in this data set do not have any comments, while around half of Show HN posts do not have any comments. It would be natural for Ask HN posts to get more community interaction though, because by their nature they are likely looking for quick responses from community members.

## How Time of Day Affects Number of Comments
Further assessment of this data set will focus on Ask HN posts, as the type of post that generally receives more comments. The code below determines the spread of number of posts and comments across different times of day.

In [4]:
from datetime import datetime as dt

# extract post creation date and number of comments into separate list of lists
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

# determine post count and comment total by hour
count_by_hour = {}
comments_by_hour = {}
for row in result_list:
    post_date = dt.strptime(row[0], '%m/%d/%Y %H:%M')
    post_hour = post_date.strftime('%H')
    if post_hour in count_by_hour:
        count_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += row[1]
    else:
        count_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = row[1]

With the number of Ask HN posts categorised by hour of day along with their associated numbers of comments, an average number of comments can be derived. This is the average number of comments an Ask HN post would receive when posted at a given hour of the day.

In [5]:
# make a list of lists to store hour of day and average number of comments for posts in that hour
avg_by_hour = []
for hour in count_by_hour:
    avg = comments_by_hour[hour] / count_by_hour[hour]
    avg_by_hour.append([hour, avg])
for hour in sorted(avg_by_hour):
    print(hour[0], ': ', round(hour[1], 1)) # print the list in order of hour of day

00 :  7.6
01 :  7.4
02 :  11.1
03 :  7.9
04 :  9.7
05 :  8.8
06 :  6.8
07 :  7.0
08 :  9.2
09 :  6.7
10 :  10.7
11 :  9.0
12 :  12.4
13 :  16.3
14 :  9.7
15 :  28.7
16 :  7.7
17 :  9.4
18 :  7.9
19 :  7.2
20 :  8.7
21 :  8.7
22 :  8.8
23 :  6.7


Let's look at the 5 hours with the highest average numbers of comments.

In [6]:
# swap the columns around on the avg_by_hour list
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
#print(swap_avg_by_hour)

# sort the new list in reverse order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
template = "{hr}: {cmt:.2f} average comments per post."
for row in sorted_swap[0:5]:
    hour = dt.strptime(row[1], '%H')
    formatted_hour = hour.strftime('%H:%M')
    print(template.format(hr=formatted_hour, cmt=row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


The above analysis suggests that the best times to create an Ask HN post would be between around 12:00 and 16:00. The data set logs the post creation times in the US Eastern timezone. 15:00 corresponds to 12:00 US Pacific time, and could be a time of day when website traffic tends to be at its highest.

## Conclusions
The objectives of this project were to see which kind of posts on the Hacker News website tend to receive higher comments from the community, and at what time of day uploaded posts are likely to generate high numbers of comments.

'Ask HN' posts tend to receive the highest numbers of comments, though around 25% of these posts in the data set have not received any comments.

Among the Ask HN posts, those created between around 12:00 and 16:00 US Eastern time are more likely to receive higher numbers of comments from the community.

This guided project was originally completed in May 2020.