<a href="https://colab.research.google.com/github/andrewbeeksma/Exploring-Hacker-News/blob/master/Exploring_Hacker_News_Posts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Hacker News

Hacker News is a site where individuals may post things to the community (i.e. questions, projects, etc.) and receive feedback. Similar to Reddit, the site is very popular amongst technology and start up circles. In this project, I will be exploring a data set from Kaggle containing information on Hacker News posts (you can find the the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts)) and begin to draw insights and conclusions based on the available data.

In [None]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Notice that the first row in my list of lists hn is a header row. So, before moving forward in the cleaning and analysis process, I'm going to extract the header row and reassign hn so that it only contains rows of data.

In [None]:
header = hn[0]
hn = hn[1:]

print(header)
print(hn[:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]


For this project, I will primarily be exploring "Ask HN" and "Show HN" posts. So, I will go through the rows that extract the data that is relevant to my analysis. I will start by initializing three empty lists to sort the posts into their corresponding category.

In [None]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


The first question I'm going to ask about the data is which posts receive the most comments. I want to accurately determine whether an "Ask HN" or "Show HN" post is more likely to receive feedback from the community, and so I'm not going to iterate through these three lists and find the average number of comments each type of post receives.

In [None]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = round(total_ask_comments/len(ask_posts), 2)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = round(total_show_comments/len(show_posts), 2)
print(avg_show_comments)

10.39
4.89


Here we find that the average number of comments that an "Ask HN" post receives is 10.39, whereas the average number of comments that a "Show HN" post receives is only 4.89. This clearly shows that if one is looking for a greater amount of feedback from the technology and startup community, an "Ask HN" post is going to promise a higher likelihood of receiving this feedback. Because "Ask HN" posts receive more than twice as many comments, I am going to focus my remaining analysis on this category of Hacker News posts. The next question I will ask of this data set is if there is a time of day that posts are more likely to receive comments. I can do this by determining how many posts are created in each hour of the day, as well as the number of comments each post receives. Subsequently, I can determine the average number of comments that posts receive for each hour of the day.

In [None]:
import datetime as dt
result_list = []

for row in ask_posts:
    temp = [row[6], int(row[4])]
    result_list.append(temp)
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    time = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
    hour = time.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    

In [None]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour]/counts_by_hour[hour], 2)])
print(avg_by_hour)

[[2, 11.14], [1, 7.41], [22, 8.8], [21, 8.69], [19, 7.16], [17, 9.45], [15, 28.68], [14, 9.69], [13, 16.32], [11, 8.96], [10, 10.68], [9, 6.65], [7, 7.01], [3, 7.95], [23, 6.7], [20, 8.75], [16, 7.71], [8, 9.19], [0, 7.56], [18, 7.94], [12, 12.38], [4, 9.71], [6, 6.78], [5, 8.79]]


These results are a little hard to read, so next I'll do some quick sorting to achieve some more readable results.

In [None]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [None]:
print('Top five hours for "Ask HN" posts: ')
for row in sorted_swap[:5]:
    template = '{0}:00 {1} average comments per post'.format(row[1], row[0])
    print(template)

Top five hours for "Ask HN" posts: 
15:00 28.68 average comments per post
13:00 16.32 average comments per post
12:00 12.38 average comments per post
2:00 11.14 average comments per post
10:00 10.68 average comments per post


Here we can see that the highest average comments per post occur at 3pm, 1pm, 12pm, 2am, and 10am. This means that, in order to maximize one's feedback when asking questions on Hacker News, it would be best to post the question at one of these times.