# Analysis of Hacker News Posts

**Hacker News** is a popular technology site that users submit posts that are voted and commented upon. This project is an analysis of **Ask HN** and **Show HN** types of posts on this popular site.

Posts whose titles begin with Ask HN are posts that ask the Hacker News community a specific question while Shows HN posts show the community a project, product, or something else.

Our goal is to determine which one receives more comments on average and if posts created at a certain time receive more comments on average.

The raw data set is located [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that not receive any comments, and then randomly sampling from the remaining submissions.

## Reading Data and Removing Headers

In [24]:
#Read data
import csv

with open('hacker_news.csv', 'r') as file:
    hn = list(csv.reader(file))
    
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

The first list in the inner lists contains the column headers. In order to analyze the data, we need to remove this header row. 

In [25]:
#Extract headers
headers = hn[0]
hn = hn[1:]
headers
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Extracting Ask HN and Show HN Posts

We need to filter the data since we are only concerned with post titles beginning with Ask HN or Show HN and separate posts in different lists. 

In [5]:
#Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.

ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(post)
    elif title.startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

print(f'Number of posts in ask_posts: {len(ask_posts)}')
print(f'Number of posts in show_posts: {len(show_posts)}')
print(f'Number of posts in other_posts: {len(other_posts)}')

Number of posts in ask_posts: 1744
Number of posts in show_posts: 1162
Number of posts in other_posts: 17194


## Calculating the average Number of Comments for Ask HN and Show HN Posts

In [10]:
# Find average number of comments on ask posts
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts) 
print(f'avg_ask_comments: {avg_ask_comments :.2f}')

avg_ask_comments: 14.04


In [11]:
# Find average number of comments on show posts
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(f'avg_show_comments: {avg_show_comments :.2f}')

avg_show_comments: 10.32


On average, ask posts receive approximately four comments more than show posts in this sampling. Since ask posts are more likely to receive comments, we'll focus the rest of our analysis just on these posts.

## Finding the Amount of Ask Posts and Comments by Hour Created

We will determine if ask posts created at a certain time are more likely to attract comments. 

We will calculate the amount of ask posts created in each hour of the day, along with the number of comments received. Then, we'll calculate the average number of comments ask posts receive by hour created.

In [34]:
# Calculate the amount of ask posts created during each hour of day and the number of comments received
import datetime as dt
result_list = []

for post in ask_posts:
    result_list.append([post[6], int(post[4])])
    
counts_by_hour = {} # contains the number of ask posts created during each hour of the day
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received.

# string format in the created_at column: 8/4/2016 11:52
date_format = '%m/%d/%Y %H:%M'

for row in result_list:
    date = row[0]
    comments = row[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments
    
dict(sorted(comments_by_hour, key=lambda x: x[1]))

{'1': '9', '2': '3', '0': '9'}

## Calculating the average number of comments of Ask HN posts by hour

In [45]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]]) # [hour, average number of comments per post]
avg_by_hour

[['15', 38.5948275862069],
 ['02', 23.810344827586206],
 ['20', 21.525],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['17', 11.46],
 ['01', 11.383333333333333],
 ['11', 11.051724137931034],
 ['19', 10.8],
 ['08', 10.25],
 ['05', 10.08695652173913],
 ['12', 9.41095890410959],
 ['06', 9.022727272727273],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['07', 7.852941176470588],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775]]

## Sorting and Printing Values from a list of list

In [48]:
# Sort by average in descending order
avg_by_hour.sort(key= lambda x: x[1], reverse=True)
avg_by_hour

[['15', 38.5948275862069],
 ['02', 23.810344827586206],
 ['20', 21.525],
 ['16', 16.796296296296298],
 ['21', 16.009174311926607],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['18', 13.20183486238532],
 ['17', 11.46],
 ['01', 11.383333333333333],
 ['11', 11.051724137931034],
 ['19', 10.8],
 ['08', 10.25],
 ['05', 10.08695652173913],
 ['12', 9.41095890410959],
 ['06', 9.022727272727273],
 ['00', 8.127272727272727],
 ['23', 7.985294117647059],
 ['07', 7.852941176470588],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775]]

In [58]:
print("Top 5 Hours for Ask Posts Comments")
for row in avg_by_hour[:5]:
    print('{}: {:.2f} average comments per post'''.format(dt.datetime.strptime(row[0], '%H').strftime('%H:%M'), row[1]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The best hour to create a post is 3:00 pm in the Eastern Time zone in the US since this is the time zone specified by the data set documentation. For this reason, it is important to convert the time to the time zone you live in to know the optimized hour to post something. 

The average for this hour is approximately 60% higher than the second-highest average comments per post, then creating a post asking a question 15:00 est is the best to have a higher chance of an answer for an Ask HN post.

## Conclusion

In this project, our goal was to analyze ask posts and show posts to determine which type of post and time receive the most comments on average. As a result of the analysis, to maximize the number of comments a post receives, we'd recommend the post be categorized as an ask post and created between 15:00 est - 16:00 est).

However, it should be noted that our sampling excluded posts without any comments. That is why it's better to say that of the posts that received comments, ask posts received more comments on average and ask posts created around 3:00 pm est received the most comments on average

****************************************************************
Here are some next steps for you to consider:

Determine if show or ask posts receive more points on average.
Determine if posts created at a certain time are more likely to receive more points.
Compare your results to the average number of comments and points other posts receive.
Use Dataquest's data science project style guide to format your project. **