## Hacker News: Analyzing Types of Posts to Determine the Best Type and Time to Publish
* Analysis by **Eduardo Torres**
* Last Updated: 04/25/2020

As part of this project, I will be analyzing two different types of posts from Hacker News, which is a popular technology blog site where users can publish technology related posts and the community provides ratings and comments on those posts. The two types of posts I will be analyzing are those that begin with either 'Ask HN' or 'Show HN'.

The 'Ask HN' and 'Show HN' posts are intended to provide content to different audiences. For example, the 'Ask HN' types of posts are intended for users looking to find specific answers, for example, ‘Ask HN: How do you pass on your work when you die?'. As for the 'Show HN' types of posts, these posts are intended for people looking to share findings within their studies, a project, or simply interested in sharing general knowledge.

The purpose of this project is to compare the two types of posts to determine the following:
- Do 'Ask HN' or 'Show HN' receive more comments on average?
- Do posts created at a certain time receive more comments on average?

**<font color=Blue>Datasource Documentation:</font>**
**1.** [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts/version/1) 

Please note that the data set has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Summary of Results

As a result of the analysis, my recommendation for anyone seeking to receive the most attention in form of comments is to write an 'Ask HN' post between 1:00 PM EST - 3:00 PM EST.

## Exploring and Cleaning Publicly Available Data

In [1]:
# The datasource function will help automate the import of all datasets
def datasource(source):
    opened_file = open(source)
    from csv import reader
    read_file = reader(opened_file)
    dataset = list(read_file)
    opened_file.close()
    return dataset
    
# Importing Hacker News Posts Dataset (hn)
hn_data = datasource('HN_posts_year_to_Sep_26_2016.csv')
hn_header = hn_data[0]
hn= hn_data[1:]

In [2]:
# The explore_data function helps make the datasource exploration readable
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

**<font color=Grey>Explores the Hacker News Raw Data:</font>**

Based on the exploration of the data, I see that all the information needed to perform the analysis is available, for example, title of posts, number of comments, and the time of creation.

In [3]:
#Prints Data header and Explores the first five columns of the raw hn data
print(hn_header)
print('\n')
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

## Isolating the Types of Posts to Prepare for Analysis

In this section, I iterate through the raw data to separate the 'Ask HN' and 'Show HN' Posts into different lists to make it easier to perform analysis.

In [12]:
# Performs an iteration to separate the types of posts into their own lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_lower = title.lower()
    if title_lower.startswith('ask hn'):
        ask_posts.append(row)
    elif title_lower.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


## Calculating the Average Number of Comments for 'Ask HN' and 'Show HN' Posts

Now that the data has been prepared for analysis, I will use the two separate lists to calculate the average number of comments for the two different kinds of posts.

In [5]:
# Performs the Average Comment Calculation and Formats the Result
total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments    
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print("avg_ask_comments: {0:,.2f}".format(avg_ask_comments))

avg_show_comments = total_show_comments/len(show_posts)
print("avg_show_comments: {0:,.2f}".format(avg_show_comments))

avg_ask_comments: 10.39
avg_show_comments: 4.89


**<font color=Grey>Result:</font>**

The first part of my analysis was to determine which type of posts received the most comments on average. Based on my result, I am able to determine that the 'Ask HN' post receives the most comments on average. The ‘Ask HN' post received 10 comments on average, whereas the 'Show HN' post received 5 comments on average. For this reason, I will continue to perform the next part of the analysis with only the 'Ask HN' post.

## Determining the Quantity of Posts and Comments by Hour Created

In this section, I seek to determine the time at which a user can maximize the amount of comments their posts can receive. The first step will be to find the amount of posts created during each hour of the day, then I will calculate the average amount of comments by the hour.

**<font color=Grey>Amount of Posts Created During Each Hour of the Day</font>**

In [6]:
import datetime as dt
import operator

result_list = []

for post in ask_posts:
    result_list.append(
        [post[6], int(post[4])]
    )

comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for each_row in result_list:
    date = each_row[0]
    comment = each_row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1


sorted(comments_by_hour.items(), key=operator.itemgetter(1),reverse=True)

[('15', 18525),
 ('13', 7245),
 ('17', 5547),
 ('14', 4972),
 ('18', 4877),
 ('21', 4500),
 ('16', 4466),
 ('20', 4462),
 ('12', 4234),
 ('19', 3954),
 ('22', 3372),
 ('10', 3013),
 ('02', 2996),
 ('11', 2797),
 ('08', 2362),
 ('04', 2360),
 ('23', 2297),
 ('00', 2277),
 ('03', 2154),
 ('01', 2089),
 ('05', 1838),
 ('06', 1587),
 ('07', 1585),
 ('09', 1477)]

**<font color=Grey>Average Number of Comments by Hour</font>**

In [7]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

**<font color=Grey>Sorting the Newly Calculated Average by Hour</font>**

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, key=None, reverse=True)
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [11]:
print("Top 5 Hours for Ask Posts Comments")

template = "{}: {:,.2f} average comments per post."
for avg, hr in sorted_swap[:5]:
    output = template.format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg)
    print(output)

Top 5 Hours for Ask Posts Comments
15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


**<font color=Grey>Insights:</font>**

3:00 PM EST postings display the highest average comments per post. It is worth nothing that there is approximately a 76% increase in the number of comments a user receives between the first and second top ranking hours (15:00 and 13:00, respectively).

## Conclusion

As part of the project, I analized both the 'Ask HN' and 'Show HN' types of posts to determine which type of post and at what time those posts received the most comments. Considering the reduction in the data set to only account for posts that received comments, my recommendation for anyone seeking to receive the most attention in form of comments is to write an 'Ask HN' post between 1:00 PM EST - 3:00 PM EST.