# Exploring Hacker News posts

This project will work with a __[data set](https://www.kaggle.com/hacker-news/hacker-news-posts)__ of submissions to __[Hacker News](https://news.ycombinator.com/)__, a site where users' submitted stories are voted and commented upon. Hacker News is extremely popular in technology and startup circles.
In this project, the posts, whose titles begin with 'Ask HN' and 'Show HN' will be analyze to figure out:

- Do Ask HN posts or Show HN posts receive more comments on average?
(Ask HN is the post where users post to ask the community a specific question and Show HN is to illustrate projects, products or anything interested from users)

- Do posts created at a certain time receive more comments on average?

Noted: The original data set has been modified from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comment, and then randomly sampling from the remaining submissions. 

This is my second project in Data Analytics. This time I use Python to work with date and time. This project

## Introducing Hacker News data

In [2]:
# Read in the data 
from csv import reader
opened_file = open("hacker_news.csv", encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)

# Remove the header from the data set
headers = hn[0]
hn = hn[1:]

# Check header and first 5 rows without header of the data set
print(headers)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN posts

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check the number of posts in three lists
print('Number of Ask HN post:', len(ask_posts))
print('Number of Show HN post:', len(show_posts))
print('Number of other posts:', len(other_posts))

Number of Ask HN post: 1744
Number of Show HN post: 1162
Number of other posts: 17194


According to the niche above, we need to analyze 1744 posts containg 'Ask HN' and 1162 posts containing 'Show HN' posts. The other 17,194 posts we do not need for now. 

## Calculating the Average Number of Comments for Ask HN and Show HN posts

In [7]:
total_ask_comments = 0
total_show_comments = 0
    
for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments
    
for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments

# Calculate the average number of comments for 2 types of post
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_show_comments = total_show_comments/len(show_posts)

print('Average of Ask HN posts:', round(avg_ask_comments))
print('Average of Show HN posts:', round(avg_show_comments))  

Average of Ask HN posts: 14
Average of Show HN posts: 10


It is obvious that Ask HN posts generate more comments than Show HN posts on average. Based on this finding, the analysis now shifts towards Ask posts.

## Finding the Amount of Ask posts and comments by hour 

In [10]:
# Use the datetime module
import datetime as dt

result_list = []

for row in ask_posts:
    created_time = row[6]
    n_comments = int(row[4])
    result_list.append([created_time, n_comments])

# create frequency table for the posts and comments hour
counts_by_hour = {} # number of posts per hour
comments_by_hour = {} # number of comments in each ask post per hour
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H") # extract the hour from the date
    if time not in counts_by_hour:
        counts_by_hour[time] = 1 
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

print(counts_by_hour)
print('\n')
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask posts by hour

In [13]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_comments_by_hour = comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour, avg_comments_by_hour])
        
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting the list

Although we have the results we need, the format above make it hard to identify which hours have the highest values. Let's sort the list of lists and print the 5 highest values in a format that is easier to read.

In [14]:
# create a list that equals 'avg_by_hour' with swapped columns
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg_comments_by_hour = row[1]
    swap_avg_by_hour.append([avg_comments_by_hour, hour])
    
print(swap_avg_by_hour)

# sort the list in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [17]:
#print the 5 highest comments by hour from the list
print('Top 5 hours gain the most comments:')

for avg, hour in sorted_swap[:5]:
    hour_modified = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    print("{}: {:.2f} average comments per post.".format(hour_modified, avg))

Top 5 hours gain the most comments:
15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


## Conclusions

In this project, the two types of posts (Ask HN and Show HN) were analyzed to see which one of them gained more comments and by posting at which hours should have the most comments. The results revealed that the posts that have the title starting with 'Ask HN' would raise more comments. Moreover, by posting at around 15:00, the user can have the highest number of comments on average. The next 4 hours that have the next highest number of comments are at 02:00 in the morning, 20:00, 16:00 and 21:00. 

It can be concluded that in order to get more comments, the user should post around 15:00-16:00 and 20:00-21:00. By choosing between Ask HN and Show HN to gather the highest number of comments, user should go with Ask HN tag. However, since the data set being analyzed had been modified to eliminate all posts having 0 comments, this conclusion is not necessarily applied to all posts on Hacker News. 