# Analysing Hacker News data

Hacker News is a site, where user submitted stories (known as 'posts') are voted and commented.

In this project we'll work with a data set of submissions to popular technology site [Hacker News]('https://news.ycombinator.com').
You can find the data set [here]('https://www.kaggle.com/hacker-news/hacker-news-posts')

For this analysis we reduced the data from almost 300,000 to approximately 20,000 rows by removing submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are descriptions of the columns.
- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the - - total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We are specifically interested in posts whose titles begin either <font color = blue>Ask HN</font> or <font color = blue>Show HN</font>. Users submit those posts to ask the Hacker News community a specific question and show the Hacker News community a project, product, or just generally something interesting respectively.

We'll compare these two types of posts to determine the following.
- Do <font color = blue>Ask HN</font> or <font color = blue>Show HN</font> receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Lets start by importing the libraries we need and reading the data set into a list of lists.

In [1]:
import csv

opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Lets separate posts starts beginning with <font color = blue>Ask HN</font> and <font color =blue>Show HN</font>

In [7]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(f'Number of posts containing "ask_posts" : {len(ask_posts)}')
print(f'Number of posts containing "show_posts" : {len(show_posts)}')
print(f'Number of posts containing "other_posts" : {len(other_posts)}')


Number of posts containing "ask_posts" : 1744
Number of posts containing "show_posts" : 1162
Number of posts containing "other_posts" : 17195


Let's check if ask posts or show posts receive more comments of average

In [11]:
total_ask_comments = 0
for item in ask_posts:
    total_ask_comments += int(item[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f'average "ask comment" {avg_ask_comments}')

total_show_comments = 0
for item in show_posts:
    total_show_comments += int(item[4])
avg_show_comments = total_show_comments / len(show_posts)
print(f'average "show comment" : {avg_show_comments}')

average "ask comment" 14.038417431192661
average "show comment" : 10.31669535283993


As we observed above on average ask_comment is getting more comment than average show_comment. ask_comment is in general 4 comments higher than show_comment. 

In [24]:
import datetime as dt

result_list = []

for item in ask_posts:
    created_at = item[6]
    num_comments = int(item[4])
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    date_str = result[0]
    date_obj = dt.datetime.strptime(date_str,"%m/%d/%Y %H:%M")
    hour = date_obj.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

    


In [28]:
list(comments_by_hour.items())[:4]

[('09', 251), ('13', 1253), ('10', 793), ('14', 1416)]

Next, we'll use these two dictionaries to calculate the average number of comments per post during each day.

In [33]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
    

In [34]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Even though we have the results we need, this format makes hard to identify the hours with the highest values.Lets sort the list and print highest values in a format that's easier to read.

In [35]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [36]:
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [40]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [41]:
sorted_swap[:5]

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

In [49]:
for value in sorted_swap[:5]:
    dt_obj = dt.datetime.strptime(value[1],'%H')
    print(f"{dt_obj.strftime('%H:%M')}, {value[0]:.2f} average comments per page")
    

15:00, 38.59 average comments per page
02:00, 23.81 average comments per page
20:00, 21.52 average comments per page
16:00, 16.80 average comments per page
21:00, 16.01 average comments per page


The hour that receives the most comments per post on average is 15:00, with an average of 38.59 comments per post. There's about a 60% increase in the number of comments between the hours with the highest and second highest average number of comments.

## Conclusion

In this project we analyzed ask posts and show posts to determing which type of posts and time receive the most comments on average.Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average.