# Exploring Hacker News Posts

In this project, we will be using the dataset from Hacker News, which is a site where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

We are interested in analyzing the posts starting with Ask HN or Show HN and from these we have to determine whether:
1. "Ask HN" or "Show HN" receive more comments on average
2. Posts created at a specific time receive more comments on average

The dataset for this project can be found at:     
https://www.kaggle.com/hacker-news/hacker-news-posts

## 1. Reading the file and seperating header

In [2]:
#opening and reading the file
from csv import reader

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
#seperating the header
headers = hn[0] 
hn = hn[1:]
print(headers)
#display the first five rows
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## 2. Seperating posts beginning with Ask HN and Show HN

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
#checking the length of posts lists
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## 3. Determine whether ask posts or showposts receive more comments on an average

In [4]:
#calculating total and averge of ask comments
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average ask comments:", avg_ask_comments)


#calculating total and average of show comments
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average show comments:", avg_show_comments)

Average ask comments: 14.038417431192661
Average show comments: 10.31669535283993


We can see that ask posts receive approximately 14 comments on an average whereas show posts receive around 10 comments on an average. This states that ask posts receive more comments.

## 4. Calculate the amount of ask posts created per hour

In [5]:
import datetime as dt

result_list = []
for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    result_list.append([created_at, num_comments]) #entered list as the single argument which has two objects in it
    
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime('%H')
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

# print comments by hour
print(comments_by_hour)
print(counts_by_hour)

{'18': 1439, '09': 251, '10': 793, '11': 641, '05': 464, '22': 479, '02': 1381, '03': 421, '23': 543, '13': 1253, '15': 4477, '12': 687, '00': 447, '19': 1188, '01': 683, '04': 337, '21': 1745, '14': 1416, '16': 1814, '06': 397, '20': 1722, '08': 492, '07': 267, '17': 1146}
{'18': 109, '09': 45, '10': 59, '11': 58, '05': 46, '22': 71, '02': 58, '03': 54, '23': 68, '13': 85, '15': 116, '12': 73, '00': 55, '19': 110, '01': 60, '04': 47, '21': 109, '14': 107, '16': 108, '06': 44, '20': 80, '08': 48, '07': 34, '17': 100}


## 5. Calculating average number of comments for posts created during each hour

In [6]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour

[['18', 13.20183486238532],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['05', 10.08695652173913],
 ['22', 6.746478873239437],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['23', 7.985294117647059],
 ['13', 14.741176470588234],
 ['15', 38.5948275862069],
 ['12', 9.41095890410959],
 ['00', 8.127272727272727],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['04', 7.170212765957447],
 ['21', 16.009174311926607],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['06', 9.022727272727273],
 ['20', 21.525],
 ['08', 10.25],
 ['07', 7.852941176470588],
 ['17', 11.46]]

## 6. Sorting and printing the list of lists

In [13]:
swap_avg_by_hour = []

for value in avg_by_hour:
    swap_avg_by_hour.append([value[1], value[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [15]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


15:00 hrs is the time when ask posts receive the maximum comments, with an average of 38.59 comments per post

## Conclusions

In this project, we analysed the ask posts and show posts to determinewhich posts receive the maximum comments on an average.For this we excluded posts which did not include the "Ask HN" or "Show HN" keywords; and this helped us to narrow down the results.

From the above analysis, we can say that one needs to create a post in the range of 15:00 hrs. Since the dataset is based on Eastern Standard Time (EST), the recommended timings to post the ask posts will be between 3:00 PM to 4:00 PM EST