# Exploring Hacker News Posts

#### This projects compares and analyzes two different types of posts from Hacker News (HN), a popular site where technology related stories (posts) are voted and commented on.  The two types of posts on HN are "Ask HN" and "Show HN".

#### The "Ask HN" posts is where users ask and post technology related questions, such as "How many photons are received per bit transmitted from Voyager 1?".  Whereas "Show HN" is where users post to the HN community site to show a project, product, or anything interesting relating to technology.

### For this project we want to know the following:
#### - Do "Ask HN" or "Show HN"recievie more posts on average?
#### - Do posts created ata certain time recieve more comments on average?

In [1]:
from csv import reader

with open('hacker_news.csv', 'r') as open_file:
    read_file = reader(open_file)
    hn = list(read_file)

In [2]:
for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [3]:
headers = hn[0]
hn= hn[1:]
print(headers)

for row in hn[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


### Filtering the Data Set and Creating a New List for "Ask HN" and "Show HN"

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))



1744
1162
17194


### Which Received More Comments: "Ask HN" or "Show HN"

In [5]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[-3])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments/len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [6]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[-3])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments/len(show_posts)
    
print(avg_show_comments)

10.31669535283993


#### Based on the average number of comments in both catgeories: "ask HN" and "show HN", it appears "ask HN" has a much higher comments on average (average: 14) comapred to "show HN" which only has an average comments of ~10. 

#### Since "ask HN" posts has a higher average than "show HN" posts, we will focus our remaining analysis on "ask HN" posts.

### Number of Comments Attracted at Certain Time Points

In [7]:
import datetime as dt

result_list = [] #This will be a nested list

for row in ask_posts:
    time_created_at = row[-1]
    num_comments = int(row[-3])
    result_list.append([time_created_at, num_comments]) #Make a list of lists by using '[element1, element 2]'
    
    
#print(result_list)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    
    date_time = row[0]
    comments = row[1]
    
    date_time_format = "%m/%d/%Y %H:%M"
    date_time_obj = dt.datetime.strptime(date_time, date_time_format) #Convert string date to object
    
    hour = date_time_obj.strftime("%H") #Extract the hour from the date-time object
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

print(comments_by_hour)
print(counts_by_hour)


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


In [8]:
avg_by_hour = []

for hour in comments_by_hour:
        avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


### Sort by the Highest Amount of Comments

In [9]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    comments = row[1]
    
    empty = hour
    hour = comments
    comments = empty

    
    #print(row)
    
    swap_avg_by_hour.append([hour, comments])

# print(swap_avg_by_hour)

def sort_func(list_of_lists): #Bubble Sort Algorithm
    n = len(list_of_lists)
    for i in range(n):
        for j in range(0, n-i-1):
            if list_of_lists[j][0] < list_of_lists[j+1][0]:
                temp = list_of_lists[j]
                list_of_lists[j] = list_of_lists[j+1]
                list_of_lists[j+1] = temp
                
    return list_of_lists

sorted_swap = sort_func(swap_avg_by_hour)
                    
print(sorted_swap)          

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]


### Top 5 Hours for Ask Posts Comments

In [10]:
import datetime as dt

for row in sorted_swap:
    comments = row[0]
    hour = row[1]
    
    hour_obj = dt.datetime.strptime(hour, '%H')
    hour_formatted = hour_obj.strftime('%H:%M')
    
    comments_formatted = f'{comments:.2f}'
    
    ask_posts_comments_hour = [hour_formatted, comments_formatted]
    
    print(f"{hour_formatted} : {comments_formatted}")

15:00 : 38.59
02:00 : 23.81
20:00 : 21.52
16:00 : 16.80
21:00 : 16.01
13:00 : 14.74
10:00 : 13.44
14:00 : 13.23
18:00 : 13.20
17:00 : 11.46
01:00 : 11.38
11:00 : 11.05
19:00 : 10.80
08:00 : 10.25
05:00 : 10.09
12:00 : 9.41
06:00 : 9.02
00:00 : 8.13
23:00 : 7.99
07:00 : 7.85
03:00 : 7.80
04:00 : 7.17
22:00 : 6.75
09:00 : 5.58


In [11]:
for top_5_comments_hour in sorted_swap[:5]:
    hour = sorted_swap[0]
    avg_comments = sorted_swap[1]
    print(f"Hour: {hour} and Number of Average Comments: {avg_comments}")

Hour: [38.5948275862069, '15'] and Number of Average Comments: [23.810344827586206, '02']
Hour: [38.5948275862069, '15'] and Number of Average Comments: [23.810344827586206, '02']
Hour: [38.5948275862069, '15'] and Number of Average Comments: [23.810344827586206, '02']
Hour: [38.5948275862069, '15'] and Number of Average Comments: [23.810344827586206, '02']
Hour: [38.5948275862069, '15'] and Number of Average Comments: [23.810344827586206, '02']


## Conclusion

#### In this analysis we analyzed two types of posts "ask HN" and "show HN" to determine which type of post and the time it received the most comments on average.  Our analysis indicates that "ask HN" received more average comments than "Show HN".  Thus, our analysis was then focused on further analyzing the "ask HN" posts.  Our assesment shows that in order to maximize the amount of comments a post (in "ask HN" category) receives, we recommend the post be categorized as "ask post" and created between 15:00 and 16:00 (3:00 p.m. - 4:00 p.m. EST), since it is around this time that Hacker News receives the most average comments in "Ask HN".