## Hacker news project

In this project, a data set that contains user posts (hacker news) and the reviews on those posts will be analysed to gain insight into what type of posts get the highest reviews and the time range within which there are more reviews than ever. 

In [2]:
#parse in the hacker news file
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#display the first five rows 
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
#Extract the header row into a variable
headers = hn[0]
hn = hn[1:]
print(headers)
print([0,1,2,3,4,5,6])

print('\n')
print('\n')
print('\n')

print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[0, 1, 2, 3, 4, 5, 6]






[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2

Since we are only interested in the posts that have titles starting with **Ask HN** or **Show HN** we are going to ectract such posts into separate lists

In [4]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

num_askpost = len(ask_posts)
num_showpost = len(show_posts)
num_otherpost = len(other_posts)


1744
1162
17194


Now that we have separated the posts based on the titles, we can check for which on epof them gets the average highest comments.

In [5]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / num_askpost
print(avg_ask_comments)

14.038417431192661


In [6]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / num_showpost
print(avg_show_comments)

10.31669535283993


When we computed the average for the ask posts and the show posts, it was observed that the ask posts has a greater average number of comments when compared to the show post.

We can then use this to infer that ask post would generally have more comments that show post at a given point in time in our analysis

Making use of the ask post collection we are going to analyse the average number of comments received by the hour created and the average number of askpost created by the hour

In [9]:
import datetime as dt
result_list = []
for row in ask_posts:
    info = row[6], row[4]
    info_list = list(info)
    result_list.append(info_list)
    
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    period = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = period.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(row[1])
        print(row[1])
        print(comments_by_hour)
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(row[1])


6
{'09': 6}
29
{'09': 6, '13': 29}
1
{'09': 6, '13': 29, '10': 1}
3
{'09': 6, '13': 29, '10': 1, '14': 3}
17
{'09': 6, '13': 29, '10': 1, '14': 3, '16': 17}
1
{'09': 6, '13': 29, '10': 1, '14': 3, '16': 17, '23': 1}
4
{'09': 6, '13': 29, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4}
1
{'09': 7, '13': 29, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4, '17': 1}
1
{'09': 7, '13': 30, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4, '17': 10, '15': 1}
4
{'09': 7, '13': 30, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4, '17': 10, '15': 1, '21': 4}
2
{'09': 7, '13': 30, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4, '17': 10, '15': 1, '21': 8, '20': 2}
3
{'09': 7, '13': 30, '10': 1, '14': 3, '16': 17, '23': 1, '12': 4, '17': 10, '15': 1, '21': 8, '20': 2, '02': 3}
2
{'09': 7, '13': 30, '10': 1, '14': 5, '16': 17, '23': 1, '12': 5, '17': 10, '15': 1, '21': 8, '20': 2, '02': 25, '18': 2}
1
{'09': 7, '13': 37, '10': 1, '14': 7, '16': 24, '23': 1, '12': 5, '17': 13, '15': 7, '21': 8, '20': 2, '02': 25

In [10]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_commment = int(comments_by_hour[hour]) / int(counts_by_hour[hour])
    avg_by_hour.append([hour, avg_commment])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In other to be able to sort it, we are going to swap the columns of the avg_by_hour list of lists

In [11]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [12]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [13]:
print("Top 5 hours for ask Ask Posts Comments")

Top 5 hours for ask Ask Posts Comments


In [22]:
for row in sorted_swap[:5]:
    template = "{time_}: {average:.2f} average comments per post"
    time = dt.datetime.strptime(row[1], "%H")
    print(template.format(time_ = time.strftime("%H:00"), average = row[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


## To be continued...