# Hacker News
## What its about:
Data Analysis project on the Hacker News 2016 [dataset]('https://www.kaggle.com/hacker-news/hacker-news-posts')
## Goal of the Project:
Find out differences between ASK HN and SHOW HN posts on hacker news and to answer the following questions:
1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [64]:
from csv import reader
hn = list(reader(open('HN_posts_year_to_Sep_26_2016.csv')))
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]


In [65]:
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


> Separated the header from the dataset 

In [66]:
ask_post = []
show_post = []
other_post = []
for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_post.append(row)
    elif title.startswith("show hn"):
        show_post.append(row)
    else:
        other_post.append(row)
        
print(len(ask_post))
print(len(show_post))
print(len(other_post))

9139
10158
273822


> Separated the ASK HN and SHOW HN posts. We can see that there are more SHOW HN posts that ASK HN posts, though OTHER posts lead by far, we will ignore those as they are not relevant to our task.

In [67]:
total_ask_comments = 0
for row in ask_post:
    n_comm = int(row[4])
    total_ask_comments += n_comm
avg_ask_comments = total_ask_comments / len(ask_post)
print(avg_ask_comments)

10.393478498741656


> Found the average number of comments for ASK HN posts

In [68]:
total_show_comments = 0
for row in show_post:
    n_comm = int(row[4])
    total_show_comments += n_comm
avg_show_comments = total_show_comments / len(show_post)
print(avg_show_comments)

4.886099625910612


> Found the average of number comments for SHOW HN posts

## Answer 1
We found that there tend to be more comments on average on Ask HN posts compared to Show HN posts.

In [69]:
import datetime as dt
result_list = []
for row in ask_post:
    created = row[6]
    n_comm = int(row[4])
    result_list.append([created, n_comm])

In [70]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    created = row[0]
    time = dt.datetime.strptime(created, '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(time, '%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print(comments_by_hour)

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [71]:
avg_by_hour = []
for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
    
print(avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


> Found average number of comments at each hour of the day

In [72]:
swapped = []
for data in avg_by_hour:
    swapped.append([data[1], data[0]])

swapped.sort(reverse = True)
print(swapped)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


In [73]:
print("Top 5 Hours for 'Ask HN' Comments")
for avg, hr in swapped[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
        )
    )

Top 5 Hours for 'Ask HN' Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


## Answer 2
From above we can see that there are times at which it is better to create a post. The best times are shown above.