# Hacker News Analysis

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. 

In [38]:
from csv import reader
import datetime as dt

In [9]:
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')

In [10]:
read_file = reader(opened_file)

In [11]:
hn_posts = list(read_file)

In [14]:
print(hn_posts[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


In [15]:
headers  = hn_posts[:1]

In [16]:
hn_posts = hn_posts[1:]

In [17]:
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [18]:
print(hn_posts[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [19]:
ask_posts = []
show_posts = []
other_posts = []

In [20]:
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [22]:
for row in hn_posts:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [23]:
print(ask_posts)



In [24]:
print(len(ask_posts))

9139


In [25]:
print(len(show_posts))

10158


In [26]:
print(len(other_posts))

273822


In [28]:
print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]


In [29]:
total_ask_comments = 0

In [30]:
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

In [31]:
avg_ask_comments = total_ask_comments / len(ask_posts)

In [32]:
print(avg_ask_comments)

10.393478498741656


In [33]:
total_show_comments = 0

In [34]:
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

In [36]:
avg_show_comments = total_show_comments / len(show_posts)

In [37]:
print(avg_show_comments)

4.886099625910612


There are twice as many comments on ask posts than on show posts. 
We are going to analize ask comments in more depth.

In [39]:
result_list = []

In [55]:
for row in ask_posts:
    created_at = row[-1]
    num_comments = int( row[4])
    ls =[]
    ls.append(created_at)
    ls.append(num_comments)
    type(ls)
    result_list.append(ls)
                  

In [58]:
print(result_list[:5])     

[('9/26/2016 2:53', '7'), ('9/26/2016 1:17', '3'), ('9/25/2016 22:57', '0'), ('9/25/2016 22:48', '3'), ('9/25/2016 21:50', '2')]


In [89]:
counts_by_hour = {}
comments_by_hour = {}

In [99]:
for each in result_list:
    date_time = each[0]
    date, time = date_time.split(' ')
    time = dt.datetime.strptime(time, "%H:%M")
    time = time.strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = int(each[1])
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] = int(comments_by_hour[time]) + int(each[1]) 

counts_by_hour: contains the number of ask posts created during each hour of the day.

In [100]:
print(counts_by_hour)

{'02': 2157, '01': 2259, '22': 3070, '21': 4146, '19': 4418, '17': 4696, '15': 5168, '14': 4104, '13': 3552, '11': 2496, '10': 2256, '09': 1776, '07': 1808, '03': 2168, '23': 2744, '20': 4080, '16': 4632, '08': 2056, '00': 2408, '18': 4912, '12': 2736, '04': 1944, '06': 1872, '05': 1672}


comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

In [101]:
print(comments_by_hour)

{'02': 23989, '01': 16721, '22': 26982, '21': 36002, '19': 31633, '17': 44376, '15': 148200, '14': 39776, '13': 57960, '11': 22376, '10': 24104, '09': 11816, '07': 12680, '03': 17232, '23': 18376, '20': 35696, '16': 35728, '08': 18896, '00': 18216, '18': 39016, '12': 33872, '04': 18880, '06': 12696, '05': 14704}


Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [119]:
av_comments_hour_post = []
for hour in counts_by_hour:
    avg_by_hour = comments_by_hour[hour]/counts_by_hour[hour]
    av_comments_hour_post.append([hour, avg_by_hour])

In [120]:
print(*av_comments_hour_post, sep = "\n")

['02', 11.121464997681965]
['01', 7.4019477644975655]
['22', 8.788925081433225]
['21', 8.683550410033767]
['19', 7.160027161611589]
['17', 9.449744463373083]
['15', 28.676470588235293]
['14', 9.692007797270955]
['13', 16.31756756756757]
['11', 8.96474358974359]
['10', 10.684397163120567]
['09', 6.653153153153153]
['07', 7.013274336283186]
['03', 7.948339483394834]
['23', 6.696793002915452]
['20', 8.749019607843136]
['16', 7.713298791018998]
['08', 9.190661478599221]
['00', 7.5647840531561465]
['18', 7.94299674267101]
['12', 12.380116959064328]
['04', 9.7119341563786]
['06', 6.782051282051282]
['05', 8.794258373205741]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [122]:
swap_avg_by_hour = []
for row in av_comments_hour_post:
    swap_avg_by_hour.append([row[1], row[0]])
print (*swap_avg_by_hour, sep = "\n")    

[11.121464997681965, '02']
[7.4019477644975655, '01']
[8.788925081433225, '22']
[8.683550410033767, '21']
[7.160027161611589, '19']
[9.449744463373083, '17']
[28.676470588235293, '15']
[9.692007797270955, '14']
[16.31756756756757, '13']
[8.96474358974359, '11']
[10.684397163120567, '10']
[6.653153153153153, '09']
[7.013274336283186, '07']
[7.948339483394834, '03']
[6.696793002915452, '23']
[8.749019607843136, '20']
[7.713298791018998, '16']
[9.190661478599221, '08']
[7.5647840531561465, '00']
[7.94299674267101, '18']
[12.380116959064328, '12']
[9.7119341563786, '04']
[6.782051282051282, '06']
[8.794258373205741, '05']


In [124]:
sorted_swap_avg_by_hour = sorted(swap_avg_by_hour, reverse=True)
print(*sorted_swap_avg_by_hour, sep = "\n")

[28.676470588235293, '15']
[16.31756756756757, '13']
[12.380116959064328, '12']
[11.121464997681965, '02']
[10.684397163120567, '10']
[9.7119341563786, '04']
[9.692007797270955, '14']
[9.449744463373083, '17']
[9.190661478599221, '08']
[8.96474358974359, '11']
[8.794258373205741, '05']
[8.788925081433225, '22']
[8.749019607843136, '20']
[8.683550410033767, '21']
[7.948339483394834, '03']
[7.94299674267101, '18']
[7.713298791018998, '16']
[7.5647840531561465, '00']
[7.4019477644975655, '01']
[7.160027161611589, '19']
[7.013274336283186, '07']
[6.782051282051282, '06']
[6.696793002915452, '23']
[6.653153153153153, '09']


In [151]:
for each in sorted_swap_avg_by_hour:
    posts = each[0]
    hour = each[1]
    hour = dt.datetime.strptime(hour, "%H")
   # print(type(hour))
   # print(hour)
    hour = hour.strftime("%H:%M")
    result = "{:.2f}  {}".format(posts, hour)
    print(result)
    
    

28.68  15:00
16.32  13:00
12.38  12:00
11.12  02:00
10.68  10:00
9.71  04:00
9.69  14:00
9.45  17:00
9.19  08:00
8.96  11:00
8.79  05:00
8.79  22:00
8.75  20:00
8.68  21:00
7.95  03:00
7.94  18:00
7.71  16:00
7.56  00:00
7.40  01:00
7.16  19:00
7.01  07:00
6.78  06:00
6.70  23:00
6.65  09:00
