## Exploring Hackers News Posts
In this project I'll compare two different types of posts from Hacker News, a popular site where technology related stories (or 'posts') are voted and commented on. The two types of posts begin with either `Ask HN` or `Show HN`.

Users submit `Ask HN` posts to ask the Hacker News community a specific question, such as "What is the best online course you've ever taken?" 
Likewise, users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

I´ll compare these two types of posts to determine the following questions:

Do Ask HN or Show HN receive more comments on average? 
Do posts created at a certain time receive more comments on average?

#### Reading and extracting data
First I´ll open, read and extract related data

In [8]:
import csv
import datetime as dt

In [9]:
with open('HN_posts_year_to_Sep_26_2016.csv', mode='r') as opened_file:
  reader = csv.reader(opened_file)
  hn = list(reader)
  header = hn[0]
  hn = hn[1:]

print(header)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


#### Extracting from datasets datas with 0 comments

I'll use only datas where nr.comments is 1 or higher. 

In [47]:
new_hn = []
hn_zero = []

for row in hn:
  comments = row[4]
  if comments == '0':
    hn_zero.append(row)
  else:
    new_hn.append(row)

print('I will use this nr.of records:', len(new_hn))
print('These nr.of records will not be used, because of 0 comments:', len(hn_zero))


I will use this nr.of records: 80401
These nr.of records will not be used, because of 0 comments: 212718


Now is time to extract data into 3 different lists - ask posts, show posts and other posts

In [37]:
ask_posts = []
show_posts = []
other_posts = []


for row in new_hn:
  title = row[1].lower()
  if title.startswith('ask hn'):
    ask_posts.append(row)
  elif title.startswith('show hn'):
    show_posts.append(row)
  else:
    other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))



6911
5059
68431


In [38]:
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
print('\n')
print(other_posts[:5])


[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']]


[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Por

#### Number of Average comment based on post type
Creating the function to help find average rate of comments in each type of post. 


In [39]:
total_comments = 0

def avg_comments(list_name, col_index):
  total_comments = 0
  for row in list_name:
    total_comments += int(row[col_index])

  avg_comments = total_comments / len(list_name)
  return avg_comments

In [40]:
avg_ask_comments = avg_comments(ask_posts, 4)
print('Average ask post comments: ', avg_ask_comments)
print('\n')
avg_show_comments = avg_comments(show_posts, 4)
print('Average show post comments: ', avg_show_comments)
print('\n')
avg_other_comments = avg_comments(other_posts, 4)
print('Average other post comments: ', avg_other_comments)

Average ask post comments:  13.744175951381855


Average show post comments:  9.810832180272781


Average other post comments:  25.838318890561297


The results are clear. Ask posts received more comments than show posts. Since ask posts are more likely to receive comments, I'll continue with analysis just on these posts.

#### Nr.of amount of Ask Posts and created Comments by hour 
Next two steps:
1. Calculate the number of ask posts created in each hour of the day, and the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

In [41]:
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_str = row[0]
    date_dt = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M')
    hour = date_dt.strftime('%H')
    if not hour in counts_by_hour:
        counts_by_hour[hour] = 1  
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print('Counts:', counts_by_hour)
print('\n')
print('Comments:', comments_by_hour)

Counts: {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}


Comments: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


#### Calculating the average number of comments ask posts receive by hour created

In [42]:
avg_by_hour = []

for hour in counts_by_hour:
  avg_comments_per_hour = comments_by_hour[hour] / counts_by_hour[hour]
  avg_by_hour.append([hour, avg_comments_per_hour])

print(avg_by_hour)

[['02', 13.198237885462555], ['01', 9.367713004484305], ['22', 11.749128919860627], ['21', 11.056511056511056], ['19', 9.414285714285715], ['17', 13.73019801980198], ['15', 39.66809421841542], ['14', 13.153439153439153], ['13', 22.2239263803681], ['11', 11.143426294820717], ['10', 13.757990867579908], ['09', 8.392045454545455], ['07', 10.095541401273886], ['03', 10.160377358490566], ['16', 10.76144578313253], ['08', 12.43157894736842], ['00', 9.857142857142858], ['23', 8.322463768115941], ['20', 11.38265306122449], ['18', 10.789823008849558], ['12', 15.452554744525548], ['04', 12.688172043010752], ['06', 9.017045454545455], ['05', 11.139393939393939]]


#### Sorting list and showing 5 top results

In [43]:
swap_avg_by_hour = []

for row in avg_by_hour:
  swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[13.198237885462555, '02'], [9.367713004484305, '01'], [11.749128919860627, '22'], [11.056511056511056, '21'], [9.414285714285715, '19'], [13.73019801980198, '17'], [39.66809421841542, '15'], [13.153439153439153, '14'], [22.2239263803681, '13'], [11.143426294820717, '11'], [13.757990867579908, '10'], [8.392045454545455, '09'], [10.095541401273886, '07'], [10.160377358490566, '03'], [10.76144578313253, '16'], [12.43157894736842, '08'], [9.857142857142858, '00'], [8.322463768115941, '23'], [11.38265306122449, '20'], [10.789823008849558, '18'], [15.452554744525548, '12'], [12.688172043010752, '04'], [9.017045454545455, '06'], [11.139393939393939, '05']]


In [44]:
#Top 5 hours with highest average rate of comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap[:5])

[[39.66809421841542, '15'], [22.2239263803681, '13'], [15.452554744525548, '12'], [13.757990867579908, '10'], [13.73019801980198, '17']]


In [45]:
for row in sorted_swap[:5]:
  hr = dt.datetime.strptime(row[1], '%H')
  hr_formated = hr.strftime('%H:%M')

  conclusion = '{time} -> average {avg:.2f} comments per hour'.format(time=hr_formated, avg=row[0])
  print(conclusion)


15:00 -> average 39.67 comments per hour
13:00 -> average 22.22 comments per hour
12:00 -> average 15.45 comments per hour
10:00 -> average 13.76 comments per hour
17:00 -> average 13.73 comments per hour


## Conclusion
In this dataset I analyze `ASK HN` posts and at what time this type of post receive the most comments.