# Introduction

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We can get the data from here https://www.kaggle.com/hacker-news/hacker-news-posts. Below are descriptions of the columns:

- id: The unique identifier from Hacker News for the post
- title: The title of the post
- url: The URL that the posts links to, if it the post has a URL
- num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- num_comments: The number of comments that were made on the post
- author: The username of the person who submitted the post
- created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. 

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. 

We're specifically interested in posts whose titles begin with either Ask HN or Show HN
We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [1]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

In [2]:
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16']]

# Removing Headers from a List of Lists

In [3]:
headers = hn[0]
hn = hn[1:]

In [4]:
hn[0:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

# Extracting Ask HN and Show HN Posts

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Given a string object, say, string1, we can check if starts with, say, ask hn by inspecting the output of the object string1.startswith('ask hn'). If string1 starts with ask hn, it will return True, otherwise it will return False. 

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [5]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [6]:
print("Ask posts:",len(ask_posts))

Ask posts: 9139


In [7]:
print("Show posts:",len(show_posts))

Show posts: 10158


In [8]:
print("Other posts:",len(other_posts))

Other posts: 273822


In [9]:
ask_posts[0:3]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57']]

In [10]:
show_posts[0:3]

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44']]

In [11]:
other_posts[0:3]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19']]

# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Let's determine if ask posts or show posts receive more comments on average.

##### Calculating Avg Ask Comments

In [12]:
total_ask_comments = 0
for row in ask_posts:
    ask_num_comments = int(row[4])
    total_ask_comments += ask_num_comments

In [13]:
total_ask_comments

94986

In [14]:
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Average ask comments:",avg_ask_comments)

Average ask comments: 10.393478498741656


##### Calculating Avg Show Comments

In [15]:
total_show_comments = 0

for row in show_posts:
    show_num_comments = int(row[4])
    total_show_comments += show_num_comments

In [16]:
total_show_comments

49633

In [17]:
avg_show_comments = total_show_comments / len(show_posts)

print("Average show comments:",avg_show_comments)

Average show comments: 4.886099625910612


From the analysis above, we can see that the Ask Posts has almost double average number of comments as compared to Show Posts. This may be due Ask Posts is soliciting for comments hence the higher number of interaction as compared to Show Posts which doesn't require comments on the post

# When is it Most Active

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time (Hour) are more likely to attract comments. 

In [18]:
ask_posts[0:5]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57'],
 ['12577870',
  'Ask HN: Why join a fund when you can be an angel?',
  '',
  '1',
  '3',
  'anthony_james',
  '9/25/2016 22:48'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00taffe',
  '9/25/2016 21:50']]

In [19]:
#Because the created_at column is the seventh column in ask_posts, we'll need the element at index 6  (last one) in each row
#The second element shall be the number of comments of the post. It is at index 5 (third from last) in each row
import datetime as dt
result_list = [[row[-1],int(row[-3])] for row in ask_posts]

In [20]:
result_list[0:5]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2]]

In [21]:
#We'll use the datetime.strptime() method to parse the date and create a datetime object.
counts_by_hour = {}
comments_by_hour = {}
            
date_format = '%m/%d/%Y %H:%M'

#we'll parse the D/MM/YYYY H:MM to YYYY DD MM HH MM format
for row in result_list:
    dttime = row[0]
    dttime = dt.datetime.strptime(dttime,date_format)
    row[0] = dttime

In [22]:
print(result_list[0:2])

[[datetime.datetime(2016, 9, 26, 2, 53), 7], [datetime.datetime(2016, 9, 26, 1, 17), 3]]


In [42]:
for row in result_list:
    dthour = row[0]
    dthour = dthour.strftime("%H")
    if dthour not in counts_by_hour:
        counts_by_hour[dthour] = 1
        comments_by_hour[dthour] = row[1]
    else:
        counts_by_hour[dthour] += 1
        comments_by_hour[dthour] += row[1]

In [24]:
#contains the number of ask posts created during each hour of the day. 
counts_by_hour

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [25]:
#contains the corresponding number of comments ask posts created at each hour received.
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

# Calculating the Average Number of Comments for Ask HN Posts by Hour

We'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. We'll create a list of lists in which the first element is the hour and the second element is the average number of comments per post

In [26]:
avg_by_hour = []

for row in counts_by_hour:
    temp_comments = []
    hour = row
    no_posts = counts_by_hour[row] #loop through counts_by_hour dictionary
    no_comments = comments_by_hour[row] #loop through comments_by_hour dictionary
    avg_hour = int(no_comments)/int(no_posts)
    temp_comments = [hour,avg_hour]
    avg_by_hour.append(temp_comments)

In [27]:
avg_by_hour

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

# Sorting and Printing Values from a List of Lists

We want to identify the hours with the highest values and sort in descending order. 

In [28]:
swap_avg_by_hour = [[row[1],row[0]] for row in avg_by_hour]

In [29]:
swap_avg_by_hour

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [30]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [31]:
Top_5_Hours_for_Ask_Posts_Comments = sorted_swap[0:5]
Top_5_Hours_for_Ask_Posts_Comments

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10']]

In [32]:
hour_format = '%H'

for row in Top_5_Hours_for_Ask_Posts_Comments:
    avg = row[0]
    hours = row[1]
    hours = dt.datetime.strptime(hours,hour_format)
    hours = hours.strftime("%H")
    output = "{0}:00: {1:,.2f} average comments per post".format(hours,avg)
    print(output)

15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


From here, the observation is from lunch till 3PM where the average number of comments per post will be highest. Perhaps it is due to the effect of after lunch where people working in the office takes a break and gives comments during then. Another surprising timing is either at 10AM or at 2AM. The 10AM may be due to after office workers after checking their emails. The 2AM effect is for those late night owls? 

# Do show or ask posts receive more points on average

#### Calculating Avg Ask Points

In [33]:
#Points is at the 4th column
total_ask_points = 0
for row in ask_posts:
    ask_num_points = int(row[3])
    total_ask_points += ask_num_points

In [34]:
total_ask_points

103378

In [35]:
avg_ask_points = total_ask_points / len(ask_posts)

print("Average ask points:",avg_ask_points)

Average ask points: 11.31174089068826


#### Calculating Avg Show Points

In [36]:
#Points is at the 4th column
total_show_points = 0
for row in show_posts:
    show_num_points = int(row[3])
    total_show_points += show_num_points

In [37]:
total_show_points

150781

In [38]:
avg_show_points = total_show_points / len(show_posts)

print("Average show points:",avg_show_points)

Average show points: 14.843571569206537


On the other hand, we can see that the Show Posts has more points as compared to Ask Posts. 

# Do posts created at a certain time are more likely to receive more points.

Let's check this on times of the day similar to how we did it for comments. 

In [39]:
result_pts_list = [[row[-1],int(row[-4])] for row in show_posts]
result_pts_list[0:5]

[['9/26/2016 0:36', 2],
 ['9/26/2016 0:01', 1],
 ['9/25/2016 23:44', 1],
 ['9/25/2016 23:17', 2],
 ['9/25/2016 20:06', 1]]

In [43]:
counts_pts_by_hour = {}
pts_by_hour = {}

for row in result_pts_list:
    dttime = row[0]
    dttime = dt.datetime.strptime(dttime,date_format)
    row[0] = dttime

In [44]:
print(result_pts_list[0:2])

[[datetime.datetime(2016, 9, 26, 0, 36), 2], [datetime.datetime(2016, 9, 26, 0, 1), 1]]


In [45]:
for row in result_pts_list:
    dthour = row[0]
    dthour = dthour.strftime("%H")
    if dthour not in counts_pts_by_hour:
        counts_pts_by_hour[dthour] = 1
        pts_by_hour[dthour] = row[1]
    else:
        counts_pts_by_hour[dthour] += 1
        pts_by_hour[dthour] += row[1]

In [46]:
counts_pts_by_hour

{'00': 276,
 '23': 319,
 '20': 525,
 '19': 556,
 '18': 656,
 '16': 801,
 '14': 696,
 '10': 323,
 '09': 302,
 '08': 316,
 '06': 192,
 '03': 206,
 '21': 430,
 '17': 761,
 '15': 836,
 '11': 402,
 '07': 236,
 '04': 194,
 '13': 610,
 '12': 516,
 '01': 247,
 '22': 377,
 '02': 209,
 '05': 172}

In [47]:
pts_by_hour

{'00': 4291,
 '23': 5060,
 '20': 6948,
 '19': 8928,
 '18': 9935,
 '16': 11487,
 '14': 10503,
 '10': 4303,
 '09': 3762,
 '08': 4640,
 '06': 3071,
 '03': 2168,
 '21': 5990,
 '17': 10563,
 '15': 11657,
 '11': 7742,
 '07': 3303,
 '04': 2707,
 '13': 10381,
 '12': 10787,
 '01': 2931,
 '22': 5026,
 '02': 2764,
 '05': 1834}

In [49]:
avg_pts_by_hour = []

for row in counts_pts_by_hour:
    temp_pts = []
    hour = row
    no_posts = counts_pts_by_hour[row] #loop through posts_by_hour dictionary
    no_pts = pts_by_hour[row] #loop through poinnts_by_hour dictionary
    avg_pts_hour = int(no_pts)/int(no_posts)
    temp_pts = [hour,avg_pts_hour]
    avg_pts_by_hour.append(temp_pts)

In [50]:
avg_pts_by_hour

[['00', 15.547101449275363],
 ['23', 15.862068965517242],
 ['20', 13.234285714285715],
 ['19', 16.057553956834532],
 ['18', 15.144817073170731],
 ['16', 14.340823970037453],
 ['14', 15.09051724137931],
 ['10', 13.321981424148607],
 ['09', 12.456953642384105],
 ['08', 14.683544303797468],
 ['06', 15.994791666666666],
 ['03', 10.524271844660195],
 ['21', 13.930232558139535],
 ['17', 13.88042049934297],
 ['15', 13.94377990430622],
 ['11', 19.258706467661693],
 ['07', 13.995762711864407],
 ['04', 13.95360824742268],
 ['13', 17.018032786885247],
 ['12', 20.905038759689923],
 ['01', 11.866396761133604],
 ['22', 13.331564986737401],
 ['02', 13.224880382775119],
 ['05', 10.662790697674419]]

In [51]:
swap_pts_avg_by_hour = [[row[1],row[0]] for row in avg_pts_by_hour]

In [52]:
swap_pts_avg_by_hour

[[15.547101449275363, '00'],
 [15.862068965517242, '23'],
 [13.234285714285715, '20'],
 [16.057553956834532, '19'],
 [15.144817073170731, '18'],
 [14.340823970037453, '16'],
 [15.09051724137931, '14'],
 [13.321981424148607, '10'],
 [12.456953642384105, '09'],
 [14.683544303797468, '08'],
 [15.994791666666666, '06'],
 [10.524271844660195, '03'],
 [13.930232558139535, '21'],
 [13.88042049934297, '17'],
 [13.94377990430622, '15'],
 [19.258706467661693, '11'],
 [13.995762711864407, '07'],
 [13.95360824742268, '04'],
 [17.018032786885247, '13'],
 [20.905038759689923, '12'],
 [11.866396761133604, '01'],
 [13.331564986737401, '22'],
 [13.224880382775119, '02'],
 [10.662790697674419, '05']]

In [53]:
sorted_pts_swap = sorted(swap_pts_avg_by_hour, reverse = True)
sorted_pts_swap

[[20.905038759689923, '12'],
 [19.258706467661693, '11'],
 [17.018032786885247, '13'],
 [16.057553956834532, '19'],
 [15.994791666666666, '06'],
 [15.862068965517242, '23'],
 [15.547101449275363, '00'],
 [15.144817073170731, '18'],
 [15.09051724137931, '14'],
 [14.683544303797468, '08'],
 [14.340823970037453, '16'],
 [13.995762711864407, '07'],
 [13.95360824742268, '04'],
 [13.94377990430622, '15'],
 [13.930232558139535, '21'],
 [13.88042049934297, '17'],
 [13.331564986737401, '22'],
 [13.321981424148607, '10'],
 [13.234285714285715, '20'],
 [13.224880382775119, '02'],
 [12.456953642384105, '09'],
 [11.866396761133604, '01'],
 [10.662790697674419, '05'],
 [10.524271844660195, '03']]

In [54]:
sorted_pts_swap[0:5]

[[20.905038759689923, '12'],
 [19.258706467661693, '11'],
 [17.018032786885247, '13'],
 [16.057553956834532, '19'],
 [15.994791666666666, '06']]

On the other hand, for Points generated - it is around lunch time where the points are high. Another timing is at 7PM, or at 6AM. Perhaps due to people taking a break after work or when in the early morning when they check the news?

# Summary

If we look at points and comments generation and the timing of the day, we can see some similarity where the average engagements is highest approaching lunch time. This may be a good guide on when we will see activity at Hackernews. 