# Hacker news 

This is a project that will analyze post in Hacker news website. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We would like to see what is the factors that make posts get a lot of attentions e.g. comments. We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

In [1]:
open_file = open("C:/Users/dowre/GIT/Hacker News/HN_posts_year_to_Sep_26_2016.csv" , encoding="utf8" )

In [2]:
from csv import reader
read_file = reader(open_file)
hn_org = list(read_file)
hn_header = hn_org[0]
hn = hn_org[1:]

In [3]:
hn_org[:2]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26']]

In [4]:
hn_header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [5]:
hn[:3]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19']]

In [6]:
#total row of hn
len(hn)

293119

## Data cleaning
1. Verify unique value 

Only id column should has unique so we verify that id value is unique.

In [7]:
flag = 0
flag = len(set(hn[0])) == len(hn[0])
if(flag) : 
    print ("List contains all unique id ") 
else :  
    print ("List contains does not contains all unique id") 


List contains all unique id 


2. Verify the missing data. 
After go throught the data roughly, data in the following columns should not be empty.
- id
- title
- num_points
- num_comments
- author
- created_at

 For created_at which is in date form we can put in 1/1/2016 0:00 as default.

In [8]:
missing_id = []
missing_title = []
missing_num_points = []
missing_num_comments = []
missing_author = []
missing_created_at = []
for post in hn:
    id_post = post[0]
    title = post[1]
    num_points = post[3]
    num_comments = post[4]
    author= post[5]
    created_at = post[6]
    if not id_post:
        missing_id.append(post)
    elif not title:
        missing_title.append(post)
    elif not num_points:
        missing_num_points.append(post)
    elif not num_comments:
        missing_num_comments.append(post)
    elif not author:
        missing_author.append(post)
    elif not created_at:
        missing_created_at.append(post)

print('Number of missing id row is {:,}'.format(len(missing_id)))
print('Number of missing title row is {:,}'.format(len(missing_title)))
print('Number of missing num_points row is {:,}'.format(len(missing_num_points)))
print('Number of missing num_comments row is {:,}'.format(len(missing_num_comments)))
print('Number of missing author row is {:,}'.format(len(missing_author)))
print('Number of missing created_at row is {:,}'.format(len(missing_created_at)))


Number of missing id row is 0
Number of missing title row is 0
Number of missing num_points row is 30
Number of missing num_comments row is 1
Number of missing author row is 0
Number of missing created_at row is 0


As the result, num_points and num_comments columns have some missing values. The columns are in integer form which empty value will be equal to 0. We can leave it as is because it returns the same results when calculated.

3. Verify data type.
All data in each column are currently in string type. However, originaly some columns should be in other type as follow:
- id : int
- num_points : int
- num_comments : int
- created_at : date

We should verify that data in each fields are in the correct format.

In [9]:
# Create function to verify that the value can convert to int type.
def isint(value):
    try:
        int(value)
        return True
    except ValueError:
        return False

In [10]:
# Create function to verify that the value can convert to datetime type.
import datetime as dt
def isdate(value):
    try:
        dt.datetime.strptime(value,"%m/%d/%Y %H:%M" )
        return True
    except ValueError:
        return False

In [11]:
wrong_type_id = []
wrong_type_num_points = {}
wrong_type_num_comments = {}
wrong_type_created_at = {}
for post in hn:
    id_type = isint(post[0])
    num_points_type = isint(post[3])
    num_comments_type = isint(post[4])
    created_at_type = isdate(post[6])
    if id_type == False:
        wrong_type_id.append(post[0])
    if num_points_type== False:
        wrong_type_num_points[post[0]] = post[3]
    if num_comments_type== False:
        wrong_type_num_comments[post[0]]= post[4]
    if created_at_type == False:
        wrong_type_created_at[post[0]] = post[6]
        
print("Id that is not in int type are", len(wrong_type_id))
print("num_points that is not in int type are", len(wrong_type_num_points))
print("num_comments that is not in int type are", len(wrong_type_num_comments))
print("created_at that is not in date type are", len(wrong_type_created_at))

Id that is not in int type are 0
num_points that is not in int type are 681
num_comments that is not in int type are 44
created_at that is not in date type are 681


In [12]:
#wrong_num_points_list 
dict(list(wrong_type_num_points.items())[0: 5])

{'12570231': 'https://github.com/you-dont-need/You-Dont-Need-GUI',
 '12563145': 'https://github.com/Dobiasd/articles/blob/master/functional_programming_in_cpp_with_the_functionalplus_library_today_hackerrank_challange_gemstones.md',
 '12560370': 'https://aenramsden.tumblr.com/post/150263745069/personality-not-just-for-humans',
 '12559863': 'http://www.intp.io/blog/2016/09/20/Of-Mercenaries-and-Soldiers-In-House-Versus-Third-Party-Tech-Teams/',
 '12559266': 'http://futurism.com/new-form-of-atomic-nuclei-just-confirmed-and-it-suggests-time-travel-is-impossible/'}

In [13]:
#wrong_num_comments_list 
dict(list(wrong_type_num_comments.items())[0: 5])

{'12499077': '_fruit_flies_like_a_banana',
 '12251820': 'pgtype=Homepage&amp',
 '12224842': 'http://esports-marketing-blog.com/eleague-season-1-by-the-numbers/',
 '12133174': 'http://breuleux.net/blog/language-howto.html',
 '12116178': 'http://www.themacro.com/articles/2016/07/nick-grandy/'}

In [14]:
#wrong_created_at_list 
dict(list(wrong_type_created_at.items())[0: 5])

{'12570231': 'stevemao',
 '12563145': 'Dobiasd',
 '12560370': 'Mz',
 '12559863': 'intpsoftware',
 '12559266': 'wolfgke'}

According to the number of incorrect type in num_points column and created_at column are exactly the same which is 681. We would like the make sure that those are the same row.

In [15]:
set(wrong_type_num_points) == set(wrong_type_created_at)

True

The number of incorrect type in num_comments columns is 44. Check whether they are subset of incorrect num_points list or not.

In [16]:
if(set(wrong_type_num_comments).issubset(set(wrong_type_num_points))):
    print('Yes')
else:
    print('No')

Yes


##### Create new clean data

In [17]:
hn_clean =[]
for post in hn:
    post_id = post[0]
    if post_id not in wrong_type_num_points:
        hn_clean.append(post)

print('Total clean data is {:,}'.format(len(hn_clean)))

Total clean data is 292,438


## Data Analysis

As mention earlier that we are interested in the post that begin with Ask HN or Show HN. We will separate data into separate list.

In [18]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_clean:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of Ask HN post are {0:,}".format(len(ask_posts)))
print("Number of Show HN post are {0:,}".format(len(show_posts))) 
print("Number of Other post are {0:,}".format(len(other_posts)))
        

Number of Ask HN post are 9,126
Number of Show HN post are 10,146
Number of Other post are 273,166


In [19]:
ask_posts[:4]

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57'],
 ['12577870',
  'Ask HN: Why join a fund when you can be an angel?',
  '',
  '1',
  '3',
  'anthony_james',
  '9/25/2016 22:48']]

In [20]:
show_posts[:4]

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44'],
 ['12577991',
  'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules',
  'https://github.com/jakebian/zeal',
  '2',
  '0',
  'dbranes',
  '9/25/2016 23:17']]

##### Let's check whether ask posts or show posts receive more comments on average.

Start with find average number of comment on ASK HN posts.

In [21]:
total_ask_comments = 0
for post in ask_posts:
    ask_comment = int(post[4])
    total_ask_comments += ask_comment
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average comment for Ask HN post is {number:.2f}".format(number=avg_ask_comments))


Average comment for Ask HN post is 10.39


Find average number of comment on Show HN posts.

In [22]:
total_show_comments = 0
for post in show_posts:
    show_comment = int(post[4])
    total_show_comments += show_comment
avg_show_comments = total_show_comments/len(show_posts)
print("Average comment for Show HN post is {number:.2f}".format(number=avg_show_comments))

Average comment for Show HN post is 4.89


From the above results, we can see that Ask HN post recieve more comments than Show HN posts. Therefore , we will focus on the ASK HN posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments.

In [23]:
counts_by_hour = {}
comments_by_hour ={}
for post in ask_posts:
    post_datetime = post[6]
    post_datetime_1 = dt.datetime.strptime(post_datetime, "%m/%d/%Y %H:%M")
    post_hour = post_datetime_1.hour
    post_comment = int(post[4])
    if post_hour not in counts_by_hour:
        counts_by_hour[post_hour] = 1
        comments_by_hour[post_hour] = post_comment
    else:
        counts_by_hour[post_hour] += 1
        comments_by_hour[post_hour] += post_comment


print("counts_by_hour",counts_by_hour)
print("comments_by_hour",comments_by_hour)

counts_by_hour {2: 269, 1: 282, 22: 382, 21: 517, 19: 548, 17: 585, 15: 645, 14: 513, 13: 444, 11: 311, 10: 281, 9: 222, 7: 226, 3: 271, 23: 343, 20: 509, 16: 579, 8: 257, 0: 301, 18: 613, 12: 342, 4: 243, 6: 234, 5: 209}
comments_by_hour {2: 2996, 1: 2089, 22: 3353, 21: 4478, 19: 3933, 17: 5542, 15: 18524, 14: 4972, 13: 7245, 11: 2795, 10: 2963, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4441, 16: 4466, 8: 2362, 0: 2277, 18: 4876, 12: 4234, 4: 2360, 6: 1587, 5: 1838}


In [24]:
#sort the counts_by_hour from max to min
import operator
sort_counts_by_hour = sorted(counts_by_hour.items(), key = operator.itemgetter(1), reverse = True)
sort_counts_by_hour

[(15, 645),
 (18, 613),
 (17, 585),
 (16, 579),
 (19, 548),
 (21, 517),
 (14, 513),
 (20, 509),
 (13, 444),
 (22, 382),
 (23, 343),
 (12, 342),
 (11, 311),
 (0, 301),
 (1, 282),
 (10, 281),
 (3, 271),
 (2, 269),
 (8, 257),
 (4, 243),
 (6, 234),
 (7, 226),
 (9, 222),
 (5, 209)]

People like to post Ask HN posts after 2 pm.

In [25]:
#sort the comments_by_hour from max to min
sort_comments_by_hour = sorted(comments_by_hour.items(),key = operator.itemgetter(1), reverse = True)
sort_comments_by_hour

[(15, 18524),
 (13, 7245),
 (17, 5542),
 (14, 4972),
 (18, 4876),
 (21, 4478),
 (16, 4466),
 (20, 4441),
 (12, 4234),
 (19, 3933),
 (22, 3353),
 (2, 2996),
 (10, 2963),
 (11, 2795),
 (8, 2362),
 (4, 2360),
 (23, 2297),
 (0, 2277),
 (3, 2154),
 (1, 2089),
 (5, 1838),
 (6, 1587),
 (7, 1585),
 (9, 1477)]

The posts that are created in the afternoon are most likely to get more comments than in the morning.
Especially during 3 pm. it has the hightest rate of post created and post comments.

Next, we will calculate the average number of comments in each post during each hour of the day.

In [26]:
avg_by_hour = []
for hour in counts_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])

In [27]:
sorted_avg_by_hour = sorted(avg_by_hour, key =operator.itemgetter(1), reverse= True)
#display top 5 
sorted_avg_by_hour[:5]

[[15, 28.71937984496124],
 [13, 16.31756756756757],
 [12, 12.380116959064328],
 [2, 11.137546468401487],
 [10, 10.544483985765124]]

Display hour in time format

In [28]:
for post in sorted_avg_by_hour[:5]:
    hour = str(post[0])
    datetime_format =  dt.datetime.strptime(hour, "%H")
    hour_format = datetime_format.strftime("%H:%M:")
    print(hour_format,' {:.2f} average comments per post'.format(post[1]))

15:00:  28.72 average comments per post
13:00:  16.32 average comments per post
12:00:  12.38 average comments per post
02:00:  11.14 average comments per post
10:00:  10.54 average comments per post


## Conclusion

The post that attract people's attention the most is ASK HN type in hacker news site. The best period of time to create ASK HN post is in the afternoon. The best time is around 15:00 eastern time which is equal to 14:00 in central time which is our current location. They are likely to get in average 28 comments per post.