# Analysing Hacker News Posts

Hacker News is a social news website created by [Y Combinator](https://www.ycombinator.com/) which is a seed money startup accelerator headquartered in the United States. [Hacker News](https://news.ycombinator.com/) is specially popular in tech startup scene. People can make different types of posts and users can up/down vote the posts as well as make comments.

The posts that get the highest number of points (up votes - down votes) end up appearing on top and as a result can get high traffic of thousands of users.

In this project, we are going to focus on two types of posts:
- __Ask HN__: posts of specific questions to the Hacker News community.
- __Show HN__: posts to share with the Hacker News community a product or an interesting project/idea.

The aim of this project is to answer the following questions:
- Which of the two types of posts mentioned earlier recieve a higher number of comments on average?
- Does posts made on a certain time recieve more comments on average?

For our analysis we are going to use the [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)
data set which contains information about the posts for one year.

The data set can be downloaded [here](https://www.kaggle.com/hacker-news/hacker-news-posts/download).

Following are the different columns in the data set with their description:
- `title`: title of the post (self explanatory)
- `url`: the url of the item being linked to
- `num_points`: the number of upvotes the post received
- `num_comments`: the number of comments the post received
- `author`: the name of the account that made the post
- `created_at`: the date and time the post was made (the time zone is Eastern Time in the US)


## Exploring the Data Set

First, we start by opening the data set and exploring the data.

In [1]:
from csv import reader 
opened_file = open("/Users/abdallarashwan/Documents/Python Projects/Datasets/HN_posts_year_to_Sep_26_2016.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn_data = hn[1:]
hn_header = hn[0]

Following we define the function `explore_first_five()` which prints the first five rows of data.
The function arguments are:
- dataset: the data set to be used as a list.
- num_col: boolean to print the number of rows in the data set.

In [2]:
def explore_first_five(dataset, num_col = True):
    for row in dataset[:5]:
        print(row)
    print("\n")
    if num_col:
        print("Number of rows: ",len(dataset))

Now let's explore the data set as following:

In [3]:
print(hn_header)
print("\n")
explore_first_five(hn_data)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of rows:  2

## Data Cleaning

Since we are only interested in data of the posts that recieve comments from the users, let's remove all data rows which has no comments (`num_comments`=0).

In [4]:
hn_data_commented = []
for row in hn_data:
    num_comments = int(row[4])
    if num_comments != 0:
        hn_data_commented.append(row)
explore_first_five(hn_data_commented)

['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']
['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']


Number of rows:  80401


Next, we seperate the `Ask HN`, `Show HN` and `other` posts.
The type of the post can be determined from the `title` column.

Following, we create three different lists, each for one of the following types of posts: Ask HN, Show HN and other.
we do so by checking the start of the title of each post and then assign it to the corresponding list.

In [5]:
ask_posts = []
show_posts = []
other_posts = []
for post in hn_data_commented:
    title = post[1]
    if title.startswith("Ask HN"):
        ask_posts.append(post)
    elif title.startswith("Show HN"):
        show_posts.append(post)
    else:
        other_posts.append(post)
explore_first_five(ask_posts)
explore_first_five(show_posts)
explore_first_five(other_posts)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']
['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']
['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']


Number of rows:  6899
['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']
['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']
['12576090', 'Show HN: Markov chain Twitter bot. Trained on comment

## Average Number of Comments

Next, we calculate the average number of comments for each type of posts.

The `num_comments` column is at index 4

## _Ask HN_ Average Number of Comments 

In [6]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

In [7]:
print(avg_ask_comments)

13.759965212349616


## _Show HN_ Average Number of Comments 

In [8]:
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)

In [9]:
print(avg_show_comments)

9.82125890736342


As we can see, on average the Ask HN posts recieve a higher number of comments from the users.

# Time Frequency Table

Next, we will analyse the data for Ask HN posts in order to determine if posts created at a certain time recieve more comments.

## First Step

In the following step, we find the posts made on each time of the day (hour) and the totall number of comments they recieved by first parsing the data in `created_at` to datetime object.


In [10]:
import datetime as dt
for post in ask_posts:
    date_time_str = post[-1]
    post[-1] = dt.datetime.strptime(date_time_str, "%m/%d/%Y %H:%M")


In [11]:
print(ask_posts[0][-1],type(ask_posts[0][-1]))


2016-09-26 02:53:00 <class 'datetime.datetime'>


In [12]:
posts_per_hour = {}
comments_per_hour = {}
for post in ask_posts:
    hour = post[-1].strftime("%H")
    if hour in posts_per_hour:
        posts_per_hour[hour] += 1 
        comments_per_hour[hour] += int(post[4])
    else:
        posts_per_hour[hour] = 1
        comments_per_hour[hour] = int(post[4])
        


In [13]:
print(posts_per_hour)

{'02': 227, '01': 223, '22': 286, '21': 407, '19': 420, '17': 404, '15': 467, '14': 377, '13': 324, '11': 250, '10': 219, '09': 176, '07': 156, '03': 211, '16': 414, '08': 190, '00': 229, '23': 276, '20': 392, '18': 451, '12': 274, '04': 185, '06': 176, '05': 165}


In [14]:
print(comments_per_hour)

{'02': 2996, '01': 2089, '22': 3369, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4970, '13': 7227, '11': 2794, '10': 3013, '09': 1477, '07': 1584, '03': 2153, '16': 4461, '08': 2362, '00': 2265, '23': 2297, '20': 4462, '18': 4868, '12': 4234, '04': 2358, '06': 1587, '05': 1838}


In [15]:
avg_comments_per_hour = []
for hour in posts_per_hour:
    avg_comments_per_hour.append([hour, comments_per_hour[hour]/posts_per_hour[hour]])

In [16]:
print(avg_comments_per_hour)

[['02', 13.198237885462555], ['01', 9.367713004484305], ['22', 11.77972027972028], ['21', 11.056511056511056], ['19', 9.414285714285715], ['17', 13.73019801980198], ['15', 39.66809421841542], ['14', 13.183023872679046], ['13', 22.305555555555557], ['11', 11.176], ['10', 13.757990867579908], ['09', 8.392045454545455], ['07', 10.153846153846153], ['03', 10.203791469194313], ['16', 10.77536231884058], ['08', 12.43157894736842], ['00', 9.890829694323145], ['23', 8.322463768115941], ['20', 11.38265306122449], ['18', 10.793791574279378], ['12', 15.452554744525548], ['04', 12.745945945945946], ['06', 9.017045454545455], ['05', 11.139393939393939]]


Next we swap the two elements (the hour and avg num of comments) in each list in order to use the `sorted()` built-in function to order the data in a descinding order of the number of average comments per hour.

In [17]:
swap_avg_comments_per_hour = []
for element in avg_comments_per_hour:
    swap_avg_comments_per_hour.append([element[1], element[0]])
    

In [18]:
print(swap_avg_comments_per_hour)

[[13.198237885462555, '02'], [9.367713004484305, '01'], [11.77972027972028, '22'], [11.056511056511056, '21'], [9.414285714285715, '19'], [13.73019801980198, '17'], [39.66809421841542, '15'], [13.183023872679046, '14'], [22.305555555555557, '13'], [11.176, '11'], [13.757990867579908, '10'], [8.392045454545455, '09'], [10.153846153846153, '07'], [10.203791469194313, '03'], [10.77536231884058, '16'], [12.43157894736842, '08'], [9.890829694323145, '00'], [8.322463768115941, '23'], [11.38265306122449, '20'], [10.793791574279378, '18'], [15.452554744525548, '12'], [12.745945945945946, '04'], [9.017045454545455, '06'], [11.139393939393939, '05']]


Following, the swapped list is sorted in a DESC order.

In [19]:
sorted_res = sorted(swap_avg_comments_per_hour, reverse = True)

In [20]:
print(sorted_res)

[[39.66809421841542, '15'], [22.305555555555557, '13'], [15.452554744525548, '12'], [13.757990867579908, '10'], [13.73019801980198, '17'], [13.198237885462555, '02'], [13.183023872679046, '14'], [12.745945945945946, '04'], [12.43157894736842, '08'], [11.77972027972028, '22'], [11.38265306122449, '20'], [11.176, '11'], [11.139393939393939, '05'], [11.056511056511056, '21'], [10.793791574279378, '18'], [10.77536231884058, '16'], [10.203791469194313, '03'], [10.153846153846153, '07'], [9.890829694323145, '00'], [9.414285714285715, '19'], [9.367713004484305, '01'], [9.017045454545455, '06'], [8.392045454545455, '09'], [8.322463768115941, '23']]


Next, let's format our data in a more readable way.

In [39]:
for e in sorted_res:
    print("{}: {:.2f} average comments per post".format(dt.datetime.strptime(e[1], "%H").strftime("%H:%M"), e[0]))

15:00: 39.67 average comments per post
13:00: 22.31 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post
02:00: 13.20 average comments per post
14:00: 13.18 average comments per post
04:00: 12.75 average comments per post
08:00: 12.43 average comments per post
22:00: 11.78 average comments per post
20:00: 11.38 average comments per post
11:00: 11.18 average comments per post
05:00: 11.14 average comments per post
21:00: 11.06 average comments per post
18:00: 10.79 average comments per post
16:00: 10.78 average comments per post
03:00: 10.20 average comments per post
07:00: 10.15 average comments per post
00:00: 9.89 average comments per post
19:00: 9.41 average comments per post
01:00: 9.37 average comments per post
06:00: 9.02 average comments per post
09:00: 8.39 average comments per post
23:00: 8.32 average comments per post


From the results obtained, we can see that Ask HN posts made around mid day (from 13:00 to 15:00) with highest average number of comments occuring at 15:00 