# Exploring Hacker News
In this project we analyze a partial dataset from the <a href="https://news.ycombinator.com/">Hacker News website</a>. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We are especially interedted in posts of these two types:
* <font color='red'>Ask HN</font>: posts to ask the Hacker News community a specific question. 
* <font color='red'>Show HN`</font>: posts to show the Hacker News community a project, product, or just generally something interesting

With our analysis we want to answer following quesitons:
* Do <font color='red'>Ask HN</font> or <font color='red'>Show HN</font> receive more comments on average?
* Do posts created at a certain time receive more comments on average?


## Importing the Data

In [13]:
from csv import reader

opened_file = open('datasets/HN_posts_year_to_Sep_26_2016.csv', encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

To make it easier to explore the two data sets, we'll first write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [16]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(hn_header)
print('\n')
explore_data(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of 

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles. For this purpose we are going to use the string method `startswith`

In [24]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    # if the lowercase version of title starts with ask hn, append the row to ask_posts
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    # if the lowercase version of title starts with show hn, append the row to show_posts
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    # otherwise append to the other_posts
    else:
        other_posts.append(row)

Now we check the number of posts in each category and confirm that the total number of posts agrees with the one calculated above by the funciton `explore_data()`

In [35]:
print("Number of ask_posts: ",len(ask_posts))
print("Number of show_posts: ",len(show_posts))
print("Number of other_posts: ",len(other_posts))
print("Total Number of posts: ", len(ask_posts)+len(show_posts)+len(other_posts))

Number of ask_posts:  9139
Number of show_posts:  10158
Number of other_posts:  273822
Total Number of posts:  293119


In [36]:
explore_data(ask_posts, 0, 5, True)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']


Number of rows: 9139
Number of columns: 7


Next, let's determine if ask posts or show posts receive more comments on average. We first determine the total number of comments for the ask_posts, then we calculate its average number. Next we calculate the same for show_posts and other_posts

In [43]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print("Total number of ask_posts: ", total_ask_comments)
print("Average number of ask_posts: {:.2f}".format(avg_ask_comments))

Total number of ask_posts:  94986
Average number of ask_posts: 10.39


In [45]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print("Total number of show_posts: ", total_show_comments)
print("Average number of show_posts: {:.2f}".format(avg_show_comments))

Total number of show_posts:  49633
Average number of show_posts: 4.89


In [46]:
total_other_comments = 0
for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
    
avg_other_comments = total_other_comments / len(other_posts)

print("Total number of other_posts: ", total_other_comments)
print("Average number of other_posts: {:.2f}".format(avg_other_comments))

Total number of other_posts:  1768142
Average number of other_posts: 6.46


We can see from the results above that ask_posts receive in average more comments than other kinds of posts, while show_posts receive less comments in average. 

We will therefore use the ask_posts for our further analysis to determine if there is a certain time when a higher number of comments are written, following these steps:

1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

Frist we store both the time of creation of each entry from ask_posts and its number of comments in a separate variable `result_list`

In [74]:
import datetime as dt
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments]) 
    


Now we are going to separate the number of comments in time slots (every hour) - For this we make use of dictionaries & the datetime library

In [81]:
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date_dt = dt.datetime.strptime(row[0],date_format)
    hout_dt = date_dt.hour
    
    num_comments = row[1] 
    
    if hout_dt in counts_by_hour:
        counts_by_hour[hout_dt] += 1
        comments_by_hour[hout_dt] += num_comments
    else:
        counts_by_hour[hout_dt] = 1
        comments_by_hour[hout_dt] = num_comments
        
    


`counts_by_hour` and `comments_by_hour` contain how many posts and how many comments were made within the considered hour, respectively. 

Now we calculate the average number of comments for posts created during each hour of the day

In [105]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour] ])
avg_by_hour

[[2, 11.137546468401487],
 [1, 7.407801418439717],
 [22, 8.804177545691905],
 [21, 8.687258687258687],
 [19, 7.163043478260869],
 [17, 9.449744463373083],
 [15, 28.676470588235293],
 [14, 9.692007797270955],
 [13, 16.31756756756757],
 [11, 8.96474358974359],
 [10, 10.684397163120567],
 [9, 6.653153153153153],
 [7, 7.013274336283186],
 [3, 7.948339483394834],
 [23, 6.696793002915452],
 [20, 8.749019607843136],
 [16, 7.713298791018998],
 [8, 9.190661478599221],
 [0, 7.5647840531561465],
 [18, 7.94299674267101],
 [12, 12.380116959064328],
 [4, 9.7119341563786],
 [6, 6.782051282051282],
 [5, 8.794258373205741]]

To make the vizualisation easier we sort the list is descending order of avg value 

In [128]:
sorted_avg_by_hour = sorted(avg_by_hour, key=lambda x: x[1],reverse = True)
sorted_avg_by_hour

print('Top 5 Hours for Ask Posts Comments\n')

date_format = "%H"

for row in sorted_avg_by_hour[:5]:
    date_dt = dt.datetime.strptime(str(row[0]),date_format)
    dt_string = date_dt.strftime("%H:%M")
    
    print('{}: {:.2f} average comments per post.'.format(dt_string, row[1]))

Top 5 Hours for Ask Posts Comments

15:00: 28.68 average comments per post.
13:00: 16.32 average comments per post.
12:00: 12.38 average comments per post.
02:00: 11.14 average comments per post.
10:00: 10.68 average comments per post.


In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Ask posts receive the highest average of comments.

Based on the analysis of ask posts, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (note that we excluded posts without any comments).

It is interesting to note that there are also many comments written between 02:00 and 03:00 (11.14 average), which means that Hackers never sleep ;-)