### Introduction:  
  
In this project, we will compare two types of posts from [Hacker News](https://news.ycombinator.com/), site where technology related stories are voted and commented upon.  
The two types of posts are either Ask HN or Show HN.  
  
Users submit *Ask HN* posts to ask the Hacker News community a question. Likewise, user submit *Show HN* posts to show the Hacker News community a project, product, or just generally something interesting.  
  
We will specifically compare these two types of posts to determine the following:
* Do *Ask HN* or *Show HN* receives more comments on average?
* Do posts created at a certain time receive more comments on average?

In [1]:
import re
import csv
import pprint as pp
import datetime as dt

#### Reading a CSV file and separating data from header:

In [2]:
def read_csv(filename):
    with open(filename, encoding='utf8') as fd:
        all_data = list(csv.reader(fd))
        header = all_data[0]
        data = all_data[1:]
    
    return header, data

In [3]:
header, hacker_data = read_csv('hacker_news.csv')
print('Hacker News posts:')
for row in hacker_data[:5]:
    print(row)

Hacker News posts:
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']
['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


#### Extracting *Ask HN* and *Show HN* posts:  
We will identify and separate those posts which begins with either *Ask HN* or *Show HN* into two different lists. Separating the data makes it easier to analyze the dataset later.

In [4]:
def get_ask_show_posts(dataset):
    ask_posts = list()
    show_posts = list()
    other_posts = list()
    
    for row in dataset:
        title = row[1]
        if re.search('^Ask HN', title, re.I): # check if title begins with 'Ask HN'
            ask_posts.append(row)
        elif re.search('^Show HN', title, re.I): # check if title begins with 'Show HN'
            show_posts.append(row)
        else:
            other_posts.append(row)
    
    return ask_posts, show_posts

In [5]:
ask_posts, show_posts = get_ask_show_posts(hacker_data)
print('Ask posts:')
for row in ask_posts[:5]:
    print(row)

print()
print('Show posts:')
for row in show_posts[:5]:
    print(row)

Ask posts:
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']
['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']
['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']
['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']

Show posts:
['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36']
['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01']
['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'g

#### Calculating the average number of comments for *Ask HN* and *Show HN* posts:  
Now that we have separated *Ask HN* and *Show HN* posts into separate lists, we will calculate the average number of comments each type of post receives.

In [6]:
def get_average_comments(dataset, index):
    total_comments = 0
    
    for row in dataset:
        total_comments += int(row[index])
    
    return total_comments/len(dataset)

In [7]:
avg_ask_comments = get_average_comments(ask_posts, 4) # num_comments is column 5
print('Average ask comments: {:.2f}'.format(avg_ask_comments))
avg_show_comments = get_average_comments(show_posts, 4) # num_comments is column 5
print('Average show comments: {:.2f}'.format(avg_show_comments))

Average ask comments: 10.39
Average show comments: 4.89


On an average, *Ask* posts receive approximately 10 comments, whereas *Show* posts receive only 4 comments. Since *Ask* posts are more likely to receive more comments, we will focus on analyzing on these posts.  
  
#### Finding the amount of *Ask* posts and comments by hour created:  
We will determine if we can maximize the amount of comments an *Ask* post receives by creating it at a certain time.  
First, we will find the amount of *Ask* posts created during each hour of the day, along with the number of comments those posts received. Then we will calculate the average amount of comments *Ask* posts created at each hour of the day receive.

In [8]:
def get_comments_by_hour(dataset, time_index, comment_index):
    result_list = list()
    
    # get post created time and total number of comments
    for row in dataset:
        result_list.append([row[time_index], int(row[comment_index])])
    
    counts_by_hour = dict()
    comments_by_hour = dict()
    date_format = '%m/%d/%Y %H:%M' # date taken from dataset will be the format ex: 9/26/2016 3:26
    
    for row in result_list:
        # create a datetime object
        date = dt.datetime.strptime(row[0], date_format)
        comments = row[1]
        # get hour from datetime object
        hour = date.strftime('%H')
        
        # number of posts created in that hour
        counts_by_hour[hour] = counts_by_hour.get(hour, 0) + 1
        # number of comments each posts receive
        comments_by_hour[hour] = comments_by_hour.get(hour, 0) + comments
    
    return comments_by_hour, counts_by_hour

In [9]:
comments_by_hour, counts_by_hour = get_comments_by_hour(ask_posts, 6, 4) # index 4: total number of comments, index 6: timestamp at which post was created

#### Calculating the average number of comments for *Ask HN* posts by hour:

In [10]:
def average_comments_per_hour(comments_by_hour, counts_by_hour):
    average_by_hour = dict()
    
    for hour, comments in comments_by_hour.items():
        average_by_hour[hour] = comments/counts_by_hour[hour]
    
    # sort in descending order by number of comments
    sorted_avg_by_hour = sorted(average_by_hour.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_avg_by_hour

In [11]:
avg_comments_per_hour = average_comments_per_hour(comments_by_hour, counts_by_hour)
pp.pprint(avg_comments_per_hour)

[('15', 28.676470588235293),
 ('13', 16.31756756756757),
 ('12', 12.380116959064328),
 ('02', 11.137546468401487),
 ('10', 10.684397163120567),
 ('04', 9.7119341563786),
 ('14', 9.692007797270955),
 ('17', 9.449744463373083),
 ('08', 9.190661478599221),
 ('11', 8.96474358974359),
 ('22', 8.804177545691905),
 ('05', 8.794258373205741),
 ('20', 8.749019607843136),
 ('21', 8.687258687258687),
 ('03', 7.948339483394834),
 ('18', 7.94299674267101),
 ('16', 7.713298791018998),
 ('00', 7.5647840531561465),
 ('01', 7.407801418439717),
 ('19', 7.163043478260869),
 ('07', 7.013274336283186),
 ('06', 6.782051282051282),
 ('23', 6.696793002915452),
 ('09', 6.653153153153153)]


In [12]:
print('Top 5 hours for Ask HN comments:')
for hour, avg in avg_comments_per_hour[:5]:
    print('Hr {} - Average {:.2f} comments per post'.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg))

Top 5 hours for Ask HN comments:
Hr 15:00 - Average 28.68 comments per post
Hr 13:00 - Average 16.32 comments per post
Hr 12:00 - Average 12.38 comments per post
Hr 02:00 - Average 11.14 comments per post
Hr 10:00 - Average 10.68 comments per post


The hour that receives the most comments per post on average is 15:00 with an average of 28.68 comments per post. There is about a 75% increase in the number of comments between the hours with highest and second highest average number of comments.  
  
According to the dataset [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone used is Eastern Time in the US. So, we ccould also write 15:00 as 3:00 PM EST.  
  
### Conclusion:
In this project, we analyzed *Ask* posts and *Show* posts to determine which type of posts and time receive the most comments on average. Based on our analysis, to maximize the amount of comments, we would recommend the post to be categorized as *Ask* post and created between 15:00 and 16:00 (3:00 PM to 4:00 PM EST).