# Exploring Hacker News Posts

In this short project, we will analyze data on Hacker News posts collected from September 2015 to Spetember 2016. The primary goal of our analysis is to answer the following questions:

1. Do *Ask HN* or *Show HN* post-types receive more comments on average?
2. Do posts created at a certain time of day recieve more comments on average?

## Open and Explore the Dataset

The data used for this project is from a public domain dataset available on Kaggle, [here](https://www.kaggle.com/hacker-news/hacker-news-posts). Below, we will read in the csv file, inspect the header row, and explore the first five rows of the dataset.

In [1]:
def open_dataset(file_name, header=True):
    """Opens a csv file found at a relative path file_name and returns it as a list of lists.
       If a header row exists, returns a tuple containing the header row and the dataset.
       
       Parameters:
       file_name: non-empty string; the relative path of the csv file.
       header: boolean; set to True by default to indicate that the csv file contains a header row
       
       Returns:
       if header is True: tuple of lists
       if header is False: list
    """
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    
    if header:
        return data[0], data[1:]
    else:
        return data

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    """Used to visualize a subsection of dataset. Slices dataset and prints the rows in the slice.
       There is also an option to print the number of rows and columns in the full dataset.
       
       Parameters:
       dataset: non-empty list of non-empty lists; the input data to be explored, with no header row
       start: int >= 0; the index of the first element of the slice
       end: int > start; the index of the last element of the slice + 1
       rows_and_columns: boolean; optional parameter; if True, explore_data() will also print number of rows and columns in the full dataset
       
       Returns:
       None
    """
    
    data_slice = dataset[start:end]
    
    for row in data_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# open the file
hn_header, hn = open_dataset('hacker_news.csv')

# inspect the header row
print('Header:')
print(hn_header)
print('\n')

# inspect the first 5 rows of the dataset
print('Dataset:')
explore_data(hn, 0, 5, rows_and_columns=True)

Header:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Dataset:
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:

## Analyzing the average number of comments for Ask HN posts versus Show HN posts

We will now split the dataset into three separate chunks: *Ask HN* posts, *Show HN* posts, and *Other* posts.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'): # use lower case version of title to avoid case irregularities
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Check that Ask HN + Show HN + Other = total rows in dataset:', \
      len(ask_posts) + len(show_posts) + len(other_posts) == len(hn))

Check that Ask HN + Show HN + Other = total rows in dataset: True


Next, we will calculate the average number of comments for *Ask HN* posts, and then *Show HN* posts.

In [5]:
# Ask HN
total_ask_comments = 0 

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

total_ask_posts = len(ask_posts)
avg_ask_comments = round(total_ask_comments / total_ask_posts, 1)
print('The average number of comments for an Ask Hn post is:', avg_ask_comments)

The average number of comments for an Ask Hn post is: 10.4


In [6]:
# Show HN
total_show_comments = 0 

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

total_show_posts = len(show_posts)
avg_show_comments = round(total_show_comments / total_show_posts, 1)
print('The average number of comments for an Show Hn post is:', avg_show_comments)

The average number of comments for an Show Hn post is: 4.9


From the results of our code, we can clearly see that *Ask Hn* posts receive, on average, about twice as many comments than *Show HN* posts. Specifically, an *Ask Hn* post receives an average of 10.4 comments, while a *Show HN* post receives an average of 4.9 comments. 

## Analyzing average number of comments per post based on time of day when posting

time zone is Eastern time (US)

In [26]:
# import datetime module
import datetime as dt

In [27]:
# create a subset of the full dataset with only created_at and num_comments columns
result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at, num_comments]
    result_list.append(result)

In [29]:
# calculate the number of Ask HN posts created in each hour of the day, and
# calculate the aggregate number of comments received by each of those collections of posts (by hour created)
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    num_comments = row[1] # type is int
    created_at = row[0]
    created_at_dt = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M") # convert from string to datetime
    hour = created_at_dt.strftime("%H:00") # just take the hour and format as string
    
    # populate the dictionaries with the hours as keys
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
print('Counts by hour:', '\n', counts_by_hour)
print('\n')
print('Comments by hour:', '\n', comments_by_hour)

Counts by hour: 
 {'02:00': 269, '01:00': 282, '22:00': 383, '21:00': 518, '19:00': 552, '17:00': 587, '15:00': 646, '14:00': 513, '13:00': 444, '11:00': 312, '10:00': 282, '09:00': 222, '07:00': 226, '03:00': 271, '23:00': 343, '20:00': 510, '16:00': 579, '08:00': 257, '00:00': 301, '18:00': 614, '12:00': 342, '04:00': 243, '06:00': 234, '05:00': 209}


Comments by hour: 
 {'02:00': 2996, '01:00': 2089, '22:00': 3372, '21:00': 4500, '19:00': 3954, '17:00': 5547, '15:00': 18525, '14:00': 4972, '13:00': 7245, '11:00': 2797, '10:00': 3013, '09:00': 1477, '07:00': 1585, '03:00': 2154, '23:00': 2297, '20:00': 4462, '16:00': 4466, '08:00': 2362, '00:00': 2277, '18:00': 4877, '12:00': 4234, '04:00': 2360, '06:00': 1587, '05:00': 1838}


In [39]:
# calculate average number of comments per post by hour
avg_by_hour = {}

for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour[hour] = avg

# build a list of hours in descending order of avg comments per post
hour_sorted = sorted(avg_by_hour, key=avg_by_hour.get, reverse=True)

# print the top 5 hours for average number of comments per post
print('Top 5 Hours for Ask Posts Comments:')
for index in range(5):
    hour = hour_sorted[index]
    avg = avg_by_hour[hour]
    template = '{hour}: {avg:.2f} average comments per post'
    output = template.format(hour=hour, avg=avg)
    print(output)

Top 5 Hours for Ask Posts Comments:
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post
