# Exploring Hacker News Posts

In this project, we'll work with a dataset of submissions to popular technology site Hacker News. We're specifically interested in posts with titles that begin with either *Ask HN* or **Show HN**. 

We'll compare these two types of posts to determine the following:

- Do *Ask HN* or **Show HN** receive more comments on average?

- Do posts created at a certain time receive more comments on average?

First we have to import the libraries needed and read the dataset into a list of lists.

In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Notice that the first list in the inner lists contains the column headers, and the lists after contain the data for one row. In order to analyze our data, we need to first remove the row containing the column headers. Let's remove that first row next.

In [2]:
# Removing Headers from a List of Lists

headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting *Ask HN* and **Show HN** Posts

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of posts in Ask HN:", len(ask_posts))
print("Number of posts in Show Hn:", len(show_posts))
print("Number of other posts:", len(other_posts))
    

Number of posts in Ask HN: 1744
Number of posts in Show Hn: 1162
Number of other posts: 17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

The total quantity of Ask HN posts as shown above was 1744, while the total Show HN type posts was 1162. We will now calculate the average number of comments/responses for each of the two types of posts: *Ask HN* and **Show HN**.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments = total_ask_comments + num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Comments on Ask Posts:", round(avg_ask_comments,2))

total_show_comments = 0

for row in show_posts:
    comments = int(row[4])
    total_show_comments = total_show_comments + comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average Comments on Show Posts:", round(avg_show_comments, 2))

Average Comments on Ask Posts: 14.04
Average Comments on Show Posts: 10.32


From the above analysis, *Ask HN* posts receive **14** comments in comparison to **Show HN** post that only receive **10** comments. Therefore, on average *Ask HN* posts receive more comments than **Show HN** posts. 

This makes sense. More people likely will respond to a user who is asking a question (for instance, asking for help with a problem), hence more comments than to a person who is simply showing or explaining a point and who may not be looking for a response. 

## Finding the Number of Ask Posts and Comments by Hour Created

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments. 

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6] # the created_at column is the 7th col in ask_posts
    comments = int(row[4])
    result_list.append([created_at, comments]) # a list with 2 elements

print(result_list[:5])
print('\n')

counts_by_hour = {}
comments_by_hour = {}

for result in result_list:
    hour = result[0] #extract hour from date
    comments = result[1]
    time = dt.datetime.strptime(hour, "%m/%d/%Y %H:%M").strftime("%H")
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comments
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comments

print(comments_by_hour) #contains the corresponding # of comments ask posts created at each hour received
print(counts_by_hour) #contains the # of ask posts created during each hour of the day
        
        

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [6]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/ counts_by_hour[hour])])

print("Average Number of Comments for Ask HN Posts by Hour:")
print("_____________________________________________________")
print(avg_by_hour)

           

Average Number of Comments for Ask HN Posts by Hour:
_____________________________________________________
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and Printing Values from a List of Lists

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [7]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)
print('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)
print('\n')
print("Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    avg_comments = row[0]
    hour = row[1]
    time = dt.datetime.strptime(hour, "%H").strftime("%H:%M")
    print(time, round(avg_comments,2), ' average comments per post')
    




[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'],

During which hours should you create a post to have a higher chance of receiving comments? 

Well assuming that the time zone for which the data created is similar to yours, the most appropriate time to create an "Ask Post" to receive comments is between the hours of 3 - 4 PM.