# Exploring Hacker News
In this project, we'll work with a data set of submissions to popular technolgy site, [Hacker News](https://new.ycombinator.com/). We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine two main questions:
- Do 'Ask HN'or 'Show HN' receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's get started.

### Preparing the Data:

In [13]:
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Looking at the first five rows, it seems like the data imported correctly. However, we can also see that the headers for each column imported as the first row. In order to analyze our data, we need to first remove the row containing the column headers.

In [14]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We can see that, after writing our code, printing our variables is an excellent way to verify that the data reads as we expect.

'headers' and 'hn' are both reading correctly, so we can proceed.

### Filtering the Data:

Since, initially, we're concerned with posts that begin with 'Ask HN' or 'Show HN', we'll create new lists of lists containing just the data for those titles.

In [15]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


We can see (roughly) that our total number of posts equals around 20 thousand. So with all posts accounted for, only 1744 all under 'ask' posts and 1162 fall under 'show' posts. 

Now, let's determine if our ask pots or show posts receive more comments on average:

In [16]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993


Above, we've constructed strings that total the number of both ask comments and show comments. Then, by averaging the number of comments on each type of post, we can observe that ask posts usually receive 4 more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts by determining if ask posts created at a certain time are more likely to attract comments. We'll do this by calculating the am ount of ask posts created in each hour of the day, along with the number of comments received. Then, we'll calculate the average number of comments ask posts receive by hour created.

In [17]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    post_comments = int(row[4])
    result_list.append([created_at, post_comments])

counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment

print(comments_by_hour)
print('\n')
print(counts_by_hour)

{'21': 1745, '07': 267, '20': 1722, '22': 479, '13': 1253, '19': 1188, '14': 1416, '03': 421, '17': 1146, '06': 397, '23': 543, '01': 683, '12': 687, '11': 641, '16': 1814, '04': 337, '10': 793, '09': 251, '00': 447, '05': 464, '08': 492, '02': 1381, '15': 4477, '18': 1439}


{'21': 109, '07': 34, '20': 80, '22': 71, '13': 85, '19': 110, '14': 107, '03': 54, '17': 100, '06': 44, '23': 68, '01': 60, '12': 73, '11': 58, '16': 108, '04': 47, '10': 59, '09': 45, '00': 55, '05': 46, '08': 48, '02': 58, '15': 116, '18': 109}


In [18]:
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

avg_by_hour

[['21', 16.009174311926607],
 ['07', 7.852941176470588],
 ['20', 21.525],
 ['22', 6.746478873239437],
 ['13', 14.741176470588234],
 ['19', 10.8],
 ['14', 13.233644859813085],
 ['03', 7.796296296296297],
 ['17', 11.46],
 ['06', 9.022727272727273],
 ['23', 7.985294117647059],
 ['01', 11.383333333333333],
 ['12', 9.41095890410959],
 ['11', 11.051724137931034],
 ['16', 16.796296296296298],
 ['04', 7.170212765957447],
 ['10', 13.440677966101696],
 ['09', 5.5777777777777775],
 ['00', 8.127272727272727],
 ['05', 10.08695652173913],
 ['08', 10.25],
 ['02', 23.810344827586206],
 ['15', 38.5948275862069],
 ['18', 13.20183486238532]]

## Sorting and Understanding Our Data
Now that we have the results we need, its important to display them in an understandable fashion. We'll do that below:

In [19]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]]) 

print(swap_avg_by_hour)
print("\n")
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 hours for Ask Posts Comments")

for average, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"),average
        )
    )

[[16.009174311926607, '21'], [7.852941176470588, '07'], [21.525, '20'], [6.746478873239437, '22'], [14.741176470588234, '13'], [10.8, '19'], [13.233644859813085, '14'], [7.796296296296297, '03'], [11.46, '17'], [9.022727272727273, '06'], [7.985294117647059, '23'], [11.383333333333333, '01'], [9.41095890410959, '12'], [11.051724137931034, '11'], [16.796296296296298, '16'], [7.170212765957447, '04'], [13.440677966101696, '10'], [5.5777777777777775, '09'], [8.127272727272727, '00'], [10.08695652173913, '05'], [10.25, '08'], [23.810344827586206, '02'], [38.5948275862069, '15'], [13.20183486238532, '18']]


Top 5 hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion
Sorting by average comments per post, we can see that 15:00 (3PM EST) is the optimum time to post in order to receive comments, with almost 60% higher comments than the next highest time. Further, structuring your post as "Ask HN" is more likely to receive attention than "Show HN," in addition to any other post format.