### In this project, we will explore the Hacker News data set. We are interested in stories with only titles including "Ask HN" or "Show HN". We will determine which receive more comments on average and if posts created at a certain time receive more comments on average.


In [1]:
# Import data set into a list of list and display first 5 rows to see data
import csv

with open("hacker_news.csv", "r") as file:
    reader = csv.reader(file)
    hn = list(reader)
    for row in hn[:5]:
        print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


In [3]:
# Extracting first row to header variable
headers = hn[0]
hn = hn[1:]
print(headers)
for row in hn[:5]:
    print(row)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://ars

In [8]:
# Logic to truncate data to only post containing "Ask HN" and "Show HN"
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts), len(show_posts), len(other_posts))

1744 1162 17193


### Now that we have a list of the "ask" posts and "shows" posts, we will determine which receives more comments on average.


In [12]:
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Ask HN Comments Per Post: %.2f" % avg_ask_comments)

total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average Show HN Comments Per Post: %.2f" % avg_show_comments)
    

Average Ask HN Comments Per Post: 14.04
Average Show HN Comments Per Post: 10.32


### We can see that on average, Ask HN gets about 4 more comments than Show HN posts. Since Ask posts receive more comments, the rest of our analysis will only be using the Ask HN list.
### Next, we will determine if posts created at a certain time are more likely to get comments.


In [13]:
import datetime as dt
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])                   

In [24]:
counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
    date = result[0]
    date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = date.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

In [28]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
for hr, avg in avg_by_hour:
    print("%s, %.2f" % (hr, avg))

0, 8.13
1, 11.38
2, 23.81
3, 7.80
4, 7.17
5, 10.09
6, 9.02
7, 7.85
8, 10.25
9, 5.58
10, 13.44
11, 11.05
12, 9.41
13, 14.74
14, 13.23
15, 38.59
16, 16.80
17, 11.46
18, 13.20
19, 10.80
20, 21.52
21, 16.01
22, 6.75
23, 7.99


### Now that we have the results, we will modify the data to be more readable to users

In [37]:
swap_avg_by_hour = []
for hr, avg in avg_by_hour:
    swap_avg_by_hour.append([avg, hr])
print(swap_avg_by_hour)

[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]


In [39]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[:5]:
    print("{hr}: {avg:.2f} average comments".format(avg=avg, hr=dt.datetime.strftime(dt.datetime.strptime(str(hr), "%H"), "%H:%M")))
    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments
02:00: 23.81 average comments
20:00: 21.52 average comments
16:00: 16.80 average comments
21:00: 16.01 average comments
