# Guided Dataquest.io Project- Exploring Hacker News Posts
This project will involve analyzing a dataset of posts submitted to the website Hacker News. The dataset originally contained 300,000 rows. Dataquest reduced the dataset to 20,000 rows by removing posts that do not contain comments and to contain a random sampling of posts. When analyzing the data, we will compare the frequency of certain types of posts and the number of comments per post.

In [2]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

header = hn[0]
hn = hn[1:]
print(header)
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We are interested in posts in the dataset with types "Ask HN" and "Show HN". We will separate the 2 types of posts into separate lists.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("Number of Ask HN posts: ", len(ask_posts))
print("Number of Show HN posts: ", len(show_posts))
print("All other posts: ", len(other_posts))

Number of Ask HN posts:  1744
Number of Show HN posts:  1162
All other posts:  17194


We are interested in the number of comments for each type of post. Below we calculate the total number of comments and average numbers of comments.

In [4]:
total_ask_comments = 0
for row in ask_posts:
    ask_comments_count = int(row[4])
    total_ask_comments += ask_comments_count

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of Ask HN post comments: ", avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    show_comments_count = int(row[4])
    total_show_comments += show_comments_count
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of Show HN post comments: ", avg_show_comments)
    

Average number of Ask HN post comments:  14.038417431192661
Average number of Show HN post comments:  10.31669535283993


Based on our findings above, Ask HN posts receive more comments on average.

## Ask HN posts analysis

In the code below, we create 2 dictionaries to determine the number of Ask HN posts for every hour of the day and the number of Ask HN post comments for every hour of the day. To do this, we must isolate the hour from the datetime object for each Ask HN post and add the hour and comments counts to dictionaries.

In [5]:
from datetime import datetime

result_list = []

for row in ask_posts:
    created_time = row[6]
    comments_count = int(row[4])
    result_list.append((created_time, comments_count))
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    time = row[0]
    time_object = datetime.strptime(time, date_format)
    hour_object = time_object.strftime("%H")
    comments_count = row[1]
    if hour_object not in counts_by_hour:
        counts_by_hour[hour_object] = 1
        comments_by_hour[hour_object] = comments_count
    else:
        counts_by_hour[hour_object] += 1
        comments_by_hour[hour_object] += comments_count

print("Count of Ask HN posts by hour: ", counts_by_hour)
print("Comments by hour for Ask HN posts: ", comments_by_hour)

Count of Ask HN posts by hour:  {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
Comments by hour for Ask HN posts:  {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


The code below calculates the average number of comments per Ask HN post for each hour of the day. This was done by dividing the number of comments by hour by the number of Ask HN posts created per hour.

In [6]:
avgs_by_hour = []
for hour in comments_by_hour:
    avgs_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
print(avgs_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In the code below, we determine the 5 hours with the highest average number of comments per Ask HN posts. We do this by sorting the avgs_by_hour list in descending order, then creating a time object for each hour for readability, then printing the 5 rows with the highest average number of comments per post.

In [7]:
swap_avg_by_hour = []
for row in avgs_by_hour:
    swap_avg_by_hour.append((row[1], row[0]))
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("\n")
print("Top 5 Hours with the most Ask Post Comments:")
for row in sorted_swap[:4]:
    hour = datetime.strptime(row[1], '%H')
    hour_object = hour.strftime('%H:%M')
    string = "{time}: {comments:.2f} average comments per post".format(time=hour_object, comments=row[0])
    print(string)

[(5.5777777777777775, '09'), (14.741176470588234, '13'), (13.440677966101696, '10'), (13.233644859813085, '14'), (16.796296296296298, '16'), (7.985294117647059, '23'), (9.41095890410959, '12'), (11.46, '17'), (38.5948275862069, '15'), (16.009174311926607, '21'), (21.525, '20'), (23.810344827586206, '02'), (13.20183486238532, '18'), (7.796296296296297, '03'), (10.08695652173913, '05'), (10.8, '19'), (11.383333333333333, '01'), (6.746478873239437, '22'), (10.25, '08'), (7.170212765957447, '04'), (8.127272727272727, '00'), (9.022727272727273, '06'), (7.852941176470588, '07'), (11.051724137931034, '11')]


Top 5 Hours with the most Ask Post Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post


Findings: Based on the data above, the hour in which an Ask HN post is most likely to receive comments is 15:00/3:00pm.

## Show HN posts analysis

In the code below, we create 2 dictionaries to determine the number of Show HN posts for every hour of the day and the number of Show HN post comments for every hour of the day. 

In [8]:
from datetime import datetime

show_result_list = []

for row in show_posts:
    created_time = row[6]
    comments_count = int(row[4])
    show_result_list.append((created_time, comments_count))
    
show_counts_by_hour = {}
show_comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in show_result_list:
    time = row[0]
    time_object = datetime.strptime(time, date_format)
    hour_object = time_object.strftime("%H")
    comments_count = row[1]
    if hour_object not in show_counts_by_hour:
        show_counts_by_hour[hour_object] = 1
        show_comments_by_hour[hour_object] = comments_count
    else:
        show_counts_by_hour[hour_object] += 1
        show_comments_by_hour[hour_object] += comments_count

print("Count of Show HN posts by hour: ", show_counts_by_hour)
print("Comments by hour for Show HN posts: ", show_comments_by_hour)

Count of Show HN posts by hour:  {'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}
Comments by hour for Show HN posts:  {'14': 1156, '22': 570, '18': 962, '07': 299, '20': 612, '05': 58, '16': 1084, '19': 539, '15': 632, '03': 287, '17': 911, '06': 142, '02': 127, '13': 946, '08': 165, '21': 272, '04': 247, '11': 491, '12': 720, '23': 447, '09': 291, '01': 246, '10': 297, '00': 487}


The code below calculates the average number of comments per Show HN post for each hour of the day.

In [9]:
show_avgs_by_hour = []
for hour in show_comments_by_hour:
    show_avgs_by_hour.append([hour, show_comments_by_hour[hour] / show_counts_by_hour[hour]])
    
print(show_avgs_by_hour)

[['14', 13.44186046511628], ['22', 12.391304347826088], ['18', 15.770491803278688], ['07', 11.5], ['20', 10.2], ['05', 3.0526315789473686], ['16', 11.655913978494624], ['19', 9.8], ['15', 8.102564102564102], ['03', 10.62962962962963], ['17', 9.795698924731182], ['06', 8.875], ['02', 4.233333333333333], ['13', 9.555555555555555], ['08', 4.852941176470588], ['21', 5.787234042553192], ['04', 9.5], ['11', 11.159090909090908], ['12', 11.80327868852459], ['23', 12.416666666666666], ['09', 9.7], ['01', 8.785714285714286], ['10', 8.25], ['00', 15.709677419354838]]


In the code below, we determine the 5 hours with the highest average number of comments per Show HN posts. 

In [12]:
show_swap_avg_by_hour = []
for row in show_avgs_by_hour:
    show_swap_avg_by_hour.append((row[1], row[0]))
    
print(show_swap_avg_by_hour)

show_sorted_swap = sorted(show_swap_avg_by_hour, reverse = True)

print("\n")
print("Top 5 Hours with the most Show HN Post Comments:")
for row in show_sorted_swap[:4]:
    hour = datetime.strptime(row[1], '%H')
    hour_object = hour.strftime('%H:%M')
    string = "{time}: {comments:.2f} average comments per post".format(time=hour_object, comments=row[0])
    print(string)

[(13.44186046511628, '14'), (12.391304347826088, '22'), (15.770491803278688, '18'), (11.5, '07'), (10.2, '20'), (3.0526315789473686, '05'), (11.655913978494624, '16'), (9.8, '19'), (8.102564102564102, '15'), (10.62962962962963, '03'), (9.795698924731182, '17'), (8.875, '06'), (4.233333333333333, '02'), (9.555555555555555, '13'), (4.852941176470588, '08'), (5.787234042553192, '21'), (9.5, '04'), (11.159090909090908, '11'), (11.80327868852459, '12'), (12.416666666666666, '23'), (9.7, '09'), (8.785714285714286, '01'), (8.25, '10'), (15.709677419354838, '00')]


Top 5 Hours with the most Show HN Post Comments:
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post


Findings: Based on the data above, the hour in which a Show HN post is most likely to receive comments is 18:00/6:00pm.

## Analysis: Ask HN vs. Show HN posts

- Although the times of day in which posts have the highest average number of comments is similar, Show HN posts are more commonly posted later in the day compared to Ask HN posts.
- Ask HN posts receive more comments per post on average.