# Hacker Post Data Analysis

This analysis will seek to understand whether Ask HN or Show HN posts recieve more comments. Additionally, we'll analyze the times when posts are submitted to understand if there are more comments for posts created at a certain time.

The data set that we'll be working with contains the following columns:

|Column          |Description                                    |
|----------------|-----------------------------------------------|
|id              |Unique ID for each post                        |
|title           |Title of the post                              |
|url             |URL the post links to                          |
|num_points      |Number of upvotes minus number of downvotes    |
|num_comments    |Number of comments                             |
|author          |Author of the post                             |
|created_at      |Date and time post was created                 |

In [3]:
# open the csv file and display the first few rows

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hacker_news = list(read_file)

# print(hacker_news[:6])

In [None]:
# remove the header row of the data set
headers = hacker_news[0]
hacker_news = hacker_news[1:]

print(headers)
print(hacker_news[:5])

In [5]:
# create two sub sets of data beginning with 'Show HN' and 'Ask HN'

show_hn = []
ask_hn = []
other_posts = []

for row in hacker_news:
    title = row[1].lower()
    if title.startswith('show hn'):
        show_hn.append(row)
    elif title.startswith('ask hn'):
        ask_hn.append(row)
    else:
        other_posts.append(row)

In [6]:
# display the number of rows in each data set
print(len(show_hn))
print(len(ask_hn))
print(len(other_posts))

1162
1744
17194


In [7]:
# display the average number of comments for each data set
total_comments_show_hn = 0
total_comments_ask_hn = 0

for post in show_hn:
    num_comments = int(post[4])
    total_comments_show_hn += num_comments
    
for record in ask_hn:
    num_comms = int(record[4])
    total_comments_ask_hn += num_comms
    
avg_num_comments_show_hn = total_comments_show_hn / len(show_hn)
avg_num_comments_ask_hn = total_comments_ask_hn / len(ask_hn)

print(avg_num_comments_show_hn)
print(avg_num_comments_ask_hn)

10.31669535283993
14.038417431192661


# Initial findings

As you can see from the results above, 'Show HN' posts average 10.31 comments, whereas 'Ask HN' posts average 14.03 comments, meaning that Ask HN posts have a higher average number of comments for each post compared to Show HN posts.

In [10]:
print(ask_hn[3])

['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


In [16]:
# display the average number of comments per hour created for Ask HN posts
# import datetime as dt

result_list = []

for each_row in ask_hn:
    created_at = each_row[6]
    num_com = int(each_row[4])
    a_list = [created_at, num_com]
    result_list.append(a_list)
    
counts_by_hour = {}
comments_by_hour = {}

for each_record in result_list:
    dt_created = each_record[0]
    n_com = each_record[1]
    date_created_dt = dt.datetime.strptime(dt_created, '%m/%d/%Y %H:%M')
    time = date_created_dt.time()
    hour = time.hour
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
    else:
        counts_by_hour[hour] = 1
        
    if hour in comments_by_hour:
        comments_by_hour[hour] += n_com
    else:
        comments_by_hour[hour] = n_com
        

In [17]:
# display dictionaries

print(comments_by_hour)
print(counts_by_hour)

{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}
{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


In [18]:
# calculate the average number of comments per post for each hour

avg_comms = []

for i in comments_by_hour:
    avg_comts = comments_by_hour[i] / counts_by_hour[i]
    avg_comms.append([i, avg_comts])
    
print(avg_comms)

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]


In [21]:
# format the list

list_reformatted = []

for rec in avg_comms:
    hour = rec[0]
    avg = rec[1]
    list_reformatted.append([avg, hour])
    
list_sorted = sorted(list_reformatted, reverse=True)

print("Top 5 Hours for Ask Posts Comments:")

for line in list_sorted[:5]:
    hr = dt.datetime.strptime(str(line[1]), '%H')
    hr_time_obj = hr.time()
    avrg = line[0]
        
    message = "{hour}: {av:.2f} number of comments per post on average".format(hour=hr_time_obj, av=avrg)
    print(message)

Top 5 Hours for Ask Posts Comments:
15:00:00: 38.59 number of comments per post on average
02:00:00: 23.81 number of comments per post on average
20:00:00: 21.52 number of comments per post on average
16:00:00: 16.80 number of comments per post on average
21:00:00: 16.01 number of comments per post on average


# Results

As you can see from the output above, users will have the best chance of receiving maximum comments on their Ask Posts if they create the post at 3pm. The next best chance they will have is if they create the post at 2am.