# Hacker News Comments Analysis

In this project, I will take the posts from the Hacker News website (https://news.ycombinator.com/) and analyze which types of posts receive the most comments, and how timing affects the amount of comments received. There are 2 types of posts:

1) Ask HN - Users ask a question to the Hacker News community.
2) Show HN - Users share projects or information they may find useful to the Hacker News community.

This is a guided data science project from the DataQuest "Data Science in Python" course.

In [9]:
from csv import reader
import datetime as dt

def str_to_int(str_in):
    if (str_in.isdigit()):
        return int(str_in)
    return 0

# open and raed the file
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)

#convert file to list of lists
hn = list(read_file)

# remove the first row of the data, which
# contains the column names
headers = hn[:1]
hn = hn[1:]

ask_posts = []
show_posts = []
other_posts = []

total_ask_comments = 0
total_show_comments = 0

for row in hn:
    title = row[1] 
    num_comments = str_to_int(row[4])
    
    # determine what type of post this is and record the total number of comments for each 
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        total_ask_comments += num_comments
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
        total_show_comments += num_comments
    else:
        other_posts.append(row)
        
avg_ask_comments = float(total_ask_comments) / len(ask_posts)
avg_show_comments = float(total_show_comments) / len(show_posts)
        
print(avg_ask_comments)
print(avg_show_comments)






14.038417431192661
10.31669535283993


According to the total number of posts for each post type (ask/show), and the number of comments for each, the avergage number of comments for the "ask" post types seems to be higher. The following are the average comments for each post type:

Ask HN:  14.038417431192661
Show HN: 10.31669535283993

Note: Copying the whole row from "hn" to another list (ask_posts/show_posts/other_posts) during each iteration is not a necessary use of memory, but is done for the sake of the excercise and to split up the first section with the next.

Next, I will determine if ask posts receive more comments when posted during a certain part of the day.

In [38]:

counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    created_at = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
    hour = created_at.strftime('%H')
    num_comments = str_to_int(row[4])
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        

avg_by_hour = []

# calculate the average number of comments for each hour of the day
for key in counts_by_hour:
    avg = float(comments_by_hour[key]) / counts_by_hour[key]
    avg_by_hour.append([avg, key])
    
# sort list from highest avergage comments per post
avg_by_hour = sorted(avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments")
template = "{hour}:00: {avg:2f} average comments per post"
for l in avg_by_hour[:5]:
    print(template.format(hour=l[1], avg=l[0]))


    
    
    

Top 5 Hours for Ask Posts Comments
15:00: 38.594828 average comments per post
02:00: 23.810345 average comments per post
20:00: 21.525000 average comments per post
16:00: 16.796296 average comments per post
21:00: 16.009174 average comments per post
