# Exploring Hacker News Posts

This notebook is a challenge from dataquest.io

Problem description:

In the dataset there are informations about posts from "Hacker News" website. There are two popular post type, question posts that start with "Ask HN" and projects posts that start with "Show HN". We need to know which of the types receive in average more comments, and at what time the most popular type receives more comments.

Dataset:

-id: The unique identifier from Hacker News for the post\
-title: The title of the post\
-url: The URL that the posts links to, if it the post has a URL\
-num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes\
-num_comments: The number of comments that were made on the post\
-author: The username of the person who submitted the post\
-created_at: The date and time at which the post was submitted\

In [36]:
from csv import reader
import datetime as dt

hn = list(reader(open("hacker_news.csv")))
headers = hn[0]
hn = hn[1:]

Creating lists that will store the posts based on its title

In [37]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

Function thar receives a list and calculate the total number of comments

In [38]:
def count_total_comments(data):
    total = 0
    for row in data:
        total += int(row[4])
    return total

Is calculated the average of comments for questions and projects posts

In [39]:
total_ask_comments = count_total_comments(ask_posts)
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average of comments in ask posts: {0:.2f}".format(avg_ask_comments))

total_show_comments = count_total_comments(show_posts)
avg_show_comments = total_show_comments/len(show_posts)
print("Average of comments in show posts: {0:.2f}".format(avg_show_comments))

Average of comments in ask posts: 14.04
Average of comments in show posts: 10.32


Question posts receive more comments in average

result_list will storage the hour and the total of comments for question posts

In [40]:
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

Two dictionaries are created, one will store the number of posts per hour, the other one will store the number of comments per hour

In [41]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(date, "%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

avg_by_hour is a list of lists that each row will contain an hour and the average of comments of that hour

In [42]:
avg_by_hour = []

for hour, count in counts_by_hour.items():
    for hour_total, total in comments_by_hour.items():
        if hour == hour_total:
            avg_by_hour.append([hour, total/count])

The columns of each row are inverted and appended into swap_avg_by_hour
<br>
swap_avg_by_hour is sorted and the results show that at 15:00 is the hour that question posts receive more comments in average

In [43]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

for row in sorted_swap[:5]:
    print("{0}:00: {1:.2f} average comments per post".format(row[1], row[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
