# Analyzing Hacker News Posts

In this project, I will analyze a dataset of posts from Hacker News, a popular technology website where users share stories and discuss tech topics. 

The dataset contains about 20,000 posts, each with information like the title, number of comments, author, and submission time. I am particularly interested in two types of posts: "Ask HN" posts where users ask questions to the community, and "Show HN" posts where users share their projects or interesting findings. 

My goal is to find out which type gets more comments and whether the time a post is submitted affects how much discussion it generates.

## Reading and Exploring the Dataset

I will read the CSV file and display the first 5 rows to understand the structure of the data. Then I will check the total number of rows in the dataset.

In [170]:
from csv import reader

with open('hacker-news.csv') as dataset:
    read_dataset = reader(dataset)
    dataset = list(read_dataset)

for row in dataset[:5]:
    print(row)

print('\n')

print('Total rows in the dataset:', len(dataset))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Total rows in the dataset: 20101


The dataset contains 20,101 rows. Each row represents a Hacker News post with 7 columns: `id`, `title`, `url`, `num_points`, `num_comments`, `author`, and `created_at`.

## Separating the Header Row

I will extract the header row from the dataset and remove it from the data to work with the actual post records separately.


In [171]:
header = dataset[0]
dataset  = dataset[1:]

print(header)
print('\n')
for row in dataset[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Filtering Posts by Type

I will separate the dataset into three categories based on post titles: "Ask HN" posts, "Show HN" posts, and all other posts. This will allow me to analyze each type separately.

In [172]:
ask_posts = []
show_posts = []
other_posts = []

for row in dataset:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of 'Ask HN' posts:", len(ask_posts))
print("Number of 'Show HN' posts:", len(show_posts))
print("Number of other posts:", len(other_posts))

Number of 'Ask HN' posts: 1744
Number of 'Show HN' posts: 1162
Number of other posts: 17194


## Calculating Average Comments by Post Type

I will calculate the total and average number of comments for both "Ask HN" and "Show HN" posts to determine which type receives more engagement.

In [173]:
# Ask HN posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

print("Total number of comments on 'Ask HN' posts:", total_ask_comments)

avg_ask_comment = total_ask_comments / len(ask_posts)

print("Average number of comments on 'Ask HN' posts", round(avg_ask_comment, 2))

Total number of comments on 'Ask HN' posts: 24483
Average number of comments on 'Ask HN' posts 14.04


In [174]:
# Show HN posts

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

print("Total number of comments on 'Show HN' posts:", total_show_comments)

avg_show_comment = total_show_comments / len(show_posts)

print("Average number of comments on 'Show HN' posts", round(avg_show_comment, 2))

Total number of comments on 'Show HN' posts: 11988
Average number of comments on 'Show HN' posts 10.32


Ask HN posts receive more comments on average:
- Ask HN: 14.04 comments per post (24,483 total comments across 1,744 posts)
- Show HN: 10.32 comments per post (11,988 total comments across 1,162 posts)

Ask HN posts generate about 40% more comments than Show HN posts, likely because people asking questions naturally invite responses and discussion.

## Analyzing Comments by Hour

I will determine which hours receive the most comments on Ask HN posts. First, I will extract the creation time and comment count from each post, then parse the timestamps and group them by hour.

In [175]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_ad = row[-1]
    num_comments = int(row[4])
    result_list.append([created_ad, num_comments])

In [176]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    created_at = row[0]
    num_comments = row[1]
    created_at = dt.datetime.strptime(created_at, '%m/%d/%Y %H:%M')
    hour = created_at.strftime('%H')
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments

Next, I will display the number of posts created during each hour of the day.

In [177]:
# Number of posts by hour

for hour in sorted(counts_by_hour):
    print(f'{hour}:00 - {counts_by_hour[hour]}')

00:00 - 55
01:00 - 60
02:00 - 58
03:00 - 54
04:00 - 47
05:00 - 46
06:00 - 44
07:00 - 34
08:00 - 48
09:00 - 45
10:00 - 59
11:00 - 58
12:00 - 73
13:00 - 85
14:00 - 107
15:00 - 116
16:00 - 108
17:00 - 100
18:00 - 109
19:00 - 110
20:00 - 80
21:00 - 109
22:00 - 71
23:00 - 68


Now I will show the total number of comments received during each hour.

In [178]:
# Number of comments by hour

for hour in sorted(comments_by_hour):
    print(f'{hour}:00 – {comments_by_hour[hour]}')

00:00 – 447
01:00 – 683
02:00 – 1381
03:00 – 421
04:00 – 337
05:00 – 464
06:00 – 397
07:00 – 267
08:00 – 492
09:00 – 251
10:00 – 793
11:00 – 641
12:00 – 687
13:00 – 1253
14:00 – 1416
15:00 – 4477
16:00 – 1814
17:00 – 1146
18:00 – 1439
19:00 – 1188
20:00 – 1722
21:00 – 1745
22:00 – 479
23:00 – 543


The data shows post counts and total comments for each hour of the day in Eastern Time. Hour 15 (3 PM ET) has both the most posts (116) and the most total comments (4,477). 

Afternoon hours (12 PM - 5 PM ET) generally receive more activity, while early morning hours (12 AM - 6 AM ET) receive the least posts and comments.

## Calculating Average Comments per Hour

I will calculate the average number of comments per Ask HN post for each hour by dividing the total comments by the number of posts created during that hour.

In [180]:
#  Average number of comments per post by hour

avg_by_hour = []

for hour_key in comments_by_hour:
    avg = comments_by_hour[hour_key] / counts_by_hour[hour_key]
    avg_by_hour.append((hour_key, avg))

for hour, avg in sorted(avg_by_hour):
    print(f'{hour}:00 – {avg:.2f}')

00:00 – 8.13
01:00 – 11.38
02:00 – 23.81
03:00 – 7.80
04:00 – 7.17
05:00 – 10.09
06:00 – 9.02
07:00 – 7.85
08:00 – 10.25
09:00 – 5.58
10:00 – 13.44
11:00 – 11.05
12:00 – 9.41
13:00 – 14.74
14:00 – 13.23
15:00 – 38.59
16:00 – 16.80
17:00 – 11.46
18:00 – 13.20
19:00 – 10.80
20:00 – 21.52
21:00 – 16.01
22:00 – 6.75
23:00 – 7.99


Hour 15 (3 PM ET) has the highest average with 38.59 comments per post, significantly higher than other hours. Hour 2 (2 AM ET) comes second with 23.81 comments per post, followed by hour 20 (8 PM ET) with 21.52 comments per post. 

The top-performing hours are spread throughout the day rather than concentrated in a specific time period. Hour 9 (9 AM ET) has the lowest average with only 5.58 comments per post.