# Analyzing Hacker News Posts

In this project, I will analyze a dataset of posts from Hacker News, a popular technology website where users share stories and discuss tech topics. 

The dataset contains about 20,000 posts, each with information like the title, number of comments, author, and submission time. I am particularly interested in two types of posts: "Ask HN" posts where users ask questions to the community, and "Show HN" posts where users share their projects or interesting findings. 

My goal is to find out which type gets more comments and whether the time a post is submitted affects how much discussion it generates.

## Reading and Exploring the Dataset

I will read the CSV file and display the first 5 rows to understand the structure of the data. Then I will check the total number of rows in the dataset.

In [44]:
from csv import reader

with open('hacker-news.csv') as dataset:
    read_dataset = reader(dataset)
    dataset = list(read_dataset)

for row in dataset[:5]:
    print(row)

print('\n')

print('Total rows in the dataset:', len(dataset))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Total rows in the dataset: 20101


The dataset contains 20,101 rows. Each row represents a Hacker News post with 7 columns: `id`, `title`, `url`, `num_points`, `num_comments`, `author`, and `created_at`.

## Separating the Header Row

I will extract the header row from the dataset and remove it from the data to work with the actual post records separately.


In [45]:
header = dataset[0]
dataset  = dataset[1:]

print(header)
print('\n')
for row in dataset[:5]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


## Filtering Posts by Type

I will separate the dataset into three categories based on post titles: "Ask HN" posts, "Show HN" posts, and all other posts. This will allow me to analyze each type separately.

In [49]:
ask_posts = []
show_posts = []
other_posts = []

for row in dataset:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of 'Ask HN' posts:", len(ask_posts))
print("Number of 'Show HN' posts:", len(show_posts))
print("Number of other posts:", len(other_posts))

Number of 'Ask HN' posts: 1744
Number of 'Show HN' posts: 1162
Number of other posts: 17194


## Calculating Average Comments by Post Type

I will calculate the total and average number of comments for both "Ask HN" and "Show HN" posts to determine which type receives more engagement.

In [50]:
# Ask HN posts

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

print("Total number of comments on 'Ask HN' posts:", total_ask_comments)

avg_ask_comment = total_ask_comments // len(ask_posts)

print("Average number of comments on 'Ask HN' posts", avg_ask_comment)

Total number of comments on 'Ask HN' posts: 24483
Average number of comments on 'Ask HN' posts 14


In [51]:
# Show HN posts

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

print("Total number of comments on 'Show HN' posts:", total_show_comments)

avg_show_comment = total_show_comments // len(show_posts)

print("Average number of comments on 'Show HN' posts", avg_show_comment)

Total number of comments on 'Show HN' posts: 11988
Average number of comments on 'Show HN' posts 10


Ask HN posts receive more comments on average:
- Ask HN: 14 comments per post (24,483 total comments across 1,744 posts)
- Show HN: 10 comments per post (11,988 total comments across 1,162 posts)

Ask HN posts generate about 40% more comments than Show HN posts, likely because people asking questions naturally invite responses and discussion.