# Analysis of Ask HN and Show HN Posts on Hacker News

## Introduction

Hacker News is a popular platform within the technology and startup 
communities where users submit posts to share news, ask questions, or showcase 
projects. This project involves analyzing a dataset of Hacker News posts to 
answer two primary questions:

1. Do "Ask HN" or "Show HN" posts receive more comments on average?
2. Do posts created at a certain time receive more comments on average?

The dataset provided has been filtered to include only posts that received 
comments, reducing the original dataset from nearly 300,000 rows to 
approximately 20,000 rows. The columns in the dataset include:


 - `id`: The unique identifier for each post.

 - `title`: The title of the post.

 - `url`: The URL the post links to, if applicable.

 - `num_points`: The number of points the post acquired, calculated as upvotes minus downvotes.

 - `num_comments`: The number of comments on the post.

 - `author`: The username of the person who submitted the post.

 - `created_at`: The date and time of the post's submission.

 We will use Python to analyze this data, focusing on string manipulation, 
 object-oriented programming, and date and time operations to gain insights 
 into user engagement on Hacker News.

## Importing Libraries and Reading the Dataset

Let's start by importing the necessary libraries and reading the dataset into 
a list of lists.

In [13]:
import csv
from datetime import datetime as dt

In [14]:
# File path for the dataset
hacker_news_csv = "hacker_news.csv"

In [15]:
# extract_csv() function: extracts the data from a csv file
# file_name: string with the name of the file
# header: boolean parameter with True as default argument
# return: the data and the header or just the data if header is False as a list
def extract_csv(file_name, header=True):
    if file_name is None:
        print("Error : no file name provided for extract_csv function")
        return (None)
    try:
        csv_file = open(file_name, encoding="utf-8")
    except FileNotFoundError:
        print(f"Error: {file_name} not found")
        return (None)
    except Exception as e:
        print(f"Error: {e}")
        return (None)

    csv_reader = csv.reader(csv_file)
    data = list(csv_reader)
    if header:
        data_header = data[0]
        data = data[1:]
        csv_file.close()
        return data, data_header
    else:
        csv_file.close()
        return data

In [16]:
# Extracting the dataset and its header
hn, headers = extract_csv(hacker_news_csv, True)

In [17]:
# print_dataset_slice() function: allows us to explore the rows and columns of a 
# dataset
# dataset: list of lists
# start and end: integers that slice the dataset
# rows_and_columns: boolean parameter with False as default argument
# return: nothing, just prints the number of rows and columns and slices the 
# dataset
def print_dataset_slice(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for i in dataset_slice:
        print(i);
        print("\n");
        
    if rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [18]:
# print_separator() function: prints a separator
# return: nothing, just prints a separator
def print_separator():
    print("\n")
    print("----------------------------------------\n")
    print("\n")

In [19]:
# print the header and the first 5 rows of the dataset to check if the 
#extraction was successful
print (headers)    
print_dataset_slice(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of ro

## Extracting Ask HN and Show HN Posts

We will now extract the "Ask HN" and "Show HN" posts from the dataset to analyze them separately. We will also identify the number of posts in eachcategory.


In [36]:
# extract_post_types() function: splits the dataset into "Ask HN" posts, 
# "Show HN" posts, and other posts
# hn: list of lists with the dataset
# return: the "Ask HN" posts, "Show HN" posts, and other posts as lists of lists
def extract_post_types(hn):
    ask_posts = []
    show_posts = []
    other_posts = []

    for post in hn:
        title = post[1].lower()
        if title.startswith("ask hn"):
            ask_posts.append(post)
        elif title.startswith("show hn"):
            show_posts.append(post)
        else:
            other_posts.append(post)

    return ask_posts, show_posts, other_posts

# Lists to store the "Ask HN" and "Show HN" posts and other posts
ask_posts, show_posts, other_posts = extract_post_types(hn)

In [37]:
# Check the number of "Ask HN" and "Show HN" posts and other posts
print("Number of Ask HN posts:", len(ask_posts))
print("Number of Show HN posts:", len(show_posts))
print("Number of other posts:", len(other_posts))
print_separator()

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194


----------------------------------------





## Calculating the Average Number of Comments for Ask HN and Show HN Posts

We will now calculate the average number of comments for "Ask HN" and "Show HN"
posts to determine which type of post receives more comments on average.

In [38]:
# calculate_average_comments() function: calculates the average number of comments
# for a given list of posts
# posts: list of posts
# return: the average number of comments
def calculate_average_comments(posts):
    total_comments = 0
    for post in posts:
        total_comments += int(post[4])
    return total_comments / len(posts)

# Calculate the average number of comments for "Ask HN" and "Show HN" posts
avg_ask_comments = calculate_average_comments(ask_posts)
print("Average number of comments for Ask HN posts:", avg_ask_comments)

avg_show_comments = calculate_average_comments(show_posts)
print("Average number of comments for Show HN posts:", avg_show_comments)
print_separator()

Average number of comments for Ask HN posts: 14.038417431192661
Average number of comments for Show HN posts: 10.31669535283993


----------------------------------------





## (1) Do show posts or ask posts receive more comments on average?
The average number of comments for "Ask HN" posts is approximately 14.04, while
the average number of comments for "Show HN" posts is approximately 10.32.
Therefore, "Ask HN" posts receive more comments on average compared to "Show HN"
posts.

## Analyzing the Number of Comments by Hour

Next, we will analyze the number of comments for "Ask HN" posts by hour to
determine if posts created at a certain time receive more comments on average.
We will follow these steps:

1. Calculate the number of "Ask HN" posts created in each hour of the day, along
with the number of comments received.
2. Calculate the average number of comments "Ask HN" posts receive by hour created.


In [28]:
# extract_results_list(posts) function: extracts the created_at and num_comments
# columns from the "Ask HN" posts
# posts: list of lists with the "Ask HN" posts
# return: a list of lists with the created_at and num_comments columns
def extract_results_list(posts):
    results_list = []
    
    for post in posts:
        created_at = post[6]
        num_comments = int(post[4])
        results_list.append([created_at, num_comments])
    return (results_list)

# List to store the created_at and num_comments columns from the "Ask HN" posts
results_list = extract_results_list(ask_posts)

In [40]:
# parse_coments_by_hour() function: parses the date and time and calculates the
# number of posts and comments by hour
# results_list: list of lists with the created_at and num_comments columns
# return: dictionaries with the number of posts and comments by hour
def parse_coments_by_hour(results_list):
    counts_by_hour = {}
    comments_by_hour = {}
    
    for result in results_list:
        date_str = result[0]
        num_comments = result[1]
        date_dt = dt.strptime(date_str, "%m/%d/%Y %H:%M")
        hour = date_dt.strftime("%H")
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = num_comments
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += num_comments
    
    return (counts_by_hour, comments_by_hour)

# Creating dictionaries to store the number of posts and comments by hour
counts_by_hour, comments_by_hour= parse_coments_by_hour(results_list)

In [42]:
# average number of comments per post by hour function: calculates the average
# number of comments per post by hour
# counts_by_hour: dictionary with the number of posts by hour
# comments_by_hour: dictionary with the number of comments by hour
# return: a list of lists with the average number of comments per post by hour
def average_comments_per_post_by_hour(counts_by_hour, comments_by_hour):
    avg_by_hour = []
    
    for hour in counts_by_hour:
        avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
    return (avg_by_hour)

# Calculating the average number of comments per post by hour
avg_by_hour = average_comments_per_post_by_hour(counts_by_hour, comments_by_hour)

print("Average number of comments per post by hour:\n")
print_dataset_slice(avg_by_hour, 0, len(avg_by_hour), False)
print_separator()

Average number of comments per post by hour:

['09', 5.5777777777777775]


['13', 14.741176470588234]


['10', 13.440677966101696]


['14', 13.233644859813085]


['16', 16.796296296296298]


['23', 7.985294117647059]


['12', 9.41095890410959]


['17', 11.46]


['15', 38.5948275862069]


['21', 16.009174311926607]


['20', 21.525]


['02', 23.810344827586206]


['18', 13.20183486238532]


['03', 7.796296296296297]


['05', 10.08695652173913]


['19', 10.8]


['01', 11.383333333333333]


['22', 6.746478873239437]


['08', 10.25]


['04', 7.170212765957447]


['00', 8.127272727272727]


['06', 9.022727272727273]


['07', 7.852941176470588]


['11', 11.051724137931034]




----------------------------------------





In [44]:
# swap_columns() function: swaps the columns in a list of lists
# data: list of lists
# return: a new list of lists with the columns swapped
def swap_columns(data):
    swapped_data = []
    for row in data:
        swapped_data.append([row[1], row[0]])
    return swapped_data

# Swap the columns in the avg_by_hour list of lists
swap_avg_by_hour = swap_columns(avg_by_hour)

print("Swapped columns in the avg_by_hour list of lists:\n")
print_dataset_slice(swap_avg_by_hour, 0, len(swap_avg_by_hour), False)
print_separator()

Swapped columns in the avg_by_hour list of lists:

[5.5777777777777775, '09']


[14.741176470588234, '13']


[13.440677966101696, '10']


[13.233644859813085, '14']


[16.796296296296298, '16']


[7.985294117647059, '23']


[9.41095890410959, '12']


[11.46, '17']


[38.5948275862069, '15']


[16.009174311926607, '21']


[21.525, '20']


[23.810344827586206, '02']


[13.20183486238532, '18']


[7.796296296296297, '03']


[10.08695652173913, '05']


[10.8, '19']


[11.383333333333333, '01']


[6.746478873239437, '22']


[10.25, '08']


[7.170212765957447, '04']


[8.127272727272727, '00']


[9.022727272727273, '06']


[7.852941176470588, '07']


[11.051724137931034, '11']




----------------------------------------





In [45]:
# Sort the swap_avg_by_hour list of lists in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments:\n")
for avg, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.strptime(hour, "%H").strftime("%H:%M"), avg
        )
    )
print_separator()

Top 5 Hours for Ask Posts Comments:

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


----------------------------------------





# Conclusion

In this project, we analyzed a dataset of Hacker News posts to determine which type of post and time receive more comments on average. We found that "Ask HN" posts receive more comments on average compared to "Show HN" posts. Additionally,we discovered that posts created between 15:00 and 16:00 (3:00 pm - 4:00 pm EST)receive the most comments on average.