# <strong><p align='center'> Hacker News Posts </strong>

### About

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

The types of posts submitted are:
<ol>    
    <li>Ask posts: Posts submitted about a specific question</li>
    <li>Show posts: Posts submitted about a project, product, or just something interesting</li>
</ol>

### Analysis

This project will analyze this [dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) and specifically look at posts marked 'Ask HN' and 'Show HN'. We will calculate and compare their averages and time submitted to determine which posts were preferred.

Specifically, we will look to answer these questions:

<ul>
  <li>Do Ask HN or Show HN receive more comments on average?</li>
  <li>Do posts created at a certain time receive more comments on average?</li>
</ul>

Below are the details of the dataset


<table>
  <tbody style="font-size:15px">
  <tr>
    <th>Column</th>
    <th>Description</th>
  </tr>
  <tr>
    <th>id</th>
    <td>the unique identifier from Hacker News for the post</td> 
  </tr>
  <tr>
    <th>title</th>
    <td>the title of the post</td>
  </tr>
  <tr>  
    <th>url</th>
    <td>the URL that the posts links to, if the post has a URL</td>
  </tr>
  <tr>
    <th>num_points</th>
    <td>the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes</td>
  </tr>
  <tr>
    <th>num_comments</th>
    <td>the number of comments on the post</td>
  </tr>
  <tr>
    <th>author</th>
    <td>the username of the person who submitted the post</td>
  </tr>
  <tr>
    <th>created_at</th>
    <td>the date and time of the post's submission</td>
  </tr>
  </tbody>
</table>

#### Import necessary libraries

In [1]:
"""
This code sets up the necessary imports for a Jupyter Notebook project that works with Hacker News posts.

The `reader` function from the `csv` module is imported to read CSV data.
The `PrettyTable` class from the `prettytable` module is imported to create formatted tables.
The `datetime` module is imported and aliased as `dt` for working with date and time data.

The `%load_ext jupyter_black` line is a Jupyter Notebook magic command that reloads the `jupyter_black` extension, which is likely used to automatically format the Python code in the notebook using the Black code formatter.
"""

from csv import reader
from prettytable import PrettyTable
import datetime as dt

%load_ext jupyter_black

#### Load the data

The following steps are performed whilst loading the data:
<ol>
    <li>Load the data</li>
    <li>Extracting the column headers and data into separate lists</li>
</ol>

In [2]:
"""
The data is loaded using a context manager to ensure the file is properly closed after use.
"""

# Load the data
with open(file="hacker_news.csv", mode="r") as hn:
    hn = list(reader(hn))

# View the first 5 rows
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [3]:
# Separate the headers from the data
headers, hn = hn[0], hn[1:]

In [4]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

#### Extracting Ask HN and Show HN Posts

Since our focus is on 'Ask HN' & 'Show HN' posts, we will split the data into separate lists using a case-insensitive search to make analysis much easier.


In [5]:
ask_posts, show_posts, other_posts = [], [], []

for row in hn:
    title: str = row[1].lower()

    if title.startswith("ask hn"):  # startswith is case-insensitive
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

# View the number of posts in each list
num_posts_per_type = PrettyTable(
    field_names=[
        "Type of Posts",
        "Total Comments",
    ]
)
# Add each row type and length to the table
num_posts_per_type.add_rows(
    [
        ["Ask", len(ask_posts)],
        ["Show", len(show_posts)],
        ["Other", len(other_posts)],
        ["Total", len(hn)],
    ]
)

num_posts_per_type

Type of Posts,Total Comments
Ask,1744
Show,1162
Other,17194
Total,20100


#### Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now that the 'Ask HN' & 'Show HN' posts have been separated, we can calculate and compare their respective averages.

In [6]:
def calculate_total_comments(posts: list[list[str]], index: int) -> int:
    """Calculate the total number of comments from the provided posts.

    Args:
      posts: The list of posts.
      index: The index of the field containing the number of comments.

    Returns:
      The total number of comments.
    """
    return sum([int(post[index]) for post in posts])


def calculate_avg_comments(posts: list[list[str]], index: int) -> float:
    """Calculate the average number of comments for the posts.

    Args:
      posts: The list of posts.
      index: The index of the field containing the number of comments.

    Returns:
      The average number of comments, formatted to 4 decimal places.
    """
    return float(f"{calculate_total_comments(posts, index) / len(posts):.4f}")

In [7]:
"""
Calculates the total and average number of comments for "Ask" and "Show" posts on Hacker News.

The `calculate_total_comments` and `calculate_avg_comments` functions are used to compute the total and average number of comments for the given list of posts and the specified index of the comment count.

The results are then displayed in a PrettyTable with the following columns:
- Type of Posts
- Total Comments
- Average Comments
"""

total_ask_posts_comments: int = calculate_total_comments(ask_posts, 4)
avg_ask_posts_comments: float = calculate_avg_comments(ask_posts, 4)
total_show_posts_comments: int = calculate_total_comments(show_posts, 4)
avg_show_posts_comments: float = calculate_avg_comments(show_posts, 4)

# View the total posts and averages
posts_per_type = PrettyTable(
    field_names=["Type of Posts", "Total Comments", "Average Comments"]
)
# Add each post type and their information
posts_per_type.add_rows(
    [
        ["Ask", total_ask_posts_comments, avg_ask_posts_comments],
        ["Show", total_show_posts_comments, avg_show_posts_comments],
    ]
)

posts_per_type

Type of Posts,Total Comments,Average Comments
Ask,24483,14.0384
Show,11988,10.3167


As 'Ask HN' posts have more comments on average, we will shift our focus towards those posts.

In [8]:
# Calculate the percentage difference between the comments
pct_diff: float = (
    100
    * abs(avg_ask_posts_comments - avg_show_posts_comments)
    / avg_show_posts_comments
)

print(f"On average, Ask HN posts {pct_diff:.2f}% get more comments than Show HN posts")

On average, Ask HN posts 36.07% get more comments than Show HN posts


#### Finding the Number of Ask Posts and Comments by Hour Created

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We will group posts and their comments received by the hour. Then, we can calculate the average hourly posts submitted and comments received.

In [9]:
# Extract the 'created_at' and number of comments for each 'Ask HN' post
result_list: list[list[str | int]] = [[post[6], int(post[4])] for post in ask_posts]

counts_per_hour: dict = {}
comments_per_hour: dict = {}

for row in result_list:
    date, comment = row[0], row[1]
    hour = dt.datetime.strptime(
        date, "%m/%d/%Y %H:%M"
    )  # Posts are submitted at US Eastern Time
    timezone_adjustment = dt.timedelta(
        hours=6
    )  # Adjusting for 6 hour timezone difference between US Eastern Time and South African Standard Time
    adjusted_hour = (hour + timezone_adjustment).strftime(
        "%H"
    )  # Extract the hour only after timezone adjustment

    if adjusted_hour not in counts_per_hour:
        counts_per_hour[adjusted_hour] = 1
        comments_per_hour[adjusted_hour] = comment
    else:
        counts_per_hour[adjusted_hour] += 1
        comments_per_hour[adjusted_hour] += comment

#### Calculating the Average Number of Comments for Ask HN Posts by Hour



In [10]:
"""
Calculates the average number of comments per hour and appends the hour and average to the `avg_by_hour` list.

For each hour in the `comments_per_hour` dictionary, this code:
1. Calculates the average number of comments per hour by dividing the total comments for that hour by the count of posts for that hour.
2. Rounds the average to 2 decimal places.
3. Appends a list containing the hour and the calculated average to the `avg_by_hour` list.
"""

avg_by_hour: list = []

for hour in comments_per_hour:
    avg = round((comments_per_hour[hour] / counts_per_hour[hour]), 2)
    avg_by_hour.append([hour, avg])

avg_by_hour

[['15', 5.58],
 ['19', 14.74],
 ['16', 13.44],
 ['20', 13.23],
 ['22', 16.8],
 ['05', 7.99],
 ['18', 9.41],
 ['23', 11.46],
 ['21', 38.59],
 ['03', 16.01],
 ['02', 21.52],
 ['08', 23.81],
 ['00', 13.2],
 ['09', 7.8],
 ['11', 10.09],
 ['01', 10.8],
 ['07', 11.38],
 ['04', 6.75],
 ['14', 10.25],
 ['10', 7.17],
 ['06', 8.13],
 ['12', 9.02],
 ['13', 7.85],
 ['17', 11.05]]

#### Sorting and Printing Values from a List of Lists

In [11]:
"""
Sorts the average values by hour in descending order, with the hour as the primary sort key and the average value as the secondary sort key.

The `swap_avg_by_hour` list swaps the order of the elements in each inner list of `avg_by_hour`, putting the hour first and the average second.

`sorted_swap` then sorts `swap_avg_by_hour` in descending order by hour and average.
"""

swap_avg_by_hour = [[avg[1], avg[0]] for avg in avg_by_hour]
sorted_swap = sorted(swap_avg_by_hour, key=lambda x: (x[0], x[1]), reverse=True)

sorted_swap

[[38.59, '21'],
 [23.81, '08'],
 [21.52, '02'],
 [16.8, '22'],
 [16.01, '03'],
 [14.74, '19'],
 [13.44, '16'],
 [13.23, '20'],
 [13.2, '00'],
 [11.46, '23'],
 [11.38, '07'],
 [11.05, '17'],
 [10.8, '01'],
 [10.25, '14'],
 [10.09, '11'],
 [9.41, '18'],
 [9.02, '12'],
 [8.13, '06'],
 [7.99, '05'],
 [7.85, '13'],
 [7.8, '09'],
 [7.17, '10'],
 [6.75, '04'],
 [5.58, '15']]

In [12]:
# Print the first 5 rows of results
for post in sorted_swap[:5]:
    time: str = dt.datetime.strptime(post[1], "%H").strftime("%X")
    avg_comments: str | float = post[0]
    print(f"{time}: {avg_comments:.2f} average comments per post")

# One-liner
# for idx in range(5):
#     print(
#     f"{dt.datetime.strptime(sorted_swap[idx][1],"%H").strftime("%X")}: {sorted_swap[idx][0]:.2f} average comments per post"
#     )

21:00:00: 38.59 average comments per post
08:00:00: 23.81 average comments per post
02:00:00: 21.52 average comments per post
22:00:00: 16.80 average comments per post
03:00:00: 16.01 average comments per post


### Conclusion

Based on the analysis done, we can now answer our initial questions.

<ul>
  <li>Do Ask HN or Show HN receive more comments on average?</li>
  <dd>- 'Ask HN' posts receive <strong>36.07%</strong> more comments on average<br></br></dd>
  
  <li>Do posts created at a certain time receive more comments on average?</li>
  <dd>- Posts submitted late at night or early in the morning are more likely to receive more comments on average.</dd>
</ul>

Further analysis may indicate if there are any outliers that have skewed these averages and if this is a general trend or not.
