# Exploring Hacker News Posts
[Hacker News](https://news.ycombinator.com/) is a website where technology articles, ideas, and related subjects are shared. Users share stories known as "posts", which are able to receive votes and comments, similar to other discussion forums like reddit. It is extremely popular within startup and technology circles. Posts that get upvoted enough to make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

## Exploring the Data
The [dataset](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) that will be used contains about 300,000 rows. However, the [resulting dataset](https://dq-content.s3.amazonaws.com/356/hacker_news.csv) has been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments and then randomly sampling  from the remaining submissions.

Below are descriptions of the columns:

- `id`: the unique identifier from Hacker News for the post
- `title`: the title of the post
- `url`: the URL that the posts links to, if the post has a URL
- `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus - the total number of downvotes
- `num_comments`: the number of comments on the post
- `author`: the username of the person who submitted the post
- `created_at`: the date and time of the post's submission

Let's start by opening the dataset and displaying the first results. For that, we will need two functions: 
- `open_dataset`, whose work is to open the dataset provided and decide wether we want the function to consider returning a header or not.
- `explore_dataset` formats a section of a selected dataset for easier visualization. It also provides options to display the total amount of rows and columns in that dataset.

In [1]:
# Open a csv dataset and opt to return the dataset as a whole or separated by body and header
def open_dataset(file_name = "hacker_news.csv", header = True):
    from csv import reader
    opened_file = open(file_name)
    read_file = reader(opened_file)
    dataset = list(read_file)
    
    if header == False: 
        return dataset
    else:
        header = dataset[0]
        body = dataset[1:]
                   
        return header, body

# Explore a certain section of the dataset and opt to show the total count of rows and columns
def explore_dataset(dataset, start, end, rows_and_columns =  False):
    for row in dataset[start:end]:
        print(row)
        print("\n")
        
    if rows_and_columns:
        columns = len(dataset[0])
        rows = "{:,d}".format(len(dataset))
        print("Number of columns: ", columns)
        print("Number of rows: ", rows)
        
headers, hn = open_dataset()
explore_dataset(hn, 0, 5, True)

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of columns:  7
Number of rows:  20,100


These are the headers contained within the dataset. They have been previously approached.

In [2]:
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


## Extracting Ask HN and Show HN Posts
After opening the data, we need to classify Ask HN and Show HN posts into two different lists. This will allow us to work efficiently when exploring and manipulating the datasets.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Number of Ask HN posts: ", len(ask_posts))        
print("Number of Show HN posts: ", len(show_posts))
print("Number of Other HN posts: ", len(other_posts))

Number of Ask HN posts:  1744
Number of Show HN posts:  1162
Number of Other HN posts:  17194


## Calculating the Average Number of Comments for Ask HN and Show HN Posts

Each post in the website has a certain amount of interactions (comments) done to a certain post. Let us find out what is the average number of comments for Ask HN and Show HN posts.

In [4]:
total_ask_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average Ask HN comments: ", avg_ask_comments)

Average Ask HN comments:  14.038417431192661


In [5]:
total_show_comments = 0

for row in show_posts:
    total_show_comments += int(row[4])
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average Show HN comments: ", avg_show_comments)

Average Show HN comments:  10.31669535283993


As a result of the calculations performed to Show HN and Ask HN posts, it is acceptable to conclude that the average number of interactions, in other words "comments" posted on Ask HN posts is higher than those published on Show HN posts. Ask HN has resulted to have an average of fourteen comments per post while Show HN has an average of 10 comments per post.

## Finding the Number of Ask Posts and Comments by Hour Created 
 Givcen the previous results, we will be focusing only in the Ask HN post, as these posts prevail as the ones with more interactions. Now, let us determine if ask posts created at a certain time are more likely to attract comments. For this, we will need to perform a couple of tasks:
 1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
 2. Calculate the average number of comments ask posts receive by hour created.
 
Let's proceed with the first task.

In [6]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    result_list.append([created_at, num_comments])
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    num_comments = int(row[1])
    date_format = "%m/%d/%Y %H:%M"
    datetime_obj = dt.datetime.strptime(date, date_format)
    hour = datetime_obj.strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

## Calculating the Average Number of Comments for Ask HN Posts by Hour


The previously produced dictionaries are defined as follows:
- `counts_by_hour`: Contains the number of ask posts created during each hour of the day.
- `comments_by_hour`: Contains the corresponding number of commetns ask posts created at each hour received.

We will now calculate the average number of comments that interacted with each post on every hour. The result will be stored in a list of lists called `avg_by_hour`.

In [7]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


## Sorting and Printing Values from a List of Lists
It is important to format the valaues of the `avg_by_hour` list, allowing for a better readability of the iterable. Let us swap the order of the columns within the list.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Now, after swapping the columns, we may proceed to make a recommendation after formatting the processed data.

In [24]:
sorted_swap = sorted(swap_avg_by_hour, reverse =  True)
print("Top 5 hours for Ask Posts Comments")
string = "{}: {:.2f} average comments per post"

for row in sorted_swap[:5]:
    avg_comments_by_hour = row[0]
    time = row[1]
    time = dt.datetime.strptime(time, "%H")
    formatted_time = time.strftime("%H:%M")
    result_string = string.format(formatted_time, avg_comments_by_hour)
    print(result_string)

Top 5 hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The top 5 timings depict times at which the chance of receiving a major amount of comments becomes higher. If a post writer was to be looking for the best time stamps to make a post, these would be his best options.

## Converting Time Stamps Between Time Zones
Let us explore another scenario. Remembering that the time zone of the dates the dataset contains was US' Eastern Time, what would be the appropriate time for me to make a post and have a higher chance of receiving comments? This is of course converting the time stamps to my time zone: Mexico Central Standard Time (CST/UTC-6). 

Normally, this would be an appropriate scenario to work witht the `zoneinfo` module, however, the action won't be repeated, therefore, can be done by hand as a one time activity. These are the converted times and best time options for a person located in the CST time zone to make a post in Hacker News:
- 15:00
- 04:00
- 22:00
- 18:00
- 23:00


## Conclusion
The Hacker News dataset provides excellent opportunities for exploration, nevertheless, it all depends on the needs and goals of the explorer. It is reasonable to accept that the researched and concluded time zones produce better opportunities for Hacker News post writers looking for highere chances of feedback.