# Hacker News Interaction Analysis: What Types of Posts Receive the Most Comments and When?

In this analysis, we will be focusing on a dataset from Kaggle containing 12-months of post information from the social technology forum 'Hacker News' between September 2015 and 2016. We are interested in using it to understand more about user interaction on the site. Specifically to answer the following:
1. Of the categorized types of posts a user can generate, between 'Ask HN' posts and 'Show HN' posts, which receives more interaction?
2. Do posts created at a certain time receive more comments on average?

Let us examine the breakdown of the data we will be analyzing. Below are the column headers and their description:
    
    id: The unique identifier from Hacker News for the post
    title: The title of the post
    url: The URL that the posts links to, if the post has a URL
    num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    num_comments: The number of comments that were made on the post
    author: The username of the person who submitted the post
    created_at: The date and time at which the post was submitted

We will import the dataset from Kaggle (https://www.kaggle.com/hacker-news/hacker-news-posts) locally and display the first five rows below to give us a snapshot of what the data looks like.

In [27]:
from csv import reader
open_file = open(r'C:\Users\bbeckenb\OneDrive\Documents\Local Datasets\Hacker_News_Post_Data.csv', encoding="utf8")
read_file = reader(open_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']
['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']
['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']
['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


# Data Cleaning
Now, we will need to clean the data. I will go through a process to remove:
1. Erroneous data
2. Duplicate data

Removing Erroneous data:
To qualify as erroneous, the data must be incomplete. To check this, we will write a script to compare the length of the column header to the length of all rows that follow.

In [28]:
print(f"The length of the original Hacker News dataset is {len(hn[1:])}")
hn_erroneous = []

for row in hn[1:]:
    header_length = len(hn[0])
    row_length = len(row)
    if row_length != header_length:
        hn_erroneous.append(row)

print(f"The erroneous data script found {len(hn_erroneous)} erroneous rows")

The length of the original Hacker News dataset is 293119
The erroneous data script found 0 erroneous rows


Next step in our data cleansing process is to check for duplicates. To accomplish this, we will store a list of the unique ID numbers stored in column 0 and run this new list through a filter. The filter will run the ID list through two empty lists (clean and duplicate) using a for loop. We will append rows to the clean list after checking if they are already in the list. If they are already in the list, the logic will append them to the duplicates list instead.

In [29]:
#This filtering takes much too long on my local machine, I will find a solution for this, but when I ran it all the way through there were 0 duplicate rows.
unique_id = []

for row in hn[1:]:
    u_id = row[0]
    unique_id.append(u_id)


clean_list = []
duplicate_list = []
for element in unique_id[:20000]:
    if element in clean_list:
        duplicate_list.append(element)
    else:
        clean_list.append(element)
        
print(f"The duplicate data script found {len(duplicate_list)} duplicate rows, there are {len(clean_list)} clean rows")        

The duplicate data script found 0 duplicate rows, there are 20000 clean rows


Now that we have determined there are no erroneous rows of data and no duplicate rows, we can move forward with our questions at hand.

We are looking for information that guides us on the following queries:
1. Of the categorized types of posts a user can generate, between 'Ask HN' posts and 'Show HN' posts, which receives more interaction?
2. Do posts created at a certain time receive more comments on average?

# Between 'Ask HN' posts and 'Show HN' posts, which receives more interaction?
We are looking for an average interaction number or score from two categories of posts, 'Ask Hacker News' and 'Show Hacker News'. To help find this, we will need to separate the two different categories of data into separate lists then focus in on the num_comments and num_points information. To refresh, here are the definitions of these two columns of data:
    
    num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    num_comments: The number of comments that were made on the post   
    
To separate the 'Ask Hacker News' and 'Show Hacker News' data points, we will write a script to put the two in different lists. 'Ask Hacker News' rows have "Ask HN:" in the beginning of their title. In the script, we will check if "Ask HN:" is in the title for each row. If it is, we will put it in an Ask Hacker News list, if it contains "Show HN:", we will put the row in a Show Hacker News list, anything else, we will put in a miscellaneous list named 'other_hn_list'. We will check the length total of our three new lists against the length of the original list to see the percentage breakdown.

In [30]:
ask_hn_list = []
show_hn_list = []
other_hn_list = []

for row in hn[1:]:
    title = row[1]
    if "Ask HN:" in title:
        ask_hn_list.append(row)
    elif "Show HN:" in title:
        show_hn_list.append(row)
    else:
        other_hn_list.append(row)

ask_percentage = len(ask_hn_list) / len(hn) * 100
show_percentage = len(show_hn_list) / len(hn) * 100
print(f"There are {len(ask_hn_list)} 'Ask HN' posts and {len(show_hn_list)} 'Show HN' posts in our dataset, giving us {round(ask_percentage, 2)}% Ask posts and {round(show_percentage, 2)}% Show posts")

There are 9110 'Ask HN' posts and 10140 'Show HN' posts in our dataset, giving us 3.11% Ask posts and 3.46% Show posts


Now that we have separated our data, we can see we have a similar amount of 'Show HN' posts (3.46%) when compared to 'Ask Posts' (3.11%). We can use the interaction information provided in the rows to establish an average interaction score to help us understand whether Asking or Showing on Hacker News typically receives more interactions.

We will utilize num_points and num_comments where num_points is the total number of upvotes minus the total number of downvotes and num_comments is the number of comments made on the post to help us. We will group average points, average comments, and average interaction (comments plus points) separately for each list to give an initial impression.

In [31]:
tot_points_shn = 0
tot_points_ahn = 0
tot_comments_shn = 0 
tot_comments_ahn = 0

for row in ask_hn_list:
    points = int(row[3])
    comments = int(row[4])
    tot_points_ahn += points
    tot_comments_ahn += comments
    
for row in show_hn_list:
    points = int(row[3])
    comments = int(row[4])
    tot_points_shn += points
    tot_comments_shn += comments
    
avg_points_ahn = round(tot_points_ahn / len(ask_hn_list), 2)
avg_points_shn = round(tot_points_shn / len(show_hn_list), 2)
avg_comments_ahn = round(tot_comments_ahn / len(ask_hn_list), 2)
avg_comments_shn = round(tot_comments_shn / len(show_hn_list), 2)
avg_points_and_comments_ahn = round((tot_points_ahn + tot_comments_ahn) / len(ask_hn_list), 2)                                 
avg_points_and_comments_shn = round((tot_points_shn + tot_comments_shn) / len(show_hn_list), 2)

print(f"The average points for the 'Ask' List is {avg_points_ahn}, the average comments for the 'Ask' List is {avg_comments_ahn}, average total interaction (comments plus points) is {avg_points_and_comments_ahn}")
print(f"The average points for the 'Show' List is {avg_points_shn}, the average comments for the 'Show' List is {avg_comments_shn}, average total interaction (comments plus points) is {avg_points_and_comments_shn}")

The average points for the 'Ask' List is 11.33, the average comments for the 'Ask' List is 10.41, average total interaction (comments plus points) is 21.74
The average points for the 'Show' List is 14.87, the average comments for the 'Show' List is 4.89, average total interaction (comments plus points) is 19.76


As we can see from an initial pass, average points for "Show" posts are 23.81% higher when compared to "Ask" posts. Meanwhile, average comments for the "Ask" posts are 53.03% higher compared to "Show" Posts. This makes sense when thinking about the nature of the different categories. When you are "Showing" some technology you have found in the forum, it is likely something exciting or cutting edge which may appeal to some stripe of the general Hacker News user base. Meanwhile, I would expect comments to be higher for an "Ask" post compared to a "Show" post due to the fact that the action of asking is a request for a response. The average of total interaction (likes plus comments given equal weight) is different by 9.1% with 'Ask' posts having the edge.

Let's breakdown the data more to see what other insights may be gleaned. Our next action will be to create frequency distribution tables to display where a majority of the comment and upvote magnitudes are. I will define a function to generate these frequency distribution tables, then call it for the comment and upvote columns of the 'Ask' and 'Show' lists. 

In [32]:
def frequency_distribution(data_set, column_num, std_dev, context, int_convert = True):
    column_list = []
    for row in data_set:
        if int_convert == True:
            element = int(row[column_num])
        else:
            element = row[column_num]     
        column_list.append(element)
    freq_dist_dictionary = {}    
    key = "x = 0"
    freq_dist_dictionary[key] = 0
    key_0 = f"0 < x <= {std_dev}"
    freq_dist_dictionary[key_0] = 0
    key_1 = f"{std_dev} < x <= {std_dev * 2}"
    freq_dist_dictionary[key_1] = 0
    key_2 = f"{std_dev * 2} < x <= {std_dev * 3}"
    freq_dist_dictionary[key_2] = 0
    key_3 = f"{std_dev * 3} < x <= {std_dev * 4}"
    freq_dist_dictionary[key_3] = 0
    key_4 = f"{std_dev * 4} < x"
    freq_dist_dictionary[key_4] = 0
    
    for row in column_list:
        if column_list[row] == 0:
            freq_dist_dictionary[key] += 1
        if column_list[row] <= std_dev and column_list[row] != 0:
            freq_dist_dictionary[key_0] += 1
        if column_list[row] > std_dev and column_list[row] <= (2 * std_dev):
            freq_dist_dictionary[key_1] += 1
        if column_list[row] > (2 * std_dev) and column_list[row] <= (3 * std_dev):
            freq_dist_dictionary[key_2] += 1  
        if column_list[row] > (3 * std_dev) and column_list[row] <= (4 * std_dev):
            freq_dist_dictionary[key_3] += 1
        if column_list[row] > (4 * std_dev):
            freq_dist_dictionary[key_4] += 1
    print(f"{context} : (Frequency) : Percentage of {len(column_list)} Total Data Points")
    for row in freq_dist_dictionary:
        print(row, ': (', freq_dist_dictionary[row], ') :', round(100 * (freq_dist_dictionary[row] / len(column_list)), 2), '% \n')

frequency_distribution(ask_hn_list, 4, 1, 'Comments in Ask List')
frequency_distribution(ask_hn_list, 3, 1, 'Upvotes in Ask List')  
frequency_distribution(show_hn_list, 4, 1, 'Comments in Show List')
frequency_distribution(show_hn_list, 3, 1, 'Upvotes in Show List') 

Comments in Ask List : (Frequency) : Percentage of 9110 Total Data Points
x = 0 : ( 1941 ) : 21.31 % 

0 < x <= 1 : ( 564 ) : 6.19 % 

1 < x <= 2 : ( 853 ) : 9.36 % 

2 < x <= 3 : ( 2548 ) : 27.97 % 

3 < x <= 4 : ( 76 ) : 0.83 % 

4 < x : ( 3128 ) : 34.34 % 

Upvotes in Ask List : (Frequency) : Percentage of 9110 Total Data Points
x = 0 : ( 0 ) : 0.0 % 

0 < x <= 1 : ( 3878 ) : 42.57 % 

1 < x <= 2 : ( 1031 ) : 11.32 % 

2 < x <= 3 : ( 163 ) : 1.79 % 

3 < x <= 4 : ( 26 ) : 0.29 % 

4 < x : ( 4012 ) : 44.04 % 

Comments in Show List : (Frequency) : Percentage of 10140 Total Data Points
x = 0 : ( 8971 ) : 88.47 % 

0 < x <= 1 : ( 942 ) : 9.29 % 

1 < x <= 2 : ( 33 ) : 0.33 % 

2 < x <= 3 : ( 54 ) : 0.53 % 

3 < x <= 4 : ( 4 ) : 0.04 % 

4 < x : ( 136 ) : 1.34 % 

Upvotes in Show List : (Frequency) : Percentage of 10140 Total Data Points
x = 0 : ( 0 ) : 0.0 % 

0 < x <= 1 : ( 6734 ) : 66.41 % 

1 < x <= 2 : ( 1993 ) : 19.65 % 

2 < x <= 3 : ( 859 ) : 8.47 % 

3 < x <= 4 : ( 68 ) : 0.67 

The data lists are similar in length with the length of 'Show' Post data being 10140, and the length of 'Ask' Post data being 9110. The frequency distribution tables above give us an interesting break down of the upvotes and comments in the 'Show' and 'Ask' lists. Using '1' as the standard deviation, some of the returned numbers give us a strong indication of interaction behavior. These were the main points that stood out:
1. There must be a default 1 upvote given immediately to posts because both the Ask and Show posts have no 0 vote posts
2. 97.76% of posts in the Show list received 1 comment or less, 88.47% received 0 comments, only 1.34% of 'Show' posts received 5 or more comments
3. 78.69% of posts in the Ask list receive at least 1 comment, over one third (34.34%) received at least 5 comments
4. 4.79% of Show posts received 5 or more upvotes whereas 44.04% of the Ask posts received 5 or more upvotes. The magnitudes of the posts receiving 5 or more upvotes for 'Ask' and 'Show' posts are 4012 and 486 respectively, making 'Show' posts with upvotes of 5 or more stack up to only 12% of the 'Ask' posts with 5 or more upvotes.

Our first objective in this analysis was to determine if Ask HN or Show HN posts receive more comments and/ or points on average. Given the data from the initial pass to find average points and comments and the insights from the frequency distribution tables above, we can confidently say Ask HN posts are more likely to be interacted with by general users on Hacker News. With 'Show' Posts having 11.53% of their posts with at least 1 comment and 'Ask' Posts having 78.69% of posts with at least 1 comment, you would be 6.82 times more likely to receive a comment on an 'Ask' Post. With 'Show' Posts having 33.59% of their posts with at least 1 outside upvote and 'Ask' Posts having 57.43% of posts with at least 1 outside upvote, you would be 1.71 times more likely to receive an upvote from another user on an 'Ask' Post. As mentioned previously, having more comments on an 'Ask' post makes a lot of sense given asking a question incites a response and that is supported strongly by the data.

Now we will move onto the 2nd objective of this analysis: Do posts created at a certain time receive more comments on average?

First we will revisit the make up of our data by looking at a few rows to refamiliarize ourselves and make decisions on which columns will be useful to accomplish this objective. Due to the nature of a time related question, we are already aware that the timestamp will be important. I checked its class type below to see what next steps we need to take.

In [33]:
for row in ask_hn_list[:5]:
    print(row, '\n')

print(f"Time Stamp, 'created_at' column data is type {type(ask_hn_list[1][6])}")

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] 

['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'] 

['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'] 

['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'] 

['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'] 

Time Stamp, 'created_at' column data is type <class 'str'>


As we can see above, these are 'Ask HN' posts, but all rows in the data have the same format:
    
    id: The unique identifier from Hacker News for the post
    title: The title of the post
    url: The URL that the posts links to, if the post has a URL
    num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    num_comments: The number of comments that were made on the post
    author: The username of the person who submitted the post
    created_at: The date and time at which the post was submitted
    
# When do Different kinds of Posts Receive the Most Comments?
Our second objective is asking us to find a correlation between post time and number of comments to find the optimal time to post to receive the most comments. With that in mind, we will be focusing on the 'num_comments' and 'created_at' columns to guide us. We will need to find an average comments per post creation number over time to answer our question.

We will look at three different categories from the original HN dataset: Ask HN post data, Show HN post data, and the entire HN list to see how they compare.

We can see from the last row in our sample above the time stamp of the final row is '9/25/2016 21:50' giving the 'created_at' column a format of 'MM/DD/YYYY HH/mm' with hours going from 0 to 23. As we can see above, the column data for 'created_at' is already '<class 'str'>', so we will be able to manipulate it as such for the time-being. We will first group the 'comment_num' data with the 'created_date' data for the three datasets mentioned above. To improve efficiency and reduce code length, I am storing the three lists in aligned containers (meaning they are in the same order in the different containers). I will run the containers through the processing scripts. We will then print the first five rows of our new lists to give us an idea of what the data inside looks like.

In [34]:
hn_no_header = hn[1:]
shn_comments_dates = []
ahn_comments_dates = []
hn_no_header_comments_dates = []
raw_list_container = [hn_no_header, ask_hn_list, show_hn_list]
comments_dates_list_container = [hn_no_header_comments_dates, ahn_comments_dates, shn_comments_dates]
list_count = 0

for data_set in comments_dates_list_container:    
    for row in raw_list_container[list_count]:
        comments = int(row[4])
        date = row[6]
        comments_and_dates = [comments, date]
        comments_dates_list_container[list_count].append(comments_and_dates)
    list_count += 1

list_count = 0
for data_set in comments_dates_list_container: 
    for row in comments_dates_list_container[list_count][:5]:
        print(row)
    print('\n')
    list_count += 1

[0, '9/26/2016 3:26']
[0, '9/26/2016 3:24']
[0, '9/26/2016 3:19']
[0, '9/26/2016 3:16']
[0, '9/26/2016 3:14']


[7, '9/26/2016 2:53']
[3, '9/26/2016 1:17']
[0, '9/25/2016 22:57']
[3, '9/25/2016 22:48']
[2, '9/25/2016 21:50']


[0, '9/26/2016 0:36']
[0, '9/26/2016 0:01']
[0, '9/25/2016 23:44']
[0, '9/25/2016 23:17']
[1, '9/25/2016 20:06']




This interpretation of finding the optimal time to post to receive most comments is centered around hours and minutes of any given day. There certainly could be further analysis on a specific day or time of year to post, but for now we will focus our scope on the hours and minutes. To do this, we will first want to isolate the hours and minutes portion of our date elements. We will utilize the 'split()' function (https://www.w3schools.com/python/ref_string_split.asp) to change the date into a list of two strings, the 0th element containing the month/day/year, the 1st containing the hours and minutes information we want. We will store this truncated time information along with the comment information in new lists tagged with '_hours_min' in a new container tagged with the same.


In [35]:
shn_comments_hours_min = []
ahn_comments_hours_min = []
hn_no_header_comments_hours_min = []
comments_hours_mins_container = [hn_no_header_comments_hours_min, ahn_comments_hours_min, shn_comments_hours_min]
list_count = 0 

for data_set in comments_dates_list_container:
    for row in comments_dates_list_container[list_count]:
        comment = row[0]
        date = row[1]
        date = date.split(' ', 1)
        hours_and_min = date[1]
        new_row = [comment, hours_and_min]
        comments_hours_mins_container[list_count].append(new_row)
    list_count += 1

list_count = 0
for data_set in comments_hours_mins_container:     
    for row in comments_hours_mins_container[list_count][:5]:
        print(row)
    print('\n')
    list_count += 1
    

[0, '3:26']
[0, '3:24']
[0, '3:19']
[0, '3:16']
[0, '3:14']


[7, '2:53']
[3, '1:17']
[0, '22:57']
[3, '22:48']
[2, '21:50']


[0, '0:36']
[0, '0:01']
[0, '23:44']
[0, '23:17']
[1, '20:06']




Now that we have isolated the appropriate information in a single list, we are going to create a series of frequency distribution tables to narrow in on an appropriate range of time that contains the most comments. Let's start with AM versus PM times. 

We will import the datetime class as dt with all of its available methods in order to convert the timestamps from the string class to datetime objects to take advantage of pre-built functionality to analyze and manipulate time information.

The script below will establish two empty dictionaries, one to track comment magnitude over the AM and PM time-ranges, one to track posts generated during the two time ranges. That way we will be able to create an average by dividing the amounts of comments by the amount of posts over the time range in question. 

We will run the three datasets (All HN, Ask HN, and Show Hn) through the script, resetting then reusing the dictionaries 'am_vs_pm_comment_freq' and 'am_vs_pm_post_creation_freq' each iteration of the primary loop. In each iteration of the secondary loop, we will set the integer of the magnitude of comments to a variable 'comment' then the string timestamp of hours and minutes to variable 'hours_and_min'. We covert the string timestamp to a datetime object and compare it to 'noon', a datetime object we created representing the time 12:00. The logic is set so that if the 'hours_and_min' time is above or below 'noon' it will be sorted to the respective key for the two dictionaries 'AM' or 'PM'. The appropriate key of the dictionary will be incremented one for post frequency and the comments by the magnitude stored in the 'comment' variable.

We then print the name of the dataset that was processed, a header for the information we are about to output to add context to the numbers, and the output data for each row in the dictionary (only 2, for 'AM' and 'PM').

In [36]:
import datetime as dt
from datetime import *

am_vs_pm_comment_freq = {}
am_vs_pm_post_creation_freq = {}
data_set_names = ['All HN', 'Ask HN', 'Show HN']
list_count = 0

for data_set in comments_hours_mins_container:
    am_vs_pm_comment_freq['AM'] = 0
    am_vs_pm_comment_freq['PM'] = 0
    am_vs_pm_post_creation_freq['AM'] = 0
    am_vs_pm_post_creation_freq['PM'] = 0
    total_comments = 0

    for row in comments_hours_mins_container[list_count]:
        comment = row[0]
        hours_and_min = row[1]
        hours_and_min = dt.datetime.strptime(hours_and_min, "%H:%M")
        noon = dt.datetime.strptime('12:00', "%H:%M")
        if hours_and_min >= noon: 
            am_vs_pm_comment_freq['PM'] += comment
            am_vs_pm_post_creation_freq['PM'] += 1
        else:
            am_vs_pm_comment_freq['AM'] += comment
            am_vs_pm_post_creation_freq['AM'] += 1
        total_comments += comment
    print(f"For {data_set_names[list_count]}")
    print('AM/ PM : Avg Comments/ Posts Created : (Comments Generated) : Percentage of Total Comments')
    for row in am_vs_pm_comment_freq:
        print(row, ':', round(am_vs_pm_comment_freq[row]/ am_vs_pm_post_creation_freq[row], 2), ': (', am_vs_pm_comment_freq[row], ') :', round((100 * am_vs_pm_comment_freq[row] / total_comments), 2), '%\n')
    list_count += 1

For All HN
AM/ PM : Avg Comments/ Posts Created : (Comments Generated) : Percentage of Total Comments
AM : 6.61 : ( 646982 ) : 33.82 %

PM : 6.48 : ( 1265779 ) : 66.18 %

For Ask HN
AM/ PM : Avg Comments/ Posts Created : (Comments Generated) : Percentage of Total Comments
AM : 8.55 : ( 26484 ) : 27.93 %

PM : 11.36 : ( 68339 ) : 72.07 %

For Show HN
AM/ PM : Avg Comments/ Posts Created : (Comments Generated) : Percentage of Total Comments
AM : 4.94 : ( 15166 ) : 30.56 %

PM : 4.87 : ( 34454 ) : 69.44 %



From our first sweep, we can see that the average comments per post is fairly close for the overall list and Show HN posts, the 'Ask HN' list is 24.74% higher in the afternoon. Looking at Comments generated and the percentage of total comments, we can see the general amount of activity is much higher in the afternoon. Overall list comments roughly double in the afternoon, Ask HN comments almost triple, Show HN comments over double. However, we still do not have enough resolution to answer our question. In this next sweep, we will have four rows in our frequency distribution table, each consisting of a 6 hour window: 0:00 - 5:59, 6:00 - 11:59, 12:00 - 17:59, 18:00 - 23:59.

The process for this sweep is the same as the 'AM' vs 'PM' sweep, with two caveats: 
1. The magnitude of comments and the amount of posts created will be stored in a list in the same dictionary in the format "[comments, posts]". 
2. We are looking at additional time-windows

In [37]:
day_quartered_freq = {}
data_set_names = ['All HN', 'Ask HN', 'Show HN']
list_count = 0

for data_set in comments_hours_mins_container:
    day_quartered_freq['0:00 - 5:59'] = [0, 0]
    day_quartered_freq['6:00 - 11:59'] = [0, 0]
    day_quartered_freq['12:00 - 17:59'] = [0, 0]
    day_quartered_freq['18:00 - 23:59'] = [0, 0]
    midnight = dt.datetime.strptime('0:00', "%H:%M")
    six_a = dt.datetime.strptime('6:00', "%H:%M")
    noon = dt.datetime.strptime('12:00', "%H:%M")
    six_p = dt.datetime.strptime('18:00', "%H:%M")
    total_comments = 0

    for row in comments_hours_mins_container[list_count]:
        comment = row[0]
        hours_and_min = row[1]
        hours_and_min = dt.datetime.strptime(hours_and_min, "%H:%M")   
        if hours_and_min >= midnight and hours_and_min < six_a: 
            day_quartered_freq['0:00 - 5:59'][0] += comment
            day_quartered_freq['0:00 - 5:59'][1] += 1
        elif hours_and_min >= six_a and hours_and_min < noon:
            day_quartered_freq['6:00 - 11:59'][0] += comment
            day_quartered_freq['6:00 - 11:59'][1] += 1
        elif hours_and_min >= noon and hours_and_min < six_p:
            day_quartered_freq['12:00 - 17:59'][0] += comment
            day_quartered_freq['12:00 - 17:59'][1] += 1
        elif hours_and_min >= six_p:
            day_quartered_freq['18:00 - 23:59'][0] += comment
            day_quartered_freq['18:00 - 23:59'][1] += 1
        total_comments += comment
    
    print(f"For {data_set_names[list_count]}")
    print('4-Quarters of Day : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments')
    for row in day_quartered_freq:
        print(row, ':', round(day_quartered_freq[row][0] / day_quartered_freq[row][1], 2), ': (', day_quartered_freq[row][0], ') :', round((100 * day_quartered_freq[row][0] / total_comments), 2), '%\n')
    list_count += 1

For All HN
4-Quarters of Day : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments
0:00 - 5:59 : 6.68 : ( 301219 ) : 15.75 %

6:00 - 11:59 : 6.55 : ( 345763 ) : 18.08 %

12:00 - 17:59 : 6.79 : ( 721066 ) : 37.7 %

18:00 - 23:59 : 6.12 : ( 544713 ) : 28.48 %

For Ask HN
4-Quarters of Day : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments
0:00 - 5:59 : 8.72 : ( 13669 ) : 14.42 %

6:00 - 11:59 : 8.38 : ( 12815 ) : 13.51 %

12:00 - 17:59 : 14.49 : ( 44936 ) : 47.39 %

18:00 - 23:59 : 8.04 : ( 23403 ) : 24.68 %

For Show HN
4-Quarters of Day : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments
0:00 - 5:59 : 4.51 : ( 5865 ) : 11.82 %

6:00 - 11:59 : 5.26 : ( 9301 ) : 18.74 %

12:00 - 17:59 : 5.12 : ( 21580 ) : 43.49 %

18:00 - 23:59 : 4.5 : ( 12874 ) : 25.95 %



The second sweep adds more color, it looks like the '12:00 - 17:59' has the most interaction (comments and posts), but the average comments per post remain close for all but the 'Ask HN' posts. The '12:00 - 17:59' window may contain our optimal time but the time-windows are not tight enough to make that call. There may also be a shifted window that contains the most comments. With that in mind, we will break the data down into 24 windows, 1 hour each using the same script process as above, just for 24 time-windows this time.

In [38]:
hourly_freq = {}
data_set_names = ['All HN', 'Ask HN', 'Show HN']
list_count = 0

for data_set in comments_hours_mins_container:
    hourly_freq['0:00 - 0:59'] = [0, 0]
    hourly_freq['1:00 - 1:59'] = [0, 0]
    hourly_freq['2:00 - 2:59'] = [0, 0]
    hourly_freq['3:00 - 3:59'] = [0, 0]
    hourly_freq['4:00 - 4:59'] = [0, 0]
    hourly_freq['5:00 - 5:59'] = [0, 0]
    hourly_freq['6:00 - 6:59'] = [0, 0]
    hourly_freq['7:00 - 7:59'] = [0, 0]
    hourly_freq['8:00 - 8:59'] = [0, 0]
    hourly_freq['9:00 - 9:59'] = [0, 0]
    hourly_freq['10:00 - 10:59'] = [0, 0]
    hourly_freq['11:00 - 11:59'] = [0, 0]
    hourly_freq['12:00 - 12:59'] = [0, 0]
    hourly_freq['13:00 - 13:59'] = [0, 0]
    hourly_freq['14:00 - 14:59'] = [0, 0]
    hourly_freq['15:00 - 15:59'] = [0, 0]
    hourly_freq['16:00 - 16:59'] = [0, 0]
    hourly_freq['17:00 - 17:59'] = [0, 0]
    hourly_freq['18:00 - 18:59'] = [0, 0]
    hourly_freq['19:00 - 19:59'] = [0, 0]
    hourly_freq['20:00 - 20:59'] = [0, 0]
    hourly_freq['21:00 - 21:59'] = [0, 0]
    hourly_freq['22:00 - 22:59'] = [0, 0]
    hourly_freq['23:00 - 23:59'] = [0, 0]
    midnight = dt.datetime.strptime('0:00', "%H:%M")
    increment_hour = dt.timedelta(hours=1)
    total_comments = 0

    for row in comments_hours_mins_container[list_count]:
        comment = row[0]
        hours_and_min = row[1]
        hours_and_min = dt.datetime.strptime(hours_and_min, "%H:%M")   
        if hours_and_min >= midnight and hours_and_min < (midnight + increment_hour): 
            hourly_freq['0:00 - 0:59'][0] += comment
            hourly_freq['0:00 - 0:59'][1] += 1
        elif hours_and_min >= (midnight + increment_hour) and hours_and_min < (midnight + 2 * increment_hour): 
            hourly_freq['1:00 - 1:59'][0] += comment
            hourly_freq['1:00 - 1:59'][1] += 1
        elif hours_and_min >= (midnight + 2 * increment_hour) and hours_and_min < (midnight + 3 * increment_hour): 
            hourly_freq['2:00 - 2:59'][0] += comment
            hourly_freq['2:00 - 2:59'][1] += 1
        elif hours_and_min >= (midnight + 3 * increment_hour) and hours_and_min < (midnight + 4 * increment_hour): 
            hourly_freq['3:00 - 3:59'][0] += comment
            hourly_freq['3:00 - 3:59'][1] += 1
        elif hours_and_min >= (midnight + 4 * increment_hour) and hours_and_min < (midnight + 5 * increment_hour): 
            hourly_freq['4:00 - 4:59'][0] += comment
            hourly_freq['4:00 - 4:59'][1] += 1
        elif hours_and_min >= (midnight + 5 * increment_hour) and hours_and_min < (midnight + 6 * increment_hour): 
            hourly_freq['5:00 - 5:59'][0] += comment
            hourly_freq['5:00 - 5:59'][1] += 1
        elif hours_and_min >= (midnight + 6 * increment_hour) and hours_and_min < (midnight + 7 * increment_hour): 
            hourly_freq['6:00 - 6:59'][0] += comment
            hourly_freq['6:00 - 6:59'][1] += 1
        elif hours_and_min >= (midnight + 7 * increment_hour) and hours_and_min < (midnight + 8 * increment_hour): 
            hourly_freq['7:00 - 7:59'][0] += comment
            hourly_freq['7:00 - 7:59'][1] += 1
        elif hours_and_min >= (midnight + 8 * increment_hour) and hours_and_min < (midnight + 9 * increment_hour): 
            hourly_freq['8:00 - 8:59'][0] += comment
            hourly_freq['8:00 - 8:59'][1] += 1
        elif hours_and_min >= (midnight + 9 * increment_hour) and hours_and_min < (midnight + 10 * increment_hour): 
            hourly_freq['9:00 - 9:59'][0] += comment
            hourly_freq['9:00 - 9:59'][1] += 1
        elif hours_and_min >= (midnight + 10 * increment_hour) and hours_and_min < (midnight + 11 * increment_hour): 
            hourly_freq['10:00 - 10:59'][0] += comment
            hourly_freq['10:00 - 10:59'][1] += 1
        elif hours_and_min >= (midnight + 11 * increment_hour) and hours_and_min < (midnight + 12 * increment_hour): 
            hourly_freq['11:00 - 11:59'][0] += comment
            hourly_freq['11:00 - 11:59'][1] += 1
        elif hours_and_min >= (midnight + 12 * increment_hour) and hours_and_min < (midnight + 13 * increment_hour): 
            hourly_freq['12:00 - 12:59'][0] += comment
            hourly_freq['12:00 - 12:59'][1] += 1
        elif hours_and_min >= (midnight + 13 * increment_hour) and hours_and_min < (midnight + 14 * increment_hour): 
            hourly_freq['13:00 - 13:59'][0] += comment
            hourly_freq['13:00 - 13:59'][1] += 1
        elif hours_and_min >= (midnight + 14 * increment_hour) and hours_and_min < (midnight + 15 * increment_hour): 
            hourly_freq['14:00 - 14:59'][0] += comment
            hourly_freq['14:00 - 14:59'][1] += 1
        elif hours_and_min >= (midnight + 15 * increment_hour) and hours_and_min < (midnight + 16 * increment_hour): 
            hourly_freq['15:00 - 15:59'][0] += comment
            hourly_freq['15:00 - 15:59'][1] += 1
        elif hours_and_min >= (midnight + 16 * increment_hour) and hours_and_min < (midnight + 17 * increment_hour): 
            hourly_freq['16:00 - 16:59'][0] += comment
            hourly_freq['16:00 - 16:59'][1] += 1
        elif hours_and_min >= (midnight + 17 * increment_hour) and hours_and_min < (midnight + 18 * increment_hour): 
            hourly_freq['17:00 - 17:59'][0] += comment
            hourly_freq['17:00 - 17:59'][1] += comment
        elif hours_and_min >= (midnight + 18 * increment_hour) and hours_and_min < (midnight + 19 * increment_hour): 
            hourly_freq['18:00 - 18:59'][0] += comment
            hourly_freq['18:00 - 18:59'][1] += 1
        elif hours_and_min >= (midnight + 19 * increment_hour) and hours_and_min < (midnight + 20 * increment_hour): 
            hourly_freq['19:00 - 19:59'][0] += comment
            hourly_freq['19:00 - 19:59'][1] += 1
        elif hours_and_min >= (midnight + 20 * increment_hour) and hours_and_min < (midnight + 21 * increment_hour): 
            hourly_freq['20:00 - 20:59'][0] += comment
            hourly_freq['20:00 - 20:59'][1] += 1
        elif hours_and_min >= (midnight + 21 * increment_hour) and hours_and_min < (midnight + 22 * increment_hour): 
            hourly_freq['21:00 - 21:59'][0] += comment
            hourly_freq['21:00 - 21:59'][1] += 1
        elif hours_and_min >= (midnight + 22 * increment_hour) and hours_and_min < (midnight + 23 * increment_hour): 
            hourly_freq['22:00 - 22:59'][0] += comment
            hourly_freq['22:00 - 22:59'][1] += 1
        elif hours_and_min >= (midnight + 23 * increment_hour): 
            hourly_freq['23:00 - 23:59'][0] += comment
            hourly_freq['23:00 - 23:59'][1] += 1
        total_comments += comment

  
    print(f"For {data_set_names[list_count]}")
    print('Post Creation Time : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments')
    for row in hourly_freq:
        print(row, ':', round(hourly_freq[row][0] / hourly_freq[row][1], 2), ': (', hourly_freq[row][0], ') :', round((100 * hourly_freq[row][0] / total_comments), 2), '%\n')
     
    list_count += 1

For All HN
Post Creation Time : Average Comments per Post : (Magnitude of Comments) : Percentage of Total Comments
0:00 - 0:59 : 6.58 : ( 59051 ) : 3.09 %

1:00 - 1:59 : 6.42 : ( 50851 ) : 2.66 %

2:00 - 2:59 : 7.27 : ( 54172 ) : 2.83 %

3:00 - 3:59 : 6.43 : ( 45851 ) : 2.4 %

4:00 - 4:59 : 6.63 : ( 47091 ) : 2.46 %

5:00 - 5:59 : 6.76 : ( 44203 ) : 2.31 %

6:00 - 6:59 : 6.17 : ( 45541 ) : 2.38 %

7:00 - 7:59 : 6.1 : ( 47586 ) : 2.49 %

8:00 - 8:59 : 6.34 : ( 53937 ) : 2.82 %

9:00 - 9:59 : 6.52 : ( 59029 ) : 3.09 %

10:00 - 10:59 : 6.51 : ( 63388 ) : 3.31 %

11:00 - 11:59 : 7.37 : ( 76282 ) : 3.99 %

12:00 - 12:59 : 7.69 : ( 97925 ) : 5.12 %

13:00 - 13:59 : 7.34 : ( 116861 ) : 6.11 %

14:00 - 14:59 : 6.46 : ( 117088 ) : 6.12 %

15:00 - 15:59 : 7.05 : ( 137635 ) : 7.2 %

16:00 - 16:59 : 6.18 : ( 124557 ) : 6.51 %

17:00 - 17:59 : 1.0 : ( 127000 ) : 6.64 %

18:00 - 18:59 : 6.46 : ( 120621 ) : 6.31 %

19:00 - 19:59 : 6.33 : ( 107872 ) : 5.64 %

20:00 - 20:59 : 5.95 : ( 94965 ) : 4.96 %


# Conclusion
The number that sticks out is 28.68 posts per comments generated for the 15:00 - 15:59 time-slot of ther 'Ask HN' posts, the next highest is 16.37 at 13:00 - 13:59 for 'Ask HN' posts, 57.08% of the max. The highest posts per comment average in the Overall HN Data List is 7.69 at 12:00 - 12:59 and 6.99 at 12:00 - 12:59 for 'Show HN' Posts, 3.72 and 4.10 times less than the 28.68 max, respectively.

But, why?

The time zone for our timestamp data is Eastern Time (EST) in the US, which makes the rise in comment activity correlate well with working America's lunch break on the east coast. As lunch breaks occur in the time zones Westward, the magnitude of comments increases across the board, peaking when the West coast and tech hub of the US (thinking Silicon Valley in the year ranges of the data 2015 - 2016) has their lunch break. Y Combinator, start-up that created Hacker News is based in Mountain View, CA. It would make sense that the West Coast would contain a majority of users as the popularity of the social news site would diffuse from that central location. That combined with our initial hypothesis that Asking a question incites a higher response rate gives us our time and post-type for maximum comment return: 'Ask HN' at the 13:00 - 13:59 time slot.