# Looking for Ways to Get the Most Comments on Hacker News Posts

## Introduction

Hacker News is a site started by the startup [incubator Y](https://www.ycombinator.com) Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

Our main task in this project is to analyze  **Hacker News posts**1. We want to find out which posts are more likely to get the most comments on average. We also want to know if the post creation time affects the number of comments they receive.


**Hacker News** mainly consists of two types of posts. Among these type of posts are: 
 * **ASk HN**: In this type of posts, users get to ask the Hacker News Community a question to get clarified
 * **Show HN**: Here a users show the Hacker News community a project, product, or just something interesting.
 
 Our analysis shall focus on these two type of posts.

In our research, we learned that **Ask HN** gets the most comments. The creation time of the post also affects the number of comments per the post. The following will show the steps we used to achieve our goal.

# Reading and Opening the data set to be used

* We satart by importing the necessary modules

In [18]:
#Importing necessary liabraries

from csv import reader
import datetime as dt

Now we are ready to open and read our dataset. We are using [this](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) dataset from **Kaggle**. You can download it directly from [this link](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts/download?datasetVersionNumber=1).

In [19]:
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

#Extracting the first row/heaher row of data set to a variable
headers = hn[0]

#Remove/ delete the header row from the rest of the data
del hn[0]

Let us have a better understanding of the columns that constitudes our data.

In [20]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

* `id`: the unique Identifier from **Hacker News** for the post
* `title`: the title of the post
* `url`: the URL that the posts links to, if the post has a URL
* `num_points`: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: the number of comments on the post
* `author`: the username of the person who submitted the post
* `created_at`: the date and time of the post's submission

# Extracting `Ask HN` and `Show HN` Posts

* We now want to isolate the `ASk HN` and `Show HN` Posts. This is done as follows:

In [24]:
# Hold the ask HN Posts
ask_posts = [] 

#Hold the Show HN posts
show_posts = []

#Others posts that are niether ask nor Show post.
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("There are {} 'Ask HN' Posts and {} 'Show HN' posts ".format(len(ask_posts),len(show_posts)))

There are 1744 'Ask HN' Posts and 1162 'Show HN' posts 


There are **1,744 Ask HN** posts and **1,162 Show HN** posts. This shows a greater number of posts from the **Ask HN category**

# Calculating the Average Number of Comments for Ask HN and Show HN Posts

Now, let us see the average number of posts that constitute Ask HN and Show KN

In [27]:
# Finding total number of comments in ask_post list
total_ask_comments = 0

for post in ask_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_ask_comments

14.038417431192661

In [6]:
# Finding total number of comments in show_post list
total_show_comments = 0
for post in show_posts:
    num_comments = post[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

avg_show_comments


10.31669535283993

**Ask HN Posts** have **14 comments per post** and **Show HN Posts** have **10 comments per post** on average. This shows that people in the Hacker News community are more willing to comment on questions.

According to this analysis, further analysis will be focused on **Ask HN posts** since they are likely to receive more comments.

# Finding the Number of Ask Posts and Comments by Hour Created

In [28]:
result_list = [] ## List of lists containinng dates and comments

#Dictionary holding the number of  `ASK HN`posts during every hour of the day.
counts_by_hour = {}  

#Dictionary containing the number of comments on a post during eah hour of the day
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"

for post in ask_posts:
    created_at = post[6]
    num_comments = post[4]
    num_comments = int(num_comments)
    result_list.append([created_at,num_comments])
    
for row in result_list:
    date = row[0]
    num_comments = row[1]
    date = dt.datetime.strptime(date,date_format)
    hour = date.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        
print(counts_by_hour)

print("--------")

print(comments_by_hour)


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
--------
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


Now we have the number of posts and comments for each hour. With it, we can find the average number of comments per post by hour.

Now it is time for us to calculate the **average number  of comments for ASk HN Posts by hour**. This will be done by considering the two dictionaries created above, that is, the `counts_by_hour` and `comments_by_hour` dictionaries 

In [29]:
avg_by_hour = []
for key in comments_by_hour:
    avg_by_hour.append([key,comments_by_hour[key] / counts_by_hour[key]])    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

The result sofar obtained, can not really be a right for drawing conclusions because that will be visually difficult since the list of lists is not sorted.

Our next target is to get the `avg_by_hour` list of list sorted and the top five hours with the highest number of comments in order.

In [33]:
#we ajust element position in the  `avg_by_hour` list of list so that the sort function works properly
swap_avg_by_hour = []
for item in avg_by_hour:
    hour,avg = item
    swap_avg_by_hour.append([avg,hour])
    
# reverse = True so that the sorting happens in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True) 

print('The Top 5 Hours for Ask Posts Comments')
print("\n")
for item in sorted_swap[:5]:
    print(item)
    print("\n")

The Top 5 Hours for Ask Posts Comments


[38.5948275862069, '15']


[23.810344827586206, '02']


[21.525, '20']


[16.796296296296298, '16']


[16.009174311926607, '21']




* Let us now try to format the result to be visuaaally more pleasing so that conclusion can easily be drawn

In [34]:
for item in sorted_swap[:5]:
    average,hour = item
    date = dt.datetime.strptime(hour,"%H")
    date = date.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(date,average))
    

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


According to number of comments, **15:00** leads with **38.59** commnets per post, followed respectively by **02:00** with **23.81**, **20:00** with **21.52**,**16:00** with **16.80** and **21:00** with **16.01** comments per post.

With this information, we can say that post creation time has a big impact on the number of comments we receive. Best time for **Ask HN** post to get the most comments is between 13:00 and 16:00.

# Conclusion

The goal of our project was as follows. To understand which **Ask HN** or **Show HN** posts get more comments on average and whether time of the day affects this. To do this, we first calculated the average number of comments for each type of posts. Then we calculated the average number of comments **Ask HN** posts receive by hour created to see if it affects the average number of comments.

In the end, we learned that **Ask HN** posts gain a lot more comments on average. Time is also greatly affects the average number of comments per post. The best hours for **Ask HN** posts is between 13:00 and 16:00.