# Exploring Hacker News Posts

[Hacker News](https://news.ycombinator.com/) is a site, where user-submitted stories (known as "posts") receive votes and comments, similar to Reddit. The site is extremely popular in technology and startup circles, mainly because the posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result. 

The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). 
Below are the descriptions of the columns.

| Column | Description |
| :------| -----------:|
| `id`   | the unique identifier from Hacker News for the post|
| `title`| the title of the post |
| `url` | the URL the posts links to, if the post has a URL |
| `num_points`| the number of points the post acquired, calculated as the total no. of upvotes minus total no. of downvotes|
| `num_comments` | the number of comments on the post |
| `author` | the username of the person who submitted the post |
| `created_at` | the date and time of the post's submission |

The posts on Hacker News with titles **Ask HN** or **Show HN** means:
- **Ask HN** posts to ask the Hacker News community a specific question.
- **Show HN** posts to show the Hacker News community a project, product, or something interesting.

We are interested in the above titles, **Ask HN** or **Show HN**. By using these two types of posts, we will analyze
1. Do **Ask HN** or **Show HN** recevie more comments on average?
2. Do posts created at a certain time receive more comments on average?

In [29]:
import datetime as dt
from csv import reader
# Opening the dataset which is in the form of csv

opened_file = open('HN_posts_year_to_Sep_26_2016.csv',encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)

# Displaying the first five rows of hn
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]


In [5]:
# Extracting the first row of data which is the header
headers = hn[0]

# Removing the first row of data from hn
hn = hn[1:]

In [8]:
# Displaying headers to check if our header is correct
print(headers)

print('\n')

# Displaying the first five rows of hn to ensure the header has been removed
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]


Like we mentioned above, we are only concerned with the posts title beginning with **Ask HN** or **Show HN**, we will isolate them in a new lists of lists containing the data for those titles.

In [9]:
ask_posts = [] # ASk HN posts
show_posts = [] # Show HN posts
other_posts = [] # Other posts

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [11]:
# Checking the number of posts in ask, show and other posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


Let's have a look at our ask_posts and show_posts by printing a few rows of the lists of list.

In [15]:
print(headers)
print('\n')
print(ask_posts[:4])
print('\n')
print(show_posts[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']]


[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show 

Now by using the above data in ask_posts and show_posts, let's analyse one of our findings which we mentioned in the beginning; Do **Ask HN** or **Show HN** recevie more comments on average? 

In [23]:
# Since we are going to check the number of comments for ask and show posts and check their avearge. 
# Creating a function so that we can reuse it.

def average_of_comments(dataset, index):
    '''Loops through the dataset with the mentioned
    index, adds up the number of comments, and returns
    the average'''
    total_comments = 0
    for row in dataset:
        comment = int(row[index])
        total_comments += comment
    
    average = total_comments / len(dataset)
    return average

In [28]:
# Finding the average of ask_posts comments

avg_ask_comments = average_of_comments(ask_posts, 4)
print(avg_ask_comments)

print('\n')

# Finding average of show_posts_comments
avg_show_comments = average_of_comments(show_posts, 4)
print(avg_show_comments)

10.393478498741656


4.886099625910612


## Finding to the first question

The average for **Ask HN** posts is about 10 and average for **Show HN** posts is roughly about 5. By looking at the above average, we can say that the posts which ask Hacker News community is more likely to receive more comments than posts which show Hacker News community their project or ideas. This maybe due to the fact that, posts which has **Ask HN** is more likely to be a question to the community and users are more likely to comment on that question or a topic to engage in a conversation with users with similar interest.

Now that we have found out our finding for the first question. Next, we will determine if posts created at a certain time receive more comments on average. We will use the following steps:

1. Calculate the number of posts created in each hour of the day, along with the number of comments.
2. Calculate the average number of comments posts receive by hour created.

In [79]:
# Creating a function to calculate the number of posts created per hour, along with total number of comments
# Having kewword arguments because the posts created and number of comments will be in the same column index

def posts_and_comments(dataset, index=6, comments=4):     
    result_list = []
    
    for row in dataset:
        created_at = row[index]
        number_of_comment = int(row[comments])
        result_list.append([created_at, number_of_comment])
    
    counts_by_hour = {} # contains number of posts created during each hour
    comments_by_hour = {} # Contains the corresponding number of comments for each hour
    
    for row in result_list:
        hour = row[0]
        comment = row[1]
        parsed_datetime = dt.datetime.strptime(hour, '%m/%d/%Y %H:%M')
        parsed_hour = parsed_datetime.strftime('%H')
        
        if parsed_hour not in counts_by_hour:
            counts_by_hour[parsed_hour] = 1
            comments_by_hour[parsed_hour] = comment
        elif parsed_hour in counts_by_hour:
            counts_by_hour[parsed_hour] += 1
            comments_by_hour[parsed_hour] += comment
    return counts_by_hour, comments_by_hour

In [80]:
# Creating the number of posts created per hour and total number of comments for ask_posts
ask_posts_by_hour, ask_comments_by_hour = posts_and_comments(ask_posts)

# Creating the number of posts created per hour and total number of comments for show_posts
show_posts_by_hour, show_comments_by_hour = posts_and_comments(show_posts)

Now that we have two dictionaries which has the number of posts by hour and number of comments by hour. Let's us calculate the average number of comments for each hour for both **Ask HN** and **Show HN** posts.

In [107]:
# Creating a function which gives the average for each hour

def average(dict_a, dict_b):
    average_list = []
    
    for key in dict_a:
        avg = dict_b[key]/dict_a[key]
        avg = round(avg)
        average_list.append([key, avg])
    
    return average_list

In [108]:
# Computing average for both ask and show posts
avg_ask_comments = average(ask_posts_by_hour, ask_comments_by_hour)
avg_show_comments = average(show_posts_by_hour,  show_comments_by_hour)

In [113]:
print(avg_ask_comments)
print('\n')
print(avg_show_comments)

[['02', 11], ['01', 7], ['22', 9], ['21', 9], ['19', 7], ['17', 9], ['15', 29], ['14', 10], ['13', 16], ['11', 9], ['10', 11], ['09', 7], ['07', 7], ['03', 8], ['23', 7], ['20', 9], ['16', 8], ['08', 9], ['00', 8], ['18', 8], ['12', 12], ['04', 10], ['06', 7], ['05', 9]]


[['00', 5], ['23', 5], ['20', 4], ['19', 5], ['18', 5], ['16', 5], ['14', 6], ['10', 4], ['09', 5], ['08', 6], ['06', 5], ['03', 5], ['21', 4], ['17', 4], ['15', 5], ['11', 6], ['07', 7], ['04', 5], ['13', 5], ['12', 7], ['01', 4], ['22', 4], ['02', 5], ['05', 3]]


We can see that the above format is difficult to identify the hours with the highest values. We will sort the list of lists and print the five highest values in a format that's easier to read.

In [114]:
swap_avg_ask_comments_by_hour = []
swap_avg_show_comments_by_hour = []

for row in avg_ask_comments:
    swap_avg_ask_comments_by_hour.append([row[1], row[0]])

for row in avg_show_comments:
    swap_avg_show_comments_by_hour.append([row[1], row[0]])

We have swapped the hour and comments in the list. Lets take a look at the average number of comments in descending order by printing only the top 5 elements of each list.

In [119]:
# Average comments for ask HN posts
print('Top 5 hours for ask posts comments.')
for row in sorted(swap_avg_ask_comments_by_hour[:5], reverse=True):
    hour = row[1]
    avg = row[0]
    hour_stripped = dt.datetime.strptime(hour, '%H')
    hour_stripped = hour_stripped.strftime('%H:%M:%S')
    print(f"{hour_stripped}: {avg} average comments per post")
    
print('-'*40)

# Average comments for show HN posts
print('Top 5 hours for show posts comments')
for row in sorted(swap_avg_show_comments_by_hour[:5], reverse=True):
    hour = row[1]
    avg = row[0]
    hour_stripped = dt.datetime.strptime(hour, '%H')
    hour_stripped = hour_stripped.strftime('%H:%M:%S')
    print(f"{hour_stripped}: {avg} average comments per post")

Top 5 hours for ask posts comments.
02:00:00: 11 average comments per post
22:00:00: 9 average comments per post
21:00:00: 9 average comments per post
19:00:00: 7 average comments per post
01:00:00: 7 average comments per post
----------------------------------------
Top 5 hours for show posts comments
23:00:00: 5 average comments per post
19:00:00: 5 average comments per post
18:00:00: 5 average comments per post
00:00:00: 5 average comments per post
20:00:00: 4 average comments per post


## Finding for the second question

By looking at the output above, we can say that the **Ask HN** posts created during the night receive a significantly higher number of comments. (Lot of night owls on the website, huh?) 

Where as for show posts comments, the average is similar from 6 pm to 11 pm.

# Conclusion

1. The average for Ask HN posts is about 10 and average for Show HN posts is roughly about 5. By looking at the above average, we can say that the posts which ask Hacker News community is more likely to receive more comments than posts which show Hacker News community their project or ideas. This maybe due to the fact that, posts which has Ask HN is more likely to be a question to the community and users are more likely to comment on that question or a topic to engage in a conversation with users with similar interest.

2. The posts for Ask HN during the night receive a significant higher number of comments. Where as for the Show HN comments, we can say that from the time period of 6 pm to 11 pm, the number of comments received is pretty much the same.