# Finding trends in Hacker News Posts

In this project, we will work with the data of the popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted posts are voted and commented upon, similar to reddit.

We will analyse posts whose title begin with either `Ask HN`or `Show HN`:

- `Ask HN`posts are used to ask the Hacker News community a specific question.
- `Show HN` posts are used to show the Hacker News community a project, product or something generally interesting.

The goal of this project is to compare these two types of posts to determine the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

The data set can be found in [this link](https://www.kaggle.com/hacker-news/hacker-news-posts).

It should be noted that the data set we're working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

## Opening and reading the file

Let's start by opening and reading the file and visualizing the first 5 frows.

In [1]:
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


## Removing headers

We are going to save the row with column headers and the rest of the rows in separate variables.

In [2]:
headers = hn[0]
hn = hn[1:]

print(headers)
print('\n')
print(hn[:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Filtering `Ask HN` and `Show HN` posts

As we only want to analyse post that begin with 'Ask HN' or 'Show HN', we will filter them out and store the data in two new lists of list: `ask_posts` and `show_posts`. We will keep all the other posts' information in the list `ohter_posts`.

For that we will use two string methods:
- `lower` to get the lowercase version of the post title
- `statswith` to check whether the post title starts with `Ask HN` or `Show HN`

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    
    title = row[1]
    title_low = title.lower()
    
    if title_low.startswith('ask hn'):
        ask_posts.append(row)
        
    elif title_low.startswith('show hn'):
        show_posts.append(row)
        
    else:
        other_posts.append(row)
        
print('Number of Ask HN posts:',len(ask_posts))
print('Number of Show HN posts:',len(show_posts))
print('Number of other posts:',len(other_posts))
print('Total posts:',len(hn))
                                                             

Number of Ask HN posts: 1744
Number of Show HN posts: 1162
Number of other posts: 17194
Total posts: 20100


We have 1744 `Ask HN` posts and 1162 `Show HN` posts out of 20100 posts.

## Do `Ask HN` or `Show HN` receive more comments on average?

Now let's find out if these kind of posts receive more comments on average.

In [4]:
#Find the total number of comments in Ask HN posts
total_ask_comments = 0
for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments
    
#Calculate the average number of comments in Ask HN posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print ('Ask HN posts average comments:',avg_ask_comments)

#Find the total number of comments in Show HN posts
total_show_comments = 0
for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments
    
#Calculate the average number of comments in Ask HN posts
avg_show_comments = total_show_comments / len(show_posts)
print ('Show HN posts average comments:',avg_show_comments)
    

Ask HN posts average comments: 14.038417431192661
Show HN posts average comments: 10.31669535283993


On average, `Ask HN` posts receive 4 more comments than `Show HN` posts. 

However, on average they seem to receive less comments compared to the other kind of posts in Hacker News. This could be biased by a few extremely popular posts.

In [5]:
#Find the total number of comments in other posts
total_other_comments = 0
for post in other_posts:
    num_comments = int(post[4])
    total_other_comments += num_comments
    
#Calculate the average number of comments in Ask HN posts
avg_other_comments = total_other_comments / len(other_posts)
print ('Other posts average comments:',avg_other_comments)

Other posts average comments: 26.8730371059672


## When is the best time to create a post to receive more comments?

Anyway, as our purpose is to compare `Ask HN` and `Show HN` posts, we will focus on analysing only these posts. 

We want to determine if the ones created at a certain *time* are more likely to attract comments. We well use the following strategy to do the analysis:

1. Calculate the amount of posts created in each hour of the day.
2. Calculate the number of comments received in each hour of the day.
3. Calculate the average number of comments posts receive by hour of the day.

### Finding the best time to create a `Ask HN` post

We will start analysing `Ask HN` posts.
#### Amount of Posts and Comments by Hour

In [6]:
import datetime as dt
  
ask_counts_by_hour = {}
ask_comments_by_hour = {}

for post in ask_posts:
    date = post[6]
    comments = int(post[4])
    
    # Parse the date and create a datetime object
    dt_date = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    
    # Select just the hour from the datetime object
    hour = dt_date.strftime("%H")
    
    # Create two dictionaries to store the number of posts and comments by hour
    if hour not in ask_counts_by_hour:
        ask_counts_by_hour[hour] = 1
        ask_comments_by_hour[hour] = comments
    else:
        ask_counts_by_hour[hour] += 1
        ask_comments_by_hour[hour] += comments
        
print(ask_counts_by_hour)
print("\n")
print(ask_comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


#### Average Number of Comments by Hour

We will use the data in the two dictionaries that we have just created to create a new list of lists.

Each element of the new list will contain an hour during which posts were created and the average number of comments those posts received.

In [7]:
ask_avg_by_hour = []

for hour in ask_counts_by_hour:
    comments = ask_comments_by_hour[hour]
    counts = ask_counts_by_hour[hour]
    average =  comments / counts
    ask_avg_by_hour.append([hour,average])

ask_avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

#### Sorting the values of the list of lists

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. 

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [8]:
# Swaping the columns of our list of lists
ask_swap_avg_by_hour = []

for row in ask_avg_by_hour:
    ask_swap_avg_by_hour.append([row[1],row[0]])
    
# Sort the list by the average number of comments in descending order
ask_sorted_swap = sorted(ask_swap_avg_by_hour,reverse=True)

print("Top 5 Hours for Ask Posts Comments")
print("\n")

template = "{hour}: {avg_com:.2f} average comments per post"

for row in ask_sorted_swap[:5]:
    
    # Format the hour
    hour = row[1]
    hour = dt.datetime.strptime(hour,"%H")
    hour = hour.strftime("%H:%M")
    
    #Format the average with 2 decimals and print the output
    output = template.format(hour = hour, avg_com = row[0])
    print(output)

Top 5 Hours for Ask Posts Comments


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


#### Conclusion

Based on our analysis, we should create `Ask HN` posts at 3pm or 2am to have a higher chance to receive comments. Those times correspond to 9pm and 8am in Central European Time (CET).

Now let's repeat the same analysis for `Show HN` posts.

### Finding the best time to create a `Show HN` post
#### Amount of Posts and Comments by Hour

In [9]:
import datetime as dt
  
show_counts_by_hour = {}
show_comments_by_hour = {}

for post in show_posts:
    date = post[6]
    comments = int(post[4])
    
    # Parse the date and create a datetime object
    dt_date = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
    
    # Select just the hour from the datetime object
    hour = dt_date.strftime("%H")
    
    # Create two dictionaries to store the number of posts and comments by hour
    if hour not in show_counts_by_hour:
        show_counts_by_hour[hour] = 1
        show_comments_by_hour[hour] = comments
    else:
        show_counts_by_hour[hour] += 1
        show_comments_by_hour[hour] += comments
        
print(show_counts_by_hour)
print("\n")
print(show_comments_by_hour)

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


{'14': 1156, '22': 570, '18': 962, '07': 299, '20': 612, '05': 58, '16': 1084, '19': 539, '15': 632, '03': 287, '17': 911, '06': 142, '02': 127, '13': 946, '08': 165, '21': 272, '04': 247, '11': 491, '12': 720, '23': 447, '09': 291, '01': 246, '10': 297, '00': 487}


#### Average Number of Comments by Hour

This time we will create the list of lists with the columns already swaped. This allows us to sort the list directly, without any previous transformation.

In [10]:
show_avg_by_hour = []

for hour in show_counts_by_hour:
    comments = show_comments_by_hour[hour]
    counts = show_counts_by_hour[hour]
    average =  comments / counts
    show_avg_by_hour.append([average,hour])

# Sort the list by the average number of comments in descending order
show_sorted = sorted(show_avg_by_hour,reverse=True)

print("Top 5 Hours for Show Posts Comments")
print("\n")

template = "{hour}: {avg_com:.2f} average comments per post"

for row in show_sorted[:5]:
    
    # Format the hour
    hour = row[1]
    hour = dt.datetime.strptime(hour,"%H")
    hour = hour.strftime("%H:%M")
    
    #Format the average with 2 decimals and print the output
    output = template.format(hour = hour, avg_com = row[0])
    print(output)

Top 5 Hours for Show Posts Comments


18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post


#### Conclusion

Based on our analysis, we should create `Show HN` posts at 6pm or 12am to have a higher chance to receive comments. Those times correspond to 12am and 6am in Central European Time (CET).

## Sum up

In this project, we have analysed data from the technology site Hacker News to find out the following:

- Do `Ask HN` or `Show HN` receive more comments on average?
- Do posts created at a certain time receive more comments on average?

After looking into the data, we have seen that `Ask HN` posts receive 4 more comments on average than `Show HN` posts. 

The posts that receive more comments are the ones created between 3pm and 4pm for `Ask HN` posts and between 6pm and 7pm for `Show HN` posts, both in Eastern Time.

However, it should be noted that the data set we analysed excluded posts without any comments. Given that, it's more accurate to say that our conclusions only apply to the posts that received comments.