# Finding best time to publish post on Hacker news to get more audience

## Introduction

In this project we'll aim to find the best time to populate our post on [Hacker News](https://news.ycombinator.com/). This is a popular webside where users can publish technology related stories and vote or comment them. We are going to explore two types of posts **Ask HN** or **Show HN** which are prefixes for each post in which the user is asking some question and when the user is discribing some new idea respectively. 

We will work on **hacker_news.csv** dataset which can be downloaded at [Link](https://www.kaggle.com/hacker-news/hacker-news-posts/version/1). To make our recommendation we are going to find out:
- What type of post recieve more comments on average?
- What type of post recieve more comments on average?
- Do posts created at a certain time recieve more comments on average?

### Summary of results

After analyzing the data, we could conclude that the best time to populate your post on Hacker news is 21:00 EST time zone.

For more details, please refer to the full analysis below.

## Data analysis

### Importing data and removing header
We will start from importing **csv** module and constructing the list with our data.

In [3]:
from csv import reader
import datetime as dt

openObj=open('data//hacker_news.csv', encoding="utf-8")
hn=list(reader(openObj))

# Serarating header from data
headers=hn[0]
hn=hn[1:]
print(headers)
print(hn[0:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]


Here we have imported our dataset and separeted the header from the data. We printed a first few line of our data. Now lets split our posts into 3 categories: **ask_posts**, **show_posts** and **other_posts**. 

### Spliting posts into 3 categories

In [4]:
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1]
    
    # Building a list of show type posts
    if title.lower().startswith('show hn'):
        show_posts.append(row)              
    
    # Building a list of ask type posts
    elif title.lower().startswith('ask hn'):
        ask_posts.append(row)
    
    # Keeping the rest post in the other post list
    else:
        other_posts.append(row)
        
print(len(show_posts),len(ask_posts),len(other_posts))

1162 1744 17194


We are interested in **ask_posts** and **show_posts** type comments. As we can see the first type of questions appears **1162** times in our dataset and the second type **1744** times. The rest is other category.

### Counting the average number of posts per category

In [5]:
total_ask_comments=0

# Calculating the total number of comments for ask posts
for row in ask_posts:
    total_ask_comments+=int(row[4])

# Calculating the average number of commenst per post
avg_ask_comments=total_ask_comments/len(ask_posts)
print(round(avg_ask_comments))

14


In [6]:
total_show_comments=0

for row in show_posts:
    total_show_comments+=int(row[4])
    
avg_show_comments=total_show_comments/len(show_posts)
print(round(avg_show_comments))

10


Here we sum the number of comments for both **ask_posts** and **show_posts** lists. We can compare the average number of comments per post in this two groups. As we can see on average more comments are populated under ask type post then when users are showing something. Lets compare that with the other type of posts.

In [70]:
total_other_comments=0

for row in other_posts:
    total_other_comments+=int(row[4])
    
avg_other_comments=total_other_comments/len(other_posts)
print(round(avg_other_comments))

27


We can see that the most commented posts are other type of posts. \
Note:\
We could also check the distribution here. Analysing if in any of these groups are some single top commented posts which can interfer our results.

### Counting the average number of comments for ask posts per hour 

Lets now focus on **ask_posts** group only. Lets calculate the number of comments given at specific hours.

In [7]:
result_list=[]

# Creating the list with the date of published posts and number of comments
for post in ask_posts:
    result_list.append([post[6],int(post[4])])

# Creting two frequency tables with the number of comments and posts per hour 
counts_by_hour={}
comments_by_hour={}

for row in result_list:
    timeObj=dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    hour=timeObj.strftime('%H')
    comm_num=row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=comm_num
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=comm_num

Here we have created two dictionaries. The first one with the number of ask posts per each hour and the second dict with the number of comments added for ask posts per each hour. Let's now calculate the average number of comments for post created during each hour.

In [27]:
avg_by_hour=[]

# Creating the sorted list of the average comments per post in each hour
for hour in counts_by_hour:
    avg_per_post=comments_by_hour[hour]/counts_by_hour[hour]
    avg_by_hour.append([hour,avg_per_post])
    avg_by_hour.sort()

print(avg_by_hour)

[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]


In [28]:
swap_avg_by_hour=[]

# Swapping values in the list to sort them
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

# Sorting valus in descending order
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print(sorted_swap[0:5])

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [42]:
# Editing and printing top 5 hours with the most comments
for avg,hour in sorted_swap[0:5]:
    hour_format=dt.datetime.strptime(hour,'%H')
    hour_format=hour_format.strftime('%H:%M')
    template='{}: {:.2f} average comments per post.'.format(hour_format,avg)
    print(template)

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


Based on the results we can say that the best time to populate your ask comments is 15:00. At that time you have the highest chance of receiving comments. One think to notice is that these hours are in EST format. Lets convert them to our CET time zone by using deltatime.

In [52]:
# Changing the results to EST time zone
for avg,hour in sorted_swap[0:5]:
    hour_format=dt.datetime.strptime(hour,'%H')
    hour_format_EST=hour_format+dt.timedelta(hours=6)
    hour_format=hour_format_EST.strftime('%H:%M')
    template='{}: {:.2f} average comments per post.'.format(hour_format,avg)
    print(template)

21:00: 38.59 average comments per post.
08:00: 23.81 average comments per post.
02:00: 21.52 average comments per post.
22:00: 16.80 average comments per post.
03:00: 16.01 average comments per post.


In EST time the best time to populate your post on **Hacker news** is 21:00. \
Now let's focus on the number of points recived for different type of posts. Let's consider if show or ask posts got more points on average.

### Counting the average number of points per category

In [53]:
# Calculating the average number of points for ask posts
total_ask_points=0

for row in ask_posts:
    total_ask_points+=int(row[3])

avg_ask_points=total_ask_points/len(ask_posts)
print(round(avg_ask_points))

15


In [56]:
# Calculating the average number of points for show posts
total_show_points=0

for row in show_posts:
    total_show_points+=int(row[3])

avg_show_points=total_show_points/len(show_posts)
print(round(avg_show_points))

28


As we can clearly see - more points on average are recived by **show posts**. Lets compare the average number of points with other type of posts.

In [69]:
# Calculating the average number of points for other posts
total_other_points=0

for row in other_posts:
    total_other_points+=int(row[3])

avg_points_other=total_other_points/len(other_posts)
print(round(avg_points_other))

55


We can see that on average the neither show nor ask type of posts can compete with the other posts. Lets now check if a certain time at which post is added has any influence on the number of recived points. To to this lets calculate hour distribution for **show posts**.

### Counting the average number of points for show posts per hour 

In [58]:
# Creting two frequency tables with the number of points and posts per hour 
num_show_posts_hour={}
num_show_points_hour={}

for row in show_posts:
    hour_dt=dt.datetime.strptime(row[-1],'%m/%d/%Y %H:%M')
    hour=hour_dt.strftime('%H')
    if hour not in num_show_posts_hour:
        num_show_posts_hour[hour]=1
        num_show_points_hour[hour]=int(row[3])
    else:
        num_show_posts_hour[hour]+=1
        num_show_points_hour[hour]+=int(row[3])

Ok. Now lets calculate the average number of points per show post in each hour.

In [66]:
# Calculating the average number of points per show post in each hour
avg_point_hour=[]
for hour in num_show_posts_hour:
    avg_point=num_show_points_hour[hour]/num_show_posts_hour[hour]
    avg_point_hour.append([avg_point,hour])
    avg_point_hour.sort(reverse=True)
    
# Tidy printing the results
for n in avg_point_hour[0:5]:
    score=n[0]
    hour_dt=dt.datetime.strptime(n[1],'%H')
    hour=hour_dt.strftime('%H:%M')
    print('{}: {:.2f} average points per post.'.format(hour,score))


23:00: 42.39 average points per post.
12:00: 41.69 average points per post.
22:00: 40.35 average points per post.
00:00: 37.84 average points per post.
18:00: 36.31 average points per post.


As we can see there is no strong correlation between the number of points recived and a specific hour when that post was added.

## Conclusion

After analyzing the data, we could conclude that the best time to populate your post on Hacker news is 21:00 EST time zone. Ask type of comments got more comments which is logic that when people were asking about something they were expecting to get some replays. Checking the number of points gather by these two posts we can say that more possitive votes got show types. 
