# Exploring Hacker News Posts

In this project we will explor data from [Hacker News Posts](https://news.ycombinator.com/). Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We will take a look at two types of posts whoes titles begin with either *Ask HN* or *Show HN*. Users submit *Ask HN* posts to ask the Hacker News community a specific question, such as "*Ask HN: How to improve my personal website?*". Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

## Goals
We'll compare these two types of posts to determine the following:

 * Do Ask HN or Show HN receive more comments and points on average?
 * Do posts created at a certain time receive more comments and points on average? 
 
## Data set
[Link](https://www.kaggle.com/keplaxo/hacker-news) 

## Introduction
First, we will open our file *hacker_news.csv* and remove the headers.

In [3]:
#Oppening file
from csv import reader

oppened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(oppened_file)
hn = list(read_file) #Convert to list of lists

#Display first 5 rows:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers

In [4]:
headrs = hn[0] #Headrs column
hn = hn[1:] #Remove headers from data set

print(headrs)
print('\n')
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


We can spot above that our data set contains inter alia the number of comments for each post (*num_comments* **column**) and time when post was created (*created_at* **column**). We will explor this data more.

## Separating ask and show posts

Below, we created 3 empty lists to add selecting rows to each group of lists. We will loop through each row in *hn*.

In [5]:
#Separating posts that begin with "Ask HN" and "Show HN". 

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Calculating average comments in ask and show posts

Let's check where the average of comments are higher.
We will count the average each group of posts by suming total number of comments and divide by quantity of posts.

In [6]:
#Ask posts 
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print(avg_ask_comments)

14.038417431192661


In [7]:
#Show posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)

print(avg_show_comments)

10.31669535283993


We see ask posts are more popular than show posts. On average ask posts got approximately 14 comments, whereas show posts got around 10. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

In [8]:
#Creating list to store 2 values: the date as a string and number of comments as an intiger

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append((created_at, num_comments))   
    

## Looking for the number of Ask Posts and Comments by hour created

Let's check if we can state which hour is the best to received the most number of comments in *Ask posts*. First, we will find the amount of ask posts created durng each hour of day (*counts_by_hour* **dictionary**).

In [9]:
#finding the amount of ask posts created durng each hour of day 
import datetime as dt

counts_by_hour = {} #contains the number of ask posts created during each hour of the day
comments_by_hour = {} #contains the corresponding number of comments ask posts created at each hour received.

for _ in result_list:
    date = _[0]
    n_comments = _[1]
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments
    
print(counts_by_hour)
print("\n")
print(comments_by_hour) 

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


## Calculating the Average Number of Comments for Ask HN Posts by Hour

In [10]:
# Calculate the average amount of comments `Ask HN` posts created at each hour of the day receive.
avg_by_hour = []

for _ in comments_by_hour:
    avg = comments_by_hour[_] / counts_by_hour[_]
    avg_by_hour.append([_, avg])

In [11]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting number of comments from highest to lowest values

In [12]:
#Create empty list to swap the items from avg_by_hour list
swap_avg_by_hour = []
for _ in avg_by_hour:
    swap_avg_by_hour.append([_[1], _[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


In [13]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#Print 5 top values 
print(sorted_swap[:5]) 

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [14]:
for avg, h in sorted_swap[:5]:
    hour = dt.datetime.strptime(h, "%H")
    print(f'{hour.time()} {avg:.2f} average comments per post.')

15:00:00 38.59 average comments per post.
02:00:00 23.81 average comments per post.
20:00:00 21.52 average comments per post.
16:00:00 16.80 average comments per post.
21:00:00 16.01 average comments per post.


The hour 15:00 receives the most comments per post on average, with an average of 38.59 comments per post. On second place is 2:00 hour with an average of 23.81 comments per post. This is a huge diffrence.

According to the data set [documentation](https://www.kaggle.com/hacker-news/hacker-news-posts), the timezone is Eastern Time in the US. We have to convert this time to our timezone.

So we will convert time hours from Eastern Time in the US to Central European Standard Time. There is 6 hours diffrence between these 2 timezones.

In [15]:
for avg, h in sorted_swap[:5]:
    hour = dt.datetime.strptime(h, "%H")
    six_hour = dt.timedelta(hours=6) #Add 6 hours to get time in Central European Standard Time timezone
    hour = hour + six_hour
    print(f'{hour.time()} {avg:.2f} average comments per post.')

21:00:00 38.59 average comments per post.
08:00:00 23.81 average comments per post.
02:00:00 21.52 average comments per post.
22:00:00 16.80 average comments per post.
03:00:00 16.01 average comments per post.


## Checking if ask posts or show posts receive more points on average

Let's explore *num_points* **column** to check if ask posts or show posts receive more points on average.

Just for record, we have created lists with separated *ask_posts* and *show_posts*:

In [24]:
print(len(ask_posts)) 
print(len(show_posts)) 



1744
1162


Let's check where the average of points are higher.
We will count the average each group of posts the same like with number of comments.

In [23]:
#Ask posts 
total_ask_points = 0

for row in ask_posts:
    num_points = int(row[3])
    total_ask_points += num_points
    
avg_ask_points = total_ask_points / len(ask_posts)

print(avg_ask_points)

15.061926605504587


In [25]:
#Show posts 
total_show_points = 0

for row in show_posts:
    num_points = int(row[3])
    total_show_points += num_points
    
avg_show_points = total_show_points / len(show_posts)

print(avg_show_points)

27.555077452667813


We see now that show posts received more points than ask posts. Show posts has 27.55 points on avergae, whereas ask posts has only 15 points on average, so it's almost 2 times less than show posts.

Show posts are often deccription of some interesting subject. People gaves points (like likes in Facebook) to appreciate the work.

Now we will explore more show posts.

In [28]:
#Creating list to store 2 values: the date as a string and number of points as an intiger

result_list = []
for row in show_posts:
    created_at = row[6]
    num_points = int(row[3])
    result_list.append((created_at, num_points))    


## Looking for the number of Show Posts and Points by hour created

Previously, we could find which hour is the best to create Ask post to increase chance getting more number of comments. We can assume that we can also find the best time to publish Show post and get more points.

In [29]:
#finding the amount of show posts created durng each hour of day 

counts_by_hour = {} #contains the number of show posts created during each hour of the day
points_by_hour = {} #contains the corresponding number of points show posts created at each hour received.

for _ in result_list:
    date = _[0]
    n_points = _[1]
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        points_by_hour[hour] = n_points
    else:
        counts_by_hour[hour] += 1
        points_by_hour[hour] += n_points
    
print(counts_by_hour)
print("\n")
print(points_by_hour) 

{'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


{'14': 2187, '22': 1856, '18': 2215, '07': 494, '20': 1819, '05': 104, '16': 2634, '19': 1702, '15': 2228, '03': 679, '17': 2521, '06': 375, '02': 340, '13': 2438, '08': 519, '21': 866, '04': 386, '11': 1480, '12': 2543, '23': 1526, '09': 553, '01': 700, '10': 681, '00': 1173}


## Calculating the Average Number of Points for Show HN Posts by Hour

In [31]:
# Calculate the average amount of points `Show HN` posts created at each hour of the day receive.
avg_by_hour = []

for _ in points_by_hour:
    avg = points_by_hour[_] / counts_by_hour[_]
    avg_by_hour.append([_, avg])
    
print(avg_by_hour)

[['14', 25.430232558139537], ['22', 40.34782608695652], ['18', 36.31147540983606], ['07', 19.0], ['20', 30.316666666666666], ['05', 5.473684210526316], ['16', 28.322580645161292], ['19', 30.945454545454545], ['15', 28.564102564102566], ['03', 25.14814814814815], ['17', 27.107526881720432], ['06', 23.4375], ['02', 11.333333333333334], ['13', 24.626262626262626], ['08', 15.264705882352942], ['21', 18.425531914893618], ['04', 14.846153846153847], ['11', 33.63636363636363], ['12', 41.68852459016394], ['23', 42.388888888888886], ['09', 18.433333333333334], ['01', 25.0], ['10', 18.916666666666668], ['00', 37.83870967741935]]


## Sorting number of comments from highest to lowest values

In [32]:
#Create empty list to swap the items from avg_by_hour list
swap_avg_by_hour = []
for _ in avg_by_hour:
    swap_avg_by_hour.append([_[1], _[0]])
print(swap_avg_by_hour)

[[25.430232558139537, '14'], [40.34782608695652, '22'], [36.31147540983606, '18'], [19.0, '07'], [30.316666666666666, '20'], [5.473684210526316, '05'], [28.322580645161292, '16'], [30.945454545454545, '19'], [28.564102564102566, '15'], [25.14814814814815, '03'], [27.107526881720432, '17'], [23.4375, '06'], [11.333333333333334, '02'], [24.626262626262626, '13'], [15.264705882352942, '08'], [18.425531914893618, '21'], [14.846153846153847, '04'], [33.63636363636363, '11'], [41.68852459016394, '12'], [42.388888888888886, '23'], [18.433333333333334, '09'], [25.0, '01'], [18.916666666666668, '10'], [37.83870967741935, '00']]


In [33]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

#Print 5 top values 
print(sorted_swap[:5]) 

[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18']]


In [34]:
for avg, h in sorted_swap[:5]:
    hour = dt.datetime.strptime(h, "%H")
    print(f'{hour.time()} {avg:.2f} average points per post.')

23:00:00 42.39 average points per post.
12:00:00 41.69 average points per post.
22:00:00 40.35 average points per post.
00:00:00 37.84 average points per post.
18:00:00 36.31 average points per post.


The hour 23:00 receives the most points per post on average, with an average of 42.39 points per post. On second place is 12:00 hour with an average of 41.69 points per post. Then is 22:00 with 40.35 average. We can see that differences between these hours are not big.

As before, hours which we get above are in Eastern Time in the US. Let's convert to Central European Standard Time.

In [35]:
for avg, h in sorted_swap[:5]:
    hour = dt.datetime.strptime(h, "%H")
    six_hour = dt.timedelta(hours=6) #Add 6 hours to get time in Central European Standard Time timezone
    hour = hour + six_hour
    print(f'{hour.time()} {avg:.2f} average comments per post.')

05:00:00 42.39 average comments per post.
18:00:00 41.69 average comments per post.
04:00:00 40.35 average comments per post.
06:00:00 37.84 average comments per post.
00:00:00 36.31 average comments per post.


## Conclusion

In this project, we analyzed ask and show posts to check which type of post are generating the most comments and points on average, created in each hour. We found that Ask posts got more comments than Show posts. To maximize the amount of comments, we should create post between 21:00 and 22:00 (in Central European Standard Time timezone). Whereas, to get a chance of receiving the most points, we should create Show post between 5:00 and 6:00 or 18:00 and 19:00.
