## Project of Hacker News Posts ##

In this project, we'll work on a data set of submissions to technology site [Hacker News](https://news.ycombinator.com/). Hacker News is a social news website focusing on computer science and entrepreneurship.

Data set used in this project can be dowloaded from [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

This data set is Hacker News posts from the last 12 months up to September 26, 2016.

It includes the following columns:

- **title**: title of the post (self explanatory)
- **url**: the url of the item being linked to
- **num_points**: the number of upvotes the post received
- **num_comments**: the number of comments the post received
- **author**: the name of the account that made the post

Some of the entries have the expressions of `Ask HN` and `Show HN` at the beginning. `Ask HN` is used when asking questions to hacker news members, and `Show HN` is used to show something. These are two different types of submissions we want to compare. 

## Reading dataset ##

The dataset is in *csv* format. It is converted to *list of lists* format.  

In [2]:
from csv import reader

opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

We separate the first row of the dataset as a different variable called as `headers`.

In [3]:
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Breaking the dataset into sub-datasets ##
As we mentioned above, some posts were grouped as `Ask HN` and `Show HN`. We will divide the dataset into three sub-groups as `ask_posts`, `show_posts` and `other_posts`.

In [4]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else: 
        other_posts.append(row)

We look at how many posts each group contains.

In [5]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


## Comparison of two Datasets ##
Let's look at the average number of comments for posts in the `ask_post` dataset.

In [6]:
total_ask_comments = 0

for row in ask_posts:
    com = int(row[4])
    total_ask_comments += com

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


Let's look at the average number of comments for posts in the `show_post` dataset.

In [7]:
total_show_comments = 0

for row in show_posts:
    com = int(row[4])
    total_show_comments += com

avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

10.31669535283993


Ask posts get more about 4 comments than show posts averagely. 

## Correlation of the posting time and comment numbers ##

We analyze the effect of the posting time on the number of comments for ask posts. 

Firstly, we look at the number of posts for each hour of day.

In [11]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    com_num = int(row[4])
    result_list.append([created_at, com_num])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Then, average numbers of comments hourly are examined.  

In [12]:
avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
    
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

The hours and number of reviews seem difficult to analyze. Therefore, it is a good idea to organize this table.

Let's swap the hours and the number of comments in the list.

In [15]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [16]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")

    print("{} {:.2f} average comments per post".format(hour, row[0]))

15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


The average number of comments seems to be over 20 at 15:00, 20:00 and 02:00.

Let's do the same opearitons for show posts. 

In [20]:
result_list = []

for row in show_posts:
    created_at = row[6]
    com_num = int(row[4])
    result_list.append([created_at, com_num])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

avg_by_hour = []

for hour in counts_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])
    
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H").strftime("%H:%M")

    print("{} {:.2f} average comments per post".format(hour, row[0]))

18:00 15.77 average comments per post
00:00 15.71 average comments per post
14:00 13.44 average comments per post
23:00 12.42 average comments per post
22:00 12.39 average comments per post


## Conclusion ##

It turns out that if we want to ask a question on Hacker News, the best time for this is 15:00 and 16:00. We are more likely to find an answer to question during this time of the day.

Interestingly, the hours at which ask post and show post received the most comments are different. Peak hours for ask posts are 15:00, 02:00, and 20:00, while for showposts it is 18:00, 00:00 and 14:00.

Users may prefer more free time to answer the ask posts because it may take time to respond to love posts. However, an analysis other than what is done here is required to verify this estimate.

Askposts have higher hourly average comments than showposts. This situation was expected because we have already found that the average number of entries for askposts is generally higher.