## Analysis of a forum-like website

In this project the objective will be to analyze the dataset of submissions to the popular technology site [Hacker News](https://news.ycombinator.com/).

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

The complete dataset is available [here](https://www.kaggle.com/hacker-news/hacker-news-posts), with a brief description of all the features present in it. The general structure of the first features of the dataset is provided as an example below:

| id | title | url | num_points | num_comments | author | created_at |
|---|---|---|---|---|--- |---|
| 12224879 | Interactive Dynamic Video | http://www.interactivedynamicvideo.com/ | 386 | 52 | ne0phyte | 8/4/2016 11:52 |

In [6]:
# Importing and managing the dataset #

def import_dataset(dataset):
    from csv import reader
    
    opened_file = open(dataset)
    read_file = reader(opened_file)
    dataset = list(read_file)
    return dataset[0], dataset[1:]

# Function to print the first rows of the dataset #

def print_rows(dataset, n_rows):
    for row in dataset[:n_rows]:
        print(row)
        print('\n')
        
# Code #

headers, hn = import_dataset('hacker_news.csv')
print(headers)
print_rows(hn, 5)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']




For the dataset we're specifically interested in posts whose titles begin with either _Ask HN_ or _Show HN_. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. We'll compare these two types of posts to determine if:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?

In [11]:
# Code #

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    if title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('The number of posts as Ask HN is: ', len(ask_posts))
print('The number of posts as Show HN is: ', len(show_posts))
print('The number of other posts is: ', len(other_posts))
print('\n')
print_rows(ask_posts, 5)
print_rows(show_posts, 5)

The number of posts as Ask HN is:  1744
The number of posts as Show HN is:  1162
The number of other posts is:  18938


['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/

This far, we have shown that the number of posts labelled as "Ask" is higher than the number of posts labelled as "Show", and they both are quite smaller in number than the rest of the webiste's posts.



Next, let's determine if ask posts or show posts receive more comments on average.

In [12]:
# Function to extract the total and average value of a given feature #

def tot_avg_feat(dataset, index_f):
    total = 0
    for row in dataset:
        value = int(row[index_f])
        total += value
    avg = total / len(dataset)
    return total, avg


# Code #

total_ask_comments, avg_ask_comments = tot_avg_feat(ask_posts, 4)
total_show_comments, avg_show_comments = tot_avg_feat(show_posts, 4)
print('The average number of comments on ask posts is: ', avg_ask_comments)
print('The average number of comments on show posts is: ', avg_show_comments)

The average number of comments on ask posts is:  14.038417431192661
The average number of comments on show posts is:  10.31669535283993


As we can see, in average the number of comments that an Ask post receive is higher than the number of comments of a Show post. The community is more active in requests from them than in showing them a project, product or something interesting.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

    1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    2. Calculate the average number of comments ask posts receive by hour created.


In [22]:
# Code #

import datetime as dt

result_list = []

for row in ask_posts:
    creation_date = row[6]
    n_comments = int(row[4])
    result_list.append([creation_date, n_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    datetime_format = "%m/%d/%Y %H:%M"
    datetime = dt.datetime.strptime(row[0], datetime_format)
    hour = datetime.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
        
print('The number of posts created by hour is:\n', sorted(counts_by_hour.items(), key=lambda item: item[1], reverse=True))
print('\n')
print('The number of comments received by hour is:\n', sorted(comments_by_hour.items(), key=lambda item: item[1], reverse=True))

The number of posts created by hour is:
 [('15', 117), ('19', 111), ('18', 110), ('21', 110), ('16', 109), ('14', 108), ('17', 101), ('13', 86), ('20', 81), ('12', 74), ('22', 72), ('23', 69), ('01', 61), ('10', 60), ('11', 59), ('02', 59), ('00', 56), ('03', 55), ('08', 49), ('04', 48), ('05', 47), ('09', 46), ('06', 45), ('07', 35)]


The number of comments received by hour is:
 [('15', 4478), ('16', 1831), ('21', 1749), ('20', 1724), ('18', 1441), ('14', 1419), ('02', 1384), ('13', 1282), ('19', 1191), ('17', 1147), ('10', 794), ('01', 716), ('12', 691), ('11', 643), ('23', 544), ('08', 497), ('05', 493), ('22', 481), ('00', 457), ('03', 422), ('06', 398), ('04', 340), ('07', 269), ('09', 257)]


Several useful information can be obtained from this extracts. The best hour to receive comments is 15:00, followed by 16:00, 21:00 and 20:00. This coincides with the main hour at which a most number of post are created: 15:00, followed by 19:00, 18:00 and 21:00.

However, we can see that the number of posts created is quite similar throughout the whole day. This is not true for tue number of comments received, as we can see a significant difference between the first 4 hours indicated and the rest of the day.

As this are absolute values, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. With that, we can have a better understanding of the global idea about the frequencies.  

In [34]:
# Code #

avg_by_hour = []

for hour_1 in counts_by_hour:
    for hour_2 in comments_by_hour:
        if hour_1 == hour_2:
            avg_by_hour.append([hour_1 , comments_by_hour[hour_1] / counts_by_hour[hour_1]])

print('The average number of comments by post created by hour is:\n', sorted(avg_by_hour, key=lambda item: item[1], reverse=True))       

The average number of comments by post created by hour is:
 [['15', 38.27350427350427], ['02', 23.45762711864407], ['20', 21.28395061728395], ['16', 16.798165137614678], ['21', 15.9], ['13', 14.906976744186046], ['10', 13.233333333333333], ['14', 13.13888888888889], ['18', 13.1], ['01', 11.737704918032787], ['17', 11.356435643564357], ['11', 10.898305084745763], ['19', 10.72972972972973], ['05', 10.48936170212766], ['08', 10.142857142857142], ['12', 9.337837837837839], ['06', 8.844444444444445], ['00', 8.160714285714286], ['23', 7.884057971014493], ['07', 7.685714285714286], ['03', 7.672727272727273], ['04', 7.083333333333333], ['22', 6.680555555555555], ['09', 5.586956521739131]]


In [49]:
# Another way to sort the list and show results #

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments in EST\n")
for row in sorted_swap[:5]:
    n_comments = row[0]
    h_format = "%H"
    hour = dt.datetime.strptime(row[1], h_format)
    hour = hour.strftime("%H:%M")
    txt = "{time}: {comments:.2f} average comments per post"
    print(txt.format(time=hour , comments=n_comments))

Top 5 Hours for Ask Posts Comments in EST

15:00: 38.27 average comments per post
02:00: 23.46 average comments per post
20:00: 21.28 average comments per post
16:00: 16.80 average comments per post
21:00: 15.90 average comments per post


The results confirm that the best hour to create a post in which the chances of receiving comments are higher is, by far, 15:00, followed by 2:00 and 20:00 in a certain distance.

This analysis throw an interesting pattern: the more activity hours inside the forum are concentrated around post-lunch time and post-dinner time. This obeys to the particular fact that the website is mostly frequented by u.s. citizens.

It is important to recall that the time zone is Eastern Time in the US, something to take into account if you plan to create a post in a different timezone. For example, for the Central European Summer Time (CEST) it would be necessary to add 6 hours to each one of the times previously calculated.

In [51]:
# Conversion to CEST #

print("Top 5 Hours for Ask Posts Comments in CEST\n")
for row in sorted_swap[:5]:
    n_comments = row[0]
    h_format = "%H"
    hour = dt.datetime.strptime(row[1], h_format)
    hour += dt.timedelta(hours=6)
    hour = hour.strftime("%H:%M")
    txt = "{time}: {comments:.2f} average comments per post"
    print(txt.format(time=hour , comments=n_comments))

Top 5 Hours for Ask Posts Comments in CEST

21:00: 38.27 average comments per post
08:00: 23.46 average comments per post
02:00: 21.28 average comments per post
22:00: 16.80 average comments per post
03:00: 15.90 average comments per post
