
# Exploring Hacker News Posts
***

## Table of contents <a name="begin"></a>
### 1. [Introduction](#introduction)
* [Aim](#aim)
* [Project Goal](#goal)

### 2. [Opening Data](#open)


### 3. [Removing Headers from a List of Lists](#remove)


### 4. [Extracting Ask HN and Show HN Posts](#extract)


### 5. [Data Analysis](#analysis)
* [Calculating the Average Number of Comments for ask HN and Show HN](#average)
* [Finding the number of Ask Posts and Comments by Hour Created](#hour)


### 6. [ Conclusions](#conclusion)



## Introduction <a name="introduction"></a>

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.


**Aim** <a name="aim"></a>

The sole aim of this project is to help users identify the kind of stories that would receive more comments and on what exact hour of the day influences more comments when post are created. 

**Project Goal** <a name="goal"></a>

The goal of this project is to analyze the Ask posts and Show posts to determine which receives the highest comments and on what exact hour of the day has the highest number of comments.

## Opening Data <a name="open"></a>

In [5]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


## Removing Headers from a List of Lists <a name="remove"></a>

In [6]:
# Displaying the header in the dataset
headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [9]:
# Removing the header from the dataset
del(hn[0])
print(hn[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## Extracting Ask HN and Show HN Posts <a name="extract"></a>

In [12]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    Title = row[1] 
    title = Title.lower() #since startswith argument is case sensitive, we converted all characters to Lower case.
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
    

In [21]:
print('The number of post in Ask_Post is',len(ask_posts))
print('\n')
print(ask_posts[:2])
print('\n')
print('\n')
print('The number of post in Ask_Post is',len(show_posts))
print('\n')
print(show_posts[:2])
print('\n')
print('\n')
print('The number of post in Ask_Post is',len(other_posts))
print('\n')
print(other_posts[:2])

The number of post in Ask_Post is 1744


[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]




The number of post in Ask_Post is 1162


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]




The number of post in Ask_Post is 17194


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/201

## Data Analysis <a name="analysis"></a>

Firstly, we want to determine which of the post recieves more comments on Harker News site.

### Calculating the Average Number of Comments for ask HN and Show HN <a name="average"></a>

In [33]:
# Calculating the average number of ask post comments
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,2))

14.04


In [35]:
# Calculating the average number of show post comments
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,2))

10.32


***
From my findings, Ask HN has an average number of comment of 14.04, while Show HN has an average of 10.32.

Hence, Ask HN receives more comments than Show HN.
***

### Finding the number of Ask Posts and Comments by Hour Created <a name="hour"></a>

Since Ask posts are more likely to receive more comments, we will focus our remaining analysis just on these posts.

Now, we will determine if ask posts created at a certain time are more likely to attract comments. Achieving this, we perform the following steps:

1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.


**importing datetime module**

Analysis is on ask_posts dataset

### 1. Calculate the number of ask posts created in each hour of the day, along with the number of comments received.

In [42]:
import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4]) # Converting number of comments to integer
    List = []
    List.append(created_at)
    List.append(num_comments)
    result_list.append(List)
 
    
print(result_list[:10])

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1], ['4/22/2016 12:24', 4], ['11/16/2015 9:22', 1], ['2/24/2016 17:57', 1], ['6/4/2016 17:17', 2]]


**Creating Dictionary for Counts by Hour and Comments by Hour**

In [71]:
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    date_time_string = row[0]
    date_time = dt.datetime.strptime(date_time_string, "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    comments = row[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
print('The dictionary below contains the number of ask posts created during each hour of the day.') 
print("\n")
print(counts_by_hour)
print("\n")
print("\n")
print('The dictionary below contains the corresponding number of comments created at each hour received.')  
print("\n")
print(comments_by_hour)

The dictionary below contains the number of ask posts created during each hour of the day.


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}




The dictionary below contains the corresponding number of comments created at each hour received.


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


### 2.  Calculating the Average Number of Comments for Ask HN Posts by Hour


In [88]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour],2])
    
avg_by_hour

[['09', 5.5777777777777775, 2],
 ['13', 14.741176470588234, 2],
 ['10', 13.440677966101696, 2],
 ['14', 13.233644859813085, 2],
 ['16', 16.796296296296298, 2],
 ['23', 7.985294117647059, 2],
 ['12', 9.41095890410959, 2],
 ['17', 11.46, 2],
 ['15', 38.5948275862069, 2],
 ['21', 16.009174311926607, 2],
 ['20', 21.525, 2],
 ['02', 23.810344827586206, 2],
 ['18', 13.20183486238532, 2],
 ['03', 7.796296296296297, 2],
 ['05', 10.08695652173913, 2],
 ['19', 10.8, 2],
 ['01', 11.383333333333333, 2],
 ['22', 6.746478873239437, 2],
 ['08', 10.25, 2],
 ['04', 7.170212765957447, 2],
 ['00', 8.127272727272727, 2],
 ['06', 9.022727272727273, 2],
 ['07', 7.852941176470588, 2],
 ['11', 11.051724137931034, 2]]

Although we now have the result we need, but this format makes it difficult to identify the hours with the highest values.lets finish by sorting the list of lists and printing out the five highest values in a format that is easier to read.

**Sorting and Printing Values from a List of Lists** 

Since sorted() function can only work for the first element in a list of lists, and we want to sort for the average number of comments which is on the second element on every list. 
Hence, we proceed to swapping the first element with the second element on every List using a for loop.


In [89]:
#Creating a new avg_by_hour list with swapped columns. 

swap_avg_by_hour = []
for row in avg_by_hour:
    first = row[0]
    second = row[1]
    swap_avg_by_hour.append([second,first])

swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [93]:
# Using the sorted() function to sort swap_avg_by_hour in descending function

sorted_swap = sorted(swap_avg_by_hour,reverse=True)
print('Top 5 Hours for Ask Posts Comments')
print('\n')

for row in sorted_swap[:5]:
    hour = row[1]
    avg_comment = row[0]
    
    date = dt.datetime.strptime(hour,"%H")
    time = date.strftime("%H:%M")
    
    template = "{}: {:.2f} average comments per post.".format(time,avg_comment)
    
    print(template)

Top 5 Hours for Ask Posts Comments


15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


From the analysis, we can conclude that Ask_HR News has more and the highest comments when posted around 15:00 (3pm). Following that, more comments are likely to be received if Ask_HR News are posted in the morning as early as 2:00 (2am), and in the evening around 16:00 (4pm), 20:00 (8pm) and 21:00 (9pm).

## Conclusion <a name="conclusion"></a>

In this project, we are aimed at analyzing Ask posts and Show posts to determine which receives the highest comments and on what exact hour of the day has the highest number of comments.

Based on our analysis, Ask posts have more comments than Show post. And moving further, we were able to deduce that Ask posts receive more comments if posted in the midday towards evening around 3:00pm–4:00pm, and early hour of the day around 2:00am.

Hence, to maximize the amount of comments a post receives, I would recommend Ask_HR news to be posted between in the morning by 2:00am and in the evening between 3:00pm–4:00pm.

However, it should be noted that the dataset used for this analysis excluded post without comments. To this effect, we would be more convinced that our results and conclusion are accurate.


[Back to Table of Content](#begin)