# Exploring Hacker News Posts

This is a Data analysis project where we will be analyzing Hacker News posts. We are specifically interested in posts where the titles begin with 'Ask HN' or 'Show HN'.

We will be comparing these two types of posts and try to see if Ask HN or Show HN receive more comments on average.

And, Do posts created at a certain time receive more comments on average compared to posting on other times.

Dataset documentation found [here](https://www.kaggle.com/hacker-news/hacker-news-posts)

# Data Collection (kaggle set)

Below we will,
- import the needed modules
- open the dataset
- read the csv file
- save the data as a list of lists
- separate the header from the data for ease of manipulation
- print the header
- print first 5 rows to get a feel

In [1]:
from csv import reader
import datetime as dt

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(headers) # Header row
print('\n')
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


# Data Cleaning

`Modelling the data to fit our need`

Below we will isolate only data with the Ask HN or Show HN starting keywords, then we will store them in a list of lists.

We will separate each category to ensure understandability in the analysis.

In [2]:
# storage lists
ask_posts = []
show_posts = []
other_posts = []

# NOTE: append posts not the title.
for row in hn:
    title = row[1] # Title is in second column index 1.
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Checking the number of each category
print('ask hn: ' + str(len(ask_posts)))
print('show hn: ' + str(len(show_posts)))
print('others: ' + str(len(other_posts)))

ask hn: 1744
show hn: 1162
others: 17194


In [3]:
# First 5 in ask posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [4]:
# First 5 in show posts
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

# Exploratory Data Analysis

Below we shall determine if ask posts or show posts receive more comments on average.

In [5]:
# Find total number of comments in ask posts.
total_ask_comments = 0

for post in ask_posts:
    num_comments = int(post[4])
    total_ask_comments += num_comments

# Compute the average comments in ask posts.
avg_ask_comments = total_ask_comments / len(ask_posts)
# show
print(avg_ask_comments)

14.038417431192661


In [6]:
# Find total number of comments in show posts.
total_show_comments = 0

for post in show_posts:
    num_comments = int(post[4])
    total_show_comments += num_comments

# Compute the average comments in show posts.
avg_show_comments = total_show_comments / len(show_posts)
# show
print(avg_show_comments)

10.31669535283993


# Findings

As we can observe from the computations above, `Ask Posts` with an Average value of 14 (approx.) does have a larger number of commenters per post on average versus `Show Posts` with 10 (approx.).

Therefore, here on out we shall now focus only on the Ask Posts for further deeper analysis.

We shall now check if posts created at a certain `time` are more likely to get or attract more comments.

In [7]:
result_list = []

for post in ask_posts:
    created_at = post[6]
    num_comments = int(post[4])
    list_append = [created_at, num_comments]
    result_list.append(list_append)

# print(result_list[:10])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = row[0]
    num_comment = row[1]
    date_time = dt.datetime.strptime(date_time, "%m/%d/%Y %H:%M")
    hour = date_time.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comment
       
print(counts_by_hour)
print('\n')
print(comments_by_hour)
print('\n')
print(len(counts_by_hour))
print(len(comments_by_hour))

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


24
24


# Average comments per post in each Hour of the Day

The two dictionaries above namely:

`counts_by_hour` - contains number of ask posts created during each hour of the day.

`comments_by_hour` - contains number of comments ask posts created each hour of the day.

We will now use these dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [11]:
avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
 
# Testing the list of list results        
print(len(avg_by_hour)) # if 24 - complete.
print('\n')
print(avg_by_hour) 

24


[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


We now have the results needed But we need to enhance it's readability. Therefore, we will sort the data further to enhance ease of use and readability.

In [13]:
# Column swap avg_by_hour
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)
print('\n')
print(len(swap_avg_by_hour))

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


24


In [17]:
# Sorting Algorithm to show in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print(sorted_swap)
print('Top 5')
print(sorted_swap[:5])

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
Top 5
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [22]:
print("Top 5 Hours for Ask Posts Comments")
print('\n')
for row in sorted_swap[:5]:
    hour = row[1]
    avg_comments = row[0]
    time_obj = dt.datetime.strptime(hour, '%H')
    time = time_obj.strftime('%H:%M')
    result = '{}: {:.2f} average comments per post'.format(time, avg_comments)
    print(result)

Top 5 Hours for Ask Posts Comments


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


# Conclusion

Based on the results of our analysis we can now conclude that the `Hour of the day` with the highest chance of receiving comments is at `3:00 P.M. Eastern Time, US`

Locally this is at exactly `3:00 A.M. PH Time`.

Other times of the day (EST) are 2 AM, 8 PM, 4 PM and lastly 9 PM. 

Conversion to Philippine time is just the opposite meridian of EST e.g. PM becomes AM or AM becomes PM.