# Exploring Hacker News Posts

This project focuses on comparing the interactions between Queries and Showcases on Hacker News. Queries are posts whose titles are prefixed with 'Ask HN', namely by authors who are seeking answers to their respective questions. Post prefixed with 'Show HN' are Showcases, where authors intend to present something to the community, such as a project or a proposal. 

Both types of posts would receive replies from individuals in the Hack News community, hence the objective here would be to analyse which type of post would garner more responses, and if other factors such as creation time affect the amount of responses received.

We will be using a snapshot of Hacker News articles captured in 2016, more details of this data set can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts).


In [48]:
from csv import reader

open_file = open('hacker_news.csv')
read_file = reader(open_file)

hn_list = list(read_file)
hn_headers = hn_list[0] #header row
hn = hn_list[1:] # actual data

print('Header columns:')
print(hn_headers)

# quick check on row integrity
index=0
malformed_row_count = 0

for i in range(len(hn)):
    if len(hn[i]) != len(hn_headers): #check for column-count mismatch
        print('Malformed row detected at index '+str(i))
        malformed_row_count += 1
print('\n{} malformed row(s) detected.'.format(malformed_row_count))

Header columns:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

0 malformed row(s) detected.


Column details for this data set as follows:

| Column Name | Description |
|-------------|-------------|
| id | Unique ID of article |
| title | Title of the post |
| url | Hyperlink of the item being linked to |
| num_points | Number of upvotes the post received |
| num_comments | Number of comments the post received |
| author | Name of the account that made the post |
| created_at | Date and time the post was made (the time zone is Eastern Time in the US) |

Since num_comments is a required column for our analysis, we also do a quick check to see if there are any invalid values present:

In [50]:
malformed_comment_count = 0
for i in range(len(hn)):
    if not hn[i][hn_headers.index('num_comments')].isnumeric(): #check num_comments column
        print('Malformed comment-count detected at index '+str(i))
        malformed_comment_count += 1
print('\n{} row(s) with malformed comment-count detected.'.format(malformed_comment_count))


0 row(s) with malformed comment-count detected.


Now that we have determined our data set to be free of issues, we then proceed to inspect the first 5 rows of the data set:

In [2]:
print('1st 5 rows:')
for i in range(5):
    print(hn[i])

1st 5 rows:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Here, we start to divide up the data set into 3 broad categories:
- 'Ask' posts (queries)
- 'Show' posts (showcases)
- Other posts (all other articles that do not fit the above 2 categories)

As mentioned earlier, queries and showcases would be prefixed with specific strings, and that would serve as a criteria for the categorisation logic.

In [3]:
ask_prefix = 'Ask HN'.lower()
show_prefix = 'Show HN'.lower()

ask_posts = []
show_posts = []
other_posts = []

for record in hn:
    title = record[hn_headers.index('title')].lower()
    if title.startswith(ask_prefix):
        ask_posts.append(record)
    elif title.startswith(show_prefix):
        show_posts.append(record)
    else:
        other_posts.append(record)

print('Number of \'Ask\' posts: {}'.format(len(ask_posts)))
print('Number of \'Show\' posts: {}'.format(len(show_posts)))
print('Number of other posts: {}'.format(len(other_posts)))

Number of 'Ask' posts: 1744
Number of 'Show' posts: 1162
Number of other posts: 17194


With 3 lists of the various types of posts being built, we inspect 5 rows of the 'Ask' posts:

In [4]:
for i in range(5):
    print(ask_posts[i])

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


As well as 'Show' posts:

In [5]:
for i in range(5):
    print(show_posts[i])

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']


With each list, we can derive the average number of comments given per type of article:

In [19]:
import statistics as stat

# derive average number of comments per 'Ask' post
avg_ask_comments = stat.mean([int(rec[hn_headers.index('num_comments')]) for rec in ask_posts])

# derive average number of comments per 'Show' post
avg_show_comments = stat.mean([int(rec[hn_headers.index('num_comments')]) for rec in show_posts])

print(' Average number of comments per \'Ask\' post: {:.2f}'.format(avg_ask_comments))
print('Average number of comments per \'Show\' post: {:.2f}'.format(avg_show_comments))

 Average number of comments per 'Ask' post: 14.04
Average number of comments per 'Show' post: 10.32


As shown above, a typical 'Ask' post seem to garner an average of about 4 more comments than a 'Show' post. This may arise from the possibility that 'Ask' posts tend to yield member inputs that lead to more discussion if the inputs differ from one another. On the other hand, 'Show' posts may attract one-off comments that express opinion on the given subject, with lesser chance to generate further discussion.

With focus on 'Ask' posts, we now derive the average number of comments captured on an hourly basis, to see if there's a time of day where interaction rate is higher:

In [None]:
import datetime as dt

# use a list to isolate out the post creation time, as well as comment count
result_list = []
for rec in ask_posts:
    created_at_str = rec[hn_headers.index('created_at')]
    comment_count = int(rec[hn_headers.index('num_comments')])
    result_list.append([created_at_str, comment_count])

counts_by_hour = {}
comments_by_hour = {}

# With 2 dictionaries, capture the number of post at the different hour of day,
# followed by comment-count also at different hour of day
for rec in result_list:
    created_at_str = rec[0]
    count_value = rec[1]
    dt_format = '%m/%d/%Y %H:%M' #example: 11/25/2015 14:03
    created_at_datetime = dt.datetime.strptime(created_at_str, dt_format)
    hour = created_at_datetime.strftime('%H')
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
    else:
        counts_by_hour[hour] += 1
        
    if hour not in comments_by_hour.keys():
        comments_by_hour[hour] = count_value
    else:
        comments_by_hour[hour] += count_value

# With the above 2 dictionaries, we can derive the averge number of comments 
# per post at the different times of the day
avg_by_hour = []

for key,value in comments_by_hour.items():
    avg_by_hour.append([key, (value / counts_by_hour[key])])

# Print raw findings out
for rec in avg_by_hour:
    print('{}: {:.2f}'.format(rec[0], rec[1]))

Now with the raw results, we do time conversion to ascertain the time of day in Singapore when we could publish a 'Ask' post on Hacker News to possibly garner the greatest number of comments.

In [45]:
# US eastern time is GMT-4, SGP time is GMT+8, i.e. 12 hours apart
avg_by_sgp_hour = []
for rec in avg_by_hour:
    eastern_time_str_value = rec[0]
    eastern_time_int_value = int(eastern_time_str_value) # convert zero-padded strings to int
    sgp_time_int_value = (eastern_time_int_value + 12) % 24 # account for time-diff
    sgp_time_str_value = '{0:02d}'.format(sgp_time_int_value) # convert back to zero-padded string
    avg_by_sgp_hour.append([sgp_time_str_value, eastern_time_str_value, rec[1]]) # put results in new list

# sort the list according to time of day
sorted_avg_by_sgp_hour = sorted(avg_by_sgp_hour, key=lambda x: x[0])

# print out sorted results
print('Average number of comments per \'Ask\' post at hour of day:')
for rec in sorted_avg_by_sgp_hour:
    print('{}:00 SST ({}:00 EST): {:.0f}'.format(rec[0], rec[1], rec[2]))

Average number of comments per 'Ask' post at hour of day (GMT+8):
00:00 SST (12:00 EST): 9
01:00 SST (13:00 EST): 15
02:00 SST (14:00 EST): 13
03:00 SST (15:00 EST): 39
04:00 SST (16:00 EST): 17
05:00 SST (17:00 EST): 11
06:00 SST (18:00 EST): 13
07:00 SST (19:00 EST): 11
08:00 SST (20:00 EST): 22
09:00 SST (21:00 EST): 16
10:00 SST (22:00 EST): 7
11:00 SST (23:00 EST): 8
12:00 SST (00:00 EST): 8
13:00 SST (01:00 EST): 11
14:00 SST (02:00 EST): 24
15:00 SST (03:00 EST): 8
16:00 SST (04:00 EST): 7
17:00 SST (05:00 EST): 10
18:00 SST (06:00 EST): 9
19:00 SST (07:00 EST): 8
20:00 SST (08:00 EST): 10
21:00 SST (09:00 EST): 6
22:00 SST (10:00 EST): 13
23:00 SST (11:00 EST): 11


Based on the above hour-of-day tabulation, it has been shown that an 'Ask' post created at 3am local time would stand to garner the top average of 39 comments. 

It corresponds to 3pm in US eastern time, which might be due to site traffic being highest in mid-afternoon in the US as it is the preferred time of day for most members to browse content and interact with others within the community.