# Exploring Hacker News Posts

This data exploration will be analyzing posts from the site Hacker News from the year 2016. The dataset we will be analyzing can be found [here](https://www.kaggle.com/hacker-news/hacker-news-posts). 

In this project, I will be using basic Python libraries and basic modules that don't require any extraneous installation - even though using a module like Pandas and/or MatPlotLib might make this analysis more efficient.

The columns in the dataset can be described as follows:
- `id` - The unique identifier from Hacker News for the post
- `title` - The title of the post
- `url` - The URL that the posts links to, if it the post has a URL
- `num_points` - The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- `num_comments` - The number of comments that were made on the post
- `author` - The username of the person who submitted the post
- `created_at` - The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit `Ask HN` posts to ask the Hacker News community a specific question. And users submit `Show HN` posts to show the Hacker News community a project, product, or just generally something interesting.

Let's compare these two types of posts to determine the following:

- **Do `Ask HN` or `Show HN` receive more comments on average?**
- **Do posts created at a certain time receive more comments on average?**

First, we'll import the dataset as a list of lists and display the first five rows.

In [1]:
import csv

hn = list(csv.reader(open('HN_posts_year_to_Sep_26_2016.csv', encoding="utf-8")))

# saving the header seperately 
header = hn[0]

# removing the header from the main dataset
hn = hn[1:]

print(header)
print()
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


Now, we're going to use regular expressions to filter out articles that begin with `Ask HN` or `Show HN`:

In [2]:
import re

ask_posts, show_posts, other_posts = [], [], []

# creating our re patterns that look for articles that start with the appropriate texts
pattern_ask = r'^Ask HN'
pattern_show = r'^Show HN'

# filtering out articles into 3 different types using list comprehensions and regex
ask_posts = [row for row in hn if re.search(pattern_ask, row[1], flags = re.I)]
show_posts = [row for row in hn if re.search(pattern_show, row[1], flags = re.I)]
other_posts = [row for row in hn if not (re.search(pattern_ask, row[1], flags = re.I) or re.search(pattern_show, row[1], flags = re.I))]

# test printing some titles:
print('Ask HN articles:', len(ask_posts), '\n', [row[1] for row in ask_posts[0:5]], '\n')
print('Show HN articles:', len(show_posts), '\n', [row[1] for row in show_posts[0:5]], '\n')
print('Other articles:', len(other_posts), '\n', [row[1] for row in other_posts[0:5]], '\n')

Ask HN articles: 9139 
 ['Ask HN: What TLD do you use for local development?', 'Ask HN: How do you pass on your work when you die?', 'Ask HN: How a DNS problem can be limited to a geographic region?', 'Ask HN: Why join a fund when you can be an angel?', 'Ask HN: Someone uses stock trading as passive income?'] 

Show HN articles: 10158 
 ['Show HN: Finding puns computationally', 'Show HN: A simple library for complicated animations', 'Show HN: WebGL visualization of DNA sequences', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'Show HN: Jumble  Essays on the go #PaulInYourPocket'] 

Other articles: 273822 
 ['You have two days to comment if you want stem cells to be classified as your own', 'SQLAR  the SQLite Archiver', 'What if we just printed a flatscreen television on the side of our boxes?', 'algorithmic music', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake'] 



Now, let's analyze these different datasets and see which types of articles recieve more comments: `Ask HN` vs `Show HN` articles. We will find the average number of comments for each using the `num_comments` column at index 4:

In [4]:
# calculating average number of comments in each:
avg_comments_ask = round(sum([int(row[4]) for row in ask_posts]) / len(ask_posts), 2)
avg_comments_show = round(sum([int(row[4]) for row in show_posts]) / len(show_posts), 2)

print('Ask Hacker News - average number of comments:')
print(avg_comments_ask)
print()
print('Show Hacker News - average number of comments:')
print(avg_comments_show)

Ask Hacker News - average number of comments:
10.39

Show Hacker News - average number of comments:
4.89


It's no surprise and a good thing that the average number of comments for someone ASKING a question to Hacker News will be higher than someone posting something to SHOW the community something since people asking questions tend to get more answers and responses.

**Since `Ask HN` posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.**

# Time Based Analysis

Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We are going to: 
- Calculate the number of Ask HN posts created in each hour of the day
- Calculate the number of comments received based on hour of creation
- Use the above information to calculate the average number of comments a single Ask HN post receives depending on the hour created.

Let's calculate the number of `Ask HN` posts created in each hour of the day by creating a frequency table and using the `created_at` column at index 6:

In [82]:
import datetime as dt

# initializing our frequency table which will be a dictionary that stores hours as keys, 
# and a list of [posts created, num comments for posts created in that hour] as values
posts_created_hour = {}

# creating our frequency table for posts created at each hour of the day
for row in ask_posts:
    hour = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M').hour
    numcomments = int(row[4])
    if hour in posts_created_hour:
        posts_created_hour.get(hour)[0] += 1
        posts_created_hour.get(hour)[1] += numcomments
    else:
        posts_created_hour[hour] = [1, numcomments]

# sorting by hour:
posts_sorted_hour = sorted(posts_created_hour.items())
posts_sorted_posts = sorted(posts_created_hour.items(), key=lambda x: x[1], reverse=True)

# creating an output print format for the table:
output = "{0:<10} {1:<8} {2:<10}"

print('Printing out our results sorted by hour:')
print(output.format("Time", "#_Posts", "#_Comments"))
for row in posts_sorted_hour:
    time = dt.datetime(2016, 1, 1, hour=row[0])
    numposts = row[1][0]
    numcomments = row[1][1]
    print(output.format(time.strftime('%I:%M %p'), numposts, numcomments))

print()
print('Printing out our results sorted by most posts:')
print(output.format("Time", "#_Posts", "#_Comments"))
for row in posts_sorted_posts:
    time = dt.datetime(2016, 1, 1, hour=row[0])
    numposts = row[1][0]
    numcomments = row[1][1]
    print(output.format(time.strftime('%I:%M %p'), numposts, numcomments))

Printing out our results sorted by hour:
Time       #_Posts  #_Comments
12:00 AM   301      2277      
01:00 AM   282      2089      
02:00 AM   269      2996      
03:00 AM   271      2154      
04:00 AM   243      2360      
05:00 AM   209      1838      
06:00 AM   234      1587      
07:00 AM   226      1585      
08:00 AM   257      2362      
09:00 AM   222      1477      
10:00 AM   282      3013      
11:00 AM   312      2797      
12:00 PM   342      4234      
01:00 PM   444      7245      
02:00 PM   513      4972      
03:00 PM   646      18525     
04:00 PM   579      4466      
05:00 PM   587      5547      
06:00 PM   614      4877      
07:00 PM   552      3954      
08:00 PM   510      4462      
09:00 PM   518      4500      
10:00 PM   383      3372      
11:00 PM   343      2297      

Printing out our results sorted by most posts:
Time       #_Posts  #_Comments
03:00 PM   646      18525     
06:00 PM   614      4877      
05:00 PM   587      5547      
04:00 PM   5

**As you can see, we have the most posts created between 3-4 pm followed by posts created between 6-7 pm.** Now, let's calculate the average number of comments based on the time the posts were created - which can be found by dividing the total number of comments by the total number of posts at a given hour.

***We are attempting to infer that posts created at a given hour have more overall user activity.***

In [61]:
# using a list comprehension to create an average comments based on hour created:
avg_comments_based_hour_created = [[key, posts_created_hour.get(key)[1] / posts_created_hour.get(key)[0]] for key in posts_created_hour]

print('Printing out our results sorted by hour:')
print(*(item for item in sorted(avg_comments_based_hour_created)), sep='\n')

print()
print('Results sorted by highest average number of comments based on time of date created:')
print(*(item for item in sorted(avg_comments_based_hour_created, key = lambda x: x[1], reverse=True)), sep='\n')

Printing out our results sorted by hour:
[0, 7.5647840531561465]
[1, 7.407801418439717]
[2, 11.137546468401487]
[3, 7.948339483394834]
[4, 9.7119341563786]
[5, 8.794258373205741]
[6, 6.782051282051282]
[7, 7.013274336283186]
[8, 9.190661478599221]
[9, 6.653153153153153]
[10, 10.684397163120567]
[11, 8.96474358974359]
[12, 12.380116959064328]
[13, 16.31756756756757]
[14, 9.692007797270955]
[15, 28.676470588235293]
[16, 7.713298791018998]
[17, 9.449744463373083]
[18, 7.94299674267101]
[19, 7.163043478260869]
[20, 8.749019607843136]
[21, 8.687258687258687]
[22, 8.804177545691905]
[23, 6.696793002915452]

Results sorted by highest average number of comments based on time of date created:
[15, 28.676470588235293]
[13, 16.31756756756757]
[12, 12.380116959064328]
[2, 11.137546468401487]
[10, 10.684397163120567]
[4, 9.7119341563786]
[14, 9.692007797270955]
[17, 9.449744463373083]
[8, 9.190661478599221]
[11, 8.96474358974359]
[22, 8.804177545691905]
[5, 8.794258373205741]
[20, 8.749019607843136

**As we can see, posts created between 3-4 pm have nearly double the amount of comments / activity compared to other posts. Whereas posts created between 9-10 am have the least amount of activity.** This can possibly be explained since people are usually beginning their work day from 9-10 am and starting to finish their work day around 3-4 pm so they would have more time to write comments and read posts on Hacker News.

Before we come to our conclusion, let's check to see if there are any posts created during this hour that are extreme outliers with really large numbers that might skew our analysis and see if we need to remove any of these standard deviations.

# FINDING OUTLIERS

Let's see what makes these posts at 3 pm unique and analyze further the articles posted from 3-4 pm by printing out the posts with the highest number of comments at that time.

In [57]:
# filtering out all posts that are posted at 3 pm
three_pm_posts = [row for row in ask_posts if dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M').hour == 15]

# sorting the posts by maximum number of comments
three_pm_posts = sorted(three_pm_posts, key = lambda x: int(x[4]), reverse = True)

# printing the top 50 posts with the highest number of comments
for row in three_pm_posts[0:50]:
    print(row[4], ':', row[1])

1007 : Ask HN: Who is hiring? (June 2016)
947 : Ask HN: Who is hiring? (August 2016)
937 : Ask HN: Who is hiring? (May 2016)
910 : Ask HN: Who is hiring? (September 2016)
898 : Ask HN: Who is hiring? (July 2016)
896 : Ask HN: Who is hiring? (November 2015)
825 : Ask HN: Who is hiring? (March 2016)
778 : Ask HN: Who is hiring? (February 2016)
720 : Ask HN: Who is hiring? (April 2016)
705 : Ask HN: Who is hiring? (October 2015)
626 : Ask HN: What are you working on and why is it awesome? Please include URL
472 : Ask HN: Who is hiring? (January 2016)
431 : Ask HN: Who is hiring? (December 2015)
283 : Ask HN: Who wants to be hired? (April 2016)
258 : Ask HN: What are some examples of beautiful software?
250 : Ask HN: Who wants to be hired? (June 2016)
210 : Ask HN: Who wants to be hired? (July 2016)
202 : Ask HN: Who wants to be hired? (March 2016)
200 : Ask HN: Freelancer? Seeking freelancer? (June 2016)
169 : Ask HN: Who wants to be hired? (May 2016)
166 : Ask HN: Who wants to be hired? 

The number of comments seems to gradually fall and it looks like there aren't any extreme single outliers that we need to remove. However, the top posts of this hour are ones that ask the question `Who is hiring?`. If people only post these `Who is hiring?` questions at a certain time from 3-4 pm each day, that could be a reason why there is such an uptick of activity in this time period. Let's do a frequency table of `Who is hiring?` posts throughout the day:

In [85]:
hiring_posts = [row for row in ask_posts if re.search(r'Who is hiring?', row[1], flags = re.I)]
hiring_posts_created_hour = {}

# creating our frequency table for 'who is hiring' posts created at each hour of the day
for row in hiring_posts:
    hour = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M').hour
    numcomments = int(row[4])
    if hour in hiring_posts_created_hour:
        hiring_posts_created_hour.get(hour)[0] += 1
        hiring_posts_created_hour.get(hour)[1] += numcomments
    else:
        hiring_posts_created_hour[hour] = [1, numcomments]

# sorting by hour:
hiring_posts_sorted_hour = sorted(hiring_posts_created_hour.items())
hiring_posts_sorted_posts = sorted(hiring_posts_created_hour.items(), key=lambda x: x[1], reverse=True)

# creating an output print format for the table:
output = "{0:<10} {1:<8} {2:<10}"

print('Printing out our results of WHO IS HIRING posts sorted by hour:')
print(output.format("Time", "#_Posts", "#_Comments"))
for row in hiring_posts_sorted_hour:
    time = dt.datetime(2016, 1, 1, hour=row[0])
    numposts = row[1][0]
    numcomments = row[1][1]
    print(output.format(time.strftime('%I:%M %p'), numposts, numcomments))

print()
print('Printing out our results of WHO IS HIRING posts sorted by most posts:')
print(output.format("Time", "#_Posts", "#_Comments"))
for row in hiring_posts_sorted_posts:
    time = dt.datetime(2016, 1, 1, hour=row[0])
    numposts = row[1][0]
    numcomments = row[1][1]
    print(output.format(time.strftime('%I:%M %p'), numposts, numcomments))

Printing out our results of WHO IS HIRING posts sorted by hour:
Time       #_Posts  #_Comments
02:00 AM   2        120       
09:00 AM   1        2         
01:00 PM   2        16        
03:00 PM   13       9526      
05:00 PM   1        0         
06:00 PM   1        2         
07:00 PM   2        1         
08:00 PM   1        0         
09:00 PM   1        2         

Printing out our results of WHO IS HIRING posts sorted by most posts:
Time       #_Posts  #_Comments
03:00 PM   13       9526      
02:00 AM   2        120       
01:00 PM   2        16        
07:00 PM   2        1         
06:00 PM   1        2         
09:00 AM   1        2         
09:00 PM   1        2         
08:00 PM   1        0         
05:00 PM   1        0         


Wow! - it seems that our outliers that cause this uptick at the 3-4 pm time interval (or our standard deviation) is the `Who is hiring` articles. Even though there are other times when these articles are posted, the 13 articles posted at 3-4 pm have 9,526 comments combined!

# Removing the Discovered Outliers

Let's remove the outliers in our data by re-running the code above and purposely filtering out "Who is hiring" posts using regular expressions and then check again to see what our averages look like by hour.