# Analyzing Hacker News Dataset

- author: Victor Omondi
- toc: true
- comments: true
- categories: [data-analysis, hacker-news]
- image: images/ahnd-shield.gif

![Hacker News](https://s3.amazonaws.com/dq-content/354/hacker_news.jpg)

# Introduction

## About Hacker News

Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "stories") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles; stories that make it to the top of Hacker News' listings can get hundreds of thousands of visitors.

## The dataset

The dataset we will be working with is based off this [CSV](https://www.kaggle.com/hacker-news/hacker-news-posts) of Hacker News stories from September 2015 to September 2016. The columns in the dataset are explained below:

* `id`: The unique identifier from Hacker News for the story
* `title`: The title of the story
* `url`: The URL that the stories links to, if the story has a URL
* `num_points`: The number of points the story acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_comments`: The number of comments that were made on the story
* `author`: The username of the person who submitted the story
* `created_at`: The date and time at which the story was submitted

For analysis purposes, we have reduced the dataset from the almost 300,000 rows in its original form to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Let's start by reading our Hacker News dataset into a pandas dataframe.

# Import Libraries

In [1]:
import pandas as pd
import re

# Read the Dataset

In [2]:
hn = pd.read_csv('../hacker_news.csv')
hn.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


## Dataset Shape

In [3]:
hn.shape

(20099, 7)

In [4]:
hn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20099 entries, 0 to 20098
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20099 non-null  int64 
 1   title         20099 non-null  object
 2   url           17659 non-null  object
 3   num_points    20099 non-null  int64 
 4   num_comments  20099 non-null  int64 
 5   author        20099 non-null  object
 6   created_at    20099 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB


In [5]:
hn.isnull().sum()

id                 0
title              0
url             2440
num_points         0
num_comments       0
author             0
created_at         0
dtype: int64

In [6]:
hn.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,20099.0,11317550.0,696453.087424,10176908.0,10701720.0,11284523.0,11926127.0,12578975.0
num_points,20099.0,50.29663,107.110322,1.0,3.0,9.0,54.0,2553.0
num_comments,20099.0,24.80303,56.108639,1.0,1.0,3.0,21.0,1733.0


# how many times is Python mentioned in the title of stories in our Hacker News dataset.

In [7]:
len([title for title in hn.title.to_list() if re.search('[Pp]ython', title)])

160

In [8]:
hn.title.str.contains('[Pp]ython').sum()

160

# Titles that mention the programming language Ruby

In [9]:
hn.title[hn.title.str.contains('[Rr]uby')]

190                     Ruby on Google AppEngine Goes Beta
484           Related: Pure Ruby Relational Algebra Engine
1388     Show HN: HTTPalooza  Ruby's greatest HTTP clie...
1949     Rewriting a Ruby C Extension in Rust: How a Na...
2022     Show HN: CrashBreak  Reproduce exceptions as f...
2163                   Ruby 2.3 Is Only 4% Faster than 2.2
2306     Websocket Shootout: Clojure, C++, Elixir, Go, ...
2620                       Why Startups Use Ruby on Rails?
2645     Ask HN: Should I continue working a Ruby gem f...
3290     Ruby on Rails and the importance of being stup...
3749     Telegram.org Bot Platform Webhooks Server, for...
3874     Warp Directory (wd) unix command line tool for...
4026     OS X 10.11 Ruby / Rails users can install ther...
4163     Charles Nutter of JRuby Banned by Rubinius for...
4602     Quiz: Ruby or Rails? Matz and DHH were not abl...
5832     Show HN: An experimental Python to C#/Go/Ruby/...
6180     Shrine  A new solution for handling file uploa.

# how many titles in our dataset mention email or e-mail

In [10]:
hn.title[hn.title.str.contains('e-?mail')]

119      Show HN: Send an email from your shell to your...
313          Disposable emails for safe spam free shopping
1361     Ask HN: Doing cold emails? helps us prove this...
1750     Protect yourself from spam, bots and phishing ...
2421                    Ashley Madison hack treating email
                               ...                        
18098    House panel looking into Reddit post about Cli...
18583    Mailgen  Generates clean, responsive HTML for ...
18847    Show HN: Crisp iOS keyboard for email and text...
19303    Ask HN: Why big email providers don't sign the...
19446    Tell HN: Secure email provider Riseup will run...
Name: title, Length: 86, dtype: object

#  how many titles in our dataset have tags?

In [11]:
hn.title[hn.title.str.contains('\[\w+\]')]

66       Analysis of 114 propaganda sources from ISIS, ...
100      Munich Gunman Got Weapon from the Darknet [Ger...
159           File indexing and searching for Plan 9 [pdf]
162      Attack on Kunduz Trauma Centre, Afghanistan  I...
195                 [Beta] Speedtest.net  HTML5 Speed Test
                               ...                        
19763    TSA can now force you to go through body scann...
19867                       Using Pony for Fintech [video]
19947                                Swift Reversing [pdf]
19979    WSJ/Dowjones Announce Unauthorized Access Betw...
20089    Users Really Do Plug in USB Drives They Find [...
Name: title, Length: 444, dtype: object

 we were able to calculate that 444 of the 20,100 Hacker News stories in our dataset contain tags. What if we wanted to find out what the text of these tags were, and how many of each are in the dataset? In order to do this, we'll need to use capture groups.
 
 # extract all of the tags from the Hacker News titles and build a frequency table of those tags.

In [12]:
hn['title'].str.extract(r'\[(\w+)\]')[0].value_counts().head()

pdf       276
video     111
2015        3
audio       3
slides      2
Name: 0, dtype: int64

In [13]:
def first_10_matches(pattern):
    """
    Return the story titles that match
    the provided regular expression
    """
    return titles[titles.str.contains(pattern)]

# Titles that contain Java

In [14]:
hn.title[hn.title.str.contains(r'[Jj]ava[^Ss]')]

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1840                     Adopting RxJava on the Airbnb App
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
2910                 2016 JavaOne Intel Keynote  32mn Talk
3452     What are the Differences Between Java Platform...
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
5947                                        JavaFX is dead
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

In [15]:
hn.title[hn.title.str.contains(r'\b[Jj]ava\b')]

436      Unikernel Power Comes to Java, Node.js, Go, an...
811      Ask HN: Are there any projects or compilers wh...
1023                          Pippo  Web framework in Java
1972           Node.js vs. Java: Which Is Faster for APIs?
2093                     Java EE and Microservices in 2016
2367     Code that is valid in both PHP and Java, and p...
2493     Ask HN: I've been a java dev for a couple of y...
2751                 Eventsourcing for Java 0.4.0 released
3228                               Comparing Rust and Java
3452     What are the Differences Between Java Platform...
3627                     Friends don't let friends do Java
4273      Ask HN: Is Bloch's Effective Java Still Current?
4624     Oracle Discloses Critical Java Vulnerability i...
5461                        Lambdas (in Java 8) Screencast
5847     IntelliJ IDEA and the whole IntelliJ platform ...
6268             Oracle deprecating Java applets in Java 9
7436     Forget Guava: 5 Google Libraries Java Develope.

# how many titles have tags at the start versus the end of the story title in our Hacker News dataset.

In [16]:
hn.title.str.contains(r'^\[\w+\]').sum()

15

In [17]:
hn.title.str.contains(r'\[\w+\]$').sum()

417

# count the number of times that email is mentioned in story titles. 

In [18]:
hn.title.str.contains(r'\be\-?\s?mails?\b', flags=re.I).sum()

141

We'll continue to analyze and count mentions of different programming languages in the dataset, and then we'll finish by extracting the different components of the URLs submitted to Hacker News.

# count the number of times that sql is mentioned in story titles.

In [19]:
hn.title.str.contains(r'sql', flags=re.I).sum()

108

In [20]:
hn_sql = hn[hn.title.str.contains(r'\w+sql', flags=re.I)].copy()
hn_sql['flavor'] = hn['title'].str.extract(r'(\w+sql)', flags=re.I)[0].str.lower()
sql_pivot = hn_sql.pivot_table(index='flavor', values='num_comments')
sql_pivot

Unnamed: 0_level_0,num_comments
flavor,Unnamed: 1_level_1
cloudsql,5.0
memsql,14.0
mysql,12.230769
nosql,14.529412
postgresql,25.962963
sparksql,1.0


# version of Python that is mentioned most often in our dataset

In [21]:
hn.title.str.extract(r'python ([\d\.]+)', flags=re.I)[0].value_counts().to_dict()

{'3': 10,
 '2': 3,
 '3.5': 3,
 '3.6': 2,
 '2.7': 1,
 '8': 1,
 '1.5': 1,
 '3.5.0': 1,
 '4': 1}

# C programming titles

In [22]:
hn.title[hn.title.str.contains(r'(?!<series)\bc\b(?![\.\+])', flags=re.I)]

221                   MemSQL (YC W11) Raises $36M Series C
365                       The new C standards are worth it
444            Moz raises $10m Series C from Foundry Group
521           Fuchsia: Micro kernel written in C by Google
1307             Show HN: Yupp, yet another C preprocessor
                               ...                        
18549            Show HN: An awesome C library for Windows
18649                 Python vs. C/C++ in embedded systems
18689                    Philz Coffee raises $45M Series C
19151                      Ask HN: How to learn C in 2016?
19933    Lightweight C library to parse NMEA 0183 sente...
Name: title, Length: 105, dtype: object

# make all the different variations of "email" in the dataset uniform.

In [24]:
hn['title'] = hn.title.str.replace(r'e[\-\s]?mail','email', flags=re.I)
hn.title[hn.title.str.contains('email')]

119      Show HN: Send an email from your shell to your...
161      Computer Specialist Who Deleted Clinton emails...
174                                        email Apps Suck
261      emails Show Unqualified Clinton Foundation Don...
313          Disposable emails for safe spam free shopping
                               ...                        
19303    Ask HN: Why big email providers don't sign the...
19395    I used HTML email when applying for jobs, here...
19446    Tell HN: Secure email provider Riseup will run...
19838                       Petition to Open Sourcemailbox
19905    Gmail Will Soon Warn Users When emails Arrive ...
Name: title, Length: 151, dtype: object

#  extract components of URLs from our dataset.

most stories on Hacker News contain a link to an external resource. Once we have extracted the domains, we will be building a frequency table so we can determine the most popular domains. There are over 7,000 unique domains in our dataset, so to make the frequency table easier to analyze, we'll look at only the top 20 domains

In [26]:
hn.url.str.extract(r'https?://([\w\-\.]+)', flags=re.I)[0].value_counts()

github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
                          ... 
pss-camera.appspot.com       1
www.mrrrgn.com               1
ams-ix.net                   1
www.codeshare.co.uk          1
lambdaschool.com             1
Name: 0, Length: 7251, dtype: int64

In [28]:
hn_urls=hn.url.str.extract(r'(?P<protocol>\w+://(?P<domain>[\w\.\-]+)/?(?P<path>.*))', flags=re.I)
hn_urls.head()

Unnamed: 0,protocol,domain,path
0,http://www.interactivedynamicvideo.com/,www.interactivedynamicvideo.com,
1,http://www.thewire.com/entertainment/2013/04/f...,www.thewire.com,entertainment/2013/04/florida-djs-april-fools-...
2,https://www.amazon.com/Technology-Ventures-Ent...,www.amazon.com,Technology-Ventures-Enterprise-Thomas-Byers/dp...
3,http://www.nytimes.com/2007/11/07/movies/07ste...,www.nytimes.com,2007/11/07/movies/07stein.html?_r=0
4,http://arstechnica.com/business/2015/10/comcas...,arstechnica.com,business/2015/10/comcast-and-other-isps-boost-...


In [30]:
hn_urls.domain.value_counts()

github.com                1008
medium.com                 825
www.nytimes.com            525
www.theguardian.com        248
techcrunch.com             245
                          ... 
pss-camera.appspot.com       1
www.mrrrgn.com               1
ams-ix.net                   1
www.codeshare.co.uk          1
lambdaschool.com             1
Name: domain, Length: 7251, dtype: int64