<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DBSCAN Practice

_Authors: Joseph Nelson (DC)_

---

Now that you're familiar with how DBSCAN works, let's practice it in scikit-learn.

We'll start out working with the [NHL data](https://github.com/josephnelson93/GA-DSI/blob/master/NHL_Data_GA.csv). We're going to investigate clustering teams based on their counting stats.



In [3]:
# Use this glossary of hockey terms as a 
# reference guide for what the columns indicate:
    
    # Statistic 	Definition
# TOI 	Time On Ice
# GF20 	Goals For while player is on ice per 20 minutes of ice time.
# GA20 	Goals Against while player is on ice per 20 minutes of ice time.
# GF% 	GF% = Goals For / (Goals For + Goals Against)
# TMGF20 	Weighted (by TOI together) average of all teammates GF20.
# TMGA20 	Weighted (by TOI together) average of all teammates GA20.
# TMGF% 	Weighted (by TOI together) average of all teammates GF%.
# OPPGF20 	Weighted (by TOI against each other) average of all opponents GF20.
# OPPGA20 	Weighted (by TOI against each other) average of all opponents GA20.
# OPPGF% 	Weighted (by TOI against each other) average of all opponents GF%.
# HARO 	Hockey Analysis Rating - Offense
# HARD 	Hockey Analysis Rating - Defense
# HART 	Hockey Analysis Rating - Total (average of HARO and HARD)
# HARO+ 	Hockey Analysis Rating - Offense - Enhanced (experimental)
# HARD+ 	Hockey Analysis Rating - Defense - Enhanced (Experimental)
# HART+ 	Hockey Analysis Rating - Total - Enhanced (Experimental, average of HARO+ and HARD+)
# SF20 	Shots For while player is on ice per 20 minutes of ice time.
# SA20 	Shots Against while player is on ice per 20 minutes of ice time.
# SF% 	SF% = Shots For / (Shots For + Shots Against)
# TMSF20 	Weighted (by TOI together) average of all teammates SF20.
# TMSA20 	Weighted (by TOI together) average of all teammates SA20.
# TMSF% 	Weighted (by TOI together) average of all teammates SF%.
# OppSF20 	Weighted (by TOI against each other) average of all opponents SF20.
# OppSA20 	Weighted (by TOI against each other) average of all opponents SA20.
# OppSFor% 	Weighted (by TOI against each other) average of all opponents SF%.
# ShotHARO 	Hockey Analysis Shot Rating - Offense
# ShotHARD 	Hockey Analysis Shot Rating - Defense
# ShotHART 	Hockey Analysis Shot Rating - Total (average of HARO and HARD)
# ShotHARO+ 	Hockey Analysis Shot Rating - Offense - Enhanced (experimental)
# ShotHARD+ 	Hockey Analysis Shot Rating - Defense - Enhanced (Experimental)
# ShotHART+ 	Hockey Analysis Shot Rating - Total - Enhanced (Experimental, average of HARO+ and HARD+)
# CorF20 	Corsi For while player is on ice per 20 minutes of ice time.
# CorA20 	Corsi Against while player is on ice per 20 minutes of ice time.
# CorF% 	CorF% = Corsi For / (Corsi For + Corsi Against)
# TMCorF20 	Weighted (by TOI together) average of all teammates CorF20.
# TMCorA20 	Weighted (by TOI together) average of all teammates CorA20.
# TMCorF% 	Weighted (by TOI together) average of all teammates CorF%.
# OppCorF20 	Weighted (by TOI against each other) average of all opponents CorF20.
# OppCorA20 	Weighted (by TOI against each other) average of all opponents CorA20.
# OppCorF% 	Weighted (by TOI against each other) average of all opponents CorF%.
# CorHARO 	Hockey Analysis Corsi Rating - Offense
# CorHARD 	Hockey Analysis Corsi Rating - Defense
# CorHART 	Hockey Analysis Corsi Rating - Total (average of HARO and HARD)
# CorHARO+ 	Hockey Analysis Corsi Rating - Offense - Enhanced (experimental)
# CorHARD+ 	Hockey Analysis Corsi Rating - Defense - Enhanced (Experimental)
# CorHART+ 	Hockey Analysis Corsi Rating - Total - Enhanced (Experimental, average of HARO+ and HARD+)
# FenF20 	Fenwick For while player is on ice per 20 minutes of ice time.
# FenA20 	Fenwick Against while player is on ice per 20 minutes of ice time.
# FenF% 	FenF% = Fenwick For / (Fenwick For + Fenwick Against)
# TMFenF20 	Weighted (by TOI together) average of all teammates FenF20.
# TMFenA20 	Weighted (by TOI together) average of all teammates FenA20.
# TMFenF% 	Weighted (by TOI together) average of all teammates FenF%.
# OppFenF20 	Weighted (by TOI against each other) average of all opponents FenF20.
# OppFenA20 	Weighted (by TOI against each other) average of all opponents FenA20.
# OppFenF% 	Weighted (by TOI against each other) average of all opponents FenF%.
# FenHARO 	Hockey Analysis Fenwick Rating - Offense
# FenHARD 	Hockey Analysis Fenwick Rating - Defense
# FenHART 	Hockey Analysis Fenwick Rating - Total (average of HARO and HARD)
# FenHARO+ 	Hockey Analysis Fenwick Rating - Offense - Enhanced (experimental)
# FenHARD+ 	Hockey Analysis Fenwick Rating - Defense - Enhanced (Experimental)
# FenHART+ 	Hockey Analysis Fenwick Rating - Total - Enhanced (Experimental, average of HARO+ and HARD+)
# Corsi 	Corsi = Shots + Missed Shots + Blocked Shots
# Fenwick 	Fenwick = Shots + Missed Shots 

In [1]:
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### 1) Load our data and perform any basic cleaning and/or exploratory data analysis (EDA).

In [2]:
nhl = pd.read_csv('./datasets/nhl.csv')


In [3]:
# A:

### 2) Set up an `X` matrix to perform clustering with DBSCAN.

Let's cluster on all features except `team` and `rank`.

Make `rank` our `y` vector, which we can use to perform cluster validation. 

In [4]:
# A:

### 3) Scatterplot EDA.

Create two scatterplots. At least one axis in one of the plots should represent points (goals for and goals against). Do the scatterplots give us a general idea of how many clusters we should expect to extract with a clustering algorithm?

In [5]:
# A:

### 4) Scale our data.

Standardize the data and compare at least one of the scatterplots for the scaled data to the unscaled data above.

In [6]:
# A:

### 5) Fit a DBSCAN clusterer.

Remember to pass an `eps` and `min_samples` of your choice.

In [7]:
# A:

### 6) Check out the assigned cluster labels.

Use the `.labels_` command on our DBSCAN class.

In [8]:
# A:

### 7) Evaluate the DBSCAN clusters.

**7.A) Check the silhouette score.**

How are the clusters?

If you're up for a challenge, see how you can adjust our `eps` and `min_points` to improve them.

In [9]:
# A:

**7.B) Check the homogeneity, completeness, and v measure against the stored rank `y`.**

In [10]:
# A:

### 8) Plot the clusters.

You can choose any two variables for the axes.

In [11]:
# A:

### 9) Fit DBSCAN on an easier data set.

Import the `make_circles()` function from `sklearn.datasets`. You can use this to create some fake clusters that will perform well with DBSCAN.

Create some `X` and `y` using the function. Here is some sample code:
```python
from sklearn.datasets import make_circles
circles_X, circles_y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)
```

**9.A) Plot the fake circles data.**

In [12]:
# A:

**9.B) Scale the data and fit DBSCAN on it.**

In [13]:
# A:

**9.C) Evaluate DBSCAN visually with silhouette and the metrics against the true `y`.**

In [14]:
# A: