#### DBSCAN and Use-case for exploring anomalic instances:

In [2]:
# tabular manipulation:
import numpy as np
import pandas as pd
# visualization:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib qt
import seaborn as sns
# sklearn for scaling and clustering:
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import DBSCAN
# environment:
from env import host, user, password

def get_db_url(database, host=host, user=user, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

url = get_db_url("grocery_db")

sql = """
select *
from grocery_customers
"""

df = pd.read_sql(sql, url, index_col="customer_id")
df.head()


Unnamed: 0_level_0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


#### We will need to proceed through several operational steps to achieve utility from DBSCAN:

- Select what variables/features that we wish to examine
- Scale these features (DBSCAN is going to be useful for continuous variables)
- Ensure that our features are in a numpy array for fitting DBSCAN
- Select our epsilon and min_neighbors to fit our clusters
- Use our clusters to label outliers
- Explore our clusters

In [3]:
selected_feats = ['Fresh', 'Milk', 'Grocery']

**note:** choosing continuous variables is going to be significantly more valuable
for distance based clustering as points in space will not inherently snap
to any given set values and density will mean more as a result

In [5]:
# we are treating this as if it's already been split into train

grocery_milk_fresh = df[selected_feats]

In [6]:

# Make the scaler
scaler = MinMaxScaler()

# Fit the scaler
scaler.fit(grocery_milk_fresh)

# Use the scaler
grocery_milk_fresh = scaler.transform(grocery_milk_fresh)
grocery_milk_fresh

array([[0.11294004, 0.13072723, 0.08146416],
       [0.06289903, 0.13282409, 0.10309667],
       [0.05662161, 0.11918086, 0.08278992],
       ...,
       [0.1295431 , 0.21013575, 0.32594285],
       [0.091727  , 0.02622442, 0.02402535],
       [0.02482434, 0.02237109, 0.02702178]])

#### Time for the DBSCAN
- post scale
- we fit the dbscan 
- you kind of have to do a shot in the dark on the hyper parameters for your first round

In [9]:
# Make the object
dbsc = DBSCAN(eps = .10, min_samples = 20)

# Fit the object
dbsc.fit(grocery_milk_fresh)

DBSCAN(eps=0.1, min_samples=20)

In [10]:
# Now, let's add the scaled value columns back onto the dataframe
columns = ["Grocery", "Milk", "Fresh"]
scaled_columns = ["Scaled_" + column for column in columns]

# Save a copy of the original dataframe
original_df = df.copy()

# Create a dataframe containing the scaled values
scaled_df = pd.DataFrame(grocery_milk_fresh, columns=scaled_columns)

# Merge the scaled and non-scaled values into one dataframe
df = df.merge(scaled_df, on=df.index)
df = df.drop(columns=['key_0'])
df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Scaled_Grocery,Scaled_Milk,Scaled_Fresh
0,2,3,12669,9656,7561,214,2674,1338,0.11294,0.130727,0.081464
1,2,3,7057,9810,9568,1762,3293,1776,0.062899,0.132824,0.103097
2,2,3,6353,8808,7684,2405,3516,7844,0.056622,0.119181,0.08279
3,1,3,13265,1196,4221,6404,507,1788,0.118254,0.015536,0.045464
4,2,3,22615,5410,7198,3915,1777,5185,0.201626,0.072914,0.077552


In [12]:
# Assign the cluster labels
    # .lables_ comes out as an array
    # assign to column in dataframe
# Recall that cluster labels don't have inherent meaning
# DBSCAN makes a cluster called -1 that contains the "noise" or outliers
# if it's not noise it's the number of the cluster 
    # ex if there's 3 clusters you'll have 0, 1, 2 and then -1 for anything else
df['labels'] = dbsc.labels_
df['labels']

0      0
1      0
2      0
3      0
4      0
      ..
435    0
436    0
437   -1
438    0
439    0
Name: labels, Length: 440, dtype: int64

In [13]:
# look at value counts of lables
df.labels.value_counts()

 0    409
-1     31
Name: labels, dtype: int64

**Cluster Lables**

Clusters: 1 (label:0)

Outliers: (label: -1)

### Explore
- use the label as a hue

In [17]:
df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Scaled_Grocery,Scaled_Milk,Scaled_Fresh,labels
0,2,3,12669,9656,7561,214,2674,1338,0.11294,0.130727,0.081464,0
1,2,3,7057,9810,9568,1762,3293,1776,0.062899,0.132824,0.103097,0
2,2,3,6353,8808,7684,2405,3516,7844,0.056622,0.119181,0.08279,0
3,1,3,13265,1196,4221,6404,507,1788,0.118254,0.015536,0.045464,0
4,2,3,22615,5410,7198,3915,1777,5185,0.201626,0.072914,0.077552,0


In [22]:
# because of the matplotlib qt it pops up in a new window
sns.scatterplot(data = df, x = 'Fresh', y = 'Milk', hue = 'labels')

<AxesSubplot:xlabel='Fresh', ylabel='Milk'>

In [23]:
sns.scatterplot(data = df, x = 'Fresh', y = 'Grocery', hue = 'labels')

<AxesSubplot:xlabel='Fresh', ylabel='Grocery'>

In [26]:
# let's examine it on a 3d scale

# create matplotlib figure
fig = plt.figure(1, figsize = (10,10))

# wrap the figure in Axes 3D
ax = Axes3D(fig)

# treat ax object like we have before
ax.scatter(df.Fresh,
          df.Milk,
          df.Grocery,
          c = df.labels,   # this instead of Hue
          edgecolor = 'k')

# remove chart junk (lables on the axis)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

# create lables for each axis
ax.set_xlabel('Fresh')
ax.set_ylabel('Milk')
ax.set_zlabel('Grocery')

Text(0.5, 0, 'Grocery')

We made an awesome 3D plot! Only viewable when running the notebook. Will create popout window. 

### Takeaways
- There are at least 10 point that are definite outliers 
- we have some data points that may or may not belong to the main cluster
- out of the dimensions we observed, there doesn't appear to be a need for more than one cluster
- we may want to be less strict about our hyper parameters to catch the data points that aren't as extreme as the furthest outlier in the data set