### Data Mining Spring 2021 Project: Deliverable 3 Clustering and Final Results
#### Understanding and predicting Shark Presence in Near Shore Waters
#### Group Members:<br><br>
<p> Your overall project has three deliverables: we are on the final deliverable for clustering<br>
Deliverable 1:  Domain Understanding, Data Exploration and Preparation, Decision Trees and Random Forests<br>
Deliverable 2:  Association Rules<br>
Deliverable 3:  Clustering and Final Results (due 5/9)<br>

#### Deliverable 3:  Clustering
    
<p>Directions:  Review the notebook by reading the markdown and running the code.  Answer the questions at the end which draw upon the knowledge gained from notebooks 1, 2 and 3 associated with the project.</p>
<p>Steps for Deliverable 3:<br>
    1.  Understanding the K-means clustering algorithm - an examiniation of the modeling algorithm<br>
    2.  Import libraries and read data, prepare the data for clustering (unscaled)<br>
    3.  Run the clustering algorithm<br>
    4.  Analyze the results<br>
    5.  Final Questions<br>

### 1. Understanding the K Means Clustering Algorithm

The $K$-means algorithm divides a set of $N$ samples $X$ into $K$ disjoint clusters $C$, each described by the mean $\mu_j$ of the samples in the cluster. The means are commonly called the **cluster “centroids”**; note that they are not, in general, points from $X$, although they live in the same space. The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum of squared criterion:

$$\sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)$$

### How the algorithm works

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The algorithm inputs are the number of clusters $Κ$ and the data set. The data set is a collection of features for each data point. The algorithms starts with initial estimates for the $Κ$ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:

**Data assigment step**: Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, based on the squared Euclidean distance. More formally, if $c_i$ is the collection of centroids in set $C$, then each data point $x$ is assigned to a cluster based on

$$\underset{c_i \in C}{\arg\min} \; dist(c_i,x)^2$$
where dist( · ) is the standard ($L_2$) Euclidean distance. Let the set of data point assignments for each ith cluster centroid be $S_i$.

**Centroid update step**: In this step, the centroids are recomputed. This is done by taking the mean of all data points assigned to that centroid's cluster.

$$c_i=\frac{1}{|S_i|}\sum_{x_i \in S_i x_i}$$

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).

** Convergence and random initialization **

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not necessarily the best possible outcome), meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.

<img src=https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif style="width: 500px;"/>




## 2. Import Libraries and Read Data

#### 2. A.  Import Libraries
<p>We are importing pandas and numpy for working with data, sklearn for scikit-learn to easily perform modeling, matplotlib for plotting and datetime to work with the date attribute.</p><p>You can simply run this code</p>


In [None]:
#some code so those pesky warnings from deprecated code won't appear
import warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
#the rest of the imports
#pandas for working with datasets
import pandas as pd
#numpy for working with arrays
import numpy as np
#seaborn for plotting and styling visualizations
import seaborn as sns
#matplotlib for additional customization
import matplotlib.pyplot as plt
# import KMeans from sklearn
# want to learn more?  visit https://scikit-learn.org/stable/modules/clustering.html
from sklearn.cluster import KMeans
#some we may not use
#scikit-learn for preprocessing and modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, r2_score, mean_squared_error
from sklearn.neighbors import KNeighborsClassifier



<h4>2. B. Input Data, Review and Prepare Attributes for Clustering</h4><br>
  <p>  NOTE:  This data has had transformations applied for the purpose of education and ease of understanding the process we use to apply data mining to predictive analysis.  Transformations include balancing the data set, discretization according to domain understanding and other methods, merging with other data sets according to date, and imputation or removal of null values by row or column. </p>
<p>Due to these changes, this particular data set should not be used for an actual production sytem for shark presence or attacks. For further studies, the data should be updated with additional years and rebuilt. It can be used, however, to gain an understanding of the problem in order to continue addressing the matter in a scientific manner.</p>
<p>We won't be using all of the attributes for our clustering model, just a few of them. You can, however, use any of the attributes for your visualization.</p>

In [None]:
# encoding is a statement of the kinds of characters used
# this data set includes some special characters
# read the csv file sharkdata.csv into bdf
# you can examine the csv file on the github site for class
file = "sharkdata.csv"
bdf = pd.read_csv(file, encoding="ISO-8859-1")
#let's take a look at the attributes and file size
bdf.info()

In [None]:
#let's take a look at the data - again!
bdf.head()

In [None]:
# this time we will use numeric data
# we need to change the Attack Yes or No Feature to 0 or 1
bdf["Attack"] = bdf["Attack"].astype('category')
bdf["AttackCat"] = bdf["Attack"].cat.codes

#### 2. C. Only include the attributes needed for modeling - create "df" dataframe with these attributes
<p>You can build df to include the features you want. Here we are using the numeric features that we have already learned may be relevant from our previous work.</p>
<p>You can simply run this code but do pay attention to the attributes we will use for modeling. We are going to use the variables unscaled for our clustering model.  We will then use scaled features to see if results improve.  </p>

In [None]:
#df will include numeric attributes for clustering
#we are leaving the turtles and crabs out for now, also temperature (it's always hot in summer) and more!
df = bdf[["MoonPhaseIntExtend", "StationPressure",
          "WindSpeed", "Salinity", "Turbidity", 
          "DissolvedO2", "DirectionDiscInt","AttackCat"]]
#take a look
df.info()

In [None]:
# examine first 15 records
df.head(15)

### 3. Create an instance of a K Means model with 4 clusters.**

In [None]:
# Create the K Means Clustering Model on our data
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(df)
y_kmeans = kmeans.predict(df)

In [None]:
# show the cluster centers
clus_cent=kmeans.cluster_centers_
clus_cent

In [None]:
# let's take a look
df.head(10)

### 4.  Analyze the Results<br>
<p>Lets take a look at the four clusters and the cluster centers.  If you examine this carefully, clusters 0 and 2 are rather interesting.  The AttackCat is 0 or 1 and the center for AttackCat is close to 0 or no attack.  The center for cluster 2 is 1 which is attack = yes.  Look at the centers of the attributes and be ready to report on the results below.  For example, for cluster 0, the turbidity is much lower with attack = no than it is in cluster 2 where attack = 1.<p.P

#### 4. a.  Analyze the centroids with respect to AttackCat

In [None]:
# Analyze the clusters with respect to the centers and AttackCat.
# For Help in Understanding some of the data, look back at previous notebooks (ex. Wind Direction Disc)
#0 is Quarter moons, 1 is wan gibb and wax cres, 2 is wax gibb and wan cres, 3 is Full and New
#DirectionDiscInt is the Wind Direction discretized
#NE = 1, E = 2, SE = 3, S = 4, W = 5, SW = 6
#MoonPhaseCat is the actual MoonPhase as a string
#MoonPhaseCatExtended is the Extended MoonPhase
#0 is Quarter moons, 1 is wan gibb and wax cres, 2 is wax gibb and wan cres, 3 is Full and New
df_desc=pd.DataFrame(df.describe())
feat = list(df_desc.columns)
kmclus = pd.DataFrame(clus_cent,columns=feat)
kmclus

#### 4 b.  Add the cluster id to the dataframe for visualization

In [None]:
# add the cluster id to a dataframe called df_clus
# then add this to our original dataset so we have weach
# record and the cluster that the model assigned the record to
df_clus = pd.DataFrame(data=y_kmeans,index=None,columns=None)
df_clus.columns = ['cluster_id']


In [None]:
# merge with original data frame, ignore warnings!
# lets take a look - you can see the id of the cluster now along with AttackCat
df['id_of_cluster'] = df_clus['cluster_id']
df.head(15)

#### 4 C.  Visualization with Scatter Plots and color coded clusters

In [None]:
# lets do some visualization
# we are going to set up some colors for attack = 0 (no attack) or 1 (attack)
cluster_colors = {0:'blue', 1:'red', 2:'yellow', 3:'green'}
pd.plotting.scatter_matrix(df.loc[:,"MoonPhaseIntExtend":"AttackCat"],figsize=(30,30),grid=True,
                           marker='o', c= df['id_of_cluster'].map(cluster_colors))


## 5.  Final Questions
<p>Review the original problem statement and Deliverables 1 through 3.  After discussing all resulrs, answer the following questions.</p>
<p>Question 1:  Summarize your work in two to four paragraphs (be concise).  What was the problem?  What did you do to solve it?  What did you learn?  </p>
<p>Question 2:  In order to continue the shark research and improve on the results, what would be some logical next steps?</p>
<p>Question 3:  Have you noticed if any features are of particular importance?  What are they and why?</p>
<p>Question 4:  What features do not seem important?  Why?</p>
<p>Question 5:  What was your favorite algorithm?  Why?</p>


Question 1 answer.

Question 2 answer.

Question 3 answer.

Question 4 answer.

Question 5 answer.