# Classification

As always, we will start importing all of the libraries that we know we will need.
There is one new library here called seaborn. It is used for plotting and built on matplotlib. It has some really useful presets that we will be using.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

We will be looking at a dataset about IRIS flowers today, found in **IRIS.csv**

This dataset includes sepal length and width, along with petal length and width, and the particular species of iris. Here is a helpful digram:

![](https://miro.medium.com/max/2550/0*GVjzZeYrir0R_6-X.png)

Let's look at our dataset:

Notice that one of our columns, **species**, is catagorical. Since we know this may cause issues, let's go ahead and encode it to be numerical.

There are a few easy ways we can get a look at how the columns in our dataset are related to each other. First, we can look at it's correlation matrix:

We can also plot what is called a **pairplot**:

## Training and Testing Sets

Let's split our data into training and testing sets using the **train_test_split** function from sklearn.

First we need to import it:

In [None]:
from sklearn.model_selection import train_test_split

Now we can use it to split up our data in an 80:20 ratio.

## Lasso Regression

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. 

![](https://media-exp1.licdn.com/dms/image/C5112AQGeaIOJ4uR63g/article-cover_image-shrink_600_2000/0/1572439564210?e=1658966400&v=beta&t=I2zuDNrOXiAP8LUgK_oD1apDwrOdA6QSCLr3vxPVKQ4)

First we need to import **Lasso** from sklearn

In [None]:
from sklearn.linear_model import Lasso

Now we can create and fit the model.

Then we can go ahead and predict the species type with our model for both the training and testing set.

We can print the lasso coefficients like this:

Now let's make a pairity plot to visualize the predictions:

We can also make violin plots to give us a better idea of how many points we are seeing.

Now let's look at the mean squared error and r-squared values.

Now, what happens when we adjust the alpha level in our model? How does this increase or decrease the accuracy?

# K Nearest Neighbors (KNN)

KNN is an approach to data classification that estimates how likely a data point is to be a member of one group or the other depending on what group the data points nearest to it are in.

![](https://miro.medium.com/max/753/0*jqxx3-dJqFjXD6FA)

First we need to import **KNeighborsRegressor** from sklearn

In [None]:
from sklearn.neighbors import KNeighborsRegressor

Now we can create and fit the model.

Then we can go ahead and predict the species type with our model for both the training and testing set.

We can print the lasso coefficients like this:

Now let's make a pairity plot to visualize the predictions:

We can also make violin plots to give us a better idea of how many points we are seeing.

Now let's look at the mean squared error and r-squared values.

Now, what happens when we adjust the alpha level in our model? How does this increase or decrease the accuracy?

# K Means

K-Means is a type clustering that is a simple unsupervised learning algorithm used to solve clustering problems. It follows a simple procedure of classifying a given data set into a number of clusters, defined by the letter "k," which is fixed beforehand. The clusters are then positioned as points and all observations or data points are associated with the nearest cluster, computed, adjusted and then the process starts over using the new adjustments until a desired result is reached.

![](https://miro.medium.com/max/1200/1*rw8IUza1dbffBhiA4i0GNQ.png)

First we need to import **KMeans** from sklearn.

In [None]:
from sklearn.cluster import KMeans 

Now we can create and fit the model.

Then we can go ahead and predict the species type with our model for both the training and testing set.

Additionally we can visualize the clusters and their centers.

In [None]:
#Visualising the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'purple', label = 'Iris-setosa')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'orange', label = 'Iris-versicolour')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'red', label = 'Centroids')

plt.legend()

Now let's make a pairity plot to visualize the predictions:

We can also make violin plots to give us a better idea of how many points we are seeing.

Now let's look at the mean squared error and r-squared values.

Now, what happens when we adjust the alpha level in our model? How does this increase or decrease the accuracy?

# Decision Tree

A Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. 

![](https://www.explorium.ai/wp-content/uploads/2019/12/Decision-Trees-2.png)

First we need to import **tree** from sklearn

In [None]:
from sklearn import tree