# About this Notebook

In this notebook we are going to use several well-known models to classify using scikit-learn.

## Iris Dataset

First, we are going to solve the Iris Dataset, it is a classic one. 

The target is to identify the iris from the dimension of the flower. 

<img src="https://machinelearninghd.com/wp-content/uploads/2021/03/iris-dataset.png">

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

In [None]:
y = iris.target
X = iris.data[:,:2]

In [None]:
import pandas as pd

In [None]:
y

We divide in train and test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, we are going to use different classic algorithms.

## Lineal Classification

First, we are going to use a Linear Classifier.

In [None]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier(max_iter=100)
model.fit(X_train, y_train)

In [None]:
predict = model.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [None]:
accuracy_score(predict, y_test)

It is able to classify with great results.

We are going to visualize the decision bounds.

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    model,
    X[:,:2],
    ax=plt.gca(),
    response_method="predict",
    xlabel=iris.feature_names[0], ylabel=iris.feature_names[1], alpha=0.5
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=iris.target, edgecolor="k")

We can see that the space is linearly divided in 3, to classify each instance.

We have use only two variables to visualize, with all four we achieve a better results. 

**Task: Modify the code to use the 4 attributes, and check the accuracy.**

In [None]:
predict = model.predict(X_test)

In [None]:
cross_val_score(model, X, y, cv=5).mean()

## K-Nearest Neighborhood

This algorithm allow us to classify considering the class of the nearest instances.

In [None]:

from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

### Important: Data must be normalize

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
model_knn = Pipeline([
            ("scaler", StandardScaler()),
            ("knn", knn)])

In [None]:
model_knn.fit(X_train, y_train)

In [None]:
predict = model_knn.predict(X_test)

In [None]:
accuracy_score(predict, y_test)

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    model_knn,
    X[:,:2],
    ax=plt.gca(),
    response_method="predict",
    xlabel=iris.feature_names[0], ylabel=iris.feature_names[1], alpha=0.5
)
disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")

In this case, the region is grouping around each solution.

**Task: Apply with several neighborhood.**

In [None]:
cross_val_score(model_knn, X, y, cv=5).mean()

## Support Vector Machine

SVM is a very popular classifier, that divide the space.

<img src="https://miro.medium.com/max/1200/1*06GSco3ItM3gwW2scY6Tmg.png" width="50%">

In [None]:
from sklearn.svm import LinearSVC

In [None]:
svc = LinearSVC()

In [None]:
model_svc = Pipeline([("scale", StandardScaler()), ("svc", svc)])

In [None]:
model_svc.fit(X_train, y_train)

In [None]:
predict = model_svc.predict(X_test)
accuracy_score(predict, y_test)

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    model_svc,
    X[:,:2],
    response_method="predict",
    xlabel=iris.feature_names[0], ylabel=iris.feature_names[1], alpha=0.5
)
disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")

In [None]:
cross_val_score(model_svc, X, y, cv=5).mean()

## Titanic Dataset

We are going to use another datasets, detection of Titanic.

In [None]:
data = pd.read_csv("titanic.csv").dropna()

In [None]:
data.shape

In [None]:
data.head()

In [None]:
y = data['Survived']

In [None]:
X = data.drop(['Survived', 'Cabin', 'PassengerId', 'Name', 'Ticket'], axis=1)

In [None]:
import seaborn as sns
disp = sns.countplot(x = 'Survived', hue = 'Sex', palette = 'Set1', data = data)
disp.set(title = 'Passenger status (Survived/Died) against Passenger Class', 
       xlabel = 'Passenger Class', ylabel = 'Total')
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
labels_t = {}

for col in ['Sex', 'Embarked']:
    labels_t[col] = LabelEncoder().fit(X[col])
    X[col] = labels_t[col].transform(X[col])

In [None]:
X.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

## Decision Tree

The decision tree are one of the most intuitive models to predict a category. The idea is to automatically create a decision tree that, for each instance, in function of its attributes, a specific category is assigned.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model_tree = DecisionTreeClassifier(max_depth=3)

In [None]:
model_tree.fit(X_train, y_train)

We are going to visualize it.

In [None]:

from sklearn import tree

In [None]:
tree.plot_tree(model_tree)
plt.show()

In [None]:
plt.figure(figsize=(50,50))
tree.plot_tree(model_tree, feature_names=X_train.columns)
plt.show()

In [None]:
cross_val_score(model_tree, X, y, cv=5).mean()

## Ensemble models and Random Forest

In this example, we are going to use Ensemble Model, Random Forest.

<img src="https://miro.medium.com/max/1482/0*Srg7htj4TOMP5ldX.png">

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_rf = RandomForestClassifier(n_estimators=50) # Limit the number of trees

In [None]:
cross_val_score(model_rf, X, y, cv=5).mean()

# Task: Tackle 

We have a list of Seeds for Pumpins, we want to classify the Class from the Features. Test the different models compared, using cross_validation, and get the best models.

In [None]:
from scipy.io import arff
import pandas as pd

data = arff.loadarff('Pumpkin_Seeds_Dataset.arff')
seeds = pd.DataFrame(data[0])

In [None]:
seeds.head()

In [None]:
target_labels = LabelEncoder().fit(seeds['Class'])

In [None]:
y = target_labels.transform(seeds['Class'])
X = seeds.drop(['Class'], axis=1)