# Classification using structured data

## Example iris data

### Vectorization

![image](https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg/220px-Kosaciec_szczecinkowaty_Iris_setosa.jpg) ![image](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Iris_versicolor_3.jpg/220px-Iris_versicolor_3.jpg) ![image](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Iris_virginica.jpg/220px-Iris_virginica.jpg)

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
iris

Convert to `DataFrame` with `pandas`

In [None]:
import pandas as pd
idf = pd.DataFrame(iris["data"], columns=["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"])
idf["target"] = iris["target"]
idf["name"] = [iris["target_names"][target] for target in iris["target"]]
idf

*Five Number Summaries*

In [None]:
idf.describe()

Plot as histogram

In [None]:
idf["Petal Width"].plot.hist()

Which features might be suitable for vectorization?

In [None]:
idf.plot.scatter(x="Sepal Length", y="Sepal Width", c="target", cmap="Set1")

Better use all? Plot data in a way to find features which contribute most to distinguish the cases:

In [None]:
import seaborn as sns
sns.pairplot(idf.drop(columns=["target"]), hue="name")

### Iris-Klassifikation

In [None]:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
svm.fit(iris['data'], iris['target'])

In [None]:
svm.predict(iris['data'])

In [None]:
svm.predict(iris['data']) == iris['target']

Looks good, but is this really abstraction?

### Training test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size = 0.25, random_state = 42)

In [None]:
svm = SGDClassifier(loss='hinge', max_iter=1000, tol=1e-3, random_state=42)
svm.fit(X_train, y_train)

In [None]:
svm.predict(X_test) == y_test

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, svm.predict(X_test))

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, svm.predict(X_test)))