# Machine Learning 3

We've just created our model below. It's a function that gives us the desired output. Now, we want to __improve__ on our model.

In [11]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data  # our input data
y = iris.target # our labels

feature_names = iris.feature_names # column names (pre-defined)
target_names = iris.target_names # target names (pre-defined)
feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [12]:
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data  # our input data
y = iris.target # our labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% training and 20% test
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


In [13]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)

In [14]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_predict))

0.9666666666666667


How can we improve on this model? All we've done is to split our data between `test` and `train`. But we can decide how much data to give the model to train.
If we notice above, we've decided that we wanted 80% to be the test data and 20% to be the train data, giving the model less data to learn from, making our accuracy go down, as we're learning from less data.

If we give it less data to test, `test_size=0.1`, we'll get 100% accuracy. But that just means that our test size is very small. We only have 15 inputs to test! Because our test data is so small, we can't be sure that our model is accurate because we only tested it `15` times.

The trade-off:

- How much data do we want to train the model with
- How much data do we want to test it with

The more data we have, the more we can train our model.

## How to improve?

With the parameters in `KNeighborsClassifier()`, we can do more, like 4. Our accuracy will be a bit lower. Because we were only testing for 3 neighbors (types) before, now that we've created four nearest dots, we really only need 3.

Another thing--if we can collect more data--is to have more columns or parameters. These are sometimes called features, and the more features that a machine has to look at, the more information it can have.

That's not to say the more features, the better. But if there was another feature that we can look at, flower-wise, then there is a better model that we can build to predict what type that it is.

The most interesting part is the actual algorithm that we use. In this case, we use the `KNeighborsClassifier()` algorithm, but we can use whatever type of algorithm that we want.

If we go to the [Scikit-Learn library](https://scikit-learn.org/stable/supervised_learning.html), there are many options to allow our improvements. We can do a decision tree

In [15]:
from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data  # our input data
y = iris.target # our labels

feature_names = iris.feature_names # column names (pre-defined)
target_names = iris.target_names # target names (pre-defined)
feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [22]:
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data  # our input data
y = iris.target # our labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # gives nearly 100% accuracy
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


In [23]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)

In [24]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_predict))

0.9666666666666667


Does this mean that a test size of 0.1 is the best option? Our test size is way small. We only have 15 inputs that are being tested, but we're learning from 135 entries.

Because our test data is so small--although we got 15 out of 15--we can't be sure that our model is that accurate because __we only tested it 15 times__.

A tradeoff between how much data do we want to train the model with vs. how much do we want to test it with. Companies really value data. __The more data we have, the more we can train our models__.

To improve what we have here, with the parameters in `KNeighborsClassifier(n_neighbors=4)` is to increase the number of neighbors, which will lower our accuracy a little. If we were only testing for 3 classes, we would match that with our nearest neighbors. But if our parameter is `(n_neighbors=4)`, what happens is we're creating four __segments (nearest dots)__ for us to evaluate. But we only need three Iris flower types.

In [27]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)

In [28]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_predict))

0.9666666666666667


What else can we do to improve the model? If we can collect more data, we can maybe have more columns or parameters. These are sometimes called features, and the more features that a machine has to look at, the more information it can have.

That's not to say the more features, the better. But if there was another feature that we can look at re the flowers, there is a better model that we can build to predict that flower's type.

The most interesting part is the actual algorithm that we use. In this case, we use `KNeighborsClassifier`, but we can use whatever type of algorithm we want.

If we go to the [Scikit-Learn Library](https://scikit-learn.org/stable/model_selection.html#model-selection), there are a lot of things we can do, like a __decision tree__, using a decision tree classifier. We simply need to adjust our imports:

In [38]:
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data  # our input data
y = iris.target # our labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # gives nearly 100% accuracy
print(X_train.shape)
print(X_test.shape)

(75, 4)
(75, 4)


In [39]:
from sklearn.tree import DecisionTreeClassifier
knn = DecisionTreeClassifier()
knn.fit(X_train, y_train)
y_predict = knn.predict(X_test)

In [40]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_predict))

0.9466666666666667
