# 📝 Exercise M1.02

The goal of this exercise is to fit a similar model as in the previous
notebook to get familiar with manipulating scikit-learn objects and in
particular the `.fit/.predict/.score` API.

Let's load the adult census dataset with only numerical variables

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
data = adult_census.drop(columns="class")
target = adult_census["class"]

In the previous notebook we used `model = KNeighborsClassifier()`. All
scikit-learn models can be created without arguments. This is convenient
because it means that you don't need to understand the full details of a model
before starting to use it.

One of the `KNeighborsClassifier` parameters is `n_neighbors`. It controls the
number of neighbors we are going to use to make a prediction for a new data
point.

What is the default value of the `n_neighbors` parameter?

**Hint**: Look at the documentation on the [scikit-learn
website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
or directly access the description inside your notebook by running the
following cell. This opens a pager pointing to the documentation.

In [2]:
from sklearn.neighbors import KNeighborsClassifier

KNeighborsClassifier?

[0;31mInit signature:[0m
[0mKNeighborsClassifier[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_neighbors[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweights[0m[0;34m=[0m[0;34m'uniform'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malgorithm[0m[0;34m=[0m[0;34m'auto'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mleaf_size[0m[0;34m=[0m[0;36m30[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mp[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmetric[0m[0;34m=[0m[0;34m'minkowski'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmetric_params[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_jobs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Classifier implementing the k-nearest neighbors vote.

Read more in the :ref:`User Guide <classification>`.

Parameters
----------
n_

Create a `KNeighborsClassifier` model with `n_neighbors=50`

In [3]:
# Write your code here.
classifier = KNeighborsClassifier(n_neighbors=50)

Fit this model on the data and target loaded above

In [4]:
# Write your code here.
classifier.fit(data,target)

Use your model to make predictions on the first 10 data points inside the
data. Do they match the actual target values?

In [6]:
# Write your code here.
pred = classifier.predict(data[:10])
pred

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K',
       ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [8]:
target[:10]

0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
5     <=50K
6     <=50K
7      >50K
8     <=50K
9      >50K
Name: class, dtype: object

In [9]:
pred == target[:10]

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: class, dtype: bool

Compute the accuracy on the training data.

In [11]:
# Write your code here.
classifier.score(data,target)

0.8290379545978042

Now load the test data from `"../datasets/adult-census-numeric-test.csv"` and
compute the accuracy on the test data.

In [13]:
# Write your code here.
test_data = pd.read_csv("../datasets/adult-census-numeric-test.csv")
test_data

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,class
0,20,0,0,35,<=50K
1,53,0,0,72,>50K
2,41,0,0,50,>50K
3,20,0,0,40,<=50K
4,25,0,0,40,<=50K
...,...,...,...,...,...
9764,30,0,0,49,<=50K
9765,57,0,0,50,>50K
9766,63,0,0,35,<=50K
9767,59,0,0,40,<=50K


In [14]:
classifier.score(test_data.iloc[:,:4],test_data.iloc[:,4])

0.8177909714402702