# 📝 Exercise M1.02

The goal of this exercise is to fit a similar model as in the previous
notebook to get familiar with manipulating scikit-learn objects and in
particular the `.fit/.predict/.score` API.

Let's load the adult census dataset with only numerical variables

In [44]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census-numeric.csv")
X_train = adult_census.drop(columns="class")
y_train = adult_census["class"]

In the previous notebook we used `model = KNeighborsClassifier()`. All
scikit-learn models can be created without arguments. This is convenient
because it means that you don't need to understand the full details of a model
before starting to use it.

One of the `KNeighborsClassifier` parameters is `n_neighbors`. It controls the
number of neighbors we are going to use to make a prediction for a new data
point.

What is the default value of the `n_neighbors` parameter?

**Hint**: Look at the documentation on the [scikit-learn
website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
or directly access the description inside your notebook by running the
following cell. This opens a pager pointing to the documentation.

In [45]:
from sklearn.neighbors import KNeighborsClassifier

# documentation in-line
KNeighborsClassifier?

[31mInit signature:[39m
KNeighborsClassifier(
    n_neighbors=[32m5[39m,
    *,
    weights=[33m'uniform'[39m,
    algorithm=[33m'auto'[39m,
    leaf_size=[32m30[39m,
    p=[32m2[39m,
    metric=[33m'minkowski'[39m,
    metric_params=[38;5;28;01mNone[39;00m,
    n_jobs=[38;5;28;01mNone[39;00m,
)
[31mDocstring:[39m     
Classifier implementing the k-nearest neighbors vote.

Read more in the :ref:`User Guide <classification>`.

Parameters
----------
n_neighbors : int, default=5
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : {'uniform', 'distance'}, callable or None, default='uniform'
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
  

Create a `KNeighborsClassifier` model with `n_neighbors=50`

In [46]:
# Write your code here.
model = KNeighborsClassifier(n_neighbors=50)

Fit this model on the data and target loaded above

In [47]:
KNeighborsClassifier.fit?

[31mSignature:[39m KNeighborsClassifier.fit(self, X, y)
[31mDocstring:[39m
Fit the k-nearest neighbors classifier from the training dataset.

Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features) or                 (n_samples, n_samples) if metric='precomputed'
    Training data.

y : {array-like, sparse matrix} of shape (n_samples,) or                 (n_samples, n_outputs)
    Target values.

Returns
-------
self : KNeighborsClassifier
    The fitted k-nearest neighbors classifier.
[31mFile:[39m      ~/miniconda3/envs/mlmooc/lib/python3.13/site-packages/sklearn/neighbors/_classification.py
[31mType:[39m      function

In [48]:
# Write your code here.
model.fit(X_train, y_train)

Use your model to make predictions on the first 10 data points inside the
data. Do they match the actual target values?

In [49]:
model.predict?

[31mSignature:[39m model.predict(X)
[31mDocstring:[39m
Predict the class labels for the provided data.

Parameters
----------
X : {array-like, sparse matrix} of shape (n_queries, n_features),                 or (n_queries, n_indexed) if metric == 'precomputed', or None
    Test samples. If `None`, predictions for all indexed points are
    returned; in this case, points are not considered their own
    neighbors.

Returns
-------
y : ndarray of shape (n_queries,) or (n_queries, n_outputs)
    Class labels for each data sample.
[31mFile:[39m      ~/miniconda3/envs/mlmooc/lib/python3.13/site-packages/sklearn/neighbors/_classification.py
[31mType:[39m      method

In [50]:
# Write your code here.
predictions = model.predict(X_train)

Compute the accuracy on the training data.

In [52]:
# option 1:
print((predictions == y_train).mean())

# option 2:
model.score(X_train, y_train)

0.8290635477183733


0.8290635477183733

Now load the test data from `"../datasets/adult-census-numeric-test.csv"` and
compute the accuracy on the test data.

In [53]:
model.score?

[31mSignature:[39m model.score(X, y, sample_weight=[38;5;28;01mNone[39;00m)
[31mDocstring:[39m
Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like of shape (n_samples, n_features), or None
    Test samples. If `None`, predictions for all indexed points are
    used; in this case, points are not considered their own
    neighbors. This means that `knn.fit(X, y).score(None, y)`
    implicitly performs a leave-one-out cross-validation procedure
    and is equivalent to `cross_val_score(knn, X, y, cv=LeaveOneOut())`
    but typically much faster.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
    True labels for `X`.

sample_weight : array-like of shape (n_samples,), default=None
    Sample weights.

Returns
-------
score : float
    Mean accuracy of ``self.predi

In [54]:
# Write your code here.
adult_census_test = pd.read_csv("../datasets/adult-census-numeric-test.csv")
X_test = adult_census_test.drop(columns=["class"])
y_test = adult_census_test["class"]
model.score(X_test, y_test)

0.8194288054048521