In [66]:
import pandas as pd

In [67]:
adult_census = pd.read_csv('../data/adult_census_train.csv')

In [68]:
adult_census.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [69]:
numerical_catagories = ['age', 'capital-gain', 'capital-loss','hours-per-week', 'class']
adult_census_numerical = adult_census[numerical_catagories]

In [70]:
adult_census_numerical['class'].value_counts()

 <=50K    29802
 >50K      9286
Name: class, dtype: int64

In [71]:
target_name = 'class'
target = adult_census_numerical[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
39083     <=50K
39084     <=50K
39085     <=50K
39086     <=50K
39087      >50K
Name: class, Length: 39088, dtype: object

In [72]:
data = adult_census_numerical.drop(columns = [target_name, ])
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


We can now linger on the variables, also denominated features, that we will
use to build our predictive model. In addition, we can also check how many
samples are available in our dataset.

In [73]:
data.columns

Index(['age', 'capital-gain', 'capital-loss', 'hours-per-week'], dtype='object')

In [74]:
print(f'The dataset contains {data.shape[0]} samples and {data.shape[1]} features')

The dataset contains 39088 samples and 4 features


## Fit a model and make predictions

We will build a classification model using the "K-nearest neighbors"
strategy. To predict the target of a new sample, a k-nearest neighbors takes
into account its `k` closest samples in the training set and predicts the
majority target of these samples.

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p class="last">We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive algorithm. In the next
notebook, we will introduce better models.</p>
</div>

The `fit` method is called to train the model from the input (features) and
target data.

In [75]:
from sklearn import set_config
set_config()

In [76]:
from sklearn.neighbors import KNeighborsClassifier

In [77]:
model = KNeighborsClassifier()
model.fit(data, target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

The method `fit` is composed of two elements: (i) a **learning algorithm**
and (ii) some **model states**. The learning algorithm takes the training
data and training target as input and sets the model states. These model
states will be used later to either predict (for classifiers and regressors)
or transform data (for transformers).

Both the learning algorithm and the type of model states are specific to each
type of model.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Here and later, we use the name <tt class="docutils literal">data</tt> and <tt class="docutils literal">target</tt> to be explicit. In
scikit-learn documentation, <tt class="docutils literal">data</tt> is commonly named <tt class="docutils literal">X</tt> and <tt class="docutils literal">target</tt> is
commonly called <tt class="docutils literal">y</tt>.</p>
</div>

In [78]:
target_predicted = model.predict(data)

To predict, a model uses a **prediction function** that will use the input
data together with the model states. As for the learning algorithm and the
model states, the prediction function is specific for each type of model.

In [79]:
target_predicted[:5]

array([' <=50K', ' <=50K', ' <=50K', ' >50K', ' <=50K'], dtype=object)

In [80]:
target[:5]

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [81]:
target_predicted[:5] == target[:5]

0     True
1     True
2    False
3     True
4     True
Name: class, dtype: bool

In [82]:
print(f'Number of correct predictions: '
     f'{(target[:5] == target_predicted[:5]).sum()}/5')

Number of correct predictions: 4/5


Here, we see that our model makes a mistake when predicting for the first
sample.

To get a better assessment, we can compute the average success rate.

In [83]:
(target_predicted == target).mean()

0.8223239869013508

## All wrong analysis done above because: 

But, can this evaluation be trusted, or is it too good to be true?

## Train-test data split

When building a machine learning model, it is important to evaluate the
trained model on data that was not used to fit it, as generalization is
more than memorization (meaning we want a rule that generalizes to new data,
without comparing to data we memorized).
It is harder to conclude on never-seen instances than on already seen ones.

Correct evaluation is easily done by leaving out a subset of the data when
training the model and using it afterwards for model evaluation.
The data used to fit a model is called training data while the data used to
assess a model is called testing data.

We can load more data, which was actually left-out from the original data
set.


In [84]:
adult_census_test = pd.read_csv('../data/adult_census_test.csv')

In [85]:
adult_census_test = adult_census_test[numerical_catagories]

In [86]:
target_test =adult_census_test[target_name]
data_test = adult_census_test.drop(columns = [target_name, ])

In [87]:
print(f"The testing dataset contains {data_test.shape[0]} samples and "
      f"{data_test.shape[1]} features")

The testing dataset contains 9754 samples and 4 features



Instead of computing the prediction and manually computing the average
success rate, we can use the method `score`. When dealing with classifiers
this method returns their performance metric.

In [88]:
accuracy = model.score(data_test, target_test)
# accessing the model name from the model class

model_name = model.__class__.__name__

print(f'The test accuracy using a {model_name} is'
     f'{accuracy: .3f}')

The test accuracy using a KNeighborsClassifier is 0.807


To compute the score, the predictor first computes the predictions (using
the `predict` method) and then uses a scoring function to compare the
true target `y` and the predictions. Finally, the score is returned.

If we compare with the accuracy obtained by wrongly evaluating the model
on the training set, we find that this evaluation was indeed optimistic
compared to the score obtained on an held-out test set.

It shows the importance to always testing the statistical performance of
predictive models on a different set than the one used to train these models.
We will discuss later in more details how predictive models should be
evaluated.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">In this MOOC, we will refer to <strong>statistical performance</strong> of a model when
referring to the test score or test error obtained by comparing the
prediction of a model and the true targets. Equivalent terms for
<strong>statistical performance</strong> are predictive performance and generalization
performance. We will refer to <strong>computational performance</strong> of a predictive
model when accessing the computational costs of training a predictive model
or using it to make predictions.</p>
</div>

In this notebook we:

* fitted a **k-nearest neighbors** model on a training dataset;
* evaluated its statistical performance on the testing data;
* introduced the scikit-learn API `.fit(X, y)` (to train a model),
  `.predict(X)` (to make predictions) and `.score(X, y)`
  (to evaluate a model).