In [8]:
#!pip install --upgrade scikit-learn

# First model with scikit-learn

In this notebook, we present how to build predictive models on tabular datasets, with only numeric features.

In particular, we will highlight:
* the scikit-learn API: `.fit(X, y)`/`.predict(X)`/`.score(X, y)`;
* how to evaluate the generalization performance of a model with a train-test split.

## Loading the dataset with Pandas

We will load a subset of the original data with only the numerical columns

In [9]:
import pandas as pd

adult_census = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/master/datasets/adult-census-numeric.csv")
adult_census.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,class
0,41,0,0,92,<=50K
1,48,0,0,40,<=50K
2,60,0,0,25,<=50K
3,37,0,0,45,<=50K
4,73,3273,0,40,<=50K


## Separate the data and the target

In [10]:
target_name = "class"
target = adult_census[target_name]
target

0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
39068     <=50K
39069     <=50K
39070      >50K
39071     <=50K
39072      >50K
Name: class, Length: 39073, dtype: object

In [11]:
data = adult_census.drop(columns=[target_name,])
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,41,0,0,92
1,48,0,0,40
2,60,0,0,25
3,37,0,0,45
4,73,3273,0,40


In [12]:
data.columns

Index(['age', 'capital-gain', 'capital-loss', 'hours-per-week'], dtype='object')

In [13]:
print(f"The dataset contains {data.shape[0]} samples and "
      f"{data.shape[1]} features")

The dataset contains 39073 samples and 4 features


## Fit a model and make predictions

We will build a classifiation model using the "K-nearest neighbors" strategy. To predict the target of a new sample, a k-nearest neighbors takes into account its `k` closest samples in the training set and predicts the majority target of these samples.

Note that knn is used here because it is an intuitive algorithm. It is seldom useful in practice.

The `fit` method is called to train the model from the input features and target data.

In [14]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [16]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(data, target)

The method `fit` is composed of two elements: (i) a **learning algorithm** and (ii) some **model states**. The learning algorithm takes the training data and training target as input and sets the model states. These model states will be used later to either predict (for classifiers and regressors) or transform data (for transformers). Both the learning algorithm and type of model startes are specific to each type of model. In scikit-learn documentation, `data` is commonly named `X` and `target` is commonly called `y`.

Let's use our model to make some predictions using the same dataset

In [17]:
target_predicted = model.predict(data)

In [18]:
# Look at first five predicted targets
target_predicted[:5]

array([' >50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

In [19]:
# And compare to actual data
target[:5]

0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: class, dtype: object

In [20]:
# Check if predictions agree with the real targets
target[:5] == target_predicted[:5]

0    False
1     True
2     True
3     True
4     True
Name: class, dtype: bool

In [21]:
# Compute average success rate
(target == target_predicted).mean()

0.8224349294909529

But this generalization cannot be trusted.

## Train-test data split

We can load more data which was left out from the original dataset

In [22]:
adult_test = pd.read_csv("https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/master/datasets/adult-census-numeric-test.csv")

In [23]:
target_test = adult_test[target_name]
data_test = adult_test.drop(columns=[target_name,])

In [24]:
print(f"Testing dataset has {data_test.shape[0]} samples and "
      f"{data_test.shape[1]} features")

Testing dataset has 9769 samples and 4 features


The method `score` returns the performance metric of classifiers.

In [25]:
accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__

print(f"The test accuracy using a {model_name} is "
      f"{accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.807
