# Working with numerical data
In this notebook, we aim at:

- identifying numerical data in a heterogeneous dataset;
- selecting the subset of columns corresponding to numerical data;
- using a scikit-learn helper to separate data into train-test sets;
- training and evaluating a more complex scikit-learn model.



### Load in the data

In [2]:
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
adult_census = adult_census.drop(columns="education-num")
adult_census.head(2)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K


In [3]:
# Separate data and target
data, target = adult_census.drop(columns="class"), adult_census["class"]

In [4]:
data.shape

(48842, 12)

In [5]:
target.shape

(48842,)

### Select only numerical features

In [6]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [7]:
numerical_columns = ["age", "capital-gain",
                     "capital-loss", "hours-per-week"]

In [8]:
data_numeric = data[numerical_columns]

In [9]:
data_numeric.head(2)

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50


#### Missing data
- age: missing values (NaN) -> -1 or better use the mean value
- Imputation
- Categorical data: '?'
- age -> age_missing (0 or 1, true or fale)

## Train-test split the dataset

In [10]:
from sklearn.model_selection import train_test_split

In [12]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, test_size=0.25, random_state=42)

In [13]:
data_train.shape, data_test.shape

((36631, 4), (12211, 4))

1000 samples in your data -> 200 samples in your test set
accuracy 50.5% -> 51%  50% -> 70% 98.8% -> 98.9%

In [14]:
data_train.head(3)

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
27859,41,0,2415,12
5654,39,0,0,37
3779,34,0,0,50


### Train the model

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
model = LogisticRegression()

In [17]:
model.fit(data_train, target_train)

In [18]:
accuracy = model.score(data_test, target_test)

In [19]:
accuracy

0.8070592089099992

### Exercise: Compare with simple baselines
#### 1. Compare with simple baseline
The goal of this exercise is to compare the performance of our classifier in the previous notebook (roughly 81% accuracy with LogisticRegression) to some simple baseline classifiers. The simplest baseline classifier is one that always predicts the same class, irrespective of the input data.

What would be the score of a model that always predicts ' >50K'?

What would be the score of a model that always predicts ' <=50K'?

Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier such that the resulting classifier will always predict the class ' >50K'. What is the accuracy score on the test set? Repeat the experiment by always predicting the class ' <=50K'.

Hint: you can set the strategy parameter of the DummyClassifier to achieve the desired behavior.

You can import DummyClassifier like this:
```python
from sklearn.dummy import DummyClassifier
```

#### 2. (optional) Try out other baselines
What other baselines can you think of? How well do they perform?