# Working with numerical data


#### What we'll do
- identify numerical data in a heterogeneous dataset
- select corresponding columns
- use scikit-learn helper to separate data into train and test
- train and evaluate a more complex scikit-learn model

### Loading the entire dataset

In [44]:
import pandas as pd
import numpy as np
np.set_printoptions(legacy='1.25')
adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop duplicated column
adult_census = adult_census.drop(columns=["education-num"])
adult_census.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
# as before
data, target = adult_census.drop(columns="class"), adult_census["class"]

In [5]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [6]:
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

### Identifying numerical data
- represented by numbers (but: categories!)
- measurable data: age, hours worked
- require little work before training models

In [7]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [8]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

In [9]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


Observations
- columns of type "object" contain strings -> categorical; later
- now, select columns with integers and check their content

In [10]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data[numerical_columns].head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [11]:
data["age"].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [12]:
# store subset of numerical columns in a new dataframe
data_numeric = data[numerical_columns]

### Train-test split the dataset
- rather than doing it by hand, as we did earlier, we can do the split directly


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42, test_size=0.25
)

Notes
- setting `random_state` ensures reproducible results in the random number generator
- we asked for 25% test data and 75% train data

In [16]:
fraction_test = data_test.shape[0] / data_numeric.shape[0]
fraction_test

0.2500102370910282

### Train a logistic regression model

In [18]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [19]:
model.fit(data_train, target_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [20]:
accuracy = model.score(data_test, target_test)

In [21]:
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.807


### Conclusion
- logistic regression finds the right income type in around 8 out of 10 cases
- **is this generalization performance relevant for a good predictive model?**

## Exercise: Compare with simple baselines
#### 1. Compare with simple baseline
The goal of this exercise is to compare the performance of our classifier in the previous notebook (roughly 81% accuracy with LogisticRegression) to some simple baseline classifiers. The simplest baseline classifier is one that always predicts the same class, irrespective of the input data.

What would be the score of a model that always predicts ' >50K'?

What would be the score of a model that always predicts ' <=50K'?

Is 81% or 82% accuracy a good score for this problem?

Use a DummyClassifier such that the resulting classifier will always predict the class ' >50K'. What is the accuracy score on the test set? Repeat the experiment by always predicting the class ' <=50K'.

Hint: you can set the strategy parameter of the DummyClassifier to achieve the desired behavior.

You can import DummyClassifier like this:
```python
from sklearn.dummy import DummyClassifier
```

#### 2. (optional) Try out other baselines
What other baselines can you think of? How well do they perform?

## Solution



In [22]:
from sklearn.dummy import DummyClassifier

In [23]:
?DummyClassifier

[31mInit signature:[39m DummyClassifier(*, strategy=[33m'prior'[39m, random_state=[38;5;28;01mNone[39;00m, constant=[38;5;28;01mNone[39;00m)
[31mDocstring:[39m     
DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more
complex classifiers.

The specific behavior of the baseline is selected with the `strategy`
parameter.

All strategies make predictions that ignore the input feature values passed
as the `X` argument to `fit` and `predict`. The predictions, however,
typically depend on values observed in the `y` parameter passed to `fit`.

Note that the "stratified" and "uniform" strategies lead to
non-deterministic predictions that can be rendered deterministic by setting
the `random_state` parameter if needed. The other strategies are naturally
deterministic and, once fit, always return the same constant prediction
for any value of `X`.

Read more in the :ref:`User Guide <dummy_estimators>`

In [30]:
model = DummyClassifier(strategy="most_frequent", random_state=42)

In [31]:
model.fit(data_train, target_train)

0,1,2
,strategy,'most_frequent'
,random_state,42
,constant,


In [34]:
model.predict_proba(data_test)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]], shape=(12211, 2))

In [32]:
accuracy = model.score(data_test, target_test)
accuracy

0.7660306281221849

In [36]:
model = DummyClassifier(strategy="prior", random_state=42)
model.fit(data_train, target_train)
model.predict_proba(data_test)

array([[0.75894734, 0.24105266],
       [0.75894734, 0.24105266],
       [0.75894734, 0.24105266],
       ...,
       [0.75894734, 0.24105266],
       [0.75894734, 0.24105266],
       [0.75894734, 0.24105266]], shape=(12211, 2))

In [37]:
accuracy = model.score(data_test, target_test)
accuracy

0.7660306281221849

we can also verify that this is the most frequent class

In [39]:
target_train.value_counts()

class
<=50K    27801
>50K      8830
Name: count, dtype: int64

In [40]:
target_test.value_counts()

class
<=50K    9354
>50K     2857
Name: count, dtype: int64

In [42]:
(target_test == " <=50K").mean()

np.float64(0.7660306281221849)

### Conclusion
- we want to have our models to perform better than choosing the most frequent class
- we can use the DummyClassifier as a benchmark to compare our models against