### Working with numerical data

In [1]:
import pandas as pd
import numpy as np

In [2]:
adult_census = pd.read_csv("../datasets/adult-census.csv")

In [3]:
adult_census = adult_census.drop(columns=["education-num"])

In [4]:
adult_census.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
data, target = adult_census.drop(columns="class"), adult_census["class"]

In [6]:
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

In [7]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [8]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

In [9]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [10]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

In [11]:
data[numerical_columns].head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [12]:
data["age"].describe()

count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

In [13]:
data_numeric = data[numerical_columns]

In [15]:
data_numeric.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


### Train-test split the dataset

In [14]:
from sklearn.model_selection import train_test_split

In [16]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, target, random_state=42, test_size=0.25
)

In [17]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [18]:
model.fit(data_train, target_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [19]:
accuracy = model.score(data_test, target_test)

In [20]:
print(f"The accuracy of logistic regression is: {accuracy}")

The accuracy of logistic regression is: 0.8070592089099992



### Exercise: Comparing Logistic Regression to a Baseline

The goal of this exercise is to compare the performance of our classifier in
the previous notebook (roughly 81% accuracy with `LogisticRegression`) to some
simple baseline classifiers. The simplest baseline classifier is one that
always predicts the same class, irrespective of the input data.

- What would be the score of a model that always predicts `' >50K'`?
- What would be the score of a model that always predicts `' <=50K'`?
- Is 81% or 82% accuracy a good score for this problem?

Use a `DummyClassifier` and do a train-test split to evaluate its accuracy on
the test set. This
[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
shows a few examples of how to evaluate the generalization performance of
these baseline models.

```python
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
```

We first split our dataset to have the target separated from the data used to
train our predictive model.

```python
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)
```

We start by selecting only the numerical columns as seen in the previous
notebook.

```python
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]
```

Split the data and target into a train and test set.

```python
from sklearn.model_selection import train_test_split

# Write your code here.
```

Use a `DummyClassifier` such that the resulting classifier always predict the
class `' >50K'`. What is the accuracy score on the test set? Repeat the
experiment by always predicting the class `' <=50K'`.

Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve
the desired behavior.

```

```


from sklearn.dummy import DummyClassifier

In [21]:
from sklearn.dummy import DummyClassifier

In [22]:
model = DummyClassifier(strategy="most_frequent", random_state=42)

In [26]:
?DummyClassifier

[31mInit signature:[39m DummyClassifier(*, strategy=[33m'prior'[39m, random_state=[38;5;28;01mNone[39;00m, constant=[38;5;28;01mNone[39;00m)
[31mDocstring:[39m     
DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more
complex classifiers.

The specific behavior of the baseline is selected with the `strategy`
parameter.

All strategies make predictions that ignore the input feature values passed
as the `X` argument to `fit` and `predict`. The predictions, however,
typically depend on values observed in the `y` parameter passed to `fit`.

Note that the "stratified" and "uniform" strategies lead to
non-deterministic predictions that can be rendered deterministic by setting
the `random_state` parameter if needed. The other strategies are naturally
deterministic and, once fit, always return the same constant prediction
for any value of `X`.

Read more in the :ref:`User Guide <dummy_estimators>`

In [25]:
target_train[:5]

27859      >50K
5654      <=50K
3779       >50K
10522      >50K
22461     <=50K
Name: class, dtype: object

In [23]:
model.fit(data_train, target_train)

0,1,2
,strategy,'most_frequent'
,random_state,42
,constant,


In [24]:
accuracy = model.score(data_test, target_test)

In [None]:
accuracy

0.7660306281221849

In [27]:
?DummyClassifier

[31mInit signature:[39m DummyClassifier(*, strategy=[33m'prior'[39m, random_state=[38;5;28;01mNone[39;00m, constant=[38;5;28;01mNone[39;00m)
[31mDocstring:[39m     
DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more
complex classifiers.

The specific behavior of the baseline is selected with the `strategy`
parameter.

All strategies make predictions that ignore the input feature values passed
as the `X` argument to `fit` and `predict`. The predictions, however,
typically depend on values observed in the `y` parameter passed to `fit`.

Note that the "stratified" and "uniform" strategies lead to
non-deterministic predictions that can be rendered deterministic by setting
the `random_state` parameter if needed. The other strategies are naturally
deterministic and, once fit, always return the same constant prediction
for any value of `X`.

Read more in the :ref:`User Guide <dummy_estimators>`

In [47]:
model1 = DummyClassifier(strategy="constant", constant=" >50K")

In [48]:
model1.fit(data_train, target_train)

0,1,2
,strategy,'constant'
,random_state,
,constant,' >50K'


In [49]:
model1.predict(data_test)

array([' >50K', ' >50K', ' >50K', ..., ' >50K', ' >50K', ' >50K'],
      shape=(12211,), dtype='<U5')