# **S07: EXERCISES**

## Exercise 01: Dealing with categories

Load the Adults dataset from `data/adult.csv.zip` and build a machine learning model to estimate the target `income`. This dataset contains not just numerical variables, but categorical ones, so don't forget to preprocess this variables as well before training the model. Choose the `LogisticRegression` model from `scikit-learn` for training.

*Note: This dataset is for classification, so feel free to experiment with any model you want from the classification `scikit-learn` catalog*

Here is the documentation of the dataset

```
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
class: >50K, <=50K
```

In [11]:
import pandas as pd

In [12]:
data = pd.read_csv("data/adult.csv.zip")
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [13]:
data.isna().sum()

age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64

In [14]:
from sklearn.preprocessing import LabelEncoder

cat_columns = [col for col in data.select_dtypes(include=['object']).columns]

for cat_col in cat_columns:
    print(cat_col, f"{len(data[cat_col].unique())} categories")
    data[cat_col] = LabelEncoder().fit_transform(data[cat_col])
    
data

workclass 9 categories
education 16 categories
marital-status 7 categories
occupation 15 categories
relationship 6 categories
race 5 categories
gender 2 categories
native-country 42 categories
income 2 categories


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,4,226802,1,7,4,7,3,2,1,0,0,40,39,0
1,38,4,89814,11,9,2,5,0,4,1,0,0,50,39,0
2,28,2,336951,7,12,2,11,0,4,1,0,0,40,39,1
3,44,4,160323,15,10,2,7,0,2,1,7688,0,40,39,1
4,18,0,103497,15,10,4,0,3,4,0,0,0,30,39,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0
48838,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1
48839,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0
48840,22,4,201490,11,9,4,1,3,4,1,0,0,20,39,0


In [15]:
data_corr = data.corr()
data_corr["income"].sort_values(ascending=False)

income             1.000000
educational-num    0.332613
age                0.230369
hours-per-week     0.227687
capital-gain       0.223013
gender             0.214628
capital-loss       0.147554
education          0.080091
occupation         0.076722
race               0.070934
workclass          0.052674
native-country     0.012210
fnlwgt            -0.006339
marital-status    -0.199072
relationship      -0.253214
Name: income, dtype: float64

In [16]:
X = pd.DataFrame(data.drop(columns="income"))
X

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,25,4,226802,1,7,4,7,3,2,1,0,0,40,39
1,38,4,89814,11,9,2,5,0,4,1,0,0,50,39
2,28,2,336951,7,12,2,11,0,4,1,0,0,40,39
3,44,4,160323,15,10,2,7,0,2,1,7688,0,40,39
4,18,0,103497,15,10,4,0,3,4,0,0,0,30,39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,4,257302,7,12,2,13,5,4,0,0,0,38,39
48838,40,4,154374,11,9,2,7,0,4,1,0,0,40,39
48839,58,4,151910,11,9,6,1,4,4,0,0,0,40,39
48840,22,4,201490,11,9,4,1,3,4,1,0,0,20,39


In [17]:
y = pd.Series(data["income"])
y

0        0
1        0
2        1
3        1
4        0
        ..
48837    0
48838    1
48839    0
48840    0
48841    1
Name: income, Length: 48842, dtype: int32

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( ##it gives back 4 arrays
    X, y,               # arrays or matrices I want to split
    test_size=0.3,      # the proportion to data for testing (if its an integer, how many rows do you want from the dataset)
    random_state=123,   # can be any number. make the split reproducibile
    shuffle=True,       # if we want to shuffle data before splitting (this is good if we are not using time series)
    stratify=None       # For clasification problems. Split data stratifying the target variable (for classificatino problem when the data is not balanced)
)

print(f"size X_train{X_train.shape}")
print(f"size X_test{X_test.shape}")
print(f"size y_train{y_train.shape}")
print(f"size y_test{y_test.shape}")

size X_train(34189, 14)
size X_test(14653, 14)
size y_train(34189,)
size y_test(14653,)


In [19]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 200) ## we need to understand what is this
 
model.fit(X_train, y_train)

In [20]:
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.24188038, 0.33867886, 0.17734459, ..., 0.64193708, 0.11034803,
       0.16106942])

In [21]:
##Evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy*100}%")
print(f"Precision: {precision*100}%")
print(f"Recall: {recall*100}%")
print(f"F1 Score: {f1*100}%")
print(f"ROC AUC Score: {roc_auc*100}%")

Accuracy: 78.96676448508838%
Precision: 62.019230769230774%
Recall: 29.63813900057438%
F1 Score: 40.10882238631947%
ROC AUC Score: 71.38876710132615%


## Exercise 02: Dealing with text

Load the **20 newsgroups** dataset from `scikit-learn` with the code below.
1. Build a classification model (`LogisticRegression`) on the training set
2. Load the "test" set a use your model to `predict` the nex texts' category
3. Calculate the `accuracy` of the model on the test set

*Note: this is a text dataset, so use your tools available to first process the text in order to train a model*

In [22]:
from sklearn.datasets import fetch_20newsgroups

In [23]:
# first load the dataset from sklearn package
data = fetch_20newsgroups(subset="train", remove=("headers", "footers", "quotes"))
text = data["data"]
target = data["target"]
target_names = dict(enumerate(data["target_names"]))

In [24]:
# prepare data in a DataFrame

data = pd.DataFrame({
    "text": text,
    "target": target
})

data.target = data.target.replace(target_names)

In [25]:
data.head()

Unnamed: 0,text,target
0,I was wondering if anyone out there could enli...,rec.autos
1,A fair number of brave souls who upgraded thei...,comp.sys.mac.hardware
2,"well folks, my mac plus finally gave up the gh...",comp.sys.mac.hardware
3,\nDo you have Weitek's address/phone number? ...,comp.graphics
4,"From article <C5owCB.n3p@world.std.com>, by to...",sci.space


In [26]:
# to print the text of one particular sample

print(data.iloc[1].text)

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.


In [27]:
data["target"].nunique()

20

In [28]:
data.isna().sum()

text      0
target    0
dtype: int64

In [29]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

data["target_encoded"] = encoder.fit_transform(data["target"])

data

Unnamed: 0,text,target,target_encoded
0,I was wondering if anyone out there could enli...,rec.autos,7
1,A fair number of brave souls who upgraded thei...,comp.sys.mac.hardware,4
2,"well folks, my mac plus finally gave up the gh...",comp.sys.mac.hardware,4
3,\nDo you have Weitek's address/phone number? ...,comp.graphics,1
4,"From article <C5owCB.n3p@world.std.com>, by to...",sci.space,14
...,...,...,...
11309,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,sci.med,13
11310,"I have a (very old) Mac 512k and a Mac Plus, b...",comp.sys.mac.hardware,4
11311,I just installed a DX2-66 CPU in a clone mothe...,comp.sys.ibm.pc.hardware,3
11312,\nWouldn't this require a hyper-sphere. In 3-...,comp.graphics,1


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=500,  # keep only the top 5000 most frequent words
    stop_words="english", # remove stop words from the vocabulary
    ngram_range=(1, 2) # unigrams and bigrams, cake, carrot, and carrot cake
)

# vectorized
tfidf_text = tfidf.fit_transform(data["text"])

# add column namesto matrix as a dataframe
tfidf_text = pd.DataFrame(tfidf_text.toarray(), columns=tfidf.get_feature_names_out())
tfidf_text

Unnamed: 0,00,000,0t,10,100,11,12,13,14,145,...,working,works,world,wouldn,write,written,wrong,year,years,yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.150876,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.449279,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.166292,0.0
11310,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
11311,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0
11312,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.259632,0.0,0.0,0.234983,0.0,0.000000,0.0


In [31]:
y = pd.Series(data["target_encoded"])

y

0         7
1         4
2         4
3         1
4        14
         ..
11309    13
11310     4
11311     3
11312     1
11313     8
Name: target_encoded, Length: 11314, dtype: int32

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( ##it gives back 4 arrays
    tfidf_text, y,               # arrays or matrices I want to split
    test_size=0.3,      # the proportion to data for testing (if its an integer, how many rows do you want from the dataset)
    random_state=123,   # can be any number. make the split reproducibile
    shuffle=True,       # if we want to shuffle data before splitting (this is good if we are not using time series)
    stratify=None       # For clasification problems. Split data stratifying the target variable (for classificatino problem when the data is not balanced)
)

In [33]:
print(f"size X_train{X_train.shape}")
print(f"size X_test{X_test.shape}")
print(f"size y_train{y_train.shape}")
print(f"size y_test{y_test.shape}")

size X_train(7919, 500)
size X_test(3395, 500)
size y_train(7919,)
size y_test(3395,)


In [34]:
## Can we control hyperparameters??
from sklearn.linear_model import LogisticRegression

model = LogisticRegression() ## we need to understand what is this
 
model.fit(X_train, y_train)

In [35]:
# Predict the income for the test set
y_pred = model.predict(X_test)

y_pred_proba = model.predict_proba(X_test)[:, 1] ## for roc_auc

y_pred_proba

array([0.04649996, 0.04002544, 0.00162684, ..., 0.02803953, 0.02510459,
       0.06936542])

In [36]:
##Evaluate the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(y_test, y_pred)
#precision = precision_score(y_test, y_pred)
#recall = recall_score(y_test, y_pred)
#f1 = f1_score(y_test, y_pred)
#roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy}")
#print(f"Precision: {precision}")
#print(f"Recall: {recall}")
#print(f"F1 Score: {f1}")
#print(f"ROC AUC Score: {roc_auc}")

Accuracy: 0.4992636229749632


## Exercise 03 (Optional): Using Feature engineering

Given the following time series dataset, create a function that receives the data and the following arguments:
* `lag_sizes`: a list of integers with the lag sizes to use
* `window_sizes`: a list of integers with the window sizes to use
* `window_function`: a function to apply to the window. This function should receive a list of values and return a single value

and returns a new dataset with the columns of the original dataset plus the new columns created with the lag and window functions.


In [37]:
import numpy as np

ts_df = pd.DataFrame(
    {
        "ts1": np.random.randn(1000),
        "ts2": np.random.randn(1000),
        "ts3": np.random.randn(1000),
    }, 
    index=pd.date_range("2020-01-01", periods=1000)
)

ts_df.head()

Unnamed: 0,ts1,ts2,ts3
2020-01-01,-0.991823,0.785317,-0.555628
2020-01-02,0.532758,0.592826,-1.275738
2020-01-03,-0.023586,-1.356016,-0.470226
2020-01-04,0.360267,0.408593,1.482617
2020-01-05,-1.160826,-0.73096,1.711713


In [38]:
import pandas as pd

def apply_lags_and_windows(ts_df, lag_sizes, window_sizes, window_function):
    new_df = ts_df.copy()
    
    for col in ts_df.columns:
        # Apply each lag size
        for lag in lag_sizes:
            lagged_col_name = f"{col}_lag_{lag}"
            new_df[lagged_col_name] = ts_df[col].shift(lag)
            
            # Apply each window size to the lagged column
            for window in window_sizes:
                windowed_col_name = f"{lagged_col_name}_window_{window}"
                # Apply the window function to the rolling window
                new_df[windowed_col_name] = new_df[lagged_col_name].rolling(window=window).apply(window_function, raw=True)
    
    return new_df


# Define the lag sizes, window sizes, and the window function
lag_sizes = [1, 2]  # Example lag sizes
window_sizes = [3, 5]  # Example window sizes
window_function = np.mean  # Using mean as the window function

# Apply the function to the dataset
new_ts_df = apply_lags_and_windows(ts_df, lag_sizes, window_sizes, window_function)

# Display the first few rows of the new dataset
print(new_ts_df.head())

                 ts1       ts2       ts3  ts1_lag_1  ts1_lag_1_window_3  \
2020-01-01 -0.991823  0.785317 -0.555628        NaN                 NaN   
2020-01-02  0.532758  0.592826 -1.275738  -0.991823                 NaN   
2020-01-03 -0.023586 -1.356016 -0.470226   0.532758                 NaN   
2020-01-04  0.360267  0.408593  1.482617  -0.023586           -0.160884   
2020-01-05 -1.160826 -0.730960  1.711713   0.360267            0.289813   

            ts1_lag_1_window_5  ts1_lag_2  ts1_lag_2_window_3  \
2020-01-01                 NaN        NaN                 NaN   
2020-01-02                 NaN        NaN                 NaN   
2020-01-03                 NaN  -0.991823                 NaN   
2020-01-04                 NaN   0.532758                 NaN   
2020-01-05                 NaN  -0.023586           -0.160884   

            ts1_lag_2_window_5  ts2_lag_1  ...  ts2_lag_1_window_5  ts2_lag_2  \
2020-01-01                 NaN        NaN  ...                 NaN        NaN