# Car Prices

🎯 The goal of this challenge is to prepare a dataset and apply some feature selection techniques that you have learned so far.

🚗 We are dealing with a dataset about cars and we would like to predict whether a car is expensive or cheap.

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Checking whether a numerical feature has a normal distribution or not
from statsmodels.graphics.gofplots import qqplot

# Preprocessing tools
from sklearn.preprocessing import RobustScaler, StandardScaler, LabelEncoder  # <-- Include LabelEncoder here!

# Modelling and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split

# For challenge result testing
from nbresult import ChallengeResult


In [49]:
url = "https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset.csv"

❓ Go ahead and load the CSV into a dataframe called `df`.

In [50]:
df = pd.read_csv(url)
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548,dohc,four,2.68,5000,expensive
1,std,front,64.1,2548,dohc,four,2.68,5000,expensive
2,std,front,65.5,2823,ohcv,six,3.47,5000,expensive
3,std,front,,2337,ohc,four,3.4,5500,expensive
4,std,front,66.4,2824,ohc,five,3.4,5500,expensive


ℹ️ The description of the dataset is available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset_description.txt). Make sure to refer to it throughout the exercise.

## (1) Duplicates

❓ Remove the duplicates from the dataset if there are any. ❓

*Overwite the dataframe `df`*

In [51]:
df.drop_duplicates(inplace=True)
df.shape

(191, 9)

## (2) Missing values

❓ Find the missing values and impute them either with `strategy = "most frequent"` (categorical variables) or `strategy = "median"` (numerical variables) ❓


In [52]:
# (2) MISSING VALUES
# ❓ Find the missing values and impute them:
# - categorical features with strategy="most frequent"
# - numerical features with strategy="median" if it has relatively few missing points

# Inspect the DataFrame
df.info()
df.isna().sum()

# "carwidth" has multiple representations for missing values (* and np.nan).
# Replace "*" with np.nan
df["carwidth"].replace("*", np.nan, inplace=True)

# Then we can identify which features are numerical vs categorical
cat_cols = df.select_dtypes(include=["object"]).columns
num_cols = df.select_dtypes(exclude=["object"]).columns

# For each numeric column with missing values, fill with median
for col in num_cols:
    if df[col].isna().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# For each categorical column with missing values, fill with most frequent
for col in cat_cols:
    if df[col].isna().sum() > 0:
        most_frequent = df[col].mode()[0]
        df[col].fillna(most_frequent, inplace=True)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      191 non-null    object 
 1   enginelocation  181 non-null    object 
 2   carwidth        189 non-null    object 
 3   curbweight      191 non-null    int64  
 4   enginetype      191 non-null    object 
 5   cylindernumber  191 non-null    object 
 6   stroke          191 non-null    float64
 7   peakrpm         191 non-null    int64  
 8   price           191 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 14.9+ KB


### `carwidth`

<details>
    <summary> 💡 <i>Hint</i> </summary>
    <br>
    ℹ️ <code>carwidth</code> has multiple representations for missing values. Some are <code>np.nan</code>, some are  <code>*</code>. Once located, they can be imputed by the median value, since missing values make up less than 30% of the data.
</details> 

In [53]:
# Specifically address the hints:
# "carwidth" -> we replaced '*' with np.nan, then used median
# "enginelocation" -> if missing, fill with its most frequent (which is 'front')


### `enginelocation`

<details>
    <summary>💡 <i>Hint</i> </summary>
    <br>
    ℹ️ Considering that <code>enginelocation</code> is a categorical feature, and that the vast majority of the category is <code>front</code>, impute with the most frequent.
</details>

In [54]:
df["enginelocation"].fillna(df["enginelocation"].mode()[0], inplace=True)

🧪 **Test your code**

In [55]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = df)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/aheggs/code/andyheggs/05-ML/02-Prepare-the-dataset/data-car-prices/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 2 items

test_missing_values.py::TestMissing_values::test_carwidth [32mPASSED[0m[32m         [ 50%][0m
test_missing_values.py::TestMissing_values::test_engine_location [32mPASSED[0m[32m  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master



In [56]:
!git add tests/missing_values.pickle

!git commit -m 'Completed missing_values step'

!git push origin master

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   Car-Prices.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


Everything up-to-date


## (3) Scaling the numerical features

In [57]:
# As a reminder, some information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      191 non-null    object 
 1   enginelocation  191 non-null    object 
 2   carwidth        191 non-null    object 
 3   curbweight      191 non-null    int64  
 4   enginetype      191 non-null    object 
 5   cylindernumber  191 non-null    object 
 6   stroke          191 non-null    float64
 7   peakrpm         191 non-null    int64  
 8   price           191 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 14.9+ KB


In [58]:
# And here are the numerical features of the dataset we need to scale
numerical_features = df.select_dtypes(exclude=['object']).columns
numerical_features

Index(['curbweight', 'stroke', 'peakrpm'], dtype='object')

❓ **Question: Scaling the numerical features** ❓

Investigate the numerical features for outliers and distribution, and apply the solutions below accordingly:
- Robust Scaler
- Standard Scaler

Replace the original columns with the transformed values.

### `peakrpm` , `carwidth` , & `stroke`

<details>
    <summary>💡 <i>Hint</i> </summary>

    
ℹ️ <code>peakrpm</code>, <code>carwidth</code>, & <code>stroke</code> have normal distributions but also some outliers. Hence, it is advisable to use `RobustScaler()`.
</details>

In [59]:
from sklearn.preprocessing import RobustScaler, StandardScaler

robust_scaler = RobustScaler()
standard_scaler = StandardScaler()

# 3.1) peakrpm, carwidth, stroke => Robust Scaler

# Make sure we transform them in place
for col in ["peakrpm", "carwidth", "stroke"]:
    # reshape for the scaler
    df[col] = robust_scaler.fit_transform(df[[col]])

### `curbweight`

<details>
    <summary>💡 <i>Hint</i> </summary>
    <br>
    ℹ️ <code>curbweight</code> has a normal distribution and no outliers. It can be Standard Scaled.
</details>

In [60]:
# 3.2) curbweight => Standard Scaler
df["curbweight"] = standard_scaler.fit_transform(df[["curbweight"]])

🧪 **Test your code**

In [61]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = df
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/aheggs/code/andyheggs/05-ML/02-Prepare-the-dataset/data-car-prices/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 4 items

test_scaling.py::TestScaling::test_carwidth [32mPASSED[0m[32m                       [ 25%][0m
test_scaling.py::TestScaling::test_curbweight [32mPASSED[0m[32m                     [ 50%][0m
test_scaling.py::TestScaling::test_peakrpm [32mPASSED[0m[32m                        [ 75%][0m
test_scaling.py::TestScaling::test_stroke [32mPASSED[0m[32m                         [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master



In [62]:
!git add tests/scaling.pickle

!git commit -m 'Completed scaling step'

!git push origin master

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   Car-Prices.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")
Everything up-to-date


## (4) Encoding the categorical features

❓ **Question: encoding the categorical variables** ❓

👇 Investigate the features that require encoding, and apply the following techniques accordingly:

- One-hot encoding
- Manual ordinal encoding

In the Dataframe, replace the original features with their encoded version(s).

### `aspiration` & `enginelocation`

<details>
    <summary>💡 <i>Hint</i> </summary>
    <br>
    ℹ️ <code>aspiration</code> and <code>enginelocation</code> are binary categorical features.
</details>

In [63]:
# (4) ENCODING THE CATEGORICAL FEATURES

# The instructions:
# - "aspiration" & "enginelocation" => binary categorical => label or ordinal encoding
# - "enginetype" => multi-categorical => one-hot
# - "cylindernumber" => ordinal => manual numeric mapping
# - Then scale "cylindernumber" if needed
# - Finally, encode "price" (the target) with label encoding

df.info()  # review the object columns

# 4.1) aspiration & enginelocation => binary => simplest approach is to map them to 0 and 1

binary_encoder = LabelEncoder()
df["aspiration"] = binary_encoder.fit_transform(df["aspiration"])
# e.g. if aspiration has 'std'/'turbo', one will become 0, the other 1

df["enginelocation"] = binary_encoder.fit_transform(df["enginelocation"])
# e.g. if 'front'/'rear', then front => 0, rear => 1

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      191 non-null    object 
 1   enginelocation  191 non-null    object 
 2   carwidth        191 non-null    float64
 3   curbweight      191 non-null    float64
 4   enginetype      191 non-null    object 
 5   cylindernumber  191 non-null    object 
 6   stroke          191 non-null    float64
 7   peakrpm         191 non-null    float64
 8   price           191 non-null    object 
dtypes: float64(4), object(5)
memory usage: 14.9+ KB


NameError: name 'LabelEncoder' is not defined

### `enginetype`

<details>
    <summary>💡 <i>Hint</i> </summary>
    <br>
    ℹ️ <code>enginetype</code> is a multicategorical feature and must be One hot encoded.
</details>

In [None]:
# 4.2) enginetype => multi-categorical => One-hot encode
# We'll drop the first category to avoid dummy variable trap

df = pd.get_dummies(df, columns=["enginetype"], drop_first=True)
df.shape

KeyError: "None of [Index(['enginetype'], dtype='object')] are in the [columns]"

In [None]:
df.shape

(191, 14)

### `cylindernumber`

<details>
    <summary>💡 Hint </summary>

ℹ️ <code>cylindernumber</code> is an ordinal feature and must be manually encoded into numeric.

</details>

In [None]:
# 4.3) cylindernumber => ordinal => manual numeric
# The dataset has strings: 'two','three','four','five','six','eight','twelve'
# We'll map them to numeric:

mapping = {
    "two": 2,
    "three": 3,
    "four": 4,
    "five": 5,
    "six": 6,
    "eight": 8,
    "twelve": 12
}

df["cylindernumber"] = df["cylindernumber"].map(mapping)

❓ Now that you've made `cylindernumber` into a numeric feature between 2 and 12, you need to scale it ❓

<br/>

<details>
    <summary>💡 Hint </summary>

Look at the current distribution of the `cylindernumber` and ask yourself the following questions:
- Does scaling affect a feature's distribution ?
- According to the distribution of this feature, what is the most appropriate scaling method?
</details>

In [None]:
# Now decide on scaling. We'll investigate whether it has outliers or not.
# For simplicity, let's do StandardScaler on it.
df["cylindernumber"] = standard_scaler.fit_transform(df[["cylindernumber"]])

<details>
    <summary><i>Here is a screenshot of how your dataframe shoud look like after scaling and encoding</i></summary>
    
    
<img src="https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/02-Prepare-the-dataset/car_price_after_scaling_and_encoding.png">    

</details>

### `price`

👇 Encode the target `price`.

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>price</code> is the target and must be Label encoded.
</details>

In [None]:
# 4.4) price => Label encode (the target)
# We want to predict whether a car is expensive or cheap => 0 or 1

price_encoder = LabelEncoder()
df["price"] = price_encoder.fit_transform(df["price"])

NameError: name 'LabelEncoder' is not defined

🧪 **Test your code**

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding',
                         dataset = df)
result.write()
print(result.check())

## (5) Base Modelling

👏 The dataset has been preprocessed and is now ready to be fitted to a model. 

❓**Question: a first attempt to evaluate a classification model** ❓

Cross-validate a `LogisticRegression` on this preprocessed dataset and save its score under a variable named `base_model_score`.

In [None]:
# (5) BASE MODELLING
# ❓**Question: a first attempt to evaluate a classification model**
# Cross-validate a LogisticRegression on this preprocessed dataset
# Save its score under a variable named `base_model_score`.

# First separate features (X) and target (y)
X = df.drop(columns=["price"])
y = df["price"]

# Cross-validate
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression()
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
base_model_score = cv_scores.mean()
base_model_score


ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 1208, in fit
    X, y = self._validate_data(
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/base.py", line 622, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1146, in check_X_y
    X = check_array(
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/validation.py", line 915, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/home/aheggs/.pyenv/versions/3.10.6/envs/lewagon/lib/python3.10/site-packages/pandas/core/generic.py", line 2064, in __array__
    return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'std'


🧪 **Test your code**

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('base_model',
                         score = base_model_score
)

result.write()
print(result.check())

## (6) Feature Selection (with _Permutation Importance_)

👩🏻‍🏫 A powerful way to detect whether a feature is relevant or not to predict a target is to:
1. Run a model and score it
2. Shuffle this feature, re-run the model and score it
    - If the performance significantly dropped, the feature is important and you shoudn't have dropped it
    - If the performance didn't decrease a lot, the feature may be discarded.

❓ **Questions** ❓

1. Perform a feature permutation to detect which features bring the least amount of information to the model. 
2. Remove the weak features from your dataset until you notice model performance dropping substantially
3. Using your new set of strong features, cross-validate a new model, and save its score under variable name `strong_model_score`.

In [None]:
# (6) FEATURE SELECTION (Permutation Importance)

# 1) Perform feature permutation to detect which features bring the least info.
# 2) Remove weak features until you see model performance degrade significantly
# 3) Save new cross-validation score in `strong_model_score`.

# We'll implement a naive permutation importance approach:
from sklearn.metrics import accuracy_score

def permutation_importance(model, X, y):
    """
    Fit model on (X, y), compute baseline accuracy.
    Then, for each feature, shuffle it and compute drop in accuracy.
    Return a list of (feature, importance).
    """
    model.fit(X, y)
    baseline = accuracy_score(y, model.predict(X))

    importances = {}
    for col in X.columns:
        X_shuffled = X.copy()
        # shuffle current feature
        X_shuffled[col] = np.random.permutation(X_shuffled[col])

        shuffled_acc = accuracy_score(y, model.predict(X_shuffled))
        importances[col] = baseline - shuffled_acc

    # Sort by descending importance
    return sorted(importances.items(), key=lambda x: x[1], reverse=True)


# Perform the permutation
model = LogisticRegression()
importances = permutation_importance(model, X, y)
importances[:10]  # top 10 features for reference

ValueError: could not convert string to float: 'std'

In [None]:
# Let's pick a threshold for 'low importance' features
# A common approach: if importance < some small value -> remove it
# The threshold is somewhat arbitrary. You might adjust it iteratively to avoid losing performance.

threshold = 0.01  # example threshold for "low" importance
weak_features = [feat for feat, imp in importances if imp < threshold]

weak_features

NameError: name 'importances' is not defined

In [None]:
# Create a new dataset without the weak features
X_strong = X.drop(columns=weak_features)

# Let's cross-validate again
model_strong = LogisticRegression()
cv_scores_strong = cross_val_score(model_strong, X_strong, y, cv=5, scoring='accuracy')
strong_model_score = cv_scores_strong.mean()
strong_model_score

🧪 **Test your code**

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('strong_model',
                         score = strong_model_score
)

result.write()
print(result.check())

## Bonus - Stratifying your data ⚖️

💡 As we split our data into training and testing, we need to be mindful of the proportion of categorical variables in our dataset - whether it's the classes of our target `y` or a categorical feature in `X`.

Let's have a look at an example 👇

❓ Split your original `X` and `y` into training and testing data, using sklearn's `train_test_split`; use `random_state=1` and `test_size=0.3` to have comparable results.

In [None]:
# BONUS - STRATIFYING YOUR DATA
# (For demonstration purposes)

# 1) Take the original X, y
X = df.drop(columns=["price"])
y = df["price"]

# 2) Use train_test_split with random_state=1, test_size=0.3
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=1
)


NameError: name 'train_test_split' is not defined

❓ Check the proportion of `price` class `1` cars in your training dataset and testing dataset.

> _If you check the proportion of them in the raw `df`, it should be very close to 50/50_

In [None]:
# 3) Check the proportion of 'price' class 1 in training/test
train_prop_class1 = y_train.mean()
test_prop_class1 = y_test.mean()
train_prop_class1, test_prop_class1

It should still be pretty close to 50/50 ☝️ 

***But what if we change the random state?*** 

❓ Loop through random states 1 through 10, each time calculating the share of `price` class `1` cars in the training and testing data. ❓

In [None]:
# 4) Loop through random states 1 through 10 and record the proportion of price=1 in train/test

for rs in range(1, 11):
    X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(
        X, y,
        test_size=0.3,
        random_state=rs
    )
    print(f"Random State {rs}: train proportion class=1: {y_train_rs.mean():.3f}, test proportion class=1: {y_test_rs.mean():.3f}")


You will observe that the proportion changes every time, sometimes even quite drastically 😱! This can affect model performance!

❓ Compare the test score of a logistic regression when trained using `train_test_split(random_state=1)` _vs._ `random_state=9` ❓ 

Remember to fit on training data and score on testing data.

In [None]:
# 5) Compare logistic regression test score for random_state=1 vs random_state=9

for rs in [1, 9]:
    X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(
        X, y,
        test_size=0.3,
        random_state=rs
    )
    model_rs = LogisticRegression()
    model_rs.fit(X_train_rs, y_train_rs)
    score_rs = model_rs.score(X_test_rs, y_test_rs)
    print(f"Random State {rs} => Test Accuracy: {score_rs:.3f}")


👀 You should see a much lower score with `random_state=9` because the proportion of class `1` cars in that test set is 34.5%, quite far from the 57.9% in the training set or even the 50% in the original dataset.

This is substantial, as this accidental imbalance in our dataset can not only make model performance worse, but also distort the "reality" during training or scoring 🧐

***So how do we fix this issue? How do we keep the same distribution of classes across the train set and the test set? 🔧***

🎁 Luckily, this is taken care of by `cross_validate` in sklearn, when the estimator (a.k.a the model) is a classifier and the target is a class. Check out the documentation of the `cv` parameter in 📚 [**sklearn.model_selection.cross_validate**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).

The answer is to use the following:

>📚 [**Stratification**](https://scikit-learn.org/stable/modules/cross_validation.html#stratification)

### Stratification of the target

💡 We can also use the ***strafification*** technique in a `train_test_split`.

❓ Run through the same 1 to 10 random state loop again, but this time also ***pass `stratify=y` into the holdout method***. ❓

In [None]:
# 6) Now let's do the same 1->10 random-state loop, but with stratify=y
# This should preserve the overall proportion of 'price' classes.

for rs in range(1, 11):
    X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(
        X, y,
        test_size=0.3,
        random_state=rs,
        stratify=y   # Key difference
    )
    print(f"Random State {rs}: train proportion class=1: {y_train_rs.mean():.3f}, test proportion class=1: {y_test_rs.mean():.3f}")


👀 Even if the random state is changing, the proportion of classes inside the training and testing data is kept the same as in the original `y`. This is what _stratification_ is.

Using `train_test_split` with the `stratify` parameter, we can also preserve proportions of a feature across training and testing data. This can be extremely important, for example:

- preserving proportion of male and female customers in predicting churn 🙋‍♂️ 🙋
- preserving the proportion big and small houses in predicting their prices 🏠 🏰
- preserving distribution of 1-5 review scores (multiclass!) in recommending the next product 🛍️
- etc...

For instance, in our dataset, to holdout the same share of `aspiration` feature in both training and testing data, we could simply write `train_test_split(X, y, test_size=0.3, stratify=X.aspiration)`

---

As we saw, **`cross_validate` [can automatically stratify the target](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#:~:text=For%20int/None%20inputs%2C%20if%20the%20estimator%20is%20a%20classifier%20and%20y%20is%20either%20binary%20or%20multiclass%2C%20StratifiedKFold%20is%20used.), but not the features...** 🤔 We need a bit of extra work for that.

We need `StratifiedKFold` 🔬



### Stratification - generalized

📚 [**StratifiedKFold**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) allows us to split the data into `K` splits, while stratifying on certain columns (features or target).

This way, we can do a manual cross-validation while keeping proportions on the categorical features of interest - let's try it with the binary `aspiration` feature:

In [None]:
from sklearn.model_selection import StratifiedKFold

# initializing a stratified k-fold that will split the data into 5 folds
skf = StratifiedKFold(n_splits=5)
scores = []

# .split() method creates an iterator; 'X.aspiration' is the feature that we stratify by
for train_indices, test_indices in skf.split(X, X.aspiration):

    # 'train_indices' and 'test_indices' are lists of indices that produce proportional splits
    X_train, X_test = X.iloc[train_indices], X.iloc[test_indices]
    y_train, y_test = y.iloc[train_indices], y.iloc[test_indices]

    # initialize and fit a model
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # append a score to get an average of 5 folds in the end
    scores.append(model.score(X_test, y_test))

np.array(scores).mean()

ValueError: could not convert string to float: 'std'

📖 Some sklearn reads on **stratification**:

- [Visualization of how different holdout methods in sklearn work](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py)
- [Overall cross-validation and stratification understanding](https://scikit-learn.org/stable/modules/cross_validation.html#stratification)

🏁 Congratulations! You have prepared a whole dataset, ran feature selection and even learned about stratification 💪

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge!