# Stratified K-Fold Cross Validation


In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [2]:
df = pd.read_csv("../Datasets/SocialNetworkAds.csv")
df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1,19,19000,0
1,1,35,20000,0
2,2,26,43000,0
3,2,27,57000,0
4,1,19,76000,0


In [3]:
df.columns

Index(['Gender', 'Age', 'EstimatedSalary', 'Purchased'], dtype='object')

In [4]:
df.shape

(400, 4)

In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

The `StratifiedKFold` function in scikit-learn is used to split a dataset into k folds for cross-validation while preserving the percentage of samples for each class. It takes several parameters that control the behavior of the cross-validation process. Here are the parameters of the `StratifiedKFold` function:

- `n_splits`: The number of folds to create. This parameter is required.
- `shuffle`: Whether to shuffle the data before splitting it into folds. The default value is `False`.
- `random_state`: The random seed to use for shuffling the data. This parameter is ignored if `shuffle` is `False`.


In [6]:
skf = StratifiedKFold(shuffle=True, random_state=0)
skf.get_n_splits(X, y)

5

This creates a `StratifiedKFold` object that will split the data into 5 folds, shuffle the data before splitting it, and use a random seed of 0 for shuffling the data. The resulting `skf` object can be used to iterate over the folds of the data for cross-validation while preserving the percentage of samples for each class.


In [7]:
for i, (train_index, test_index) in enumerate(skf.split(X, y.tolist())):
    print(f"Fold {i+1}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 1:
  Train: index=[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  36
  37  38  39  41  42  44  45  46  47  48  49  50  51  54  55  57  58  59
  60  61  63  64  65  66  67  68  69  70  71  72  74  75  77  78  80  81
  82  84  85  86  87  88  89  90  92  93  94  95  97  98  99 100 101 102
 103 105 106 107 111 114 115 116 117 118 120 121 122 123 124 127 128 129
 130 132 133 134 137 140 142 144 145 146 147 148 149 150 151 152 153 154
 155 157 158 160 161 162 163 164 165 166 167 168 169 170 172 173 174 177
 178 179 181 183 184 185 186 187 188 190 191 192 193 194 195 196 197 198
 199 200 202 203 204 205 206 208 209 210 212 213 216 217 220 221 222 223
 224 225 226 227 233 234 235 236 237 238 239 241 242 245 247 248 250 251
 252 253 254 256 258 259 260 261 263 264 266 267 268 269 271 273 274 275
 276 277 278 279 280 281 283 284 285 286 289 290 292 293 294 295 296 297
 298 300 302 303 304 305 306

In [8]:
scores = cross_val_score(
    RandomForestClassifier(n_estimators=10), X, y.tolist(), scoring="accuracy", cv=skf
)
scores

array([0.85  , 0.8875, 0.9125, 0.9   , 0.875 ])

In [9]:
print("%f accuracy with a standard deviation of %f" % (scores.mean(), scores.std()))

0.885000 accuracy with a standard deviation of 0.021506
