# Leave P Out Cross Validation

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeavePOut, cross_val_score

In [2]:
df = pd.read_csv("../Datasets/LiteSocialNetworkAds.csv")
df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,2,32,150000,1
1,1,47,25000,1
2,1,45,26000,1
3,1,46,28000,1
4,2,48,29000,1


In [3]:
df.columns

Index(['Gender', 'Age', 'EstimatedSalary', 'Purchased'], dtype='object')

In [4]:
df.shape

(120, 4)

In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

`LeavePOut` is a cross-validation technique in scikit-learn that creates all possible training/test sets by removing `p` samples from the dataset. It is useful for small to medium-sized datasets, but can be computationally expensive for larger datasets.

The `LeavePOut` function takes a single parameter `p`, which specifies the number of samples to remove from the dataset for each split. Here's an example of how to use it:


In [6]:
lpo = LeavePOut(p=2)
lpo.get_n_splits(X)

7140

In this example, we are creating a `LeavePOut` object called `lpo` with `p=2`. We then use the `split` method of the `lpo` object to generate the training and test indices for each split. We use these indices to split the data into training and test sets, and then train and evaluate the model on this split.

Note that because `LeavePOut` generates all possible training/test sets by removing `p` samples from the dataset, it can be computationally expensive for larger datasets. If you have a large dataset, you may want to consider using a different cross-validation technique, such as `KFold` or `StratifiedKFold`.


In [7]:
for i, (train_index, test_index) in enumerate(lpo.split(X)):
    print(f"Fold {i+1}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

Fold 1:
  Train: index=[  2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37
  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73
  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109
 110 111 112 113 114 115 116 117 118 119]
  Test:  index=[0 1]
Fold 2:
  Train: index=[  1   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37
  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73
  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91
  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109
 110 111 112 11

In [8]:
scores = cross_val_score(
    RandomForestClassifier(n_estimators=10),
    X,
    y.tolist(),
    scoring="accuracy",
    cv=lpo,
)
scores

array([1. , 1. , 1. , ..., 1. , 0.5, 1. ])

In [9]:
print("%f accuracy with a standard deviation of %f" % (scores.mean(), scores.std()))

0.802731 accuracy with a standard deviation of 0.281900
