# First step in stacking - Creating folds

Creating folds is useful when ensembling/combining many kinds of models from different sources

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

In [2]:
INPUT_DIR="../data-nlp/input/"
OUTPUT_DIR="../data-nlp/output/"

RAND=10

In [3]:
df = pd.read_csv(INPUT_DIR + "labeledTrainData.tsv", sep="\t")
df.loc[:, "kfold"]=-1
df.head()

Unnamed: 0,id,sentiment,review,kfold
0,5814_8,1,With all this stuff going down at the moment w...,-1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",-1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-1
3,3630_4,0,It must be assumed that those who praised this...,-1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-1


Only taking samples from the dataframe instead of processing all rows (optional)

In [4]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,id,sentiment,review,kfold
0,11703_9,1,This is a great movie. Some will disagree with...,-1
1,6742_8,1,"Gene Tierney and Dana Andrews, who were both s...",-1
2,10071_1,0,I saw this film at its premier at Sundance 09....,-1
3,9841_7,1,I love cartoons. They can show things that fil...,-1
4,11579_10,1,this movie is a masterpiece a story of a young...,-1


Perfectly balanced data

In [5]:
df.sentiment.value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [6]:
y = df.sentiment.values

K-Fold splitting (into 5 splits)

`.split()` method returns both training and testing indices for each fold

In [7]:
skf = StratifiedKFold(n_splits=5)

for f, (t_, v_) in enumerate(skf.split(X=df, y=y)):
    df.loc[v_, "kfold"] = f

df.sample(frac=0.1, random_state=RAND).head()

Unnamed: 0,id,sentiment,review,kfold
18634,1927_8,1,"With the death of GEORGE NADER, on 4 February ...",3
1333,8398_3,0,Schlocky '70s horror films...ya gotta love 'em...,0
20315,12385_1,0,This film is so ridiculously idiot that you ma...,4
6357,12330_4,0,Let's put political correctness aside and just...,1
10496,10645_8,1,Adolf Hitler's maniacal desire to impose his w...,2


Saving folded dataframe

In [8]:
édf.to_csv(OUTPUT_DIR+"01_train_folds.csv", index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=b1ab3b60-9130-40c7-8fa2-288950ad463c' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>