# TRAINING AND TESTING SET:

Splitting a data set into training and testing is vital in Machine Learning. Lets look at different methods which help us to accomplish that:

## 1. Train_test_split:

Lets look at below code snippet:

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [2]:
target=np.ones(16)
target[-4:]=0
df=pd.DataFrame({'col':np.random.random(16),'target':target})
df

Unnamed: 0,col,target
0,0.712569,1.0
1,0.082269,1.0
2,0.436717,1.0
3,0.868498,1.0
4,0.49462,1.0
5,0.980849,1.0
6,0.838208,1.0
7,0.094207,1.0
8,0.7393,1.0
9,0.818747,1.0


In [3]:
X,y=df.col,df.target                                     

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,shuffle=False)
print('Train Index: {}\n Test Index: {}'.format(X_train.index,X_test.index))

Train Index: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
 Test Index: Int64Index([12, 13, 14, 15], dtype='int64')


Since, shuffle is turned off, look last four indexes have made thier way into test set.Shuffling is neccessary since training of only certain target variables is done which is not good. Since our model will not generalize to new data points which point to unknown targets.

Lets turn on shuffle, and also random_state which everytime we reset our cells we dont want our df to be changed everytime!

In [4]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,shuffle=True,random_state=42)
print('Train Index: {}\n Test Index: {}'.format(X_train.index,X_test.index))

Train Index: Int64Index([13, 11, 8, 9, 2, 15, 4, 7, 10, 12, 3, 6], dtype='int64')
 Test Index: Int64Index([0, 1, 5, 14], dtype='int64')


Look, now its better.
Another point to take our account is **Stratification**
Look, our test set should have the same data proportions as our train set right?

In [5]:
df1=pd.DataFrame(X_train.index.isin([12,13,14,15]))
print('0 Distribution : {}\n 1 Distribution : {}'.format(df1.value_counts()[1]/df1.shape[0],df1.value_counts()[0]/df1.shape[0]))

0 Distribution : True    0.25
dtype: float64
 1 Distribution : False    0.75
dtype: float64


In [6]:
df2=pd.DataFrame(X_test.index.isin([12,13,14,15]))
print('0 Distribution : {}\n 1 Distribution : {}'.format(df2.value_counts()[1]/df2.shape[0],df2.value_counts()[0]/df2.shape[0]))

0 Distribution : True    0.25
dtype: float64
 1 Distribution : False    0.75
dtype: float64


Fortunately, here the distribution came out to be same.

When there are more features,its difficult to maintain same data proportion in both train and test set. You might refer to your data science team which feature/features they think is most important or you can run correlation matrix on the dataframe to get insights on which feature's distribution is important.

In [7]:
df3=pd.DataFrame({'Languages':['Hindi','English','French','English','Hindi','Hindi','French','Hindi','French','Hindi','English','Hindi','French','English','Hindi','French'],'scores':[12,55,65,23,98,76,9,123,49,64,76,59,31,11,63,38],
                 'percentage':[45,32,65,67,45,98,21,32,54,87,59,43,26,10,4,76]})

In [8]:
df3

Unnamed: 0,Languages,scores,percentage
0,Hindi,12,45
1,English,55,32
2,French,65,65
3,English,23,67
4,Hindi,98,45
5,Hindi,76,98
6,French,9,21
7,Hindi,123,32
8,French,49,54
9,Hindi,64,87


In [9]:
df3['Languages'].value_counts()/len(df3['Languages'])                 #overall distribution

Hindi      0.4375
French     0.3125
English    0.2500
Name: Languages, dtype: float64

In [10]:
train_set,test_set=train_test_split(df3,test_size=0.2,stratify=df3['Languages'],random_state=0)
print('train_set index:\t {}'.format(train_set.index))
print('test_set index:\t {}'.format(test_set.index))

train_set index:	 Int64Index([6, 3, 2, 8, 14, 10, 9, 4, 15, 13, 7, 0], dtype='int64')
test_set index:	 Int64Index([5, 11, 12, 1], dtype='int64')


In [11]:
print('Train set distribution:\n',train_set['Languages'].value_counts()/len(train_set['Languages']))
print('Test set distribution:\n',test_set['Languages'].value_counts()/len(test_set['Languages']))

Train set distribution:
 Hindi      0.416667
French     0.333333
English    0.250000
Name: Languages, dtype: float64
Test set distribution:
 Hindi      0.50
French     0.25
English    0.25
Name: Languages, dtype: float64


In [12]:
train_set1,test_set1=train_test_split(df3,test_size=0.2,random_state=0)
print('Train set distribution:\n',train_set1['Languages'].value_counts()/len(train_set1['Languages']))
print('Test set distribution:\n',test_set1['Languages'].value_counts()/len(test_set1['Languages']))

Train set distribution:
 Hindi      0.50
English    0.25
French     0.25
Name: Languages, dtype: float64
Test set distribution:
 French     0.50
English    0.25
Hindi      0.25
Name: Languages, dtype: float64


As seen from above stratify parameter helps to maintain the same overall distribution in train as well as test set
based on certain feature!!!

### TRAINING A ML MODEL SHOULD BE PERFECT. JUST BY FITTING THE TRAINING DATA ON ML MODEL MIGHT NOT DO THE TRICK. THAT IS WHERE CROSS VALIDATION COMES INTO PICTURE:

-CROSS VALIDATION is a very important concept as it largely influences how our trained ML model will generalize when unseen data is presented to it.

-CROSS VALIDATION splits our training set into N splits.The 1st split gets separated as Hold-Out/Reserved Split/Evaluation split.On the rest of N-1 splits, training of ML model is done.After training, evaluation of ML model is done on the hold-out set and evaluation score is calculated.

-Thus N evaluations scores are obtained.The type of evaluation depends upon 'scoring' parameter.

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
df=pd.DataFrame({'Year':np.arange(2001,2011),'Value':np.linspace(20,100,10)})
X,y=df[['Year']],df[['Value']]                                     
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
lin_reg=LinearRegression()
lin_reg.fit(X_train,y_train)
scores=cross_val_score(lin_reg,X_train,y_train,cv=5,scoring='neg_mean_squared_error')
lin_rmse=np.sqrt(-scores)
lin_rmse

array([5.72769729e-13, 4.01472365e-13, 5.72769729e-13, 1.20792265e-12,
       1.22213351e-12])