# **Creating Train and Test Datasets Example**

Our major goal here, is to predict how a student will perform in the national exam by using their mock exam scores.There's a few steps we need to do to achieve this.

First, we need split the dataset into training and test datasets so that we can train the model to predict our desired outcome

After splitting the dataset , we are going to employ a method for training the datasets.

The following example will be split into two parts; the first being how to split the dataset into train and test datasets. The second part is how to train the data using linear regression.


In this example, we are going to learn how to split a dataset into train and test sets so that we can start training our model. We will first show a naive way of splitting a dataset then continue to show different ways of efficiently splitting the dataset.

The dataset we are going to use will comprise of 1000 students exam data from both public and private schools in Kenya. 50% of this data is from public school and the other 50% is from private schools. We need to maintain this proportion when creating our sample dataset.

[Download dataset](https://drive.google.com/file/d/12OGVlkFkLwycegmG5zkdDfzoxCJ3qU_k/view?usp=sharing)


In [3]:
import pandas as pd 

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline



In [7]:
from sklearn.model_selection import train_test_split

In [8]:
df=pd.read_csv('student_exam_data.csv');

In [9]:
df.head()

Unnamed: 0,mock_result,school_type,national_result
0,27,PUBLIC,55
1,60,PRIVATE,35
2,57,PUBLIC,39
3,52,PUBLIC,39
4,44,PUBLIC,63


In [16]:
#Split the dataset into train and test sets.
# we will split the dataset such that 
# we have the first 700 entries of our 
# dataset as train
# and the rest 300 entries as test

train=df[:700]
test=df.drop(train.index)
# This is the same as 
# test=df[700:]

print("train" + str(len(train)))

print("test:" + str(len(test)))


train700
test:300


`Analyzing the sets`


How similar are the training and test datasets?

In [15]:
# Let's analyse the training and the test dataset and see if the right proportions. 
# Ideally, we want both of our training and test datasets to have a 50-50 apportionment of private and public schools

# Check the apportionment of Private and Public schools in the train data set
train_count=train['school_type'].value_counts()

# Check the apportionment of Private and Public in the test data set
test_count=test['school_type'].value_counts()

# Print out the apportionment of private and public schools in both train and test dataset
print(train_count)
print('*************************')
print(test_count)

PUBLIC     450
PRIVATE    250
Name: school_type, dtype: int64
*************************
PRIVATE    250
PUBLIC      50
Name: school_type, dtype: int64


As you can see,  the number of public schools in the train dataset is 450 while that of private schools is 250.This translates to 65% and 35%  respectively, which is not the proportion we are aiming for.

Similarly, in the test dataset there are 250 public schools and 50 private schools. This in turn translates to 84% and 16% respectively. Again, this is not quite the proportion we were aiming for.

In conclusion this differs greatly from what we are aiming for, which is to have an equal proportion of private schools and public school in both the train/test dataset.That is, to have 50% of public school and 50% of private school in both the train and test dataset.

This is why we termed this as a naive way of splitting the dataset because it does not reflect the populations initial proportion.

To achieve the proportion we want, we will employ one of the sampling techniques we covered in module 1

` Sampling`

Remember module 1 stuff, let's do some stratified sampling, and see that our test / train are now similar to each other (public VS private student representation)

In [22]:
# Using the Stratified technique we want to split the dataset in such a way that 70% of our dataset will be train set and 30% will be test set. Furthermore, the proportion of public and private schools should be equal in both the train and test dataset. For example, in train dataset we should have 350 public schools and 350 private schools represented. The same goes for the test dataset, we expect to have 150 private schools and 150 private schools.

# Stratified train sample

train_strat_dataset=df.groupby('school_type', group_keys=False).apply(lambda grouped_subset : grouped_subset.sample(frac=0.7))
# preview the stratified train dataset


# Stratified test sample
test_strat_dataset = df.drop(train_strat_dataset.index)

# Preview the stratified test dataset
test_strat_dataset

# Print out the proprortion of private vs public schools in both train and test dataset
test_strat_count=test_strat_dataset['school_type'].value_counts()
train_strat_count=train_strat_dataset['school_type'].value_counts()

print(train_strat_count)
print('*************************************************')
print(test_strat_count)


PRIVATE    350
PUBLIC     350
Name: school_type, dtype: int64
*************************************************
PUBLIC     150
PRIVATE    150
Name: school_type, dtype: int64


train_test_split and it options

Next, we are going to demonstrate another way you can achieve similar results using sklearn library

In [32]:
# Now we will use sklearn library to split the data set into train and test datasets.Additionally, We will make use of the train_test_split method
# The method takes in an dataframe, test_size or train_size as arguments. The dataframe signifies the dataframe we want to split and the test_size/train_size indicates the size of either the train or test dataset we want to have
# We'll also use a third argument called stratify which will help us stratifiy the data once we split it.
# You can read more on this method through this link https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# Split our datset into train_data and test_data using sklearn's train_test_split method
test_data, train_data= train_test_split(df, test_size=0.3,stratify=df['school_type'])

# Preview the train dataset
print(train_data)

# Preview the test dataset
print(test_data)

# Print out the proprortion of private vs public schools in both train and test dataset
train_data['school_type'].value_counts()

test_data['school_type'].value_counts()

print(train_strat_count)
print('*************************************************')
print(test_strat_count)

     mock_result school_type  national_result
99            36      PUBLIC               42
571           68      PUBLIC               59
956           58     PRIVATE               82
877           32     PRIVATE               59
694           41      PUBLIC               58
..           ...         ...              ...
54            18     PRIVATE               74
845           26     PRIVATE               42
85            46     PRIVATE               48
634           75      PUBLIC               48
18            61      PUBLIC               65

[300 rows x 3 columns]
     mock_result school_type  national_result
581           32      PUBLIC               58
48            87     PRIVATE               33
554           25      PUBLIC               56
747           74     PRIVATE               21
625           56      PUBLIC               27
..           ...         ...              ...
961           87     PRIVATE               23
12            62     PRIVATE               18
895       