<a href="https://colab.research.google.com/github/allanstar-byte/MACHINE-LEARNING/blob/main/CRAFTING_SETS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Crafting sets**

 ### **Understanding your variables:**

First, you must analyze your variables, and determine which variable you want your model to predict - we will refer to it as the dependent variable. 

Next, you must establish which other variables will help you predict your dependent variable. These will be referred to as independent variables.

It is important to perform exploratory data analysis to identify if there is a relationship between your dependent and independent variables. This does not mean that your independent variable causes the dependent one, just that they are connected.

For example, if we have a dataset on students, we may find variables such as student height, mock exam results, and national exam results. Plotting mock exam results against national exam results, you will see them to roughly take the shape of a line, which makes intuitive sense: Students who do poorly in the mock are likely not to be ready for the national exam, and vice versa.

Plotting height against national exam results will probably lead to a much more scattered plot, indicating that there isn't a strong relationship between height and academic performance. 

Therefore, as we create our training and testing set to predict national exam performance, we will want to include mock exam performance, but not height. 

### **Why do we need two sets?**

This is where the machine learning actually happens: The training set includes data on your dependent variable, alongside all independent variables you choose to include. Your supervised learning algorithm will then go through this data set and for a given row try to predict what the dependent variable should be given the independent ones, then adjust its understanding of the process based on how good its prediction was.  Over time,  your algorithm will get really good at recognizing the patterns in your data set.

Why do we need the test set then? Well, the test set is not used for training, but to validate how good the model you've created is at predicting the desired dependent variable. 

Later this week, we will explore ethical considerations when creating train and test datasets. Remember this though: Your predictive model is only as good as the data you've used to train it. There have been many challenges with training, the reading and exercises below will run you through ways to deal with them.

[Dataset Download](https://drive.google.com/file/d/12OGVlkFkLwycegmG5zkdDfzoxCJ3qU_k/view?usp=sharing)

Our major goal here, is to predict how a student will perform in the national exam by using their mock exam scores.There's a few steps we need to do to achieve this.

First, we need split the dataset into training and test datasets so that we can train the model to predict our desired outcome

After splitting the dataset , we are going to employ a method for training the datasets.

The following example will be split into two parts; the first being how to split the dataset into train and test datasets. The second part is how to train the data using linear regression.

In this example, we are going to learn how to split a dataset into train and test sets so that we can start training our model. We will first show a naive way of splitting a dataset then continue to show different ways of efficiently splitting the dataset.

The dataset we are going to use will comprise of 1000 students exam data from both public and private schools in Kenya. 50% of this data is from public school and the other 50% is from private schools. We need to maintain this proportion when creating our sample dataset.

**Naive splitting:**

- Show a simple 3 column table, with 1 dependent 1 independent variable. The independent variable is the Mock exam column and the dependent variable is National exam column.
- use simple splits to create 2 datasets, one for train, one for test

## **Import the Relevant Libraries**

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


## **Loading the data**



In [6]:
#Load the dataset
data = pd.read_csv('/content/student_exam_data.csv')

#displaying the dataset
data.head(3)


Unnamed: 0,mock_result,school_type,national_result
0,27,PUBLIC,55
1,60,PRIVATE,35
2,57,PUBLIC,39


## **Splitting the dataset**

In [7]:
#Split the dataset into train and test sets.
# we will split the dataset such that we have the first 700 entries of our dataset as train and the rest 300 entries as test

train = data[:700] 

# Drop all the indexes of the train data we created above from the main data set then store the remaining data in a variable called test
test = data.drop(train.index)

# Confirm that the train and test dataset have out desired length
print("train:" + str(len(train)))
print("test:"+ str(len(test)))


train:700
test:300


## **Analyzing the sets**:


How similar are the training and test datasets?


In [8]:
# Let's analyse the training and the test dataset and see if the right proportions. 
# Ideally, we want both of our training and test datasets to have a 50-50 apportionment of private and public schools

# Check the apportionment of Private and Public schools in the train data set
train_count=train['school_type'].value_counts()

# Check the apportionment of Private and Public in the test data set
test_count=test['school_type'].value_counts()

# Print out the apportionment of private and public schools in both train and test dataset
print(train_count)
print('*************************')
print(test_count)

PUBLIC     450
PRIVATE    250
Name: school_type, dtype: int64
*************************
PRIVATE    250
PUBLIC      50
Name: school_type, dtype: int64


As you can see,  the number of public schools in the train dataset is 450 while that of private schools is 250.This translates to 65% and 35%  respectively, which is not the proportion we are aiming for.

Similarly, in the test dataset there are 250 public schools and 50 private schools. This in turn translates to 84% and 16% respectively. Again, this is not quite the proportion we were aiming for.

In conclusion this differs greatly from what we are aiming for, which is to have an equal proportion of private schools and public school in both the train/test dataset.That is, to have 50% of public school and 50% of private school in both the train and test dataset.

This is why we termed this as a naive way of splitting the dataset because it does not reflect the populations initial proportion.

To achieve the proportion we want, we will employ one of the sampling techniques we covered in module 1

## **train_test_split and it's options**

Next, we are going to demonstrate another way you can achieve similar results using sklearn library

In [15]:
# Now we will use sklearn library to split the data set into train and test datasets.Additionally, We will make use of the train_test_split method
# The method takes in an dataframe, test_size or train_size as arguments. The dataframe signifies the dataframe we want to split and the test_size/train_size indicates the size of either the train or test dataset we want to have
# We'll also use a third argument called stratify which will help us stratifiy the data once we split it.
# You can read more on this method through this link https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# Split our datset into train_data and test_data using sklearn's train_test_split method
train_data, test_data = train_test_split(data, test_size=0.3,stratify=data['school_type'])

# Preview the train dataset
print(train_data.shape)

# Preview the test dataset
print(test_data.shape)




(700, 3)
(300, 3)
