# <p style="color:green;">Data Splitting</p>

In Data splitting, splits some amount of data into input and output so that the system could learn for what input what output is present. Input is named as 'x-train' and output as 'y-train'. This is the case only for supervised dataset / labelled dataset.

The system then learns from the data provided. Then a second dose of input known as 'x-test' of the same dataset is given (whoose real output we know). The system then predicts the output according to it. 

Then we compare this predicted output with the real output present in the dataset.

### Importing Library

In [1]:
import pandas as pd
import numpy as np

### Loading Data

In [2]:
data = pd.read_csv('C:\\Users\\Vivek hotti\\Desktop\\Practice\\ML DA\\housing.csv')

#### A little bit of Data Exploration

In [3]:
data.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


<b><p style="color:blue;">As we can see above, the number of records for each column is exactly 20640 except that of "total_bedrooms" which has 20433. Hence this proves NULL Values are present.</p></b>

<b><p style="color:green;">Hence Pre-Processing of data is required to eliminate the NULL Values.</p></b>

In [6]:
data.isnull().sum()
# function to findout the total number of NULL Values present in respective columns. 

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [7]:
data = data.fillna(value = 537.870553)
# filling NULL Values present in the total_bedrooms column with the MEAN of the total_bedrooms column .i.e. 537.870553 obtained in cell 4.

In [8]:
data.isnull().sum()
# checking if any NULL Values are present in the dataset after filling in.

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [9]:
data.shape

(20640, 10)

# <p style="color:red;">Approach 1 : Manual Splitting with Slicing</p>

### <b>Splitting data to input and output data</b>

Median House Value (last column) will be our Output Parameter. And the rest all columns will be our input parameters.
- x-train : training input
- y-train : training output
- x-test : testing input
- y-test : testing ouput

In [10]:
# Our input will be 'x', where every column would be present except Median_House_value, cause that is what we want to predict / output.
x = data.drop('median_house_value', axis = 1)
# newDatasetName = data.drop('columnNameToBeDropped', axis = 1)
# Drops the column median_house_value. Axis = 1 is mentioned hence column is dropped. If it were 0, then row would drop.

In [11]:
x.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,NEAR BAY


In [12]:
# Our output will be 'y', where there will be only one column : median_house_value, cause that is the value which is getting predicted / ouput.
y = data['median_house_value']

In [13]:
y.head()

0    452600.0
1    358500.0
2    352100.0
3    341300.0
4    342200.0
Name: median_house_value, dtype: float64

In [15]:
x.shape
# dataframe.shape()
# Quickly finding out number of rows and columns.

(20640, 9)

In [16]:
y.shape
# dataframe.shape()
# Quickly finding out number of rows and columns.

(20640,)

### <b>Further splitting input and output data to training & testing data respectively</b>

<img src="2.jpg" align="left" width="250">

- x-train : training input   -  20,000 rows ; all 9 columns
- y-train : training output  -  
- x-test : testing input
- y-test : testing ouput

In [23]:
# Splitting our input (x) into training input (x_train) & testing input (x_test) :
x_train = x.loc[0:20000]
x_test = x.loc[20000:]

# Splitting our output (y) into training ouput (y_train) & testing ouptut (y_test) :
y_train = y.loc[0:20000]
y_test = y.loc[20000:]

In [24]:
x_train.shape

(20001, 9)

In [25]:
x_test.shape

(640, 9)

In [26]:
y_train.shape

(20001,)

In [27]:
y_test.shape

(640,)

# <p style="color:red;">Approach 2 : Splitting Data using sklearn</p>

Randomly distributes training and testing data along with corresponding entries. For instance, in the manual slicing for splitting, we selected the first 20,000 elements as x-train (training input) and simultaneously manually selected equal amount of corresponding values y-train (training output).

Butnusing sklearn, we can randomly choose the training data (x-train) along with their corresponding outputs (y-train)

In [29]:
# importing the function train_test_split from sklearn library
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)
# We have defined test_size as 0.20. This means the test data will be 20% the records present in the whole data frame. 
# This means the training data is 0.80 or 80%.

In [31]:
x_train.shape

(16512, 9)

In [32]:
y_train.shape

(16512,)

In [33]:
x_test.shape

(4128, 9)

In [34]:
y_test.shape

(4128,)