# Splitting of Dataset and Training and Testing

<h2> Importing Libraries </h2>

In [2]:
import pandas as pd
import numpy as np

<h2> Importing Dataset </h2>

In [3]:
df = pd.read_excel("Salary_Dataset_02.xls")

In [4]:
df.head()

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience,Purchased
0,0,0,1,0,39343,1.1,0
1,0,1,0,0,46205,1.3,1
2,0,1,0,0,37731,1.5,0
3,0,1,0,0,43525,2.0,0
4,0,0,0,1,39891,2.2,0


<h2> Dividing into Target and Predictor Variables </h2>

### The approach we are going to use here is to split available data in two sets

#### Training: We will train our model on this dataset
#### Testing: We will use this subset to make actual predictions using trained model

In [None]:
# In X we store predictor variables.
# Here X will be dataframe because it has more than one features(columns).
X = df.iloc[:,:-1]
#In y we store target variables.
#Here y will be pandas series as it has only one feature(column).
y = df.iloc[:,-1]

#Here we are using X,y as dataframe because it has nice represenatation so you can understand it without confusion.
#But you can also use X,y as numpy array by this method.
        #X = df.iloc[:,:-1].values
        #y = df.iloc[:,-1].values

<h2> Divide Data as Training Set and Testing Set </h2>

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state = 90)

In [7]:
X_train

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience
29,0,1,0,0,121872,10.5
6,0,1,0,0,60150,3.0
20,0,1,0,0,91738,6.8
23,1,0,0,0,113812,8.2
25,1,0,0,0,105582,9.0
14,1,0,0,0,61111,4.5
8,0,0,1,0,64445,3.2
22,1,0,0,0,101302,7.9
15,0,0,0,1,67938,4.9
21,1,0,0,0,98273,7.1


In [8]:
X_test

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience
17,0,1,0,0,83088,5.3
24,0,0,1,0,109431,8.7
26,0,0,1,0,116969,9.5
2,0,1,0,0,37731,1.5
1,0,1,0,0,46205,1.3
12,0,0,1,0,56957,4.0


In [9]:
y_train

29    1
6     1
20    0
23    1
25    1
14    0
8     1
22    0
15    0
21    0
9     0
13    0
16    0
19    0
11    0
0     0
10    1
28    0
5     0
4     0
7     0
3     0
18    1
27    1
Name: Purchased, dtype: int64

In [10]:
y_test

17    0
24    0
26    0
2     0
1     1
12    1
Name: Purchased, dtype: int64

# Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

### Importing Libraries

In [11]:
import pandas as pd
import numpy as np

### Importing Dataset

In [13]:
df = pd.read_csv("Dataset_03.csv")
df.head()

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience,Purchased
0,0,0,1,0,39343,1.1,0
1,0,1,0,0,46205,1.3,1
2,0,1,0,0,37731,1.5,0
3,0,1,0,0,43525,2.0,0
4,0,0,0,1,39891,2.2,0


In [14]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

### Splitting Dataset

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size = 0.2, random_state = 1) 

### Perform Feature Scaling

In [16]:
#Importing StandardScaler
from sklearn.preprocessing import StandardScaler
#Creating Instance of StandarScaler
sc = StandardScaler()
#Perform scaling in X_train with fit_transform.
#Here we are applying fit_transform because,
        # fit will calculate mean and standard deviation of X_train
        # transform will actually perform scaling with calculated mean and std.
        # fit_transform method does this both thing in one line of code.
X_train.iloc[:, 4:] = sc.fit_transform(X_train.iloc[:, 4:])

# Here we will only use transform because we have already calculated mean and std.
# Another reason is we don't want to know the mean and std of our test dataset As it
# Lead to information leakage.
X_test.iloc[:, 4:] = sc.transform(X_test.iloc[:, 4:])

### Check it 

In [17]:
X_train

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience
26,0,0,1,0,1.461305,1.39108
3,0,1,0,0,-1.067603,-1.058963
24,0,0,1,0,1.201748,1.129742
22,1,0,0,0,0.921841,0.868404
23,1,0,0,0,1.3526,0.966406
4,0,0,0,1,-1.192733,-0.993629
2,0,1,0,0,-1.267108,-1.222299
25,1,0,0,0,1.069215,1.227744
6,0,1,0,0,-0.495152,-0.732291
18,1,0,0,0,0.235279,0.215059


In [18]:
X_test

Unnamed: 0,Australia,Canada,Dubai,USA,Salary,YearsExperience
17,0,1,0,0,0.294676,0.019056
21,1,0,0,0,0.817543,0.607066
10,0,0,1,0,-0.389511,-0.438286
19,0,1,0,0,0.668344,0.247727
14,1,0,0,0,-0.462061,-0.242282
20,0,1,0,0,0.592523,0.509065
