In [66]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [67]:
# Importing the dataset
dataset = pd.read_csv('Data.csv')

In [68]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [69]:
#making the independent vaiables x
X = dataset.iloc[:, :-1].values
#making the dependent vaiable y
y = dataset.iloc[:, 3].values

In [70]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [71]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

The first problem that we have to deal with is the case where you have some missing data in your data set and that happens quite a lot actually in real life.So you have to get the trick to handle this problem and make it all good for your machine learning model to run correctly.

let's have a look at the data set here.

In [72]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


As you can see there is two missing data.
There is one missing data in the age column here for ``Spain`` and one missing data in the salary column for ``Germany``.


The first idea is to remove the lines the observations where there is some missing data. So what we could do is to remove this line and remove this line but that can be quite dangerous because imagine this data set contains crucial information.

It would be quite dangerous to remove an observation. So we need to figure out a better idea to handle this problem.

Another idea that's actually the most common idea to handle missing data is to take the mean of the columns.

Here we are going to replace this missing data here by the mean of all the values in the column age and that's the same for every feature that contains missing data.

We replace this missing data by the mean of the values in the column that contains this missing data.


We're not going to implement the function mean ourself.We are going to take a library to do this job for us.
The library that we're going to use for this one is called `sklearn.impute` and from this library we are going to import the `SimpleImputer` class.



In [73]:
# Taking care of missing data
from sklearn.impute import SimpleImputer

In [74]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [75]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Now we're going to learn how to dial with categorical data. But first we will discuss why we need to do this.
In this dataset we can see that we have two categorical variables. We have the `country` variable and the `purchase` variable. These two variables are categorical variables because simply they contain categories! :)
Here the country contains three categories. It's `France` `Spain` and `Germany`.
And the pre-Chase variable contains two categories.`Yes` and `No` .that's why they're called categorical variables.

Since machine learning models are based on mathematical equations you can intuitively understand that it would cause some problem if we keep the text here and the categorical variables in the equations because we would only want numbers in the equations. That's why we need to encode the categorical variables, to encode the text that we have into the numbers.

We're going to encode these two variables country and purchased.

In [76]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#take care of country names columns
labelencoder_X = LabelEncoder()
X[:,0]=labelencoder_X.fit_transform(X[:,0])
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

We replaced the text by numbers so that we can include the numbers in the equations.
However since one is greater than zero and two is greater than one the equations in the model will think that Spain has a higher value than Germany and France and Germany has a higher value than France.And that's not the case.These are actually three categories and there is no relational order between the three.
We cannot compare France Spain and Germany by saying that Spain is greater than Germany or Germany is greater than France.
This wouldn't make any sense if we had for example the variable size with the size like small medium and large then yes we could express orders between the values of this variable because large is greater than medium and medium is greater than small.
So we have to prevent the machine learning equations from thinking that Germany is greater than France and Spain is greater than Germany. And to prevent this we're going to use what are called the dummy variables.
That means that instead of having one column here we are going to have three columns we're going to have a number of columns equal to the number of categories. So since we have three categories here France Spain and Germany we're going to have three columns each of these three columns will correspond to one country we're going to have the friends column Despain column and the Germany column. And in each column there's going to be either one or zero.
for example if we are in the France column it's going to be one if the country is France and zero if the country is not France.

In [77]:
onehotencoder = OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
print(X)
ohe_df = pd.DataFrame(X)
ohe_df

[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,44.0,72000.0
1,0.0,0.0,1.0,27.0,48000.0
2,0.0,1.0,0.0,30.0,54000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,1.0,0.0,40.0,63777.777778
5,1.0,0.0,0.0,35.0,58000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,1.0,0.0,0.0,48.0,79000.0
8,0.0,1.0,0.0,50.0,83000.0
9,1.0,0.0,0.0,37.0,67000.0


In [90]:
#take care of purchased columns
labelencoder_Y = LabelEncoder()
y=labelencoder_Y.fit_transform(y)
y_lable = dataset.iloc[:, 3].values
y_lable

In [None]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)