In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Importing the dataset
dataset = pd.read_csv('Data.csv')

In [3]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [4]:
#making the independent vaiables x
X = dataset.iloc[:, :-1].values
#making the dependent vaiable y
y = dataset.iloc[:, 3].values

In [6]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [7]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

The first problem that we have to deal with is the case where you have some missing data in your data set and that happens quite a lot actually in real life.So you have to get the trick to handle this problem and make it all good for your machine learning model to run correctly.

let's have a look at the data set here.

In [9]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


As you can see there is two missing data.
There is one missing data in the age column here for ``Spain`` and one missing data in the salary column for ``Germany``.


The first idea is to remove the lines the observations where there is some missing data. So what we could do is to remove this line and remove this line but that can be quite dangerous because imagine this data set contains crucial information.

It would be quite dangerous to remove an observation. So we need to figure out a better idea to handle this problem.

Another idea that's actually the most common idea to handle missing data is to take the mean of the columns.

Here we are going to replace this missing data here by the mean of all the values in the column age and that's the same for every feature that contains missing data.

We replace this missing data by the mean of the values in the column that contains this missing data.


We're not going to implement the function mean ourself.We are going to take a library to do this job for us.
The library that we're going to use for this one is called `sklearn.impute` and from this library we are going to import the `SimpleImputer` class.



In [15]:
# Taking care of missing data
from sklearn.impute import SimpleImputer

In [22]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [23]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)