<img src="images/picture1.png" alt="Drawing" style="width: 1000px;"/>

<a id='toc'></a>

### Table of Contents
* [0. Identify Business needs](#business)<br>
* [1. Import Data](#import) <br>
    * [1.1. Import the needed libraries](#lib)<br>
    * [1.2. Import and integrate data](#integrate)<br>
    * [1.3. Set index](#index)<br>
    * [1.4. Check for duplicates](#duplicates)<br>
* [2. Explore Data](#explore) <br>
    * [2.1. Basic Exploration](#basic)<br>
    * [2.2. Statistical Exploration](#stats)<br>
        * [2.2.1. Numerical Variables](#stats_num)<br>
        * [2.2.2. Categorical Variables](#stats_cat)<br>
    * [2.3. Visual Exploration](#visual)<br>
        * [2.3.1. Numerical Variables](#visual_num)<br>
        * [2.3.2. Categorical Variables](#visual_cat)<br>
    * [2.4. In-depth Exploration](#depth)<br>
* [3. Preprocess Data](#preprocess) <br>
    * [3.1. Data Cleaning](#clean)<br>
        * [3.1.1. Outliers](#outliers)<br>
        * [3.1.2. Missing Values](#missing)<br>
    * [3.2. Data Transformation](#transform)<br>
        * [3.2.1. Create new variables](#new)<br>
        * [3.2.2. Misclassifications](#misc)<br>
        * [3.2.3. Incoherencies](#inco)<br>
        * [3.2.4. Binning](#bin)<br>
        * [3.2.5. Reclassify](#rec)<br>
        * [3.2.6. Power Transform](#power)<br>
        * [3.2.7. Apply ordinal encoding and create Dummy variables](#dummy)<br>
        * [3.2.8. Scaling](#scale)<br>
    * [3.3. Data Reduction](#reduce)<br>
        * [3.3.1. Multicollinearity - Check correlation](#corr)<br>
        * [3.3.2. Unary Variables](#unary)<br>
        * [3.3.3. Variables with a high percentage of missing values](#na)<br>

<img src="images/process_ML.png" alt="Drawing" style="width: 1000px;"/>

<div class="alert alert-block alert-success">
<a id='business'>
<font color = '#006400'> 
    
# 0. Identify Business needs </font>
</a>
    
</div>

First of all, we need to identify well the business needs.

• TugasRWe is a Portuguese retailer offering an assortment of goods within 5 major categories: Clothes, Housekeeping,
kitchen, small appliances and toys. <br><br>
• Tugas started a loyalty program 2 years ago. Among other objectives, the program’s aim is to gather Customer information
to better drive the marketing efforts. <br><br>
• There is enough historical information to start producing sound knowledge about their Customer database. IT extracted two files (at Customer Level) to be used by the analytical team. <br>

__Demographic.xlsx__

| Attribute | Description | 
| --- | --- |
| Custid | Unique identification of the customer |
| Year_Birth | Customer Year of Birth |
| Gender | Costumer Gender |
| Education | Costumer Education |
| Marital_Status | Costumer Marital Status |
| Dependents | Dependents (Yes = 1) |
| Income | Costumer Household Income |
| Country | Costumer's Country |
| City | Costumer's City |


__Firmographic.csv__

| Attribute | Description | 
| --- | --- |
| Custid | Unique identification of the customer |
| Rcn | Recency in days |
| Frq | Total Number of Purchases |
| Mnt | Total Amount spent on Purchases |
| Clothes | % Amount spent on clothes |
| Kitchen | % Amount spent on kitchen products |
| SmallAppliances | % Amount spent on small appliances |
| HouseKeeping | % Amount spent on housekeeping products |
| Toys | % Amount spent on toys |
| NetPurchase | % Purchases through the net channel |
| StorePurchase | % Purchases through the store |
| Recomendation | Recomendation [1-5] |
| Credit_Card | Information about Costumer Credit Card - Flag variable|

[BACK TO TOC](#toc)
    
<div class="alert alert-block alert-success">
<a id='import'>
<font color = '#006400'> 
    
# 1. Import Data </font>
</a>
    
</div>


<div class="alert alert-block alert-warning">

<a id='lib'></a>

## 1.1. Import the needed libraries
    
</div>

__`Step 1`__ Import the following libraries/functions: <br>
- pandas as pd <br>
    - <font color=#7a8a7c>_Pandas is a Python library for data manipulation and analysis, providing easy-to-use data structures and data analysis tools_</font>
- numpy as np <br>
    - <font color=#7a8a7c>_NumPy is a Python library for numerical computing that provides efficient arrays and matrices operations, as well as mathematical functions for arrays._</font>
- pyplot from matplotlib as plt <br>
    - <font color=#7a8a7c>_Matplotlib is a Python library for creating high-quality visualizations, including line plots, scatter plots, bar plots, and more, with extensive customization options._</font>
- seaborn as sns<br>
    - <font color=#7a8a7c>_Seaborn is a Python library for data visualization based on Matplotlib, providing additional high-level interface for creating informative statistical graphics with ease._</font>
    
    
We are going also to import some tools from sklearn: <br>
<font color=#7a8a7c>_Scikit-learn (sklearn) is a Python library for machine learning, providing a wide range of supervised and unsupervised learning algorithms, as well as tools for data preprocessing, model selection, and evaluation._</font>
- MinMaxScaler from sklearn.preprocessing<br>
- KNNImputer from from sklearn.impute<br>

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

<div class="alert alert-block alert-warning">

<a id='integrate'></a>

## 1.2. Import and integrate data
    
</div>

__`Step 2`__ Import the excel file `demographic.xlsx` and store it in the object `demo` <br>

https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

__`Step 3`__ Import the csv file `firmographic.csv` and store it in the object `firmo`<br>

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

In [1]:
# if your variables are not separated by a ',', you can define the delimiter used by defining the parameter sep
# for example, in case of ';', you should use: firmo = pd.read_csv('firmographic.csv', sep = ';') 

__`Step 4`__ Merge the data from the two previous files and store it in the object `df`. By default, the merge uses the method "inner join". <br>

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

<font color=#7a8a7c>_Merge method in pandas is used to combine two or more dataframes into a single dataframe, based on common columns or indices, using various types of joins._</font>

<div class="alert alert-block alert-warning">

<a id='index'></a>

## 1.3. Set Index 
    
</div>

__`Step 5`__ Define the variable "Custid" as the index of the dataframe using the method `set_index()`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html

<div class="alert alert-block alert-warning">

<a id='duplicates'></a>

## 1.4. Check for duplicates
    
</div>

__`Step 6`__ Check for duplicated rows with `duplicated()` and drop any duplicate rows present in the dataframe with the method `drop_duplicates()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html <br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

[BACK TO TOC](#toc)
    
<div class="alert alert-block alert-success">
<a id='explore'>
<font color = '#006400'> 
    
# 2. Explore Data </font>
</a>
    
</div>

<div class="alert alert-block alert-warning">

<a id='basic'></a>

## 2.1. Basic Exploration
    
</div>

__`Step 7`__ Check the number of rows and columns in the dataset using the attribute `shape`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html

Our dataset contains 2500 rows and 20 columns.

__`Step 8`__ Check the name of the columns of our dataset using the attribute `columns`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html

__`Step 9`__ Check the first three rows of the dataset using the method `head()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

In the same way, you have the method `tail()` that return the last rows of the dataset.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html

__`Step 10`__ Get more information of the dataset by calling the method `info()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html

We can verify that we are working with:
- 3 float variables
- 11 integer variables
- 6 object variables

We can also check that some of the variables have missing values. We are going to deal with this in a further step.

<div class="alert alert-block alert-warning">

<a id='stats'></a>

## 2.2. Statistical Exploration
    
</div>

<div class="alert alert-block alert-info">
    
<a id='stats_num'></a>

### 2.3.1. Numerical Variables
    
</div>

__`Step 11`__ Get the main descriptive statistics for all the numeric variables in using the method `describe()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

From the above table, we can get some conclusions. Some examples are:
- `count`- The income variable has 2431 valid values. We have a problem of missing values in here.
- `mean`- In average, my customers spent on my store 654 monetary units;
- `std`- The standard deviation of Income is quite high. This indicates that the values are spread out over a wider range.
- `min`- All the customers have bought in my store at least 3 times.
- `50%`- Half of my customers spend till 402 monetary units on my store.
- `max`- The maximum value for recommendation is 6. This is an incoherence - according to the business needs, the range is between 1 and 5.

The `describe()` method provides the main descriptive statistics: count, mean, standard deviation, minimum value, 25 percentile, 50 percentile or median, 75 percentile and maximum value. 
<br>
However, you can call directly other measures, such as the skewness or the kurtosis.

__`Step 12`__ Get the skewness associated with each variable in the dataset using the method `skew()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.skew.html

Concerning the variables' skewness, we can conclude the following:
- `Moderate skewness (between |0.5| and |1.0|)`: Dependents, Frq, Mnt, 
- `High skewness (higher than |1.0|)`: Rcn, Kitchen, HouseKeeping, Toys

__`Step 13`__ Get the kurtosis associated with the variables using the method `kurt()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.kurt.html

High kurtosis in a data set is an indicator that data has heavy tails or outliers. A standard normal distribution has a kurtosis of 3, so values higher than that could indicate presence of outliers. <br>

We need to check further about the presence of possible outliers for the variables:
- Rcn
- Kitchen
- HouseKeeping
- Toys

<div class="alert alert-block alert-info">
    
<a id='stats_cat'></a>

### 2.3.2. Categorical Variables
    
</div>

__`Step 14`__ Get the main descriptive statistics for all the categorical variables in using the method `describe(include = ['O'])`

We can verify some problems or issues that we need to address before applying a model:
- We have 3 possible values for Gender, 6 for Education and 9 for Marital Status: we need to check further if all those values are acceptable;
- The variable country is unary - we only have one possible value;
- The variable city has also one possible value and it has only 73 rows filled. Maybe we should drop this variable.
- The credit card has only 67 values out of 2500 filled. 

__`Step 15`__ Check the levels/possible values in the variables "Gender", "Education", "Marital_Status", "Country", "City" and "Credit Card" using the method `value_counts()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

__`Gender`__

We have 5 observations with a value of "?". We need to change this value, since it is not a valid value.

__`Education`__

We have some problems in this variable: <br>
- OldSchool is not a valid value;<br>
- Not clear what is 2n Cycle and Basic Education. <br>
      
We need to make some transformations in this variable also.

__`Marital Status`__

It seems that this variable needs also some transformations: <br>
- Together, Married and Divorced are written sometimes using capital letters;<br>
- BigConfusion is not a valid value;<br>
- Does it make sense to have Together and Married into two different levels?

__`Country`__

In country, there are no missing values, but the value is always the same. It does not make sense to keep this variable to the modelling phase.

__`City`__

In city, we have a significant quantity of values that are missing, and the city is always the same. Assuming that there is no reason behind the missingness on this variable, we are going to delete this variable in a further step.

__`Credit Card`__

In the variable Credit Card, we also have a significant number of missing values. Assuming that there is a reason behind the missingness, we are going to fill all the missing values with a constant.

<div class="alert alert-block alert-warning">

<a id='visual'></a>

## 2.3. Visual Exploration
    
</div>    

<div class="alert alert-block alert-info">
    
<a id='visual_num'></a>

### 2.3.1. Numerical Variables
    
</div>

__`Step 16`__ Check the distribution of the variable 'Mnt' using a `histplot()`. Define the color as green and the number of bins equal to 10.

https://seaborn.pydata.org/generated/seaborn.histplot.html

__`Step 17`__ Create a `scatterplot` where the x axis represent the Income and the y axis define the Mnt spent for each customer using `seaborn`.

https://seaborn.pydata.org/generated/seaborn.scatterplot.html

We can clearly see that the higher the Income, the higher the monetary spent on our store.

__`Step 17.B`__ This time, create a scatterplot similar to the previous one, but with the following changes:
- Define a figure with size equal to (12,8), and composed by two axes.
- The first axe is going to contain a scatterplot similar to the one on the previous step-
- The second axe will display a scatterplot where the x axis will define the Income and the y axis will represent the Mnt. Use the parameter hue to represent a third variable, the recomendation.
- Define the lower limit of y as -200
- Define the lower limit of x as 0
- Define the ticks of the x axis between 0 and 160000, in steps of 30000
- Define the title of the plot as "Income vs Monetary vs Recomendation", with a fontsize of 16 and a blue color
- The legend of the plot should be on the upper left area and the title of the legend should be "Recomendation"
- Define the label of the x axis as "Customer's Income"
- Remove the top and right axis of the plot
- Save the figure as "my_plot.png", with a resolution of 300 dots per inch and with no background.

In [None]:
# Define the size of the figure as (12,8)
fig, (ax1, ax2) = plt.subplots(1,2, figsize = (12,8))

# create first plot without changes, similar to step 17
sns.scatterplot(ax = ax1, data = df, x = 'Income', y = 'Mnt')

# create second plot with changes
sns.scatterplot(ax = ax2, data = df, x = 'Income', y = 'Mnt', hue = 'Recomendation')

# define the limits of y axis using matplotlib.pyplot.ylim
plt.ylim(-200,None)
# define the limits of x axis using matplotlib.pyplot.xlim
plt.xlim(0,None)

# define the ticks in x axis using matplotlib.pyplot.xticks(start, stop, step)
# np.arange - Return evenly spaced values within a given interval.
plt.xticks(np.arange(0,160000,30000))

# define the title using matplotlib.pyplot.title
plt.title('Income vs Monetary vs Recomendation', fontsize= 14, color = 'black')

# define the legend using matplotlib.pyplot.legend
plt.legend(loc = 'upper left', title = 'Recomendation', frameon = False)

# define the label for x axis using matplotlib.pyplot.xlabel
plt.xlabel("Customer's Income")
plt.ylabel("Monetary spent")

# Remove the top and right axis of the plot
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)

# Save the figure as "my_plot.png", with a resolution of 300 dots per inch and with no background.
plt.savefig('my_plot.png', dpi = 300, transparent = True)

__`Step 18`__ Plot the pairwise relationships of the variables "Clothes", "Toys" and "HouseKeeping" using a `pairplot`

https://seaborn.pydata.org/generated/seaborn.pairplot.html

__`Step 19`__ Check the spearman correlation between numerical variables using the method `corr(method = 'spearman')`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html <br>
https://seaborn.pydata.org/generated/seaborn.heatmap.html

We can verify that NetPurchase and StorePurchase have a perfect negative correlation. We don't need both variables. We are going to remove one of them in __`Step 41 `__ 

<div class="alert alert-block alert-info">

<a id='visual_cat'></a>
    
### 2.3.2. Categorical Variables

</div>

__`Step 20`__ Show the counts of observations in each categorical bin using bars for the variable "Marital_Status" using a `countplot()`.
Define the hue as "Gender". Show only the counting for Single, Divorced, Widow, Married and Together in this order.

https://seaborn.pydata.org/generated/seaborn.countplot.html

__`Step 21`__ Draw a scatterplot between Income (numerical variable) and Education (categorical variable) using the `stripplot()`

https://seaborn.pydata.org/generated/seaborn.stripplot.html

We cannot see significant differences on the money earned depending on the Education level.

<div class="alert alert-block alert-warning">

<a id='depth'></a>

## 2.4. In-depth Exploration
    
</div>

We can go further and try to understand better our population of study using the methods `groupby()` and `query()`

__`Step 22`__ What is the mean value of `Mnt` when `Dependents` is equal to 0? And when is equal to 1? Use `groupby()`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html <br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

__`Step 23`__ What is the median value of `Mnt` spent by female customers when `Dependents` is equal to 0? And when is equal to 1? Use `query()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html

[BACK TO TOC](#toc)
    
<div class="alert alert-block alert-success">
<a id='preprocess'>
<font color = '#006400'> 
    
# 3. Preprocess Data </font>
</a>
    
</div>

<div class="alert alert-block alert-warning">

<a id='clean'></a>

## 3.1. Data Cleaning
    
</div>

<div class="alert alert-block alert-info">

<a id='outliers'></a>

### 3.1.1. Outliers
    
</div>

In __Step 13__ we understood that the variables "Rcn",  "Kitchen", "HouseKeeping" and "Toys", due to the high kurtosis, could have potential outliers. In the following steps we are going to investigate further this possible situation.

__`Step 24`__ Create a figure with two axes, where the boxplots of the variables "Rcn" and "Kitchen" are shown. Use the `boxplot()` from seaborn. 

https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (14,6))
sns.boxplot(ax = ax1, data = df, x = 'Rcn')
sns.boxplot(ax = ax2, data = df, x = 'Kitchen')
sns.boxplot(ax = ax3, data = df, x = 'HouseKeeping')
sns.boxplot(ax = ax4, data = df, x = 'Toys')

__`Step 25`__ Create a figure with two axes, where the histplots of the variables "Rcn" and "Kitchen" are shown. Use the `histplot()` from seaborn. 

__`Step 26`__ Remove the observations where 

- Kitchen is higher than 50 or 
- Toys is higher than 50 <br> 

using the method `drop()`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

<div class="alert alert-block alert-info">

<a id='missing'></a>

### 3.1.2. Missing Values
    
</div>

__`Step 27`__ Check how many missing values you have in the dataset using `isna().sum()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html <br>
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html

#### 3.1.2.1 Fill with constant

__`Step 28`__ Fill the missing values in `Credit_Card` with the constant "Missing" using the method `fillna()`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

Confirm, using using the method `value_counts()`, that the variable `Credit_Card` has no more missing values.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

#### 3.1.2.2 Fill with mean / median / mode

__`Step 29`__ Fill the missing values in Marital Status and Education with the mode and in Income with median using `fillna()`

Check that these 3 variables have no missing values anymore using `isna().sum()`

#### 3.1.2.2 Fill with KNNImputer

__`Step 30`__ Use a predictive model to fill the missing values in Clothes. You can use the variables Kitchen, SmallAppliances, HouseKeeping and Toys, that have a correlation of -0.7 with clothes, to fill the missing values with KNNImputer.

In [133]:
df_products = df[['Clothes','Kitchen','SmallAppliances','HouseKeeping','Toys']]

imputer = KNNImputer(n_neighbors=1)
array_impute = imputer.fit_transform(df_products) # this is an array
df_products = pd.DataFrame(array_impute, columns = df_products.columns)

In [None]:
df['Clothes'] = df_products['Clothes'].values
df.info()

[BACK TO TOC](#toc)

<div class="alert alert-block alert-warning">

<a id='transform'></a>

## 3.2. Data Transformation
    
</div>

<div class="alert alert-block alert-info">

<a id='new'></a>

### 3.2.1. Create new variables
    
</div>

__`Step 31`__ Create the variable "Age" from the "Year_Birth". Tip: check the method `date.today()`

In [None]:
from datetime import date

__`Step 32`__ Create a new variable where the purpose is to understand how much money a customer spend on my store each time.

<div class="alert alert-block alert-info">

<a id='misc'></a>

### 3.2.2. Misclassifications
    
</div>

__`Step 33`__ Review the counting for possible values in the Gender variable using `value_counts()`

__`Step 33.B`__ Replace the "?" with the most frequent value using `mode()[0]`, which is going to return the most frequent value.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

__`Step 34`__ Review the counting for possible values in the `Marital_Status` variable with `value_counts()`

__`Step 34.B`__ Change "TOGETHER" to "Together" and do the same (Capitalize the words) for "DIVORCED" and "MARRIED" using `str.capitalize()`

__`Step 34.C`__ Replace the "BigConfusion" with the most frequent value. Apply the same procedure as in `Gender`

__`Step 35`__ Review the counting for possible values in the `Education` variable.

__`Step 35.B`__ Replace the "OldSchool" with the most frequent value with `mode()[0]`. Apply the same procedure as in `Gender` and `Marital_Status`

<div class="alert alert-block alert-info">

<a id='inco'></a>

### 3.2.3. Incoherencies
    
</div>

__`Step 36`__ Check possible incoherencies in your data. One situation that is impossible to happen is to have values of frequency equal to 0 when there was some money spent by the customer. Are there any such incoherences? If yes, change those values of Frequency to 1.
    


<div class="alert alert-block alert-info">

<a id='bin'></a>

### 3.2.4. Binning
    
</div>

__`Step 37`__ Create a new variable named as "Income_bins" where Income is going to be represented in thre possible values - "Low", "Medium" and "High". By using the method `cut()`, those are going to be equal-width bins.

https://pandas.pydata.org/docs/reference/api/pandas.cut.html

<div class="alert alert-block alert-info">

<a id='rec'></a>

### 3.2.5. Reclassify
    
</div>

__`Step 38`__ Due to the similarity of the classification, change the value "Together" to "Married" in Marital_Status using the method `replace()`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

__`Step 39`__ Since we are not sure about what 2nd Cycle and Basic School means, we are going to create a new binary variable called `Higher_Educ` where if the customer has higher education we assign the value 1, and 0 otherwise. One option is to explore the numpy method `np.where()`
Remove the variable `Education`.

<div class="alert alert-block alert-info">

<a id='power'></a>

### 3.2.6. Power Transform
    
</div>

__`Step 40`__ Create a new variable `sqrt_rcn` and `sqrt_mnt` by applying a square root transformation to the variable `Rcn` and `Mnt`, in order to try to normalize the variables. Use the numpy method `np.sqrt()`

__`Step 40.B`__ Compare the distribution of the variables with and without sqrt tranformation with a histplot.

Before applying scaling in our final dataset, we are going to remove some features that could lead to problems on modelling or even on the scaling.

[BACK TO TOC](#toc)

<div class="alert alert-block alert-warning">

<a id='reduce'></a>

## 3.3. Data Reduction 
    
</div>

<div class="alert alert-block alert-info">

<a id='corr'></a>

### 3.3.1. Multicollinearity - Check correlation
    
</div>

We understood in __Step 19__, using the heatmap to check the spearman correlation between the variables, that NetPurchase had a perfect negative relationship with StorePurchase. We don't need both, so we are going to remove one of those.

__`Step 41`__ Drop the variable `NetPurchase`, since it is highly correlated with `StorePurchase`. Do the same with `Year_Birth`, since we used this variable to calculate `Age` and they are highly correlated using `drop()`

<div class="alert alert-block alert-info">

<a id='unary'></a>

### 3.3.2. Unary Variables
    
</div>

__`Step 42`__ Drop the variable `Country`, since it is an unary variable with `drop()`

<div class="alert alert-block alert-info">

<a id='na'></a>

### 3.3.3. Variables with a high percentage of missing values
    
</div>

__`Step 43`__ Drop the variable `City`, since it has 97% of the values missing. Try using `dropna()`, defining the thresh parameter as 90% of the lenght of our dataset

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

<div class="alert alert-block alert-warning">

## 3.2. Back to Data Transformation
    
</div>

<div class="alert alert-block alert-info">

<a id='dummy'></a>

### 3.2.7. Apply ordinal encoding and create Dummy variables
    
</div>

__`Step 44`__ For the variable `Income_bins` where we have an order, we are going to apply ordinal encoding. Define the low value to 0, medium to 1 and high to 2 using the method `replace()`.

__`Step 44.B`__ We can see from the `info()` of the dataset that "Income_bins" is now a category. Convert this variable into an integer using `astype()` and check the new data type with the attribute `dtype`.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

__`Step 45`__ For the categorical variables, apply `get_dummies()`.

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

<div class="alert alert-block alert-info">

<a id='scale'></a>

### 3.2.8. Scaling
    
</div>

__`Step 46`__ Scale the data using `MinMaxScaler()` in the range [0,1]. Check how `KNNImputer()` was applied. `MinMaxScaler()` implementation is very similar.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html