# Machine Learning Tutorial! <br><br><br>



### Categorical data Template `( encoding categorical data )`

#### Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


## we convert the values to an array
> ## or we create our **_matrix of features_** (observations)
> ### `X := independent variable columns`

In [3]:
X = dataset.iloc[:, :-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### ` y := dependent variable vector`

In [4]:
y = dataset.iloc[:, 3].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

###  `to show all the columns and the total NaNs of each feature : `
> #### `dataset.isnull().sum()`
> #### `dataset.isna().sum()`

In [5]:
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

# Taking care of missing data

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
X[:, 1:3] = imputer.fit_transform(X[:, 1:3])
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## What is Categorical Data?

Categorical data are variables that contain **label values** rather than **numeric values**.

The number of possible values is often limited to a fixed set.

Categorical variables are often called **_`nominal`_**.

Some examples include:

> `A “pet” variable with the values: “dog” and “cat“.`  <br>
> `A “color” variable with the values: “red“, “green” and “blue“.`<br>
> `A “place” variable with the values: “first”, “second” and “third“.`<br>

Each **value** represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The **“place”** variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.

**`Machine learning models are based on the numerical equations and calculation of numerical variables. But most of the time we have columns in our dataset that is non-numeric such as countries, names, cities and so on. In such condition we need to convert those columns into numeric values which can be used for further processing.`**


## What is the Problem with Categorical Data?

Some algorithms can work with categorical data directly.

For example, a **decision tree** can be learned directly from categorical data with **<ins>no data transform required</ins>** (this depends on the specific implementation).

<ins>Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be **numeric**.</ins>

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

[source](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)


# `Encoding categorical data`
### Encoding the Independent Variable

In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

## ColumnTransformer
> #### Applies transformers to columns of an array or pandas DataFrame.
> ` This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.`

In [8]:
column_transformer = ColumnTransformer(
    [('onehotencoder', OneHotEncoder(categories='auto'), [0])],    # The column numbers to be transformed (here is [0] but can be [0, 1, 3])
    remainder='passthrough'                         # Leave the rest of the columns untouched
)
column_transformer

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('onehotencoder', OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True), [0])])

## `One-Hot Encoding`

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the *“color”* variable example, there are 3 categories and therefore 3 binary variables are needed. A **“1” value is placed in the binary variable for the color and “0” values for the other colors**.

For example:

| red | green | blue |
|-----|-------|------|
| 1   | 0     | 0    |
| 0   | 1     | 0    |
| 0   | 0     | 1    |

another example:
The following table is an example of **dummy coding** or **Dummy Encoding** : 

| French | Italian | German |
|----|----|----|
| 1  | 0  | 0  |
| 0  | 1  | 0  |
| 0  | 0  | 1  |

In [9]:
X = np.array(column_transformer.fit_transform(X), dtype=np.float)
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

**`That’s all label encoding is about. But depending on the data, label encoding introduces a new problem. For example, we have encoded a set of country names into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows.`**

**`The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2 ( the problem is that machine learning models are based on equations and this equations will think that Spain(2) has higher value than Germany(1) and Germany(1) has higher value than France(0) and to prevent this problem we are going to use a Dummy Variables). But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.`**