# Data Preprocessing

**Outline**
* [Required python libraries](#Required-python-libraries)
* [Dataset overall characteristics](#Dataset-and-its-overall-characteristics)
* [Missing values](#Missing-Values)
 * [Deleting the column with missing data](#Deleting-the-column-with-missing-data)
 * [Deleting the row with missing data](#Deleting-the-row-with-missing-data)
 * [Filling the Missing Values – Imputation](#Filling-the-Missing-Values-%E2%80%93-Imputation)
      * [Filling with regression model](#Filling-with-a-Regression-Model)
      * [Filling with KNN](#Filling-with-KNN)
* [Data Transformation](#Data-Transformation)
 * [Clipping](#Clipping)
 * [Basic Transformation](#Basic-transformation)
 * [Scaling Techniques](#Scaling-Techniques)
      * [MinMax Scaler](#MinMax-Scaler)
      * [Standard Scaler](#Standard-Scaler)
      * [MaxAbsScaler](#MaxAbsScaler)
      * [Robust Scaler](#Robust-Scaler)
      * [Quantile Transformer Scaler](#Quantile-Transformer-Scaler)
 * [Transforming non-normal data](#Transformating-non-normal-data)
      * [Skewness and Kurtosis](#Skewness-and-Kurtosis)
      * [Squared root transformation](#Square-root-transformation)
      * [Log transformation](#Log-transformation)
      * [Box-Cox Transformation](#Box-Cox-Transformation)
* [Exercise](#Exercise)

## Required python libraries

In [1]:
# Install necessary python libraries
import sys
## for data
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install seaborn
!{sys.executable} -m pip install scipy
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install sklearn.preprocessing
!{sys.executable} -m pip install sklearn.linear_model
!{sys.executable} -m pip install sklearn.impute
!{sys.executable} -m pip install datetime



In [2]:
# Load necessary python libraries
import pandas as pd
import numpy as np

## for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## for statistical tests
import scipy
import statsmodels.formula.api as smf
import statsmodels.api as sm
import sklearn.preprocessing as preproc
import sklearn.linear_model as lm
import sklearn.impute as im

## for date/time
import datetime as dt

## Dataset and its overall characteristics 
**Titanic dataset from Kaggle:**

For the details see: https://www.kaggle.com/competitions/titanic/data

In [3]:
url = 'https://raw.githubusercontent.com/ZIFODS/Training/master/data/data_titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Missing Values

Adapted from https://www.analyticsvidhya.com/blog/2021/05/dealing-with-missing-values-in-python-a-complete-guide/

In [4]:
# To show number of nulls (Not available data points or NAs) in the dataset
missing = df.isnull().sum()
missing

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are missing values in columns Age (177), Cabin (687) and Embarked (2).

Only some of the machine learning algorithms can work with missing data like KNN, which will ignore the values with Nan values.

### Deleting the column with missing data
This is an extreme case and should only be used when there are many null values in the column.

In [5]:
updated_df = df.dropna(axis=1).copy()
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
dtypes: float64(1), int64(5), object(3)
memory usage: 62.8+ KB


The problem with this method is that we may lose valuable information on the features, as we have deleted them completely due to some null values.

### Deleting the row with missing data
If there is a certain row with missing data, then you can delete the entire row with all the features in that row.

axis=1 is used to drop the column with `NaN` values.

axis=0 is used to drop the row with `NaN` values.

In [6]:
updated_df = df.dropna(axis=0).copy()
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Survived     183 non-null    int64  
 2   Pclass       183 non-null    int64  
 3   Name         183 non-null    object 
 4   Sex          183 non-null    object 
 5   Age          183 non-null    float64
 6   SibSp        183 non-null    int64  
 7   Parch        183 non-null    int64  
 8   Ticket       183 non-null    object 
 9   Fare         183 non-null    float64
 10  Cabin        183 non-null    object 
 11  Embarked     183 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB


We reduced number of records significantly. The better approach would be to get rid of the columns we are not that much interested in and that have null values (like Cabin and Embarked) and then apply the same method.

In [7]:
updated_df = df.copy()
updated_df.drop("Cabin",axis=1,inplace=True)
updated_df.drop("Embarked",axis=1,inplace=True)
updated_df = updated_df.dropna(axis=0)
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  714 non-null    int64  
 1   Survived     714 non-null    int64  
 2   Pclass       714 non-null    int64  
 3   Name         714 non-null    object 
 4   Sex          714 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        714 non-null    int64  
 7   Parch        714 non-null    int64  
 8   Ticket       714 non-null    object 
 9   Fare         714 non-null    float64
dtypes: float64(2), int64(5), object(3)
memory usage: 61.4+ KB


### Filling the Missing Values – Imputation

In this case, we will be filling the missing values with a certain number.

The possible ways to do this are:

1. Filling the missing data with the mean or median value if it’s a numerical variable.
2. Filling the missing data with mode if it’s a categorical value.
3. Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different.

You can use the fillna() function to fill the null values in the dataset

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [9]:
updated_df = df.copy()
updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean())
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Filling with a Regression Model

In this case, the null values in one column are filled by fitting a regression model using other columns in the dataset.

I.E in this case the regression model will contain all the columns except Age in X and Age in Y.

In [10]:
lr = lm.LinearRegression()
updated_df = df.copy()
updated_df.drop("Name",axis=1,inplace=True)
updated_df.drop("Ticket",axis=1,inplace=True)
updated_df.drop("PassengerId",axis=1,inplace=True)
updated_df.drop("Cabin",axis=1,inplace=True)
updated_df.drop("Embarked",axis=1,inplace=True)
le = preproc.LabelEncoder()
updated_df['Sex'] = le.fit_transform(updated_df['Sex'])

testdf = updated_df[updated_df['Age'].isnull()==True].copy()
traindf = updated_df[updated_df['Age'].isnull()==False].copy()
y = traindf['Age']
traindf.drop("Age",axis=1,inplace=True)
testdf.drop("Age",axis=1,inplace=True)
lr.fit(traindf,y)

LinearRegression()

In [11]:
testdf

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Fare
5,0,3,1,0,0,8.4583
17,1,2,1,0,0,13.0000
19,1,3,0,0,0,7.2250
26,0,3,1,0,0,7.2250
28,1,3,0,0,0,7.8792
...,...,...,...,...,...,...
859,0,3,1,0,0,7.2292
863,0,3,0,8,2,69.5500
868,0,3,1,0,0,9.5000
878,0,3,1,0,0,7.8958


In [12]:
pred = lr.predict(testdf)

In [13]:
pred

array([29.07080066, 30.10833306, 22.44685065, 29.08927347, 22.43705181,
       29.07922599, 32.43692984, 22.43898701, 22.15615704, 29.07922599,
       29.07691632, 24.96460346, 22.43898701, 20.8713251 , 37.80993305,
       44.85950626, 17.21443083, 29.07922599, 29.07691632, 22.43842532,
       29.07691632, 29.07691632, 29.07922599, 22.14798185, 18.1926178 ,
       29.07691632, 29.08140983, 17.39852793, 27.61791252, 29.08796287,
       29.06774208, -5.49189866, 36.98755908, 44.88640441, 15.9929439 ,
       -5.20126796, 37.01068094, 44.52580031, 18.32218064, 29.08140983,
       22.43898701, -5.49189866, 25.08068578, 29.07922599, 16.2835746 ,
       29.37503621, 25.27089854, 18.32218064, 29.08889901, 37.44600929,
       29.08140983, 29.37204054, 44.81038921, 22.43898701, 37.23610531,
       44.88528103, 44.85950626, 37.88482487, 22.43898701, 13.91474356,
       30.40869971, 29.07691632, 36.97144531, -5.49189866, 14.20537427,
       32.62971335, 29.07922599, 18.31319362, 44.75047576, 29.08

### Filling with KNN

KNN stands for the k-Nearest Neighbors method that is used to replace the missing values in the datasets with the mean value from the parameter ‘n_neighbors’ nearest neighbors found in the training set. By default, it uses a Euclidean distance metric to impute the missing values.
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

In [14]:
imputer = im.KNNImputer(n_neighbors=5)

#### Textual Values

One thing to note here is that the KNN Imputer does not recognize text data values. It will generate errors if we do not change these values to numerical values. For example, in our Titanic dataset, the categorical columns ‘Sex’ and ‘Embarked’ have text data.
We are dropping most of the textual columns except for 'Sex' one. This column is transformed into numerical values using LabelEncoder from sklearn.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [16]:
updated_df = df.copy()
updated_df.drop("Name",axis=1,inplace=True)
updated_df.drop("Ticket",axis=1,inplace=True)
updated_df.drop("PassengerId",axis=1,inplace=True)
updated_df.drop("Cabin",axis=1,inplace=True)
updated_df.drop("Embarked",axis=1,inplace=True)
le = LabelEncoder()
updated_df['Sex'] = le.fit_transform(updated_df['Sex'])

NameError: name 'LabelEncoder' is not defined

#### Scaling (Normalisation)
Another critical point here is that the KNN Imptuer is a distance-based imputation method and it requires us to **normalize** our data. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. For simplicity, we will use Scikit-Learn’s MinMaxScaler which will scale our variables to have values between 0 and 1.

In [None]:
scaler = preproc.MinMaxScaler()
updated_df = pd.DataFrame(scaler.fit_transform(updated_df), columns = updated_df.columns)
updated_df.head()

In [None]:
updated_df = pd.DataFrame(imputer.fit_transform(updated_df),columns = updated_df.columns)
updated_df.info()

In [None]:
updated_df['Age'].to_numpy(dtype=None, copy=False)

## Data Transformation
Data transformation is the process of converting raw data into a a format or structure that would be more suitable for the model or algorithm and also data discovery in general. It is an essential step in the feature engineering that facilitates discovering insights. 

Numeric data transformation is needed when the data distribution is skewed since the ML algorithm is more likely to be biased in such a case. Besides, transforming data into the same scale allows the algorithm to compare the relative relationship between data points better.

Adapted from https://www.visual-design.net/post/data-transformation-and-feature-engineering-in-python

### Clipping

This approach is more suitable when there are outliers in the dataset. Clipping method sets up the upper and lower bound and all data points will be contained within the range.

We can use quantile() to find out what is the range of the majority amount of data (between 0.05 percentile and 0.95 percentile). Any numbers below the lower bound (defined by 0.05 percentile) will be rounded up to the lower bound. Similarly, the numbers above upper bound (defined by 0.95 percentile) will be rounded down to upper bound.

In [None]:
df.head()

In [None]:
# clipping methods to handle outliers

clip_var = ['Age', 'Fare']
for i in clip_var:
    transformed = 'clipped_' + i
    # upper limit - 0.95 quantile
    upper_limit = df[i].quantile(0.95)
    # lower limit - 0.05 quantile
    lower_limit = df[i].quantile(0.05)
    df[transformed] = df[i].clip(lower_limit, upper_limit, axis = 0)
    # print(df[i].describe())
    # print(df[transformed].describe())

### Basic transformation
Basic data transformation is needed to make data tidier and more insightful. 

In [None]:
df = pd.DataFrame({
    'Income$': ['$150,000', '$10,800', '$120,0000', '$100,000'],
    'Year_Birth': [2000, 2004, 1977, 1973],
    'Joined': ['2018-12-19', '2022-1-23', '2010-10-3', '2001-5-30'],   
    'Balance':[100.0, -263.0, 2000.0, -5.0],
    'Department': ['HR','Legal','Marketing','Management']
})

df['Joined'] = pd.to_datetime(df['Joined'])
df.head()

In [None]:
# Examples of basic transformations:

# 1. Transform Year of Birth into Age
df['Age'] = dt.date.today().year - df['Year_Birth']

# 2. Transform date into enrollment length:
df['Enrollment_Length'] = ((pd.to_datetime(date.today()) - df['Joined'])/np.timedelta64(1, 'D')).astype(int)

# 3. Transform Currency into numbers
# This involves four steps: 
#   1) clean data to remove characters ", $ ." 
#   2) substitute null value to 0; 
#   3) convert string into integer; 
#   4) scale down the numbers into thousand dollar which helps with visualizing the data distribution
df['Income'] = df['Income$'].str.replace(',', '', regex=True).str.replace('$', '', regex=True).fillna(0).astype(int)
#df['Income'] = df['Income'].apply(lambda x: round(x/1000))

df.head()

### Scaling Techniques

Adapter from https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/

Often, we have datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Furthermore, we can have columns like income which can range from 20,000 to 100,000, and even more; while an age column which can range from 0 to 100. Thus, Income is about 1,000 times larger than age.

When we feed these features to the model as is, there is every chance that the income will influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to both Age, and Income, we need feature scaling.

In [None]:
# Before directly applying any feature transformation or scaling technique, 
# we need to remember the categorical columns and first deal with them. This is because we cannot scale non-numeric values.
    
df_scaled = df.copy()
col_names = ['Income', 'Age', 'Enrollment_Length','Balance']
features = df_scaled[col_names]

#### MinMax Scaler
The MinMax scaler is one of the simplest scalers to understand.  It just scales all the data between 0 and 1. 

The formula for calculating the scaled value is:

**x_scaled = (x – x_min)/(x_max – x_min)**

In [None]:
scaler = preproc.MinMaxScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

In [None]:
# Suppose we don’t want the income or age to have values like 0. Let us take the range to be (5, 10).
scaler = MinMaxScaler(feature_range=(5, 10))

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

#### Standard Scaler
For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1.

**x_scaled = x – mean/std_dev**

However, Standard Scaler assumes that the distribution of the variable is normal. Thus, in case, the variables are not normally distributed, we

1. either choose a different scaler
2. or first, convert the variables to a normal distribution and then apply this scaler

In [None]:
scaler = preproc.StandardScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

In [None]:
df_scaled.describe()

# The mean values of 'Income', 'Balance' and 'Age' are not exactly, but very close to 0 (1 in case of standard deviation). 
# This occurs due to the numerical precision of floating-point numbers in Python.

#### MaxAbsScaler
This scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value.

Thus, it first takes the absolute value of each value in the column and then takes the maximum value out of those. This operation scales the data between the range [-1, 1]. 

In [None]:
scaler = preproc.MaxAbsScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

#### Robust Scaler
Each of the scalers we seen so far was using values like the mean, maximum and minimum values of the columns. All these values are sensitive to outliers. If there are too many outliers in the data, they will influence the mean and the max value or the min value. Thus, even if we scale this data using the above methods, we cannot guarantee a balanced data with a normal distribution.

The Robust Scaler, as the name suggests is not sensitive to outliers. This scaler removes the median from the data and scales the data by the InterQuartile Range(IQR).

IQR is the difference between the first and third quartile of the variable: **IQR = Q3 – Q1**

Thus, the formula would be:

**x_scaled = (x – Q1)/(Q3 – Q1)**

In [None]:
scaler = preproc.RobustScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

#### Quantile Transformer Scaler
The Quantile Transformer Scaler converts the variable distribution to a normal distribution and scales it accordingly. 

Since it makes the variable normaly distributed, it also deals with the outliers. Here are a few important points regarding the Quantile Transformer Scaler:

1. It computes the cumulative distribution function of the variable

2. It uses this CDF to map the values to a normal distribution

3. Maps the obtained values to the desired output distribution using the associated quantile function

A caveat to keep in mind though: since this scaler changes the very distribution of the variables, linear relationships among variables may be destroyed by using this scaler. Thus, it is best to use this for non-linear data. 

In [None]:
scaler = preproc.QuantileTransformer(n_quantiles=4)

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

### Transformating non-normal data

In some cases it is important that the data we have is of normal shape (also known as following a Bell curve). This includes regression analysis, the two-sample t-test, and Analysis of Variance (ANOVA), to name a few.


Adapted from https://www.marsja.se/transform-skewed-data-using-square-root-log-box-cox-methods-in-python/

#### Skewness and Kurtosis
Briefly, skewness is a measure of lack of symmetry. This means that the larger the number is the more data lack symmetry (not normal, that is). Kurtosis, on the other hand, is a measure of whether data is heavy- or light-tailed relative to a normal distribution.

**Fairly Symmetrical** Skewness:	-0.5 to 0.5

**Moderate Skewed**    Skewness:	-0.5 to -1.0 and 0.5 to 1.0

**Highly Swewed**      Skewness:	< -1.0 and > 1.0


There are also different statistical tests that can be used to test if data is normally distributed (Shapiro-Wilks test).

In [None]:
df = pd.read_csv('./data/data_to_transform.csv')
df.hist(grid=False,
       figsize=(10, 6),
       bins=30)

In [None]:
df.agg(['skew', 'kurtosis']).transpose()

In [None]:
# Quantile Transformer is explained above
scaler = preproc.QuantileTransformer()

df.insert(len(df.columns), 'A_Quantile',
         scaler.fit_transform(np.array(df.iloc[:,0]).reshape(-1,1)))
df.insert(len(df.columns), 'B_Quantile',
         scaler.fit_transform(np.array(df.iloc[:,1]).reshape(-1,1)))
df.insert(len(df.columns), 'C_Quantile',
         scaler.fit_transform(np.array(df.iloc[:,2]).reshape(-1,1)))
df.insert(len(df.columns), 'D_Quantile',
         scaler.fit_transform(np.array(df.iloc[:,3]).reshape(-1,1)))

#### Square root transformation
The square root method is typically used when your data is moderately skewed. 
Now using the square root (e.g., sqrt(x)) is  a transformation that has a moderate effect on distribution shape. It is generally used to reduce right skewed data. Finally, the square root can be applied on zero values and is most commonly used on counted data.

In [None]:
# Python Square root transformation
df.insert(len(df.columns), 'A_Sqrt',
         np.sqrt(df.iloc[:,0]))

# Square root transormation on left skewed data in Python:
df.insert(len(df.columns), 'C_Sqrt',
         np.sqrt(max(df.iloc[:, 2]+1) - df.iloc[:, 2])) # Here we have to reverse the distribution


#### Log transformation
The logarithmic is a strong transformation that has a major effect on distribution shape. 
This technique is, as the square root method, oftenly used for reducing right skewness. 
It can not be applied to zero or negative values.

In [None]:
# Python log transform
df.insert(len(df.columns), 'B_log',
         np.log(df['Highly Positive Skew']))


df.insert(len(df.columns), 'C_log',
         np.log(max(df.iloc[:, 2] + 1) - df.iloc[:, 2]))

#### Box-Cox Transformation
The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique.
Formula:
* y(λ) = (yλ – 1) / λ  if y ≠ 0
* y(λ) = log(y)  if y = 0

This is a procedure to identify a suitable exponent (Lambda) to use to transform skewed data.

In [None]:
# Box-Cox Transformation in Python
df.insert(len(df.columns), 'A_Boxcox', 
              scipy.stats.boxcox(df.iloc[:, 0])[0])

In [None]:
df.agg(['skew']).transpose()

**NB!** If you get the “ValueError: Data must be positive” while using either np.sqrt(), np.log() or SciPy’s boxcox() it is because your dependent variable contains negative numbers. To solve this, you can reverse the distribution.

# Exercise

1. Load the "Titanic" dataset (url = 'https://raw.githubusercontent.com/ZIFODS/Training/master/data/data_titanic.csv').
2. Delete the row with Age missing values. How many records remained?
3. Apply clipping method to Age column using 0.1 and 0.8 quantiles for lower and upper limits. How many values are replaced with upper limit?
4. What type of skewness column "Fate" has?
5. Quantile Transform column "Fare". What value in the "Fare" column correspond to PassenderId equals 6 (rounded to two decimal places)?