# Data Pre-Processing

Aims of this practical:

* Identify continuous and discrete variables
* Learn how to standardise and normalise data
* Deal with missing and extreme data
* Transform discrete variables into numerical ones

In this practical, we use `sklearn` and other libraries we have seen so far:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Understanding the data

`rental_data.csv` contains 410 records, each one representing a rental property. The variables associated with each property record indicate the neighbourhood it is in, how many people it accommodates, how many bathrooms and bedrooms there are, the rental price, the property size in square feet, the score given by reviewers and the number of reviews.

Let's load the file and look at each variable in turn.

In [None]:
data = pd.read_csv('data/rental_data.csv')

data.info()
data.describe()

For each variable, is it continuous or discrete?

Will it need to be pre-processed and, if so, how? What pre-processing tools would you use?

In [None]:
# Your thoughts here...


### Using `ydata_profiling` to quickly explore data

The `ydata_profiling` package generates detailed summaries of all variables in a dataset.

It can be a great first step to explore data and see if there are any issues you need to deal with, like missing values or zeroes and so on.

It renders an interactive HTML frame inside your Python notebook. If you want to save this to a file, use the `.to_json()` or `.to_html()` method of the profile object.

This step can take some time to process and show the exploratory analysis.

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df=data, title="Pandas Profiling Report")

profile.to_notebook_iframe()

### Sanity checking

There are values of `0`s for some rows for the columns `number_of_reviews` and `reviews_per_month`. This is  plausible since some properties can have no reviews just yet.

Examine the rows where either of these variables is equal to zero.

In [None]:
# Your code here...


 What is the `review_scores_rating`? What would you expect it to be if there are no reviews?

In [None]:
# Your thoughts here...


Some values for `number_of_reviews`, `reviews_per_month`, `review_scores_rating`, and `square_feet` do not hold, and corresponding rows needs to be dropped:

* Drop rows with `NaN`s in `review_scores_rating`
* Drop rows with `review_scores_rating` > 0 but `number_of_reviews` == 0

In [None]:
# Drop rows with NaN in `review_scores_rating`
# Your code here...


In [None]:
# Drop rows with `review_scores_rating` > 0 but `number_of_reviews` == 0
# Your code here...


## Outliers

Look again at a summary of the values...

In [None]:
data.describe(percentiles=[0.99])

Are there any columns with suspicious values?

In [None]:
# Your thoughts here...


### Visualising the data

A boxplot quickly shows us the distribution of continuous variables. `seaborn` has a method for it.

In [None]:
g = sns.boxplot(data=data)

plt.xticks(rotation=90);

### Removing outliers

Using the zscore, find any rows where the value for `bedrooms`, `price` or `square_feet` is more than 3 standard deviations from the column mean.

Print them out to see their values and then add the index of the row to the list `to_drop`.

Then use that to drop the outliers.

**IMPORTANT NOTE:** this is an exercise to show how outliers can be identified using the standard deviation. Dropping observations deemed to be outliers like this is **not advised**, unless you are **very clear** on the implications!

In [None]:
def zscore(sample):
    
    mean = sample.mean()
    std = sample.std()
    
    return (sample - mean) / std

to_drop = []

# Your code here...


The box plots should look a bit more reasonable now.

In [None]:
g = sns.boxplot(data=data)

plt.xticks(rotation=90);

## Normalising and standardising

The scales of the variables are still very different. Should you normalise or standardise the data?

Recall that standardisation rescales each variable based on its overall mean and standard deviation. Normalisation rescales each variable to be between 0 and 1.

Let's see what each one looks like when applied.

Using `sklearn.preprocessing.StandardScaler` and `sklearn.preprocessing.MinMaxScaler`, fit and transform the continuous variables.

Then, view their new values as a boxplot using the same command as above.

In [None]:
cont_vars = ['accommodates', 'bathrooms', 'bedrooms', 'price',
             'review_scores_rating', 'square_feet', 'number_of_reviews',
             'reviews_per_month']

In [None]:
from sklearn.preprocessing import StandardScaler

stander = StandardScaler()

# Your code here...


In [None]:
from sklearn.preprocessing import  MinMaxScaler

normer = MinMaxScaler()

# Your code here...


## Encoding discrete variables

In the last part of this practical, we look at the `host_neighbourhood` variable. This is a discrete nominal variable, which is easy to read and understand for us but cannot be processed as is by a machine learning model.

### Using discrete variables as features

To use `host_neighbourhood` as a feature in a model (for instance to predict the value for `price`), it needs to be transformed into a vector representation.

Use `pandas.get_dummies()` to get a vectorized form of `host_neighbourhood`.

This function lets you set a `prefix` for the new columns it creates. Since the default (`host_neighbourhood`) is a bit long, set it to something shorter.

In [None]:
from pandas import get_dummies

# Your code here...


### Using discrete variables as targets

`get_dummies()` is good if you want to one-hot encode input features. There are instances however where you wish to obtain a numerical labels (often the case for the categorical output of a model).

Most machine learning packages convert categorical output variables of type string into numerical labels automatically.

For instance, you can use pandas series's method `.astype()` to obtain numerical catgeorical labels:

In [None]:
data.host_neighbourhood.astype('category').cat.codes

# Next steps

* Apply these ideas to your own data, where you have a better understanding of what should be done with `NaN`s, zeroes and so on.
* Train some simple models using original features, standardised features and normalised features - what happens to model performance in each case?
* We used `pandas` functionality here for one-hot encoding, because it works easily on a single column of data, but try `sklearn.preprocessing.OneHotEncoder`, which expects multiple columns or data to be reshaped first.