# Data preparation
we often need to preprocess the data before using it for model building due to variety of reasons
* due to errors in data capture, data may contain outliers or missing values
* different features may be at different scales
* the current data distribution is not exactly amenable to learning

typical steps in data processing are as follows
* seperate features and labels
* handling missing values and outliers
* feature scaling to bring all features on the same scale
* applying certain transformations like log, square root on the features

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(data_url, sep=";")

In [5]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["quality"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [6]:
wine_features = strat_train_set.drop("quality", axis=1)
wine_labels = strat_train_set["quality"].copy()

`wine_features` seperates all the features except quality since we want to predict the quality.
* `strat_train_set` is our full training data set
* `drop("quality", axis=1)` drops the quality column and axis is set to be 1 to remove the column(axis 0 is for rows and axis 1 is for columns)

`wine_labels` just takes the `quality` column and makes a copy of it.
* `strat_train_set["quality"]` selects the target column which we are trying to predict
* `.copy()` prevents side effects or accidental edits to the original data set

### Data cleaning
before applying data cleaning lets first check for missing values.one way of finding it is column wise

In [7]:
wine_features.isna().sum()

Unnamed: 0,0
fixed acidity,0
volatile acidity,0
citric acid,0
residual sugar,0
chlorides,0
free sulfur dioxide,0
total sulfur dioxide,0
density,0
pH,0
sulphates,0


in our case there is no missing values

incase we have missing values,
* if they are not recorded
    * use imputation techniques to fill up missing values
    * drop the rows containing missing values
* if they do not exist
    * its better to keep it as `NaN`
to drop rows containg missing values in sklearn we use
* `.dropna()`
* `.drop()`
it also provides SimpleImputer class for filling missing values with the median value


In [8]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

imputer.fit(wine_features)

incase the features contain non numerical attributes they should be dropped before calling the fit method on imputer object

In [9]:
imputer.statistics_

array([ 7.9    ,  0.52   ,  0.26   ,  2.2    ,  0.08   , 14.     ,
       39.     ,  0.99675,  3.31   ,  0.62   , 10.2    ])

these are the median values of each feature in our data set

In [10]:
wine_features.median()

Unnamed: 0,0
fixed acidity,7.9
volatile acidity,0.52
citric acid,0.26
residual sugar,2.2
chlorides,0.08
free sulfur dioxide,14.0
total sulfur dioxide,39.0
density,0.99675
pH,3.31
sulphates,0.62


In [11]:
tr_features =  imputer.transform(wine_features)

this returns a `numpy` array and we can convert it into the dataframe we needed

In [12]:
tr_features.shape

(1279, 11)

### handling text and categorical attributes
Converting categories to numbers
### Ordinal Encoding
it is used to convert text to numbers where the order matters. for example
* low, medium, high
* good, bad, ugly


In [13]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

call fit_transform() method on ordinal_encoder object to convert text to numbers.

the list of categories can be obtained via categories instance

`ordinal_encoder.fit_transform(data)`:
* fit step:
    * Scans each column in your data
    * Figures out all the unique values in each one
    * Sorts them alphabetically (by default)
    * stores them internally as `[category_1, category_2, category_3, ...]`
    * and assigns them `[0.0, 1.0, 2.0, ...]`

* transform step:
    * takes the input data
    * replaces each category with corresponding number
    * the result is a `numpy` array and the output values are float not integers

`ordinal_encoder.categories_`:
after fitting this reveals all the categories we found per column
* This is super useful for:
    * Debugging
    * Reverse mapping
    * Making sure the categories were processed in the order you expected

if the category is not order based for example `["red", "green", "blue", ...]` then ordinal encoder should not be used since if red is given 0 and green is given 1, it thinks that green is greater than red. so we use them only when the category have an order.

### OneHot Encoding:
in that case we use one hot encoding

In [14]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()

This transforms your category into separate columns. One for each unique category.

And then it puts a 1 in the column that matches your data, and 0 everywhere else.

what does it look like?

From this:

| wine_color |
|------------|
| red        |
| white      |
| rosé       |
| red        |

to this:

| red | white | rosé |
|-----|-------|------|
| 1   | 0     | 0    |
| 0   | 1     | 0    |
| 0   | 0     | 1    |
| 1   | 0     | 0    |

here also we can use fit_transform on OneHotEncoder the fit scans for all unique column entries and transform builds a binary vector for each of the unique categories

it outputs a SciPy sparse matrix rather than numpy array to save the space if we have a large no of categories.

we can use `toarray` to convert it to numpy array and the list of categories can be viewed by `categories_` method

If the no of categories is very large it will use all of the ram so this can be addressed with one of the following approaches:
* replace with categorical numerical features
* convert into low dimensional learnable vectors called embeddings


### Feature Scaling
* most ml algorithms do not perform well when input features are on different scales.
* scaling of target label is generally not required
Imagine this:
* lets say we have two features:
    * alcohol_content: ranges from 0 to 100
    * residual_sugar: ranges from 0.0 to 1.0
* our model sees alcohol_content and goes:
* this number is HUGE compared to that tiny sugar value. Must be more important!

But is it? No! It's just in a different unit. it just got confused by units.
* So we scale features to:
    * Bring them to a common range
    * Prevent one feature from dominating purely due to magnitude
    * Help gradient descent converge faster
    * Make distance-based algorithms actually useful

### min-max scaling or Normalization
* we substract minimum value of a feature from the current value and divide it by the difference between the minimum and the maximum value of that feature
* Values are shifted and scaled so that they range between 0 and 1.
* Scikit-Learn provides *MinMaxScalar* transformer for this.
* Good when you know bounds or want all features strictly positive
* Sensitive to outliers (because outliers stretch the range)

### Standardisation
* We subtract mean value of each feature from the current value and divide it by the standard deviation so that the resulting feature has a unit variance.
* While normalization bounds values between 0 and 1, standardization does not bound values to a specific range.
* Features become centered at 0, variance = 1
* Works best when data is normally distributed
* Standardization is less affected by the outliers compared to the normalization.
* Scikit-Learn provides *StandardScalar* transformation for feature standardization.

### Transformation pipeline
* Scikit-Learn provides a `Pipeline` class to lineup transformations in an intended order

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

transform_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scalar", StandardScaler()),
])

wine_features_tr = transform_pipeline.fit_transform(wine_features)

### `Pipeline`:
* it takes the data and pass it through all the steps in order
* each step is a tuple
* the steps will run in sequence

### `imputer`:
* scans features and finds missing values
* replace them with the median value of the feature

### `std_scalar`:
* Standardizes all feature values
* Makes sure each one has:
    * Mean = 0
    * Standard Deviation = 1
* Each name should be unique and should not contain __ (double underscore).

### Column transformers:
* the real world data contains both numerical and categorical features.
* Scikit-Learn introduced `ColumnTransformer` for this purpose
* we don't have any categorcal features in our data set
* lets assume that we have a **place of manufacturing** feature and `num_attributes` contain all the numerical features and `cat_attributes` have all the categorical features which in our case is just place of manufacturing
* let `wine_features_tr`be the `num_pipeline` which is the pipeline created only for numerical features
* we use `OneHotEncoder()` for categorical features
* we combine both of them using the `ColumnTransformer`

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(wine_features)
cat_attribs = ["place_of_manufacturing"]
full_pipeline = ColumnTransformer([
										("num", num_pipeline, num_attribs),
										("cat", OneHotEncoder(), cat_attribs),
								])
wine_features_tr = full_pipeline.fit_transform(wine_features)