# Preprocessing for Machine Learning

### I. Introduction to preprocessing

#### Preprocessing data for machine learning
* What is data preprocessing?
    * Beyond cleaning and exploratory data analysis
    * Prepping data for modeling
    * Modeling in Python requires numerical input
        * So if your dataset has categorical variables, you'll need to transform them.
    * Data preprocessing is a prerequisite to modeling.

* One of the first steps you can take to preprocess your data is to remove missing data
* Drop specific rows by passing index labels to the drop function, which defaults to dropping rows 
* Usually you'll want to focus on dropping a particular column, especially if all or most of its values are missing
    * `df.drop("A", axis=1)`

* drop rows where data is missing in a particular column:
    * do this with the help of boolean indexing, which is a way to filter a dataframe based on certain values
    * `df[df["B"] == 7])`
    
```
df["B"].isnull().sum())
df[df["B"].notnull()])
```

* To remove rows with **at least 3 missing values**:
    * `volunteer2 = volunteer.dropna(axis=1, thresh=3)`

#### Working with Data Types
* Why are types important?
    * Recall that you can check the types of a dataframe by using the `.dtypes` attribute
* Pandas datatypes are similar to native python types, but there are a couple of things to be aware of.
    * The `object` type is what pandas uses to refer to a column that consists of string values or is of mixed types.
* Converting column types:
    * `df.dtypes`
    * `df["C"] = df["C"].astype("float")`

#### Training and Test Sets
* We split our data into training and test sets to avoid the issue of overfitting
* Holding out a test set allows us to preserve some data the model hasn't seen yet. 
* In many scenarios, the default splitting parameters of `train_test_split()` will work well. However, if your labels have an uneven distribution, your test and training sets might not be representative samples of your dataset and could bias the model you're trying to train.
* A good technique for sampling more accurately when you have imbalanced classes is **stratified sampling.**
* **Stratified sampling** is a way of sampling that takes into account the distribution of classes or features in your dataset
* We want the distribution of our training and testing samples to be on par with the distribution of the classes in the original dataset

#### Stratified sampling
* There's a really easy way to stratify samples of *classification* variables in train_test_split function:
    * `X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)`




### II. Standardizing Data
* You may come across datasets with lots of numerical noise built in, such as lots of variance or differently-scaled data
    * The preprocessing solution for that is standardization
* **Standardization** is a preprocessing method used to transform continuous data to make it look normally distributed
* In sklearn, this is often a necessary step, because many models assume that the data you are training on is normally distributed, and if it isn't, you risk biasing your model 
* **sklearn models assume normally distributed data.**
* There are different ways to standardize data; in this course we'll focus on **log normalization** and **scaling.**
* Standardization is a preprocessing method applied to **continuous, numerical data.**

* **When to standardize**
* Model in linear space
    * K-nearest neighbors
    * Linear regression
    * k-means clustering
    * etc
* Dataset with features that have high variance
    * If a feature in your dataset has a variance that's an order of magnitude or more greater than the other features, this could impact the model's ability to learn from other features in the dataset.
    * Dataset with features that are continuous and on different scales
    * Linearity assumptions

#### Log normalization
* **Log normalization** is a method for standardizing your data that can be useful when you have a particular column with high variance
* **Log normalization** applies a log transformation to your values, which transforms your values onto a scale that approximates normality (an assumption about your data that a lot of models make).
* **Log normalization:**
    * applies log transformation
    * Natural log using the constant _e_(2.718)
    * Captures relative changes, the magnitude of change, and keeps everything in the positive space.
* It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.

#### Log normalization in Python
* Fairly straightforward
* Use log function from numpy

```
import numpy as np
df['log_2'] = np.log(df['col2'])
print(df)
```
* Pass the column we want to log normalize directly into the function.

```
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())
```

#### Scaling Data
* **Scaling** is a method of standardization that's most useful when you're working with a dataset that contains continuous features that are on different scales, and you're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors).
* **Feature scaling:**
    * Features on different scales
    * Model with linear characteristics
    * transforms the features in your dataset so they have a mean of zero and a variance of one
    * Transforms to approximately normal distribution
    * This will make it easier to linearly compare features
    
* **StandardScaler():**
    * This method works by removing the mean and scaling each feature to have unit variance
    
```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), 
                         columns = df.columns)
```
* There's a simpler `scale` function in sklearn, but the benefit of using the `StandardScaler()` object is that you can apply the same transformation on other data, like a test set, or new data that's part of the same set, for example, without having to rescale everything.

```
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

# Create the scaler
ss = StandardScaler()

# Take a subset of the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to the DataFrame subset
wine_subset_scaled = ss.fit_transform(wine_subset)
```

#### Standardized data and modeling
* Many models in scikit-learn require your data to be scaled appropriately across columns, otherwise you risk biasing your results
* This unit will be dedicated to modeling data on both unscaled and scaled data, so you can see the difference in model performance.
* Recap: KNearestNeighors: a model that classifies data based on its distance to the training set data
    * A new data point is assigned a label based on the class that the majority of surrounding data points belong to.
* **K-Nearest Neighbors:**

```
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Preprocessing first
X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
```

* **KNN on un-scaled data:

```
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))
```
* Output: `0.64444`

```
# Create the scaling method.
ss = StandardScaler()

# Apply the scaling method to the dataset used for modeling.
X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Fit the k-nearest neighbors model to the training data.
knn.fit(X_train, y_train)

# Score the model on the test data.
print(knn.score(X_test, y_test))
```
* Output: `0.95556`

#### Medium article: How data normalization affects your Random Forest algorithm
* **Data normalization won’t affect the output for Random Forest *classifiers* while it will affect the output for Random Forest *regressors*.**
* Regarding the regressor, the algorithm will be more affected by the high-end values if the data is not transformed. This means that they will probably be more accurate in predicting high values than low values.
* Consequently, transformations such as log-transform will reduce the relative importance of these high values, hence generalizing better.
* **MinMax scaler generally performs better than StandardScaler for RF Regression models.**

### III. Feature Engineering

#### Feature Engineering 
* A very important part of the preprocessing workflow: feature engineering
* **Feature engineering** is the creation of new features based on existing features, and it adds information to your dataset that is useful in some way.
     * It adds features that are usefull for your prediction or clustering task, or;
     * It sheds light or insight into relationships between features
     * Extract and expand data
     * Feature engineering is also something that is very dependent on the particular dataset you're analyzing.

* Real world data is often not neat and tidy, and in addition to preprocessing steps like standardization, you'll likely have to extract and expand information that exists in the columns in your dataset.

#### Encoding Categorical Variables
* Because models in sklearn require numerical input, if your dataset contains categorical variables, you'll have to encode them 

* **Encoding binary variables:**
* Can be done in both pandas and sklearn.
* In **pandas**, we can use the apply function to encode 1s and 0s in a dataframe column
* If we have a column of the df `users`, called `subscribed`, containing a list of binary values (`y` or `n`):
    * `users['sub_enc'] = users['unsubscribed'].apply(lambda val: 1 if val == 'y' else 0)`
* You may want to encode binary variables if you're not finished preprocessing, or if oyu're interested in further exploratory work once you've encoded your categorical variables.

* You can also do this in `sklearn` using **`LabelEncoder`**.
* It's useful to know both methods if, for example, you're implementing encoding as part of sklearn's pipeline functionality.
* Creating a `LabelEncoder` object also allows you yo reuse this encoding on other data, such as on new data or a test set.

```
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
users['sub_enc_le'] = le.fit_transform(users['subscribed'])
```

* Use the `get_dummies` function in pandas to directly encode categorical values 
* `print(pd.get_dummies(users['fav_color']))`