# Preprocessing for Machine Learning

#### Preprocessing data for machine learning
* What is data preprocessing?
    * Beyond cleaning and exploratory data analysis
    * Prepping data for modeling
    * Modeling in Python requires numerical input
        * So if your dataset has categorical variables, you'll need to transform them.
    * Data preprocessing is a prerequisite to modeling.

* One of the first steps you can take to preprocess your data is to remove missing data
* Drop specific rows by passing index labels to the drop function, which defaults to dropping rows 
* Usually you'll want to focus on dropping a particular column, especially if all or most of its values are missing
    * `df.drop("A", axis=1)`

* drop rows where data is missing in a particular column:
    * do this with the help of boolean indexing, which is a way to filter a dataframe based on certain values
    * `df[df["B"] == 7])`
    
```
df["B"].isnull().sum())
df[df["B"].notnull()])
```

* To remove rows with **at least 3 missing values**:
    * `volunteer2 = volunteer.dropna(axis=1, thresh=3)`

#### Working with Data Types
* Why are types important?
    * Recall that you can check the types of a dataframe by using the `.dtypes` attribute
* Pandas datatypes are similar to native python types, but there are a couple of things to be aware of.
    * The `object` type is what pandas uses to refer to a column that consists of string values or is of mixed types.
* Converting column types:
    * `df.dtypes`
    * `df["C"] = df["C"].astype("float")`

#### Training and Test Sets
* We split our data into training and test sets to avoid the issue of overfitting
* Holding out a test set allows us to preserve some data the model hasn't seen yet. 
* In many scenarios, the default splitting parameters of `train_test_split()` will work well. However, if your labels have an uneven distribution, your test and training sets might not be representative samples of your dataset and could bias the model you're trying to train.
* A good technique for sampling more accurately when you have imbalanced classes is **stratified sampling.**
* **Stratified sampling** is a way of sampling that takes into account the distribution of classes or features in your dataset
* We want the distribution of our training and testing samples to be on par with the distribution of the classes in the original dataset

#### Stratified sampling
* There's a really easy way to stratify samples of *classification* variables in train_test_split function:
    * `X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)`


