<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

# <font color=goldenrod>Prepare Data for Classification</font>

<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

# Big Ideas

- Reproducibility puts the science in data science!

- Docstrings are also your friends.

- Reusable helper functions work like plastic brick toys. ;) 

# Objectives 

**By the end of the acquire lesson and exercises, you will be able to...**

- **:**


- **:**


<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from darden_class_acquire import get_titanic_data, get_iris_data

<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

## Iris Data

___

### Use the function defined in acquire.py to load the iris data.

In [None]:
iris = get_iris_data()
iris.head()

___

### Drop the species_id and measurement_id columns.

- Let's also build a little helper function here.

In [None]:
iris = iris.drop(columns='species_id')
iris.head(2)

___

### Rename the species_name column to just species.

In [None]:
iris = iris.rename(columns={'species_name': 'species'})
iris.head(2)

___

### Create dummy variables of the species name.

In [None]:
species_dummies = pd.get_dummies(iris.species, drop_first=True)
species_dummies.head(3)

In [None]:
iris = pd.concat([iris, species_dummies], axis=1)
iris.head()

___

### Create a function named prep_iris that accepts the untransformed iris data, and returns the data with the transformations above applied.

In [2]:
def prep_iris(cached=True):
    '''
    This function acquires and prepares the iris data from a local csv, default.
    Passing cached=False acquires fresh data from Codeup db and writes to csv.
    Returns the iris df with dummy variables encoding species.
    '''
    
    # use my aquire function to read data into a df from a csv file
    df = get_iris_data(cached)
    
    # drop and rename columns
    df = df.drop(columns='species_id').rename(columns={'species_name': 'species'})
    
    # create dummy columns for species
    species_dummies = pd.get_dummies(df.species, drop_first=True)
    
    # add dummy columns to df
    df = pd.concat([df, species_dummies], axis=1)
    
    return df

In [7]:
iris = prep_iris()
iris.sample(7)

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width,versicolor,virginica
123,virginica,6.3,2.7,4.9,1.8,0,1
47,setosa,4.6,3.2,1.4,0.2,0,0
37,setosa,4.9,3.6,1.4,0.1,0,0
58,versicolor,6.6,2.9,4.6,1.3,1,0
55,versicolor,5.7,2.8,4.5,1.3,1,0
7,setosa,5.0,3.4,1.5,0.2,0,0
141,virginica,6.9,3.1,5.1,2.3,0,1


<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

## Titanic Data

___

### Use the function you defined in acquire.py to load the titanic data set.

In [9]:
titanic = get_titanic_data()
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


___

### Handle the missing values in the embark_town and embarked columns.

- With a quick check, I see that in total I would only be dropping two rows, so that's how I'm going to handle these missing values.

In [10]:
titanic[titanic.embark_town.isnull()]

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
61,61,1,1,female,38.0,0,0,80.0,,First,B,,1
829,829,1,1,female,62.0,0,0,80.0,,First,B,,1


In [11]:
titanic[titanic.embarked.isnull()]

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
61,61,1,1,female,38.0,0,0,80.0,,First,B,,1
829,829,1,1,female,62.0,0,0,80.0,,First,B,,1


In [12]:
titanic = titanic[~titanic.embarked.isnull()]
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  deck          201 non-null    object 
 11  embark_town   889 non-null    object 
 12  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 97.2+ KB


___

### Remove the deck column.

In [13]:
titanic = titanic.drop(columns='deck')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  889 non-null    int64  
 1   survived      889 non-null    int64  
 2   pclass        889 non-null    int64  
 3   sex           889 non-null    object 
 4   age           712 non-null    float64
 5   sibsp         889 non-null    int64  
 6   parch         889 non-null    int64  
 7   fare          889 non-null    float64
 8   embarked      889 non-null    object 
 9   class         889 non-null    object 
 10  embark_town   889 non-null    object 
 11  alone         889 non-null    int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 90.3+ KB


___

### Create a dummy variable of the embarked column.

In [14]:
titanic_dummies = pd.get_dummies(titanic.embarked, drop_first=True)
titanic_dummies.sample(10)

Unnamed: 0,Q,S
575,0,1
758,0,1
16,1,0
306,0,0
770,0,1
425,0,1
264,1,0
763,0,1
101,0,1
129,0,1


In [15]:
titanic = pd.concat([titanic, titanic_dummies], axis=1)
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0,0,1
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0,0,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1,0,1
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0,0,1
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1,0,1


___

### Scale the age and fare columns using a min max scaler. 

**- Why might this be beneficial?**

    - Age and fare use different units of measurement, so scaling will give us a more meaningful way to compare them.
    

**- When might you not want to do this?**

    - You wouldn't want to scale these columns when you're exploring the data or trying to explain the data to others. The Scaled data can be very confusing in explanatory charts.
    
    
**- Before I scale these columns, I need to split my data to prevent leaking of my unseen validate and test data.**

### Split Data

In [None]:
train_validate, test = train_test_split(titanic, test_size=.2, 
                                        random_state=123, 
                                        stratify=titanic.survived)

In [None]:
train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)

In [None]:
print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')

#### Create MVP function to split titanic data

In [19]:
def titanic_split(df):
    '''
    This function performs split on titanic data, stratify survived.
    Returns train, validate, and test dfs.
    '''
    train_validate, test = train_test_split(df, test_size=.2, 
                                        random_state=123, 
                                        stratify=df.survived)
    train, validate = train_test_split(train_validate, test_size=.3, 
                                   random_state=123, 
                                   stratify=train_validate.survived)
    return train, validate, test

In [20]:
train, validate, test = titanic_split(titanic)

In [21]:
print(f'train -> {train.shape}')
print(f'validate -> {validate.shape}')
print(f'test -> {test.shape}')

train -> (497, 14)
validate -> (214, 14)
test -> (178, 14)


___

### Fill the missing values in age. 

- The way you fill these values is up to you. Consider the tradeoffs of different methods.

In [26]:
# Create the imputer object.

imputer = SimpleImputer(strategy = 'mean')

In [27]:
# Fit the imputer to train and transform.

train['age'] = imputer.fit_transform(train[['age']])

In [28]:
# quick check

train['age'].isnull().sum()

0

In [29]:
# Transform the validate and test df age columns

validate['age'] = imputer.transform(validate[['age']])
test['age'] = imputer.transform(test[['age']])

#### Build a helper function for imputing

In [None]:
def impute_mean_age(train, validate, test):
    '''
    This function imputes the mean of the age column into
    observations with missing values.
    Returns transformed train, validate, and test df.
    '''
    # create the imputer object with mean strategy
    imputer = SimpleImputer(strategy = 'mean')
    
    # fit on and transform age column in train
    train['age'] = imputer.fit_transform(train[['age']])
    
    # transform age column in validate
    validate['age'] = imputer.transform(validate[['age']])
    
    # transform age column in test
    test['age'] = imputer.transform(test[['age']])
    
    return train, validate, test

___

### Create a function named `prep_titanic` 

- It should accept the untransformed titanic data and return the data with the transformations above applied.

In [None]:
def prep_titanic(cached=True):
    '''
    This function reads titanic data into a df from a csv file.
    Returns prepped train, validate, and test dfs
    '''
    # use my acquire function to read data into a df from a csv file
    df = get_titanic_data(cached)
    
    # drop rows where embarked/embark town are null values
    df = df[~df.embarked.isnull()]
    
    # encode embarked using dummy columns
    titanic_dummies = pd.get_dummies(df.embarked, drop_first=True)
    
    # join dummy columns back to df
    df = pd.concat([df, titanic_dummies], axis=1)
    
    # drop the deck column
    df = df.drop(columns='deck')
    
    # split data into train, validate, test dfs
    train, validate, test = titanic_split(df)
    
    # impute mean of age into null values in age column
    train, validate, test = impute_mean_age(train, validate, test)
    
    return train, validate, test

<hr style="border-top: 10px groove goldenrod; margin-top: 1px; margin-bottom: 1px"></hr>

### Test Functions in Testing Notebook

<div class="alert alert-block alert-info">If my MVP functions work properly, and I have time, I can go back and generalize my helper functions if I want to reuse them in the future. Most importantly, I have functions that I built step-by-step, testing the code along the way and testing that I can import and use them in another notebook from my module.</div>