### Exercises

Our scenario continues:

As a customer analyst, I want to know who has spent the most money with us over their lifetime. I have monthly charges and tenure, so I think I will be able to use those two attributes as features to estimate total_charges. I need to do this within an average of $5.00 per customer.

Create split_scale.py that will contain the functions that follow. Each scaler function should create the object, fit and transform both train and test. They should return the scaler, train dataframe scaled, test dataframe scaled. Be sure your indices represent the original indices from train/test, as those represent the indices from the original dataframe. Be sure to set a random state where applicable for reproducibility!

        split_my_data(X, y, train_pct)

        standard_scaler()

        scale_inverse()

        uniform_scaler()

        gaussian_scaler()

        min_max_scaler()

        iqr_robust_scaler()

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pydataset import data
import wrangle
import util
import split_scale
from sklearn.model_selection import train_test_split

#### Class Examples and Explanation

X: independent variables   uppercase X means it's a df, a set of variables

y: target                  lowercase y means there is one target variable

You can pass an array, you can pass a dataframe, you can pass multiple dataframes

In [None]:
y = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y

In [None]:
train_test_split(y, random_state=123)

#### Split My Data

In [2]:
telco = wrangle.wrangle_telco()
telco.head()

Unnamed: 0,customer_id,monthly_charges,total_charges,tenure
0,0013-SMEOE,109.7,7904.25,71
1,0014-BMAQU,84.65,5377.8,63
2,0016-QLJIS,90.45,5957.9,65
3,0017-DINOC,45.2,2460.55,54
4,0017-IUDMW,116.8,8456.75,72


In [3]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1685 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1685 non-null object
monthly_charges    1685 non-null float64
total_charges      1685 non-null float64
tenure             1685 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 65.8+ KB


In [4]:
telco.describe()

Unnamed: 0,monthly_charges,total_charges,tenure
count,1685.0,1685.0,1685.0
mean,60.872374,3728.933947,57.07181
std,34.71221,2571.252806,17.72913
min,18.4,20.35,1.0
25%,24.05,1278.8,48.0
50%,64.45,3623.95,64.0
75%,90.55,5999.85,71.0
max,118.75,8672.45,72.0


In [5]:
telco = telco.drop(columns="customer_id")
telco.head()

Unnamed: 0,monthly_charges,total_charges,tenure
0,109.7,7904.25,71
1,84.65,5377.8,63
2,90.45,5957.9,65
3,45.2,2460.55,54
4,116.8,8456.75,72


### Split My Data

In [21]:
# split the data into two sets of x and y
def split_my_data(df, train_ratio=.8, seed=123):
    train, test = train_test_split(df, train_size = train_ratio, random_state=seed)
    return train, test

#### Scale My Data

In [22]:
from sklearn.preprocessing import StandardScaler, QuantileTransformer, PowerTransformer, RobustScaler, MinMaxScaler


In [55]:
# Create StandardScaler function

def standard_scaler(train, test):
    """z-scores, removes mean and scales to unit var
       Takes in a train and test set of data,
       creates and fits a scaler to the train set,
       returns the scaler, train_scaled, test_scaled
    """
    scaler = StandardScaler(copy=True, with_mean=True, with_std=True).fit(train)
    train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values])
    test_scaled =  pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, train_scaled, test_scaled

In [53]:
standard_scaler(train, test)

(StandardScaler(copy=True, with_mean=True, with_std=True),
       monthly_charges  total_charges    tenure
 119          0.419607       0.572659  0.729412
 1424        -1.169158      -1.035331 -0.130571
 385          1.385242       1.310036  0.442751
 1140         1.075836       1.213291  0.729412
 1504         1.592472       1.876641  0.786745
 435          1.462954       1.466844  0.614748
 571         -0.098468      -0.184616 -0.015907
 656         -0.610787      -0.703237 -0.417232
 756          1.343508       1.233986  0.270754
 574          0.284332       0.435457  0.614748
 1216         0.238281       0.362207  0.557416
 1057        -1.025248      -0.850223  0.213422
 1134         1.229820       1.514886  0.844077
 1253        -1.005100      -0.756220  0.672080
 1035         0.292967       0.577443  0.786745
 775         -1.051152      -0.903516  0.213422
 1403        -1.179231      -0.910732  0.672080
 515         -1.173475      -1.092962 -0.417232
 1106        -1.156206      -

In [54]:
# Create Uniform Scaler function - Quantile Transformer

def uniform_scaler(train, test):
    """Quantile transformer, non_linear transformation - uniform
       Takes in a train and test set of data,
       creates and fits a scaler to the train set,
       returns the scaler, uniform_train, uniform_test
    """
    scaler = QuantileTransformer(n_quantiles=100, output_distribution='uniform', random_state=123, copy=True).fit(train)
    uniform_train = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values]) 
    uniform_test = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, uniform_train, uniform_test

In [26]:
uniform_scaler(train, test)

(QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=100,
                     output_distribution='uniform', random_state=123,
                     subsample=100000),
       monthly_charges  total_charges    tenure
 119          0.580710       0.657682  0.686869
 1424         0.181818       0.181987  0.323232
 385          0.898309       0.872947  0.515152
 1140         0.800304       0.850193  0.686869
 1504         0.987325       0.995336  0.752525
 435          0.942289       0.904016  0.606061
 571          0.445493       0.464668  0.353535
 656          0.390683       0.405886  0.262626
 756          0.884032       0.856202  0.449495
 574          0.553059       0.622568  0.606061
 1216         0.537953       0.597060  0.575758
 1057         0.333333       0.337921  0.429293
 1134         0.838005       0.916107  1.000000
 1253         0.367138       0.386044  0.641414
 1035         0.555556       0.659050  0.752525
 775          0.270346       0.302899  0.429

In [56]:
# Create Normal Scaler function - Quantile Transformer

def normal_scaler(train, test, seed=123):
    """Quantile transformer, non_linear transformation - uniform
       Takes in a train and test set of data,
       creates and fits a scaler to the train set,
       returns the scaler, normal_train, normal_test
    """
    scaler = QuantileTransformer(n_quantiles=100, output_distribution='normal', random_state=seed, copy=True).fit(train)
    normal_train = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values]) 
    normal_test = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, normal_train, normal_test

In [34]:
normal_scaler(train, test, seed=123)

(QuantileTransformer(copy=True, ignore_implicit_zeros=False, n_quantiles=100,
                     output_distribution='normal', random_state=123,
                     subsample=100000),
       monthly_charges  total_charges    tenure
 119          0.203711       0.406144  0.486994
 1424        -0.908458      -0.907821 -0.458679
 385          1.271976       1.140433  0.037988
 1140         0.842707       1.037261  0.486994
 1504         2.236023       2.599815  0.682458
 435          1.574284       1.304781  0.269066
 571         -0.137056      -0.088680 -0.375793
 656         -0.277540      -0.238140 -0.635270
 756          1.195384       1.063410 -0.126937
 574          0.133393       0.312233  0.269066
 1216         0.095279       0.245744  0.191052
 1057        -0.430727      -0.418143 -0.178175
 1134         0.986291       1.379351  5.199338
 1253        -0.339444      -0.289644  0.362241
 1035         0.139710       0.409873  0.682458
 775         -0.611766      -0.516080 -0.1781

In [57]:
# Create MinMaxScaler function

def min_max_scaler(train, test):
    """Transforms features by scaling each feature to a given range.
       Takes in train and test data and returns
       the scaler and train and test scaled within range
       Sensitive to outliers.
    """
    scaler = MinMaxScaler(copy=True, feature_range=(0,1)).fit(train)
    train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values])
    test_scaled = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, train_scaled, test_scaled

In [73]:
min_max_scaler(train, test)

(      monthly_charges  total_charges    tenure
 119          0.569008       0.600120  0.971831
 1424         0.018934       0.122363  0.760563
 385          0.903338       0.819206  0.901408
 1140         0.796213       0.790461  0.971831
 1504         0.975087       0.987552  0.985915
 435          0.930244       0.865796  0.943662
 571          0.389636       0.375123  0.788732
 656          0.212257       0.221033  0.690141
 756          0.888889       0.796610  0.859155
 574          0.522172       0.559356  0.943662
 1216         0.506228       0.537592  0.929577
 1057         0.068759       0.177362  0.845070
 1134         0.849527       0.880070  1.000000
 1253         0.075735       0.205291  0.957746
 1035         0.525162       0.601542  0.985915
 775          0.059791       0.161527  0.845070
 1403         0.015446       0.159383  0.957746
 515          0.017439       0.105240  0.690141
 1106         0.023418       0.134719  0.802817
 1694         0.404584       0.426168  0

In [59]:
# Create RobustScaler function

def iqr_robust_scaler(train, test):
    """Scales features using stats that are robust to outliers
       by removing the median and scaling data to the IQR.
       Takes in train and test sets and returns
       the scaler and scaled train and test sets

    """
    scaler = RobustScaler(quantile_range=(25.0,75.0), copy=True, with_centering=True, with_scaling=True).fit(train)
    train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns.values).set_index([train.index.values])
    test_scaled = pd.DataFrame(scaler.transform(test), columns=test.columns.values).set_index([test.index.values])
    return scaler, train_scaled, test_scaled

In [81]:
train_scaled, test_scaled = iqr_robust_scaler(train, test)

In [61]:
# Create Scaler Inverse function to return data to unscaled state

def scale_inverse(scaler, train_scaled, test_scaled):
    """Takes in the scaler and scaled train and test sets
       and returns the scaler and the train and test sets
       in their original forms before scaling

    """
    train = pd.DataFrame(scaler.inverse_transform(train_scaled), columns=train_scaled.columns.values).set_index([train_scaled.index.values])
    test = pd.DataFrame(scaler.inverse_transform(test_scaled), columns=test_scaled.columns.values).set_index([test_scaled.index.values])
    return train, test

In [None]:
# df[[]] = 1D dataframe
telco_y = telco[["total_charges"]]
telco_y.head()

# x = df.drop(columns="total_charges")
# y = df.total_charges

In [None]:
telco_x = telco[["monthly_charges", "tenure"]]
telco_x.head()

In [None]:
# using mpg data
# breaking the two dataframes, train and test, into four

### x_train = train_df.drop(columns="hwy")
### y_train = train_df.hwy
### x_test = test_df.drop(columns="hwy")
### y_test = test_df.hwy