# Deskew and Scale

####  Goals
* `train_test_split`.
* Deskew using `BoxCoxTransformer` and scale using `StandardScaler`.
* Encode the categorical features.
* Replace raw numeric features in dummy set with deskewed and scaled values.

#### Output
* DataFrames ready for benchmark model scoring.

In [1]:
cd ..

/home/jovyan/dsi/CAPSTONE


In [2]:
%run lib/__init__.py
%matplotlib inline

## 0. Load Data

In [3]:
# the whole dataset. numeric and categorical together.
commute_df = pd.read_pickle('./data/dropped_correlated_features_df.pkl')
commute_df.shape

(1038, 102)

In [4]:
# to get a list of the numerical columns
commute_stats_df = pd.read_pickle('./data/commute_stats_dropped_correlated_features_df.pkl')
commute_stats_df.shape

(55, 8)

In [5]:
data_set   = commute_df.drop(['Alone_Share'], axis=1)
target_set = commute_df['Alone_Share']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(data_set, target_set, test_size=0.3)

In [7]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((726, 101), (726,), (312, 101), (312,))

## 1. Identify Numerical Features in `commute_df`
* Make a DataFrame only containing numerical features

In [8]:
numerical_columns = list(commute_stats_df.index)
numerical_columns.remove('Alone_Share')

In [9]:
X_train_numeric = X_train[numerical_columns]
X_test_numeric  = X_test[numerical_columns]

In [10]:
# add a small amount to everything so that I can use boxcox
X_train_numeric += 1E-9
X_test_numeric += 1E-9

## 2. Pipeline Build

In [11]:
from lib.preprocessing import BoxCoxTransformer

In [12]:
pipeline = Pipeline([
    ('boxcox'  , BoxCoxTransformer()),
    ('ss'      , StandardScaler())
])

In [13]:
X_train_scaled = pipeline.fit_transform(X_train_numeric)

In [14]:
X_test_scaled = pipeline.transform(X_test_numeric)

## 3. Represent Deskewed/Scaled Data in DataFrames
* `X_train_scaled_df`.
* `X_test_scaled_df`.

In [15]:
X_train_scaled_df = pd.DataFrame(X_train_scaled,
                                 columns=X_train_numeric.columns,
                                 index=X_train.index)
X_train_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,Goal_VMT,Total_VMT,Total_Goal_VMT,Goal_NDAT_Rate_(Worksite_only),Total_Goal_NDAT_Trips,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),GHGforAgg_(Pounds),Total_Weekly_Trips,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
523,0.437152,0.463558,0.537554,0.81238,0.648889,0.315736,-0.456055,0.5769,0.822272,0.800148,...,2.114425,2.658552,0.975558,-0.683034,-0.646379,0.558697,0.937415,1.050897,-0.648478,-0.424837
421,-0.48639,0.868731,0.711178,1.447592,0.402041,0.467714,-0.456055,1.294936,1.451067,1.022063,...,-0.473585,-0.376177,1.055351,1.462402,1.576897,0.450305,0.713476,1.020397,-0.648478,-0.424837
1018,1.367102,-0.646111,0.054266,-0.662309,0.044093,0.565948,-0.456055,-0.853712,-0.653843,-0.199223,...,-0.473585,-0.376177,0.837192,-0.683034,-0.646379,0.41461,0.715195,0.999402,-0.648478,-0.424837
515,0.699034,0.13975,0.406376,-0.561566,0.10216,0.124206,-0.456055,0.076098,-0.609912,-0.991342,...,-0.473585,-0.376177,0.808302,1.42239,-0.646379,0.558697,0.802323,0.792646,1.545325,-0.424837
89,-1.112634,-1.5037,0.747509,-0.238811,0.297416,0.203777,-0.456055,-0.118634,-0.2237,-1.705242,...,-0.473585,-0.376177,0.860793,-0.683034,-0.646379,0.066609,0.719976,0.816699,-0.648478,-0.424837


In [16]:
X_test_scaled_df = pd.DataFrame(X_test_scaled,
                                 columns=X_test_numeric.columns,
                                 index=X_test.index)
X_test_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,Goal_VMT,Total_VMT,Total_Goal_VMT,Goal_NDAT_Rate_(Worksite_only),Total_Goal_NDAT_Trips,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),GHGforAgg_(Pounds),Total_Weekly_Trips,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
757,0.323694,0.81855,-2.08619,0.624763,-2.006833,-2.120769,-0.456055,0.501015,0.581558,0.988425,...,-0.473585,-0.376177,0.822868,-0.683034,-0.646379,0.52449,0.572734,1.016707,1.564145,2.357765
998,-0.576198,0.513434,0.582508,0.648204,0.352973,0.253424,-0.456055,0.602568,0.605011,0.470548,...,-0.473585,-0.376177,0.912025,1.440436,-0.646379,0.547529,0.614431,0.845463,1.507287,-0.424837
700,-0.584288,1.805148,0.372468,1.775566,1.22177,0.455708,2.195848,1.63585,1.782911,1.925261,...,2.118665,2.658581,1.153941,1.504745,-0.646379,0.649882,1.491742,1.140845,1.584666,2.352755
213,1.474612,-0.034422,-0.025109,-0.043755,0.073447,0.731482,-0.456055,-0.435334,-0.029775,0.426354,...,-0.473585,-0.376177,-1.189249,-0.683034,-0.646379,0.41461,0.886891,0.797328,-0.648478,-0.424837
278,1.985462,-0.265411,-2.08619,-0.28017,-2.006833,-2.120769,-0.456055,-0.704244,-0.271608,0.339158,...,-0.473585,-0.376177,-1.189249,-0.683034,1.524916,1.018734,1.092158,1.018048,-0.648478,-0.424837


## 4. Encode Categorical Columns and Replace Raw Numeric Values in `X_train` and `X_test` with Deskewed/Scaled Values
* Delete UUID columns, do not want to predict on UUID
* Encode `X_train` and `X_test`
* Replace deskewed/scaled numeric features into the dataset in appropriate columns.

In [17]:
X_train.drop(['UUID'], axis=1, inplace=True)
X_test.drop(['UUID'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [18]:
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies  = pd.get_dummies(X_test)

In [19]:
for col in X_train_scaled_df.columns:
    X_train_dummies[col] = X_train_scaled_df[col]

In [20]:
for col in X_test_scaled_df.columns:
    X_test_dummies[col] = X_test_scaled_df[col]

In [21]:
X_test_dummies.shape, X_train_dummies.shape, y_train.shape, y_test.shape

((312, 1173), (726, 1173), (726,), (312,))

## 5. Pickling and Saving
* encoded categorical/normed numeric dataframe for benchmark scoring
* numeric deskewed/scaled for outlier removal.

In [22]:
X_test_dummies.to_pickle('./data/X_test_dummies_df.pkl')
X_train_dummies.to_pickle('./data/X_train_dummies_df.pkl')

y_train.to_pickle('./data/y_train.pkl')
y_test.to_pickle('./data/y_test.pkl')