# Deskew and Scale

####  Goals
* `train_test_split`.
* Deskew using `BoxCoxTransformer` and scale using `StandardScaler`.
* Encode the categorical features.
* Replace raw numeric features in dummy set with deskewed and scaled values.

#### Output
* DataFrames ready for benchmark model scoring.

In [1]:
cd ..

/home/jovyan/Capstone


In [2]:
%run lib/__init__.py
%matplotlib inline

## 0. Load Data

In [3]:
# the whole dataset. numeric and categorical together.
commute_df = pd.read_pickle('./data/dropped_correlated_features_df.pkl')
commute_df.shape

(1038, 87)

In [4]:
# to get a list of the numerical columns
commute_stats_df = pd.read_pickle('./data/commute_stats_dropped_correlated_features_df.pkl')
commute_stats_df.shape

(40, 8)

In [5]:
data_set   = commute_df.drop(['Alone_Share'], axis=1)
target_set = commute_df['Alone_Share']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(data_set, target_set, test_size=0.3)

In [7]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((726, 86), (726,), (312, 86), (312,))

## 1. Identify Numerical Features in `commute_df`
* Make a DataFrame only containing numerical features

In [8]:
numerical_columns = list(commute_stats_df.index)
numerical_columns.remove('Alone_Share')

In [9]:
X_train_numeric = X_train[numerical_columns]
X_test_numeric  = X_test[numerical_columns]

In [10]:
# add a small amount to everything so that I can use boxcox
X_train_numeric += 1E-9
X_test_numeric += 1E-9

## 2. Pipeline Build

In [11]:
from lib.preprocessing import BoxCoxTransformer

In [12]:
pipeline = Pipeline([
    ('boxcox'  , BoxCoxTransformer()),
    ('ss'      , StandardScaler())
])

In [13]:
X_train_scaled = pipeline.fit_transform(X_train_numeric)

In [14]:
X_test_scaled = pipeline.transform(X_test_numeric)

## 3. Represent Deskewed/Scaled Data in DataFrames
* `X_train_scaled_df`.
* `X_test_scaled_df`.

In [15]:
X_train_scaled_df = pd.DataFrame(X_train_scaled,
                                 columns=X_train_numeric.columns,
                                 index=X_train.index)
X_train_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,VMT/ Employee,Goal_VMT,Total_VMT,Total_Goal_VMT,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),Daily_Roundtrip_GHG_Per_Employee_(Pounds),Weekly_CWW_Days,Weekly_Overnight_Business_Trip,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
134,0.456246,-1.741409,-0.837291,-2.030305,-1.648725,-1.964458,-1.626407,-0.913909,-1.250537,0.293563,...,-0.493101,-0.36415,-1.169933,-0.693884,-0.659225,-1.470294,-1.228342,0.956089,-0.616545,-0.413481
410,2.259471,0.858069,0.587983,0.544787,0.96378,0.566393,0.903881,0.612237,0.555684,-1.419059,...,2.029609,-0.36415,0.930982,1.460636,1.541344,0.072496,0.564846,0.407263,1.624221,-0.413481
43,0.269411,-0.195934,0.189247,0.682996,0.146021,0.602948,-0.028154,0.207436,0.768923,0.712107,...,-0.493101,-0.36415,0.880743,-0.693884,-0.659225,0.451868,0.678434,-1.115292,-0.616545,-0.413481
710,0.532237,-0.395032,0.863619,0.76037,0.455975,0.490265,0.177673,0.87294,0.723289,-1.419059,...,-0.493101,-0.36415,0.621417,-0.693884,-0.659225,0.585624,0.744948,0.986381,-0.616545,-0.413481
718,-0.637704,1.238708,-0.284978,0.49848,-0.098461,0.506806,0.791365,-0.363963,0.723289,0.504873,...,-0.493101,-0.36415,0.787347,1.460636,1.553309,0.496422,0.825845,0.524255,-0.616545,-0.413481


In [16]:
X_test_scaled_df = pd.DataFrame(X_test_scaled,
                                 columns=X_test_numeric.columns,
                                 index=X_test.index)
X_test_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,VMT/ Employee,Goal_VMT,Total_VMT,Total_Goal_VMT,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),Daily_Roundtrip_GHG_Per_Employee_(Pounds),Weekly_CWW_Days,Weekly_Overnight_Business_Trip,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
231,1.258809,0.270873,1.005276,0.786664,1.026471,0.684649,0.71233,1.02644,0.660217,-1.419059,...,-0.493101,-0.36415,0.84757,1.433742,-0.659225,0.762388,-1.228342,-1.115292,1.645262,-0.413481
58,0.29715,-0.179586,-1.586423,-2.030305,-1.693471,-1.964458,-1.253397,-1.655216,0.555684,0.610508,...,-0.493101,-0.36415,-1.169933,-0.693884,-0.659225,0.243269,0.79202,0.899306,-0.616545,-0.413481
614,-0.020179,-0.388417,1.811998,1.034533,0.887389,0.537382,0.676779,1.839676,0.723289,-1.419059,...,-0.493101,-0.36415,0.903914,1.421742,-0.659225,0.886196,1.021792,1.249013,1.641478,-0.413481
404,2.316566,1.488512,0.826552,0.67555,1.701801,0.929962,1.545771,0.844571,1.09785,-1.419059,...,-0.493101,-0.36415,0.994716,1.4764,-0.659225,-1.470294,-1.228342,-1.115292,-0.616545,-0.413481
628,2.078644,0.532725,0.730687,0.668029,0.724611,0.377252,0.587073,0.746936,0.901621,0.293563,...,-0.493101,-0.36415,1.033583,1.408057,-0.659225,-1.470294,-1.228342,-1.115292,-0.616545,-0.413481


## 4. Encode Categorical Columns and Replace Raw Numeric Values in `X_train` and `X_test` with Deskewed/Scaled Values
* Delete UUID columns, do not want to predict on UUID
* Encode `X_train` and `X_test`
* Replace deskewed/scaled numeric features into the dataset in appropriate columns.

In [17]:
X_train.drop(['UUID'], axis=1, inplace=True)
X_test.drop(['UUID'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [18]:
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies  = pd.get_dummies(X_test)

In [19]:
for col in X_train_scaled_df.columns:
    X_train_dummies[col] = X_train_scaled_df[col]

In [20]:
for col in X_test_scaled_df.columns:
    X_test_dummies[col] = X_test_scaled_df[col]

In [21]:
X_test_dummies.shape, X_train_dummies.shape, y_train.shape, y_test.shape

((312, 1158), (726, 1158), (726,), (312,))

## 5. Pickling and Saving
* encoded categorical/normed numeric dataframe for benchmark scoring
* numeric deskewed/scaled for outlier removal.

In [22]:
X_test_dummies.to_pickle('./data/X_test_dummies_df.pkl')
X_train_dummies.to_pickle('./data/X_train_dummies_df.pkl')

y_train.to_pickle('./data/y_train.pkl')
y_test.to_pickle('./data/y_test.pkl')