# Deskew and Scale

####  Goals
* `train_test_split`.
* Deskew using `BoxCoxTransformer` and scale using `StandardScaler`.
* Encode the categorical features.
* Replace raw numeric features in dummy set with deskewed and scaled values.

#### Output
* DataFrames ready for benchmark model scoring.

In [1]:
cd ..

/home/jovyan/dsi/CAPSTONE


In [2]:
%run lib/__init__.py
%matplotlib inline

## 0. Load Data

In [3]:
# the whole dataset. numeric and categorical together.
commute_df = pd.read_pickle('./data/dropped_correlated_features_df.pkl')
commute_df.shape

(1038, 86)

In [4]:
# to get a list of the numerical columns
commute_stats_df = pd.read_pickle('./data/commute_stats_dropped_correlated_features_df.pkl')
commute_stats_df.shape

(39, 8)

In [5]:
data_set   = commute_df.drop(['Alone_Share'], axis=1)
target_set = commute_df['Alone_Share']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(data_set, target_set, test_size=0.3)

In [7]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((726, 85), (726,), (312, 85), (312,))

## 1. Identify Numerical Features in `commute_df`
* Make a DataFrame only containing numerical features

In [8]:
numerical_columns = list(commute_stats_df.index)
numerical_columns.remove('Alone_Share')

In [9]:
X_train_numeric = X_train[numerical_columns]
X_test_numeric  = X_test[numerical_columns]

In [10]:
# add a small amount to everything so that I can use boxcox
X_train_numeric += 1E-9
X_test_numeric += 1E-9

## 2. Pipeline Build

In [11]:
from lib.preprocessing import BoxCoxTransformer

In [12]:
pipeline = Pipeline([
    ('boxcox'  , BoxCoxTransformer()),
    ('ss'      , StandardScaler())
])

In [13]:
X_train_scaled = pipeline.fit_transform(X_train_numeric)

In [14]:
X_test_scaled = pipeline.transform(X_test_numeric)

## 3. Represent Deskewed/Scaled Data in DataFrames
* `X_train_scaled_df`.
* `X_test_scaled_df`.

In [15]:
X_train_scaled_df = pd.DataFrame(X_train_scaled,
                                 columns=X_train_numeric.columns,
                                 index=X_train.index)
X_train_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,Goal_VMT,Total_VMT,Total_Goal_VMT,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),Daily_Roundtrip_GHG_Per_Employee_(Pounds),Weekly_CWW_Days,Weekly_Overnight_Business_Trip,Weekly_Did_Not_Work,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
429,0.770679,-0.443029,0.696461,0.562495,0.665233,0.281685,1.089895,0.69263,0.586213,0.37189,...,-0.458257,-0.354375,-1.137151,1.482903,1.574046,0.609278,0.665775,1.001802,-0.605931,-0.404303
677,1.540363,0.825398,-0.016709,-0.569912,0.29527,-0.775412,-1.902601,1.103329,1.256703,1.426133,...,2.18465,-0.354375,-1.137151,-0.667929,-0.637851,-1.407234,-1.178743,1.062026,-0.605931,-0.404303
682,-1.938914,1.353697,0.675026,1.565248,0.82792,1.716053,1.352643,0.9554,0.887688,1.115471,...,-0.458257,-0.354375,-1.137151,-0.667929,-0.637851,-0.034231,0.938142,1.096511,1.614309,-0.404303
206,-0.250261,1.231941,-2.005668,0.730569,-1.944997,0.577863,-0.678404,0.796708,1.083946,0.482577,...,-0.458257,-0.354375,0.79517,-0.667929,1.596272,0.45083,0.863267,0.818004,-0.605931,-0.404303
875,-0.685495,0.4299,-2.005668,0.310486,-1.944997,0.331011,0.06569,0.796708,-1.386123,0.135168,...,2.18354,-0.354375,0.994845,1.457119,-0.637851,1.736993,0.962564,0.82278,-0.605931,-0.404303


In [16]:
X_test_scaled_df = pd.DataFrame(X_test_scaled,
                                 columns=X_test_numeric.columns,
                                 index=X_test.index)
X_test_scaled_df.head()

Unnamed: 0,Response_Rate,Total_Employees,Goal_VMT,Total_VMT,Total_Goal_VMT,Total_Annual_Greenhouse_Gas_Emissions_-__All_Employees_(Metric_Tons_CO2e),Daily_Roundtrip_GHG_Per_Employee_(Pounds),Weekly_CWW_Days,Weekly_Overnight_Business_Trip,Weekly_Did_Not_Work,...,num_employees_using_bike_subsidy,num_employees_using_other_transportation_subsidy,num_parking_spaces_reserved_for_employee_usage,num_HOV_parking_spaces,num_shared_parking_spaces,cost_of_program_in_past_year,cost_of_meeting_program_requirements,cost_of_financial_incentives_subsidies_paid_to_employees,cost_of_facility_upkeep,cost_of_other
638,0.325947,-0.435773,-2.005668,-0.622037,-1.944997,-0.88678,-0.81107,-1.224324,0.934854,0.439733,...,2.18354,2.822108,0.932171,-0.667929,-0.637851,1.5052,1.117837,1.255224,1.671971,2.476006
895,0.118604,1.519155,0.185072,0.931451,0.792965,0.706854,-0.954542,1.29162,0.530506,1.711614,...,-0.458257,-0.354375,-1.137151,-0.667929,1.589777,1.323447,-1.178743,1.261797,-0.605931,-0.404303
9,0.441827,-0.232053,0.689387,0.596051,0.572806,0.378329,0.98263,0.69263,0.329706,-0.140092,...,2.174233,2.820505,0.393177,1.439191,1.482008,0.45083,0.624939,0.496711,-0.605931,-0.404303
553,0.0724,0.174648,0.521614,0.534995,0.546757,0.278005,0.454582,0.69263,0.530506,0.836072,...,-0.458257,-0.354375,-1.137151,-0.667929,-0.637851,1.192573,-1.178743,-1.089071,1.661972,2.475886
426,-0.188895,0.461692,0.724086,1.198421,0.448794,0.95595,1.352643,-1.224324,0.811742,1.042323,...,-0.458257,-0.354375,-1.137151,-0.667929,-0.637851,0.45083,0.624939,0.496711,-0.605931,-0.404303


## 4. Encode Categorical Columns and Replace Raw Numeric Values in `X_train` and `X_test` with Deskewed/Scaled Values
* Delete UUID columns, do not want to predict on UUID
* Encode `X_train` and `X_test`
* Replace deskewed/scaled numeric features into the dataset in appropriate columns.

In [17]:
X_train.drop(['UUID'], axis=1, inplace=True)
X_test.drop(['UUID'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [18]:
X_train_dummies = pd.get_dummies(X_train)
X_test_dummies  = pd.get_dummies(X_test)

In [19]:
for col in X_train_scaled_df.columns:
    X_train_dummies[col] = X_train_scaled_df[col]

In [20]:
for col in X_test_scaled_df.columns:
    X_test_dummies[col] = X_test_scaled_df[col]

In [21]:
X_test_dummies.shape, X_train_dummies.shape, y_train.shape, y_test.shape

((312, 1157), (726, 1157), (726,), (312,))

## 5. Pickling and Saving
* encoded categorical/normed numeric dataframe for benchmark scoring
* numeric deskewed/scaled for outlier removal.

In [22]:
X_test_dummies.to_pickle('./data/X_test_dummies_df.pkl')
X_train_dummies.to_pickle('./data/X_train_dummies_df.pkl')

y_train.to_pickle('./data/y_train.pkl')
y_test.to_pickle('./data/y_test.pkl')