# 3 train_test_split data<a id='3_train_test_split data'></a>

Start with trimmed set of columns of ps_performance and concordance results visualized mainly in Tibco Spotfire file:
/git_repositories/DataScienceCapstoneTwo/spotfire/data_cleaning_step2_EDA.dxp

This file has removed most of the unnecessary columns and highly correlated columns to remaining metrics.

Do some final feature engineering

Split into 70:30 train:test sets, preserving proportions of OriginalCT AND quality_binary bin.

Be prepared in model building to then subsample quality_binary=good class to about 7% of original (again preserving proportions of OriginalCT), so good:bad split in training set is about 60:40.

Don't drop sparse columns yet, since some models can handle it.

Don't scale columns yet, since some models can handle unscaled data.

In [1]:
import os
import pandas as pd
#import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

## 3.1 Load and inspect data

In [2]:
# also set probeset_id as the index.  Will propagate to all derived dataframes.
ps_data = pd.read_csv('../data/data_cleaning_step2.zip',sep='\t', index_col = 'probeset_id')

In [3]:
ps_data.head(7).T

probeset_id,AX-100003653,AX-100004573,AX-100004941,AX-100006840,AX-100007392,AX-100007701,AX-100008742
quality_bin,high,low,high,low,high,low,marginal
quality_score,4.463,0.733,5.0,2.638,4.68,1.968,4.12
OriginalCT.recommended,True,False,True,True,True,True,False
OriginalCT,PolyHighResolution,Other,MonoHighResolution,PolyHighResolution,PolyHighResolution,PolyHighResolution,Other
CC,0.996,0.953,1.0,0.975,0.996,0.963,0.989
CR,98.913,97.464,100.0,98.188,100.0,99.638,100.0
FLD,6.761,3.694,,4.352,6.359,7.161,
HetSO,0.745,-0.28,,0.421,0.36,0.161,
Nclus,3,2,1,3,3,3,1
MMD,36.317,,,29.674,47.444,50.152,


In [4]:
# Some metrics are only computed for certain categories of probeset, 
missing = pd.concat([ps_data.isnull().sum(), 100 * ps_data.isnull().mean()], axis=1)
missing.columns=['count_missing', 'frac_missing']
missing.sort_values(by='count_missing',ascending = False)

Unnamed: 0,count_missing,frac_missing
MMD,549639,68.011582
AA.meanX.clean,340804,42.170623
AA.varY.clean,340804,42.170623
AA.varX.clean,340804,42.170623
AA.meanY.clean,340804,42.170623
BB.varY.clean,208499,25.799383
BB.varX.clean,208499,25.799383
BB.meanY.clean,208499,25.799383
BB.meanX.clean,208499,25.799383
FLD,30841,3.816223


## 3.2 More feature engineering

### 3.2.1 Prediction feature

In [5]:
# count of probesets in each group that I want to split into test/train sets by same proportion
ps_data.pivot_table(index=['OriginalCT'],columns=['quality_bin'],values='CR',aggfunc='count') \
       .sort_values(by='high',ascending=False)[['high','marginal','low']]

quality_bin,high,marginal,low
OriginalCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NoMinorHom,423812.0,74654.0,5736.0
PolyHighResolution,191118.0,37684.0,6774.0
MonoHighResolution,25750.0,2116.0,952.0
Other,3795.0,4382.0,16754.0
ABvarianceY,1263.0,1061.0,549.0
ABvarianceX,1244.0,1007.0,256.0
AAvarianceY,1125.0,260.0,73.0
BBvarianceX,1094.0,265.0,67.0
BBvarianceY,981.0,319.0,142.0
AAvarianceX,671.0,141.0,74.0


In [6]:
# I want the model to predict only 2 classes 'good' 'bad', not the current 3 'high' 'marginal' 'low'.
# Let's define good quality as Concordance with reference data of at least 98.5%, and Call Rate of at least 95%.
# This is equivalent to combining the 'high' and 'marginal' bins into 'good', and 'low' into 'bad'.
# There are no missing values for CC and CR
ps_data['quality_binary'] = np.where((ps_data['CC']>0.985) & (ps_data['CR']>95), 'good', 'bad')

In [7]:
ps_data[['CC','CR','quality_bin','quality_binary']].head(7)

Unnamed: 0_level_0,CC,CR,quality_bin,quality_binary
probeset_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AX-100003653,0.996,98.913,high,good
AX-100004573,0.953,97.464,low,bad
AX-100004941,1.0,100.0,high,good
AX-100006840,0.975,98.188,low,bad
AX-100007392,0.996,100.0,high,good
AX-100007701,0.963,99.638,low,bad
AX-100008742,0.989,100.0,marginal,good


In [8]:
# count of probesets in each OriginalCT & quality_binary group that I want to split into test/train sets by same proportion
ps_data.pivot_table(index=['OriginalCT'],columns=['quality_binary'],values='CR',aggfunc='count') \
       .sort_values(by='good',ascending=False)[['good','bad']]

quality_binary,good,bad
OriginalCT,Unnamed: 1_level_1,Unnamed: 2_level_1
NoMinorHom,498455.0,5747.0
PolyHighResolution,228775.0,6801.0
MonoHighResolution,27866.0,952.0
Other,8175.0,16756.0
ABvarianceY,2322.0,551.0
ABvarianceX,2251.0,256.0
AAvarianceY,1383.0,75.0
BBvarianceX,1359.0,67.0
BBvarianceY,1300.0,142.0
OTV,927.0,1027.0


### 3.2.2 Make stratification column to prepare for train-test splitting

Because I want preserve proportions of the imbalanced classes when doing a train-test split, I want to stratify the split by BOTH 'OriginalCT' and 'quality_binary'.  However, stratification feature works more predictably if there's a single stratification column.  Therefore, let's concatenate these two string columns.

In [9]:
ps_data_stratifier = pd.DataFrame(columns=['OriginalCTquality_binary'])
ps_data_stratifier['OriginalCTquality_binary']=ps_data['OriginalCT'] + '.' + ps_data['quality_binary']

In [10]:
ps_data_stratifier.head()

Unnamed: 0_level_0,OriginalCTquality_binary
probeset_id,Unnamed: 1_level_1
AX-100003653,PolyHighResolution.good
AX-100004573,Other.bad
AX-100004941,MonoHighResolution.good
AX-100006840,PolyHighResolution.bad
AX-100007392,PolyHighResolution.good


### 3.2.3 One-hot encoding

"Avoid OneHot for decision tree-based algorithms." 
https://web.archive.org/web/20200924113639/https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

So maybe I shouldn't do this yet?

But I think sklearn requires one-hot encoding for decision tree algorithms.


In [11]:
ps_data.select_dtypes(exclude=float)

Unnamed: 0_level_0,quality_bin,OriginalCT.recommended,OriginalCT,Nclus,quality_binary
probeset_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AX-100003653,high,True,PolyHighResolution,3,good
AX-100004573,low,False,Other,2,bad
AX-100004941,high,True,MonoHighResolution,1,good
AX-100006840,low,True,PolyHighResolution,3,bad
AX-100007392,high,True,PolyHighResolution,3,good
...,...,...,...,...,...
AX-98295628,low,False,ABvarianceY,3,bad
AX-98295631,high,True,PolyHighResolution,3,good
AX-98295632,high,True,PolyHighResolution,3,good
AX-98295636,high,True,PolyHighResolution,3,good


In [12]:
# will drop 'quality_bin', so don't bother encoding
# will encode Nclus, since actually categorical, not ordinal (order doesn't matter)
unencoded_columns = ['OriginalCT.recommended', 'OriginalCT', 'Nclus', 'quality_binary']

# If I don't provide columns, it'll quess which to encode, and will skip OriginalCT.recommended (True/False) and Nclus
encoded = pd.get_dummies(ps_data[unencoded_columns], columns=unencoded_columns)

In [13]:
encoded.head().T

probeset_id,AX-100003653,AX-100004573,AX-100004941,AX-100006840,AX-100007392
OriginalCT.recommended_False,0,1,0,0,0
OriginalCT.recommended_True,1,0,1,1,1
OriginalCT_AAvarianceX,0,0,0,0,0
OriginalCT_AAvarianceY,0,0,0,0,0
OriginalCT_ABvarianceX,0,0,0,0,0
OriginalCT_ABvarianceY,0,0,0,0,0
OriginalCT_BBvarianceX,0,0,0,0,0
OriginalCT_BBvarianceY,0,0,0,0,0
OriginalCT_CallRateBelowThreshold,0,0,0,0,0
OriginalCT_MonoHighResolution,0,0,1,0,0


In [14]:
# drop unnecessary encoded columns
encoded_to_drop = ['OriginalCT.recommended_False','quality_binary_bad']
encoded.drop(columns=encoded_to_drop, inplace=True)

# let's preserve dropped original columns in a new dataset
ps_data_dropped = ps_data[unencoded_columns]

# Replace original features with encoded features. DataFrame.join merges tables on common row indices by default.
ps_data = ps_data.drop(columns=unencoded_columns).join(encoded)
ps_data.head().T

probeset_id,AX-100003653,AX-100004573,AX-100004941,AX-100006840,AX-100007392
quality_bin,high,low,high,low,high
quality_score,4.463,0.733,5.0,2.638,4.68
CC,0.996,0.953,1.0,0.975,0.996
CR,98.913,97.464,100.0,98.188,100.0
FLD,6.761,3.694,,4.352,6.359
HetSO,0.745,-0.28,,0.421,0.36
MMD,36.317,,,29.674,47.444
het_frac,0.238095,0.026022,0.0,0.092251,0.105072
MinorAlleleFrequency,0.145,0.013,0.0,0.061,0.056
H.W.p-Value,0.469615,1.0,1.0,0.010957,0.590233


### 3.3.3 Features to drop

Some features are not normally available and were added to create the predictor feature.  Should be dropped so not used for modeling.
* CC
* quality_score
* quality_bin

In [15]:
columns_to_drop = ['CC','quality_score','quality_bin']
ps_data.drop(columns=columns_to_drop, inplace=True)

In [16]:
ps_data.dtypes

CR                                   float64
FLD                                  float64
HetSO                                float64
MMD                                  float64
het_frac                             float64
MinorAlleleFrequency                 float64
H.W.p-Value                          float64
AA.meanX.clean                       float64
AB.meanX.abs_clean                   float64
BB.meanX.clean                       float64
HomRO                                float64
AA.meanY.clean                       float64
AB.meanY.clean                       float64
BB.meanY.clean                       float64
meanY                                float64
Hom.meanY.delta                      float64
AA.varX.clean                        float64
AB.varX.clean                        float64
BB.varX.clean                        float64
AA.varY.clean                        float64
AB.varY.clean                        float64
BB.varY.clean                        float64
OriginalCT

## 3.3 Split into train and test sets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

"stratify" on two columns
https://stackoverflow.com/questions/45516424/sklearn-train-test-split-on-pandas-stratify-by-multiple-columns

It sounds like the strategy should be to create new column concatenating OriginalCT & quality_binary, and stratify on new column when splitting.


In [17]:
# Don't forget to define your X (features) and y (predictor)
X = ps_data.drop(columns=['quality_binary_good'])
y = ps_data['quality_binary_good']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=ps_data_stratifier)

In [18]:
type(X_train), X.shape, type(y_train), y.shape

(pandas.core.frame.DataFrame,
 (808155, 38),
 pandas.core.series.Series,
 (808155,))

In [19]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((565708, 38), (565708,), (242447, 38), (242447,))

## 3.4 Subsample to balance the good/bad classes

In [20]:
# relative difference in good vs bad category probesets
# y_train is Series of quality_binary_good values
train_good = sum(y_train==1)
train_bad = sum(y_train==0)
print(train_good,' / ',train_bad, ' = ',train_good/train_bad)

541536  /  24172  =  22.403441999007114


22.4 good for every 1 bad probeset is pretty imbalanced when training models.  Prefer maximum imbalance of about 1.5:1  (60:40).  There seems to be enough good probesets that can subsample only the good probesets in the training set.

In [21]:
# if retain every 1/15 "good" data point, the proportion of good to bad observations is about 1.5:1
print(train_good/15,' / ',train_bad, ' = ',(train_good/15)/train_bad)

36102.4  /  24172  =  1.4935627999338077


In [22]:
y_train.head()

probeset_id
AX-34925385     1
AX-152970946    1
AX-41872981     1
AX-36809499     1
AX-148865368    1
Name: quality_binary_good, dtype: uint8

In [23]:
# sklearn.utils.resample(*arrays, replace=True, n_samples=None, random_state=None, stratify=None)
# want to subsample only quality_binary_good = 1 rows (y_train==1), keeping about 1/15 rows (36102 probesets)
X_train_subsample1, y_train_subsample1 = resample(X_train[y_train==1], \
                                                y_train[y_train==1], \
                                                replace=False, n_samples=36102, random_state=42)

# then concatenate back on all the quality_binary_good = 0 rows.
# do for both X_train and y_train !
# pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
X_train_subsample = pd.concat([X_train_subsample1,X_train[y_train==0]],verify_integrity=True)
y_train_subsample = pd.concat([y_train_subsample1,y_train[y_train==0]],verify_integrity=True)

X_train_subsample1.shape, X_train[y_train==0].shape, X_train_subsample.shape, len(y_train_subsample1), len(y_train[y_train==0]), len(y_train_subsample)

((36102, 38), (24172, 38), (60274, 38), 36102, 24172, 60274)

In [24]:
y_train.head()

probeset_id
AX-34925385     1
AX-152970946    1
AX-41872981     1
AX-36809499     1
AX-148865368    1
Name: quality_binary_good, dtype: uint8

## 3.4 (No Normalization)

may not do this until part of specific model pipeline

## 3. Save to zip

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [25]:
%whos

Variable             Type         Data/Info
-------------------------------------------
X                    DataFrame                       CR    <...>808155 rows x 38 columns]
X_test               DataFrame                      CR     <...>242447 rows x 38 columns]
X_train              DataFrame                       CR    <...>565708 rows x 38 columns]
X_train_subsample    DataFrame                       CR    <...>[60274 rows x 38 columns]
X_train_subsample1   DataFrame                       CR    <...>[36102 rows x 38 columns]
columns_to_drop      list         n=3
encoded              DataFrame                  OriginalCT.<...>808155 rows x 17 columns]
encoded_to_drop      list         n=2
missing              DataFrame                            c<...>     208499     25.799383
np                   module       <module 'numpy' from '/Us<...>kages/numpy/__init__.py'>
os                   module       <module 'os' from '/Users<...>ard/lib/python3.8/os.py'>
pd                   modul

In [26]:
os.getcwd()

'/Users/Carsten/OneDrive/Documents/Springboard/git_repositories/DataScienceCapstoneTwo/notebooks'

In [32]:
os.chdir('../data')
os.getcwd()

'/Users/Carsten/OneDrive/Documents/Springboard/git_repositories/DataScienceCapstoneTwo/data'

In [None]:
#compression_opts = dict(method='zip',archive_name='X_train_subsample.tsv')
X_train_subsample.to_csv('X_train_subsample.zip', sep='\t', compression={'method':'zip','archive_name':'X_train_subsample.tsv'})
y_train_subsample.to_csv('y_train_subsample.zip', sep='\t', compression={'method':'zip','archive_name':'y_train_subsample.tsv'})
X_train.to_csv('X_train.zip', sep='\t', compression={'method':'zip','archive_name':'X_train.tsv'})
y_train.to_csv('y_train.zip', sep='\t', compression={'method':'zip','archive_name':'y_train.tsv'})
X_test.to_csv('X_test.zip', sep='\t', compression={'method':'zip','archive_name':'X_test.tsv'})
y_test.to_csv('y_test.zip', sep='\t', compression={'method':'zip','archive_name':'y_test.tsv'})