# 3 train_test_split data<a id='3_train_test_split data'></a>

Start with trimmed set of columns of ps_performance and concordance results visualized mainly in Tibco Spotfire file:
/git_repositories/DataScienceCapstoneTwo/spotfire/data_cleaning_step2_EDA.dxp

This file has removed most of the unnecessary columns and highly correlated columns to remaining metrics.

Do some final feature engineering

Split into 70:30 train:test sets, preserving proportions of OriginalCT AND quality_binary bin.

Be prepared in model building to then subsample quality_binary=good class to about 7% of original (again preserving proportions of OriginalCT), so good:bad split in training set is about 60:40.

Don't drop sparse columns yet, since some models can handle it.

Don't scale columns yet, since some models can handle unscaled data.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split

## 3.1 Load and inspect data

In [2]:
ps_data = pd.read_csv('../data/data_cleaning_step2.zip',sep='\t')

In [3]:
ps_data.head(7).transpose()

Unnamed: 0,0,1,2,3,4,5,6
probeset_id,AX-100003653,AX-100004573,AX-100004941,AX-100006840,AX-100007392,AX-100007701,AX-100008742
quality_bin,high,low,high,low,high,low,marginal
quality_score,4.463,0.733,5.0,2.638,4.68,1.968,4.12
OriginalCT.recommended,True,False,True,True,True,True,False
OriginalCT,PolyHighResolution,Other,MonoHighResolution,PolyHighResolution,PolyHighResolution,PolyHighResolution,Other
CC,0.996,0.953,1.0,0.975,0.996,0.963,0.989
CR,98.913,97.464,100.0,98.188,100.0,99.638,100.0
FLD,6.761,3.694,,4.352,6.359,7.161,
HetSO,0.745,-0.28,,0.421,0.36,0.161,
Nclus,3,2,1,3,3,3,1


In [4]:
# Some metrics are only computed for certain categories of probeset, 
missing = pd.concat([ps_data.isnull().sum(), 100 * ps_data.isnull().mean()], axis=1)
missing.columns=['count_missing', 'frac_missing']
missing.sort_values(by='count_missing',ascending = False)

Unnamed: 0,count_missing,frac_missing
MMD,549639,68.011582
AA.meanX.clean,340804,42.170623
AA.varY.clean,340804,42.170623
AA.varX.clean,340804,42.170623
AA.meanY.clean,340804,42.170623
BB.varX.clean,208499,25.799383
BB.meanY.clean,208499,25.799383
BB.meanX.clean,208499,25.799383
BB.varY.clean,208499,25.799383
FLD,30841,3.816223


## 3.2 More feature engineering

### 3.2.1 Prediction feature

In [5]:
# count of probesets in each group that I want to split into test/train sets by same proportion
ps_data.pivot_table(index=['OriginalCT'],columns=['quality_bin'],values='probeset_id',aggfunc='count') \
       .sort_values(by='high',ascending=False)[['high','marginal','low']]

quality_bin,high,marginal,low
OriginalCT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NoMinorHom,423812.0,74654.0,5736.0
PolyHighResolution,191118.0,37684.0,6774.0
MonoHighResolution,25750.0,2116.0,952.0
Other,3795.0,4382.0,16754.0
ABvarianceY,1263.0,1061.0,549.0
ABvarianceX,1244.0,1007.0,256.0
AAvarianceY,1125.0,260.0,73.0
BBvarianceX,1094.0,265.0,67.0
BBvarianceY,981.0,319.0,142.0
AAvarianceX,671.0,141.0,74.0


In [6]:
# I want the model to predict only 2 classes 'good' 'bad', not the current 3 'high' 'marginal' 'low'.
# Let's define good quality as Concordance with reference data of at least 98.5%, and Call Rate of at least 95%.
# This is equivalent to combining the 'high' and 'marginal' bins into 'good', and 'low' into 'bad'.
# There are no missing values for CC and CR
ps_data['quality_binary'] = np.where((ps_data['CC']>0.985) & (ps_data['CR']>95), 'good', 'bad')

In [7]:
ps_data[['probeset_id','CC','CR','quality_bin','quality_binary']].head(7)

Unnamed: 0,probeset_id,CC,CR,quality_bin,quality_binary
0,AX-100003653,0.996,98.913,high,good
1,AX-100004573,0.953,97.464,low,bad
2,AX-100004941,1.0,100.0,high,good
3,AX-100006840,0.975,98.188,low,bad
4,AX-100007392,0.996,100.0,high,good
5,AX-100007701,0.963,99.638,low,bad
6,AX-100008742,0.989,100.0,marginal,good


In [8]:
# count of probesets in each OriginalCT & quality_binary group that I want to split into test/train sets by same proportion
ps_data.pivot_table(index=['OriginalCT'],columns=['quality_binary'],values='probeset_id',aggfunc='count') \
       .sort_values(by='good',ascending=False)[['good','bad']]

quality_binary,good,bad
OriginalCT,Unnamed: 1_level_1,Unnamed: 2_level_1
NoMinorHom,498455.0,5747.0
PolyHighResolution,228775.0,6801.0
MonoHighResolution,27866.0,952.0
Other,8175.0,16756.0
ABvarianceY,2322.0,551.0
ABvarianceX,2251.0,256.0
AAvarianceY,1383.0,75.0
BBvarianceX,1359.0,67.0
BBvarianceY,1300.0,142.0
OTV,927.0,1027.0


In [9]:
ps_data[['quality_binary']].value_counts()

quality_binary
good              773625
bad                34530
dtype: int64

### 3.2.2 One-hot encoding

"Avoid OneHot for decision tree-based algorithms."

So maybe I shouldn't do this yet?

In [10]:
ps_data.select_dtypes(exclude=float)

Unnamed: 0,probeset_id,quality_bin,OriginalCT.recommended,OriginalCT,Nclus,quality_binary
0,AX-100003653,high,True,PolyHighResolution,3,good
1,AX-100004573,low,False,Other,2,bad
2,AX-100004941,high,True,MonoHighResolution,1,good
3,AX-100006840,low,True,PolyHighResolution,3,bad
4,AX-100007392,high,True,PolyHighResolution,3,good
...,...,...,...,...,...,...
808150,AX-98295628,low,False,ABvarianceY,3,bad
808151,AX-98295631,high,True,PolyHighResolution,3,good
808152,AX-98295632,high,True,PolyHighResolution,3,good
808153,AX-98295636,high,True,PolyHighResolution,3,good


### 3.3.3 Features to drop

Some features are not normally available and were added to create the predictor feature.  Should be dropped so not used for modeling.
* CC
* quality_score
* quality_bin

In [None]:
columns_to_drop = ['CC','quality_score','quality_bin']
#ps_data.drop(columns=columns_to_drop, inplace=True)

## 3.3 split into train and test sets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

"stratify" on two columns
https://stackoverflow.com/questions/45516424/sklearn-train-test-split-on-pandas-stratify-by-multiple-columns

it sounds like the strategy should be to create new column concatenating OriginalCT & quality_binary, and stratify on new column when splitting.


## 3.4 normalization

may not do this until part of specific model pipeline

In [None]:
ps_data.describe().T.sort_values(by='count',ascending=False)