### Deletions: Random Selection of Structural Variants

** Note: tSNE Plots and DBSCAN Histogram/CDF**

** Background**
- 5000 Insertions and 5000 Deletions were randomly selected from our union callset of sequence resolved variants.

- **3991** unique insertions are described below.

- Features were generated by svviz to describe each variant

- tSNE was used to visualize the structure of the data

- The goal is to randomly select datapoints from each unique group/tSNE cluster and distribute these selected variants for manual curation. 

- In order to randomly select samples from each unique tSNE cluster, DBSCAN will be used to generate cluster labels. For each set of DBSCAN cluster labels, a select number will be randomly selected from each cluster group.

In [1]:
'''
Import statements
'''
import pandas as pd
import numpy as np
from fancyimpute import KNN
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import LeaveOneOut
from scipy.stats import ks_2samp
from scipy import stats
from matplotlib import pyplot
from sklearn import preprocessing
from scipy.linalg import svd
from sklearn.decomposition import TruncatedSVD
import sqlite3
import seaborn as sns
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.cluster import DBSCAN
from bokeh.charts import Scatter, Histogram, output_file, show
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)

In [2]:
'''
Load Data
'''

df = pd.read_csv("svviz.Annotate.DEL.HG002.csv")
df4 = pd.read_csv("svviz.Annotate.DEL.HG002.csv")
log_size = pd.read_csv("log_Size.csv")

***
** Part 1: Data Preprocessing **
***

In [3]:
'''
Convert Categorical to Numerical
'''

#Label Encoding: convert categorical to numerical
label_encoder = preprocessing.LabelEncoder()
df['chrom'] = label_encoder.fit_transform(df['chrom'])
df['SVtype'] = label_encoder.fit_transform(df['SVtype'])
df['sample'] = label_encoder.fit_transform(df['sample'])
df['type'] = label_encoder.fit_transform(df['type'])
# Count Number of NaN in each column
dfNaN = pd.DataFrame()
df.isnull().sum()

chrom                                                 0
start                                                 0
end                                                   0
sample                                                0
id                                                    0
type                                                  0
SVtype                                                0
Size                                                  0
Ill300x.alt_alnScore_mean                             9
Ill300x.alt_alnScore_std                              9
Ill300x.alt_count                                     9
Ill300x.alt_insertSize_mean                           9
Ill300x.alt_insertSize_std                            9
Ill300x.alt_reason_alignmentScore                     9
Ill300x.alt_reason_insertSizeScore                    9
Ill300x.alt_reason_orientation                        9
Ill300x.amb_alnScore_mean                             9
Ill300x.amb_alnScore_std                        

In [5]:
'''
Missing values
'''
# Count Number of NaN in each column
dfNaN = pd.DataFrame()
df.isnull().sum()

Imputing row 1/3996 with 1 missing, elapsed time: 16.743
Imputing row 101/3996 with 1 missing, elapsed time: 16.751
Imputing row 201/3996 with 1 missing, elapsed time: 16.757
Imputing row 301/3996 with 2 missing, elapsed time: 16.805
Imputing row 401/3996 with 2 missing, elapsed time: 16.825
Imputing row 501/3996 with 2 missing, elapsed time: 16.838
Imputing row 601/3996 with 1 missing, elapsed time: 16.873
Imputing row 701/3996 with 1 missing, elapsed time: 16.888
Imputing row 801/3996 with 1 missing, elapsed time: 16.920
Imputing row 901/3996 with 1 missing, elapsed time: 16.930
Imputing row 1001/3996 with 1 missing, elapsed time: 16.943
Imputing row 1101/3996 with 1 missing, elapsed time: 16.979
Imputing row 1201/3996 with 2 missing, elapsed time: 16.996
Imputing row 1301/3996 with 1 missing, elapsed time: 17.007
Imputing row 1401/3996 with 1 missing, elapsed time: 17.020
Imputing row 1501/3996 with 1 missing, elapsed time: 17.034
Imputing row 1601/3996 with 1 missing, elapsed time:

Unnamed: 0,chrom,start,end,sample,id,type,SVtype,Size,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,...,pacbio.ref_reason_alignmentScore,pacbio.GT,GTconflict,GTsupp,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct
0,0.0,1685921.0,1685944.0,0.0,669.0,0.0,0.0,-23.0,576.973451,12.756716,...,13.0,1.0,-1.0,2.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,0.0,4137242.0,4137420.0,0.0,170.0,0.0,0.0,-178.0,562.000000,17.962925,...,18.0,0.0,-1.0,1.0,0.0,0.0,1.0,0.679775,0.0,0.0
2,0.0,16363392.0,16363413.0,0.0,777.0,1.0,0.0,-20.0,571.285714,14.578255,...,35.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,0.0,33901563.0,33901589.0,0.0,952.0,1.0,0.0,-25.0,576.336245,6.124298,...,0.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
4,0.0,41024252.0,41024371.0,0.0,553.0,0.0,0.0,-119.0,577.461538,10.676108,...,12.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
5,0.0,52270083.0,52270310.0,0.0,641.0,1.0,0.0,-90.0,580.000000,18.547237,...,47.0,0.0,-1.0,2.0,0.0,0.0,1.0,1.000000,0.0,0.0
6,0.0,62589448.0,62589500.0,0.0,454.0,1.0,0.0,-51.0,559.516854,13.513300,...,16.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
7,0.0,69330658.0,69330680.0,0.0,436.0,0.0,0.0,-22.0,581.960699,5.946136,...,1.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
8,0.0,81960824.0,81960847.0,0.0,413.0,0.0,0.0,-23.0,581.541284,7.321618,...,9.0,1.0,-1.0,2.0,0.0,0.0,0.0,0.000000,0.0,0.0
9,0.0,164176123.0,164176335.0,0.0,606.0,1.0,0.0,-97.0,581.144543,9.004083,...,1.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0


In [6]:
'''
Use KNN to impute missing values
'''

#Convert dataframe to matrix
X = df.as_matrix()

# Imput missing values from three closest observations
X_imputed = KNN(k=3).complete(X)
df2 = pd.DataFrame(X_imputed)

#Re-label all columns in the dataframe
df2.columns = ['chrom', 'start', 'end', 'sample', 'id', 'type', 'SVtype', 'Size', 'Ill300x.alt_alnScore_mean', 'Ill300x.alt_alnScore_std', 'Ill300x.alt_count', 'Ill300x.alt_insertSize_mean', 'Ill300x.alt_insertSize_std', 'Ill300x.alt_reason_alignmentScore', 'Ill300x.alt_reason_insertSizeScore', 'Ill300x.alt_reason_orientation', 'Ill300x.amb_alnScore_mean', 'Ill300x.amb_alnScore_std', 'Ill300x.amb_count', 'Ill300x.amb_insertSize_mean', 'Ill300x.amb_insertSize_std', 'Ill300x.amb_reason_alignmentScore_alignmentScore', 'Ill300x.amb_reason_alignmentScore_orientation', 'Ill300x.amb_reason_flanking', 'Ill300x.amb_reason_insertSizeScore_alignmentScore', 'Ill300x.amb_reason_insertSizeScore_insertSizeScore', 'Ill300x.amb_reason_insertSizeScore_orientation', 'Ill300x.amb_reason_multimapping', 'Ill300x.amb_reason_orientation_alignmentScore', 'Ill300x.amb_reason_orientation_orientation', 'Ill300x.amb_reason_same_scores', 'Ill300x.ref_alnScore_mean', 'Ill300x.ref_alnScore_std', 'Ill300x.ref_count', 'Ill300x.ref_insertSize_mean', 'Ill300x.ref_insertSize_std', 'Ill300x.ref_reason_alignmentScore', 'Ill300x.ref_reason_insertSizeScore', 'Ill300x.ref_reason_orientation', 'Ill300x.GT', 'Ill250.alt_alnScore_mean', 'Ill250.alt_alnScore_std', 'Ill250.alt_count', 'Ill250.alt_insertSize_mean', 'Ill250.alt_insertSize_std', 'Ill250.alt_reason_alignmentScore', 'Ill250.alt_reason_insertSizeScore', 'Ill250.alt_reason_orientation', 'Ill250.amb_alnScore_mean', 'Ill250.amb_alnScore_std', 'Ill250.amb_count', 'Ill250.amb_insertSize_mean', 'Ill250.amb_insertSize_std', 'Ill250.amb_reason_alignmentScore_alignmentScore', 'Ill250.amb_reason_alignmentScore_orientation', 'Ill250.amb_reason_flanking', 'Ill250.amb_reason_insertSizeScore_alignmentScore', 'Ill250.amb_reason_multimapping', 'Ill250.amb_reason_orientation_alignmentScore', 'Ill250.amb_reason_orientation_orientation', 'Ill250.amb_reason_same_scores', 'Ill250.ref_alnScore_mean', 'Ill250.ref_alnScore_std', 'Ill250.ref_count', 'Ill250.ref_insertSize_mean', 'Ill250.ref_insertSize_std', 'Ill250.ref_reason_alignmentScore', 'Ill250.ref_reason_orientation', 'Ill250.GT', 'IllMP.alt_alnScore_mean', 'IllMP.alt_alnScore_std', 'IllMP.alt_count', 'IllMP.alt_insertSize_mean', 'IllMP.alt_insertSize_std', 'IllMP.alt_reason_alignmentScore', 'IllMP.alt_reason_insertSizeScore', 'IllMP.alt_reason_orientation', 'IllMP.amb_alnScore_mean', 'IllMP.amb_alnScore_std', 'IllMP.amb_count', 'IllMP.amb_insertSize_mean', 'IllMP.amb_insertSize_std', 'IllMP.amb_reason_alignmentScore_alignmentScore', 'IllMP.amb_reason_alignmentScore_orientation', 'IllMP.amb_reason_flanking', 'IllMP.amb_reason_insertSizeScore_alignmentScore', 'IllMP.amb_reason_insertSizeScore_insertSizeScore', 'IllMP.amb_reason_multimapping', 'IllMP.amb_reason_orientation_alignmentScore', 'IllMP.amb_reason_orientation_orientation', 'IllMP.amb_reason_same_scores', 'IllMP.ref_alnScore_mean', 'IllMP.ref_alnScore_std', 'IllMP.ref_count', 'IllMP.ref_insertSize_mean', 'IllMP.ref_insertSize_std', 'IllMP.ref_reason_alignmentScore', 'IllMP.ref_reason_insertSizeScore', 'IllMP.ref_reason_orientation', 'IllMP.GT', 'TenX.HP1_alt_alnScore_mean', 'TenX.HP1_alt_alnScore_std', 'TenX.HP1_alt_count', 'TenX.HP1_alt_insertSize_mean', 'TenX.HP1_alt_insertSize_std', 'TenX.HP1_alt_reason_alignmentScore', 'TenX.HP1_alt_reason_insertSizeScore', 'TenX.HP1_alt_reason_orientation', 'TenX.HP1_amb_alnScore_mean', 'TenX.HP1_amb_alnScore_std', 'TenX.HP1_amb_count', 'TenX.HP1_amb_insertSize_mean', 'TenX.HP1_amb_insertSize_std', 'TenX.HP1_amb_reason_alignmentScore_alignmentScore', 'TenX.HP1_amb_reason_alignmentScore_orientation', 'TenX.HP1_amb_reason_flanking', 'TenX.HP1_amb_reason_insertSizeScore_alignmentScore', 'TenX.HP1_amb_reason_insertSizeScore_insertSizeScore', 'TenX.HP1_amb_reason_multimapping', 'TenX.HP1_amb_reason_orientation_alignmentScore', 'TenX.HP1_amb_reason_orientation_orientation', 'TenX.HP1_amb_reason_same_scores', 'TenX.HP1_ref_alnScore_mean', 'TenX.HP1_ref_alnScore_std', 'TenX.HP1_ref_count', 'TenX.HP1_ref_insertSize_mean', 'TenX.HP1_ref_insertSize_std', 'TenX.HP1_ref_reason_alignmentScore', 'TenX.HP1_ref_reason_insertSizeScore', 'TenX.HP1_ref_reason_orientation', 'TenX.HP2_alt_alnScore_mean', 'TenX.HP2_alt_alnScore_std', 'TenX.HP2_alt_count', 'TenX.HP2_alt_insertSize_mean', 'TenX.HP2_alt_insertSize_std', 'TenX.HP2_alt_reason_alignmentScore', 'TenX.HP2_alt_reason_insertSizeScore', 'TenX.HP2_alt_reason_orientation', 'TenX.HP2_amb_alnScore_mean', 'TenX.HP2_amb_alnScore_std', 'TenX.HP2_amb_count', 'TenX.HP2_amb_insertSize_mean', 'TenX.HP2_amb_insertSize_std', 'TenX.HP2_amb_reason_alignmentScore_alignmentScore', 'TenX.HP2_amb_reason_alignmentScore_orientation', 'TenX.HP2_amb_reason_flanking', 'TenX.HP2_amb_reason_insertSizeScore_alignmentScore', 'TenX.HP2_amb_reason_insertSizeScore_insertSizeScore', 'TenX.HP2_amb_reason_multimapping', 'TenX.HP2_amb_reason_orientation_alignmentScore', 'TenX.HP2_amb_reason_orientation_insertSizeScore', 'TenX.HP2_amb_reason_orientation_orientation', 'TenX.HP2_amb_reason_same_scores', 'TenX.HP2_ref_alnScore_mean', 'TenX.HP2_ref_alnScore_std', 'TenX.HP2_ref_count', 'TenX.HP2_ref_insertSize_mean', 'TenX.HP2_ref_insertSize_std', 'TenX.HP2_ref_reason_alignmentScore', 'TenX.HP2_ref_reason_orientation', 'TenX.GT', 'pacbio.alt_alnScore_mean', 'pacbio.alt_alnScore_std', 'pacbio.alt_count', 'pacbio.alt_insertSize_mean', 'pacbio.alt_insertSize_std', 'pacbio.alt_reason_alignmentScore', 'pacbio.amb_alnScore_mean', 'pacbio.amb_alnScore_std', 'pacbio.amb_count', 'pacbio.amb_insertSize_mean', 'pacbio.amb_insertSize_std', 'pacbio.amb_reason_alignmentScore_alignmentScore', 'pacbio.amb_reason_flanking', 'pacbio.amb_reason_multimapping', 'pacbio.amb_reason_same_scores', 'pacbio.ref_alnScore_mean', 'pacbio.ref_alnScore_std', 'pacbio.ref_count', 'pacbio.ref_insertSize_mean', 'pacbio.ref_insertSize_std', 'pacbio.ref_reason_alignmentScore', 'pacbio.GT', 'GTcons', 'GTconflict', 'GTsupp', 'tandemrep_cnt', 'tandemrep_pct', 'segdup_cnt', 'segdup_pct', 'refN_cnt', 'refN_pct']
df2.drop('GTcons',axis=1)

Imputing row 1/3996 with 1 missing, elapsed time: 16.987
Imputing row 101/3996 with 1 missing, elapsed time: 16.995
Imputing row 201/3996 with 1 missing, elapsed time: 17.004
Imputing row 301/3996 with 2 missing, elapsed time: 17.046
Imputing row 401/3996 with 2 missing, elapsed time: 17.062
Imputing row 501/3996 with 2 missing, elapsed time: 17.075
Imputing row 601/3996 with 1 missing, elapsed time: 17.107
Imputing row 701/3996 with 1 missing, elapsed time: 17.121
Imputing row 801/3996 with 1 missing, elapsed time: 17.155
Imputing row 901/3996 with 1 missing, elapsed time: 17.163
Imputing row 1001/3996 with 1 missing, elapsed time: 17.175
Imputing row 1101/3996 with 1 missing, elapsed time: 17.210
Imputing row 1201/3996 with 2 missing, elapsed time: 17.225
Imputing row 1301/3996 with 1 missing, elapsed time: 17.237
Imputing row 1401/3996 with 1 missing, elapsed time: 17.248
Imputing row 1501/3996 with 1 missing, elapsed time: 17.261
Imputing row 1601/3996 with 1 missing, elapsed time:

Unnamed: 0,chrom,start,end,sample,id,type,SVtype,Size,Ill300x.alt_alnScore_mean,Ill300x.alt_alnScore_std,...,pacbio.ref_reason_alignmentScore,pacbio.GT,GTconflict,GTsupp,tandemrep_cnt,tandemrep_pct,segdup_cnt,segdup_pct,refN_cnt,refN_pct
0,0.0,1685921.0,1685944.0,0.0,669.0,0.0,0.0,-23.0,576.973451,12.756716,...,13.0,1.0,-1.0,2.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,0.0,4137242.0,4137420.0,0.0,170.0,0.0,0.0,-178.0,562.000000,17.962925,...,18.0,0.0,-1.0,1.0,0.0,0.0,1.0,0.679775,0.0,0.0
2,0.0,16363392.0,16363413.0,0.0,777.0,1.0,0.0,-20.0,571.285714,14.578255,...,35.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,0.0,33901563.0,33901589.0,0.0,952.0,1.0,0.0,-25.0,576.336245,6.124298,...,0.0,2.0,-1.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
4,0.0,41024252.0,41024371.0,0.0,553.0,0.0,0.0,-119.0,577.461538,10.676108,...,12.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
5,0.0,52270083.0,52270310.0,0.0,641.0,1.0,0.0,-90.0,580.000000,18.547237,...,47.0,0.0,-1.0,2.0,0.0,0.0,1.0,1.000000,0.0,0.0
6,0.0,62589448.0,62589500.0,0.0,454.0,1.0,0.0,-51.0,559.516854,13.513300,...,16.0,1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
7,0.0,69330658.0,69330680.0,0.0,436.0,0.0,0.0,-22.0,581.960699,5.946136,...,1.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0
8,0.0,81960824.0,81960847.0,0.0,413.0,0.0,0.0,-23.0,581.541284,7.321618,...,9.0,1.0,-1.0,2.0,0.0,0.0,0.0,0.000000,0.0,0.0
9,0.0,164176123.0,164176335.0,0.0,606.0,1.0,0.0,-97.0,581.144543,9.004083,...,1.0,-1.0,-1.0,1.0,0.0,0.0,0.0,0.000000,0.0,0.0


In [7]:
# Count NaNs post KNN imputation
df2.isnull().sum()

chrom                                                 0
start                                                 0
end                                                   0
sample                                                0
id                                                    0
type                                                  0
SVtype                                                0
Size                                                  0
Ill300x.alt_alnScore_mean                             0
Ill300x.alt_alnScore_std                              0
Ill300x.alt_count                                     0
Ill300x.alt_insertSize_mean                           0
Ill300x.alt_insertSize_std                            0
Ill300x.alt_reason_alignmentScore                     0
Ill300x.alt_reason_insertSizeScore                    0
Ill300x.alt_reason_orientation                        0
Ill300x.amb_alnScore_mean                             0
Ill300x.amb_alnScore_std                        

In [8]:
# Scale Data
# Standardize features by removing the mean and scaling to unit variance
# For more information see the following:
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(df2)

***
** Part 2: Data Visualization and tSNE analysis **
***

In [9]:
'''
SVD
'''
ncomps = 100
svd = TruncatedSVD(n_components=ncomps)
svd_fit = svd.fit(X)
Y = svd.fit_transform(X)
dfsvd = pd.DataFrame(Y, columns=['c{}'.format(c) for c in range(ncomps)], index=df.index)

In [10]:
'''
tSNE
'''

tsne = TSNE(n_components=2, random_state=0)
Z = tsne.fit_transform(dfsvd)
dftsne = pd.DataFrame(Z, columns=['x','y'], index=dfsvd.index)

In [11]:
'''
DBSCAN
'''
# DBSCAN with tSNE data
dbscan = DBSCAN()
labels = dbscan.fit_predict(Z)
print("Unique labels: {}".format(np.unique(labels)))
df['clusterLabel'] = labels

# DBSCAN with SVD data
labels_SVD = dbscan.fit_predict(Y)
print("Unique labels: {}".format(np.unique(labels_SVD)))
df['clusterLabel.SVD'] = labels_SVD

# DBSCAN with raw data
labels_raw = dbscan.fit_predict(X)
print("Unique labels: {}".format(np.unique(labels_raw)))
df['clusterLabel.raw'] = labels_raw

Unique labels: [-1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94]
Unique labels: [-1]
Unique labels: [-1]


In [12]:
# Error above: 
# Unique labels for SVD and raw data are not displaying
# Will import dataframe from previous run

In [13]:
df_dbscan = pd.read_csv('DEL.tSNE.raw.csv')

In [14]:
'''
Generate tSNE Plots
'''

'\nGenerate tSNE Plots\n'

In [15]:
'''
Data Cleaning
'''
dftsne['tandemrep_pct'] = df4['tandemrep_pct']
dftsne['segdup_pct'] = df4['segdup_pct']
dftsne['segdup_pct'].replace(0,-1,inplace=True)
dftsne['tandemrep_pct'].replace(0,-1,inplace=True)

bins = [-1, 0.2, 0.5, 1]
group_names = ['0-0.2', '0.2-0.5', '0.5-1']
df4['cat'] = pd.cut(df4['segdup_pct'], bins, labels=group_names)
df4['cat2'] = pd.cut(df4['tandemrep_pct'], bins, labels=group_names)

#Size Bins
bins = [20,50,100,1000,3000,9062]
df4['Size'] = df4['Size'].abs()
group_names_size = ['20-50', '50-100', '100-1000', '1000-3000', '3000-9062']
df4['size_bin'] = pd.cut(df4['Size'], bins, labels=group_names_size)
dftsne['cat'] = df4['cat']
dftsne['cat2'] = df4['cat2']
dftsne['size_bin'] = df4['size_bin']


df4['Size2'] = df4['Size'].apply(lambda x: x/1000)
dftsne['Size2'] = df4['Size2']
dftsne['Size'] = df4['Size']
dftsne['GTcons'] = df4['GTcons']
dftsne['sample'] = df4['sample']
dftsne['refN_pct'] = df4['refN_pct']
dftsne['label'] = df_dbscan['clusterLabel']
dftsne['label.SVD'] = df_dbscan['clusterLabel.SVD']
dftsne['label.raw'] = df_dbscan['clusterLabel.raw']

In [16]:
'''
Generate Plots
'''

'\nGenerate Plots\n'

** Size Distribution - Histogram **

In [17]:
p = Histogram(log_size, values='DEL_log_size', title='HG002 DEL: Size Distribution [5000 Samples]', color='LightSlateGray', bins=19, xlabel="Size[log10]", ylabel="Frequency")
output_file("tSNE4_DEL_Histo_logsize.html")
# show(p)

** Size Distribution - tSNE **

In [18]:
p = Scatter(dftsne, x='x', y='y', color='size_bin', title='HG002 DEL:Size', legend="top_left")
output_file("tSNE7_DEL_SizeBin.html")
# show(p)

** Tandem Repeat Plot **

In [19]:
p = Scatter(dftsne, x='x', y='y', color='cat2', title='HG002 DEL: tSNE tandem repeats', legend="top_left")
output_file("tSNE2_DEL_tandRep.html")
# show(p)

** Segmental Duplication Plot **

In [21]:
p = Scatter(dftsne, x='x', y='y', color='cat', title='HG002 DEL: tSNE segmental duplications', legend="top_left")
output_file("tSNE3_DEL_segDup.html")
# show(p)

** Consensus Genotype **

In [22]:
p = Scatter(dftsne, x='x', y='y', color='GTcons', title='HG002 DEL: Consensus Genotypes', legend="top_left")
output_file("tSNE6_DEL_GTcons.html")
# show(p)

** DBSCAN with tSNE Data **

In [23]:
p = Scatter(dftsne, x='x', y='y', color='label', title='HG002 DEL: tSNE DBSCAN labels', legend="top_left")
output_file("tSNE2_DBSCAN_DEL_label.html")
# show(p)

** DBSCAN with SVD Data **

In [25]:
p = Scatter(dftsne, x='x', y='y', color='label.SVD', title='HG002 DEL: SVD DBSCAN labels', legend="top_left")
output_file("SVD_DBSCAN_DEL_label.html")
show(p)

** DBSCAN with Raw Data **

In [None]:
p = Scatter(dftsne, x='x', y='y', color='label.raw', title='HG002 DEL: Raw DBSCAN labels', legend="top_left")
output_file("raw_DBSCAN_DEL_label.html")
# show(p)

***
** Part 3: Describe DBSCAN labels - Frequency distribution and CDF **
***

In [26]:
lab = df_dbscan['clusterLabel']
lab.SVD = df_dbscan['clusterLabel.SVD']
lab.raw = df_dbscan['clusterLabel.raw']