### CapstoneTwo: Genotyping SNP classification
Carsten Bruckner

## Step 1: Data Wrangling
* read a probeset performance file that summarizes >800,000 probesets' performance into a number of figures of merit, like call rate
* read a similar file of predictor metrics that in this case summarizes some of these probesets' agreement with reference data (1000Genomes project) and Reproducibility.
* preprocess these files to exclude measurements from non-standard probesets, which use different sets of metrics and may not be comparable to standard "diploid, biallelic" probesets.

In [None]:
%reset

In [1]:
import pandas as pd
import os
from zipfile import ZipFile
import pandas_profiling

In [2]:
pwd

'/Users/Carsten/OneDrive/Documents/Springboard/git_repositories/DataScienceCapstoneTwo/notebooks'

In [3]:
# os.chdir('/Users/Carsten/OneDrive/Documents/Springboard/git_repositories/DataScienceCapstoneTwo/Notebooks')

In [4]:
zip_absolute_path = '/Users/Carsten/OneDrive/Documents/Springboard/git_repositories/DataScienceCapstoneTwo/raw_data/Output.allps.zip'

In [5]:
zip_relative_path = os.path.relpath(zip_absolute_path)
zip_relative_path

'../raw_data/Output.allps.zip'

In [6]:
# Raw data files inside zip package, paths relative to project's Notebooks directory
zip_file = ZipFile('../raw_data/Output_allps.zip')
ps_file = zip_file.open('Output_allps/genotype-inliers/filtered/Ps.performance.txt')
snp_file = zip_file.open('Output_allps/genotype-inliers-gtools/SnpSummary/Axiom_CombinedSnpSummary_0.15.txt')

# while testing notebook, only read first 1000 rows
# warning! if call a second time, might read next 1000 rows, so need
# to recreate file handle ps_file earlier in this cell
ps_df = pd.read_csv(ps_file,delimiter='\t',skiprows=89)
#ps_df = pd.read_csv(ps_file,delimiter='\t',skiprows=89,nrows=1000)

# might be better to load entire thing,
# and then randomly sample rows for compute intensive operations using
# sample_df = df.sample(n_rows)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Could mixed type warning for 3rd column 'multi_snp_id' be because the string is sometimes 'NaN' ?

In [7]:
#check number of rows loaded
len(ps_df)

884158

In [None]:
# takes many minutes !  Slows when doing correlations.  Eventually crashes with these settings.
#profile_report = ps_df.profile_report(html={'style': {'full_width': True}})
#profile_report.to_file("reports/ps_file.html")


In [None]:
#del profile_report      #still doesn't help if change next input parameters and rerun

In [9]:
# https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html
# many other options can be set.
# pool_size is number of CPUs. 0 uses all available. CB: but not used for intensive tasks so not useful.
# minimal=True    skips compute-intensive operations to large datasets.
# correlations=None  turns off correlation steps (which are compute intensive). CB: also removes 'Missing Values'

# 10000 rows fails with standard settings
#sample_df = ps_df
sample_df = ps_df.sample(1000)          #subset of rows to profile
file_title = 'PS_performance.sample1000'     # ! remember to update  !

profile_report = sample_df.profile_report(
    html={'style': {'full_width': True}},
    title= file_title,
)
profile_report.to_file('../reports/'+file_title+'.html')

Summarize dataset:   0%|          | 0/101 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"cramers": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
print(ps_df.dtypes[0:43])

probeset_id            object
affy_snp_id            object
multi_snp_id           object
CR                    float64
FLD                   float64
HomFLD                float64
HomFLD_hap            float64
HetSO                 float64
HomRO                 float64
HomRO_hap             float64
nMinorAllele          float64
Nclus                   int64
n_AA                  float64
n_AB                  float64
n_BB                  float64
n_A                   float64
n_B                   float64
n_CN0                 float64
n_NC                    int64
hemizygous              int64
specialSNP_chr         object
gender_metrics         object
ConversionType         object
CopyNumIssue          float64
BestProbeset            int64
BestandRecommended      int64
HomHet                float64
AA.meanX              float64
AA.meanY              float64
AA.varX               float64
AA.varY               float64
AB.meanX              float64
AB.meanY              float64
AB.varX   

In [11]:
print(ps_df.dtypes[44:])

CN0.meanY                   float64
NC.meanX                    float64
NC.meanY                    float64
AA.varX.Z                   float64
AA.varY.Z                   float64
AB.varX.Z                   float64
AB.varY.Z                   float64
BB.varX.Z                   float64
BB.varY.Z                   float64
MMD                         float64
MinorAlleleFrequency        float64
H.W.p-Value                 float64
H.W.chisquared.statistic    float64
nSamples                    float64
nCalls                      float64
count_ma_A                  float64
count_ma_B                  float64
count_ma_C                  float64
count_ma_D                  float64
count_ma_E                  float64
count_ma_F                  float64
nAllelesTested              float64
nAllelesDetected            float64
NHetClus                    float64
nMajorAlleles               float64
maxMinorAllele              float64
nMinorAlleles               float64
MAFall                      