# Determine possible confounder effects
There are 3 possible confounders
- Batch effect - region: "scz_s234_eur", "scz_swe1_eur", "scz_swe5_eur", "scz_swe6_eur"
- Gender
- PCA

In [1]:
import pandas as pd, numpy as np
import os
from collections import Counter
cwd = os.path.expanduser("~/GIT/cnv-gene-mapping/data/swcnv")

Import phenotype.

In [None]:
pheno = pd.read_csv(f"{cwd}/swcnv.pheno", header = 0, sep = "\s+", usecols = [0,3])
pheno = pheno[pheno["SCZ"] != -9]
pheno["SCZ"] = [x-1 for x in pheno["SCZ"]]

In [3]:
Counter(pheno["SCZ"])

Counter({0: 6256, 1: 4978})

## Batch effect: region

In [4]:
batch = pd.read_csv(f"{cwd}/swcnv.clusters", sep = "\s+", usecols = [0,2], header = 0, names = ["FID", "batch"])
batch = pd.merge(pheno, batch, how = "inner", on = "FID")

Samples from the Swedish Schizophrenia Study were collected in a multi-year project and genotypes in six batches (sw1-6), paper [description](https://www.sciencedirect.com/science/article/pii/S0092867418306585)

         'scz_swe1_eur': 417
         'scz_s234_eur': 4299
         'scz_swe5_eur': 4376
         'scz_swe6_eur': 1999


In [5]:
Counter(batch["batch"])

Counter({'scz_swe1_eur': 417,
         'scz_s234_eur': 4299,
         'scz_swe6_eur': 1999,
         'scz_swe5_eur': 4376})

Create a dummy categorical variable "batch_n" to stand for batches:

        sw1: 0
        sw234: 1
        sw5: 2
        sw6: 3

In [15]:
batch["batch_n"] = batch["batch"].apply(lambda x: 0 if x == "scz_s234_eur" else 1 if x == "scz_swe1_eur" else 2 if x == "scz_swe5_eur" else 3)

Create 4 dummy variables "batch1" to "batch4" to stand for sw1, sw234, sw5, sw6.

          batch1    batch2    batch3    batch4
    sw1        1         0         0         0
    sw234      0         1         0         0
    sw5        0         0         1         0
    sw6        0         0         0         1

In [16]:
batch["batch1"] = [1 if x == "scz_swe1_eur" else 0 for x in batch["batch"]]
batch["batch2"] = [1 if x == "scz_s234_eur" else 0 for x in batch["batch"]]
batch["batch3"] = [1 if x == "scz_swe5_eur" else 0 for x in batch["batch"]]
batch["batch4"] = [1 if x == "scz_swe6_eur" else 0 for x in batch["batch"]]

In [17]:
batch.head()

Unnamed: 0,FID,SCZ,batch,batch_n,batch1,batch2,batch3,batch4
0,PT-1RTW,0,scz_swe1_eur,1,1,0,0,0
1,PT-1RTX,0,scz_swe1_eur,1,1,0,0,0
2,PT-1RTY,0,scz_swe1_eur,1,1,0,0,0
3,PT-1RTZ,1,scz_swe1_eur,1,1,0,0,0
4,PT-1RU1,0,scz_swe1_eur,1,1,0,0,0


Use R package glm for logistic regression of possible confounders.

In [18]:
%get batch

### Use dummy categorical varible "batch_n" for covariate. 
- z score: -2.278
- p-value: 0.0227

In [19]:
res = glm(batch$SCZ ~ batch$batch_n, family = binomial())
summary(res)


Call:
glm(formula = batch$SCZ ~ batch$batch_n, family = binomial())

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.107  -1.077  -1.062   1.265   1.298  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.16700    0.02933  -5.695 1.24e-08 ***
batch$batch_n -0.03721    0.01634  -2.278   0.0227 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15245  on 11090  degrees of freedom
Residual deviance: 15239  on 11089  degrees of freedom
AIC: 15243

Number of Fisher Scoring iterations: 3


### Use 4 dummy variables "batch1" to "batch4" for covariates.
- One of the p-values is 5.41e-07.

In [20]:
res1 = glm(batch$SCZ ~ batch$batch1 + batch$batch2 + batch$batch3 + batch$batch4, family = binomial())
summary(res1)


Call:
glm(formula = batch$SCZ ~ batch$batch1 + batch$batch2 + batch$batch3 + 
    batch$batch4, family = binomial())

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.196  -1.119  -1.023   1.237   1.340  

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.10314    0.04479  -2.303   0.0213 *  
batch$batch1  0.14632    0.10772   1.358   0.1744    
batch$batch2 -0.03525    0.05423  -0.650   0.5157    
batch$batch3 -0.27232    0.05434  -5.011 5.41e-07 ***
batch$batch4       NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 15245  on 11090  degrees of freedom
Residual deviance: 15198  on 11087  degrees of freedom
AIC: 15206

Number of Fisher Scoring iterations: 4


## Gender
- Male: 5697
- Female: 4629

In [21]:
gender = pd.read_table(f"{cwd}/swcnv.qc6.fam", sep="\s+", header = None, usecols = (0,4,5), names = ["FID", "sex", "pheno"])
gender["pheno"] = [x-1 for x in gender["pheno"]]
gender["sex"] = [x-1 for x in gender["sex"]]

In [22]:
gender = pd.merge(pheno, gender[["FID", "sex"]], how = "inner", on = "FID")

In [23]:
Counter(gender["sex"])

Counter({0: 5697, 1: 4629})

### Use covariate "gender"
- z score: -8.801
- p-value: < 2e-16

In [24]:
%get gender

In [25]:
res2 = glm(gender$SCZ ~ gender$sex, family = binomial())
summary(res2)


Call:
glm(formula = gender$SCZ ~ gender$sex, family = binomial())

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.146  -1.146  -1.002   1.209   1.364  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.07481    0.02652  -2.821  0.00478 ** 
gender$sex  -0.35285    0.04009  -8.801  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 14178  on 10325  degrees of freedom
Residual deviance: 14100  on 10324  degrees of freedom
AIC: 14104

Number of Fisher Scoring iterations: 4


## PCA
mds: multidimensional scaling on Euclidean distance.

In [28]:
mds = pd.read_table(f"{cwd}/swcnv.mds", sep="\s+", usecols = [0,3,4,5,6,7,8,9,10,11,12])
mds = pd.merge(pheno, mds, how = "inner", on = "FID")
mds.head()

Unnamed: 0,FID,SCZ,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10
0,PT-1RTW,0,-0.008445,0.005512,0.00123,0.002499,0.009046,0.000775,0.002276,-0.001839,0.003567,-0.00405
1,PT-1RTX,0,0.002004,-0.004334,-0.001712,0.003688,0.000497,0.006221,-0.009228,-0.001324,-0.002098,0.000272
2,PT-1RTY,0,0.015069,0.001424,-0.002671,0.001604,0.000752,0.003769,-0.001655,-5e-05,-0.000774,-0.000185
3,PT-1RTZ,1,0.00104,-0.00595,-0.001848,-0.004309,-0.007455,0.002835,-0.01296,-0.005529,0.004817,-0.011632
4,PT-1RU1,0,-0.002875,-0.001082,-0.002621,0.002374,-0.000494,0.004072,-0.003412,-0.000758,-0.003522,0.002001


In [36]:
mds.shape

(11112, 12)

### Use all 10 eigenvectors
There first 4 are significant by p-value threhold of 0.001.

In [29]:
%get mds

In [32]:
res3 = glm(mds$SCZ ~ mds$C1 + mds$C2 + mds$C3 + mds$C4 + mds$C5 + mds$C6 + mds$C7 + mds$C8 + mds$C9 + mds$C10, family = binomial())

In [33]:
summary(res3)


Call:
glm(formula = mds$SCZ ~ mds$C1 + mds$C2 + mds$C3 + mds$C4 + mds$C5 + 
    mds$C6 + mds$C7 + mds$C8 + mds$C9 + mds$C10, family = binomial())

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6544  -1.0548  -0.9341   1.2707   1.6321  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.2244     0.0194 -11.571  < 2e-16 ***
mds$C1       36.0621     2.2357  16.130  < 2e-16 ***
mds$C2      -13.6085     3.2043  -4.247 2.17e-05 ***
mds$C3      -13.6412     3.3294  -4.097 4.18e-05 ***
mds$C4       16.7767     4.1443   4.048 5.16e-05 ***
mds$C5        3.7912     3.8056   0.996    0.319    
mds$C6       -8.6163     4.0418  -2.132    0.033 *  
mds$C7        0.7361     4.1746   0.176    0.860    
mds$C8        6.2205     4.2996   1.447    0.148    
mds$C9        1.3255     4.4080   0.301    0.764    
mds$C10      -0.1356     4.5557  -0.030    0.976    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion paramete