# Genetic Association Testing

This is an example of the workflow analysis used for the genetic association analysis described in https://doi.org/10.1016/j.jdermsci.2023.02.003. After genotyping using TaqMan™ SNP Genotyping Assays, genotypes were obtained for all samples. The genotype distributions of the 3 tag SNPs did not significantly deviate from Hardy-Weinberg Equilibrium in either cases or control groups. Genotypes were successfully called for more than 99% of samples, and the minor allele frequencies in the control groups were consistent with those reported for the East Asian (EAS) population of the 1KGPh3. This evidence reinforces the reliability of the generated data (for further details, please see the article).
    
    
## working files
#### 1. PED
The PED file is a white-space (space or tab) delimited file: the first 6 columns are mandatory:
- Family ID
- Individual ID
- Paternal ID
- Maternal ID
- Sex (1=male; 2=female; other=unknown)
- Phenotype (case=2, control=1)
- Genotypes (column 7 onwards) e.g. A,C,G,T, or 0 for the missing genotype character. All markers should be biallelic.
#### 2. MAP
Each line of the MAP file describes a single marker and must contain exactly 4 columns:
- chromosome (1-22, X, Y or 0 if unplaced)
- rs# or snp identifier
- Genetic distance (morgans)
- Base-pair position (bp units)
#### 3. Phenotype file
It is a file that contains 3 columns (one row per individual):
- Family ID
- Individual ID
- Phenotype: For this example, male cases with hair loss pattern type 2 and male control were selected.
#### 4. Covar file
- Family ID
- Individual ID
- AGE

In [33]:
import pandas as pd

#example of PED file
ped_file = pd.read_csv("SELL_AllSNPs.ped", sep='\s+', header=None)
ped_file.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,10,FAGA_0001,0,0,2,2,C,T,T,T,T,T
1,10,FAGA_0002,0,0,2,2,C,T,A,T,C,T
2,10,FAGA_0003,0,0,2,2,T,T,T,T,C,T
3,10,FAGA_0004,0,0,2,2,T,T,T,T,C,T
4,10,FAGA_0005,0,0,2,2,T,T,A,A,T,T


In [228]:
#example of MAP file
map_file = pd.read_csv("SELL_AllSNPs.map", sep='\t', header=None)
map_file

Unnamed: 0,0,1,2,3
0,1,rs2223286,0,169696491
1,1,rs2420381,0,169706566
2,1,rs4987349,0,169698838


## Case-control analyses using PLINK (version 1.9)

PLINK (version 1.9) was used to perform logistic regression case-control analyses on the genotypes, assuming additive, dominant, recessive, and 2-degree of freedom (genotypic) models.

For each SNP tagger, the odds ratio (OR), p-value, and 95% confidence interval (CI) were calculated.

Subgroup analyses were performed comparing male pattern hair loss type 2 (pheno_name = MALES_2) against male controls, with age adjustment using the covariate file.

Results with a p-value < 0.05 were considered statistically significant after multiple testing correction

In [1]:
from genetic_assoc_testing import main
main("SELL_AllSNPs", "PHENO_ALL_10Jan2022.txt", "COVAR_1_AGE.txt", pheno_name="MALES_2")

PLINK command for chi-square allelic test has been executed successfully.
PLINK command for association analysis using Fisher's exact test has been executed successfully.
PLINK command for logistic regression, assuming additive model, has been executed successfully.
PLINK command for logistic regression, assuming dominant model, has been executed successfully.
PLINK command for logistic regression, assuming recessive model, has been executed successfully.
PLINK command for logistic regression, assuming genotypic model, has been executed successfully.


In [31]:
import os

# Get list of all files in the directory
files = os.listdir('./')

# Iterate over the files and print their names if they start with 'MALES_2'
results_files= []
for file in files:
    if file.startswith('MALES_2') and file.endswith('.assoc.logistic'):
        results_files.append(file)
print(results_files)

['MALES_2control_AGE_GENOTYPIC.assoc.logistic', 'MALES_2Control_AGE_ADDITIVE.assoc.logistic', 'MALES_2Control_AGE_DOMINANT.assoc.logistic', 'MALES_2control_AGE_RECESSIVE.assoc.logistic']


In [43]:
# Dictionary to store DataFrames
dataframes = {}

# Iterate over the files and read them into DataFrames
for file in results_files:
    df = pd.read_csv(file, sep='\s+', header=0)
    dataframes[file] = df

# Print each DataFrame separately
for filename, df in dataframes.items():
    print(f"Results from file: {filename}")
    print(df)
    print("\n")

Results from file: MALES_2control_AGE_GENOTYPIC.assoc.logistic
   CHR        SNP         BP A1      TEST  NMISS     OR      STAT         P
0    1  rs2223286  169696491  C       ADD    521  1.005  0.009481  0.992400
1    1  rs2223286  169696491  C    DOMDEV    521  2.289  1.345000  0.178700
2    1  rs2223286  169696491  C  GENO_2DF    521    NaN  9.928000  0.006985
3    1  rs4987349  169698838  C       ADD    517  1.350  1.633000  0.102500
4    1  rs4987349  169698838  C    DOMDEV    517  1.024  0.095820  0.923700
5    1  rs4987349  169698838  C  GENO_2DF    517    NaN  3.172000  0.204800
6    1  rs2420381  169706566  A       ADD    518  0.760 -0.976900  0.328600
7    1  rs2420381  169706566  A    DOMDEV    518  1.237  0.632900  0.526800
8    1  rs2420381  169706566  A  GENO_2DF    518    NaN  0.961900  0.618200


Results from file: MALES_2Control_AGE_ADDITIVE.assoc.logistic
   CHR        SNP         BP A1 TEST  NMISS      OR      SE     L95    U95  \
0    1  rs2223286  169696491  C  AD