# Sample QC for GWAS analysis

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

This notebook loads cohorts created in cohort browser, perform sample QC and creates file containing phenotype and covariate information needed for GWAS analysis.

This work was done mainly by Yih-Chii Hwang, PhD as a part of her work on [AD-by-proxy GWAS Guide](https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/gwas-ex).

## Jupyterlab app details (launch configuration)

Recommended configuration
- runtime: < 10 min
- cluster configuration: `Spark cluster`
- number of nodes: 2
- recommended instance: `mem1_ssd1_v2_x16`
- cost: < £0.09

1. Secify whole exome data (WES) directory, exome field ID, these variables will depend on WES release (e.g. 200K, 300K or 450K) and output directory.

In [None]:
exome_folder = 'Population level exome OQFE variants, PLINK format - interim 200k release'
exome_field_id = '23155'
output_dir = '/Data/'

2. Import libraries and initialize Spark connection.

In [None]:
import databricks.koalas as ks
import dxpy
import dxdata
import pandas as pd
import pyspark
import re

In [None]:
# Initialize Spark
# Spark initialization (Done only once; do not rerun this cell unless you select Kernel -> Restart kernel).
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

3. Load daatset description and select entity containing phenotypic data.

In [None]:
# Automatically discover dispensed dataset ID and load the dataset
dispensed_dataset = dxpy.find_one_data_object(
    typename="Dataset", 
    name="app*.dataset", 
    folder="/", 
    name_mode="glob")
dispensed_dataset_id = dispensed_dataset["id"]
dataset = dxdata.load_dataset(id=dispensed_dataset_id)

In [None]:
participant = dataset['participant']

4. Load cohorts that were created in cohort browser.

In [None]:
case = dxdata.load_cohort("/Cohorts/diabetes_cases")  
cont = dxdata.load_cohort("/Cohorts/diabetes_controls")  

5. Specify fields ID to retrieve, get corresponding UKB RAP field names and print description table.

In [None]:
field_ids = ['31', '22001', '22006', '22019', '22021', '21022', '41270']

In [None]:
# This function is used to grab all field names (e.g. "p<field_id>_iYYY_aZZZ") of a list of field IDs
def fields_for_id(field_id):
    from distutils.version import LooseVersion
    field_id = str(field_id)
    fields = participant.find_fields(name_regex=r'^p{}(_i\d+)?(_a\d+)?$'.format(field_id))
    return sorted(fields, key=lambda f: LooseVersion(f.name))

In [None]:
fields = [fields_for_id(f)[0] for f in field_ids] + [participant.find_field(name='p20160_i0')] + [participant.find_field(name='eid')]
field_description = pd.DataFrame({
    'Field': [f.name for f in fields],
    'Title': [f.title for f in fields],
    'Coding': [f.coding.codes if f.coding is not None else '' for f in fields ]
 })
field_description

6. Retrieve data for both cohorts.

In [None]:
case_df = participant.retrieve_fields(fields = fields, filter_sql = case.sql, engine=dxdata.connect()).to_koalas()
cont_df = participant.retrieve_fields(fields = fields, filter_sql = cont.sql, engine=dxdata.connect(
    dialect="hive+pyspark", 
        connect_args=
        {
            'config':{'spark.kryoserializer.buffer.max':'256m','spark.sql.autoBroadcastJoinThreshold':'-1'}
                     
        }
)).to_koalas()

7. Create phenotype variable and concatenate cohorts into one dataframe.

In [None]:
case_df['diabetes_cc'] = 1
cont_df['diabetes_cc'] = 0

In [None]:
df = ks.concat([case_df, cont_df])

In [None]:
df.shape

In [None]:
df.diabetes_cc.value_counts()

Here is an example of retrieved data.

|    |     eid |   p21022 | p41270                                         | p41271   |   p20160_i0 |   p31 |   p22001 |   p22006 |   p22019 |   p22021 |   diabetes_cc |
|---:|--------:|---------:|:-----------------------------------------------|:---------|------------:|------:|---------:|---------:|---------:|---------:|--------------:|
|  0 | 1234567 |       67 | ['E119', 'M179', 'M431']                       |          |           0 |     0 |      nan |      nan |      nan |      nan |             1 |
|  1 | 1234568 |       62 | ['E119', 'R15', 'R32', 'R55', 'Z922']          |          |           0 |     1 |        0 |        1 |      nan |        0 |             1 |
|  2 | 1234569 |       50 | ['E119', 'I050', 'I080', 'I10', 'I270']        |          |           1 |     1 |        0 |      nan |      nan |        0 |             0 |
|  3 | 1234570 |       60 | ['A099', 'D128', 'D70', 'E114', ]              |          |           1 |     0 |        1 |      nan |      nan |        0 |             0 |
|  4 | 1234571 |       58 | ['A082',  'Z867', 'Z948', 'Z960']              |          |           0 |     1 |        1 |        1 |      nan |        0 |             0 |

8. QC samples based on several conditions.

In [None]:
df_qced = df[
    (df['p31'] == df['p22001']) & # Filter in sex and genetic sex are the same           
    (df['p22006'] == 1) &         # in_white_british_ancestry_subset           
    (df['p22019'].isnull()) &     # Not Sex chromosome aneuploidy           
    (df['p22021'] == 0)           # No kinship found
]

In [None]:
df_qced.diabetes_cc.value_counts()

9. Rename columns and organize it in format suitable for PLINK and regenie.

In [None]:
# Rename columns for better readibility
df_qced = df_qced.rename(columns=
                         {'eid':'IID', 'p31': 'sex', 'p21022': 'age',
                          'p20160_i0': 'ever_smoked',
                          'p22006': 'ethnic_group',                           
                          'p22019': 'sex_chromosome_aneuploidy',                          
                          'p22021': 'kinship_to_other_participants'})
# Add FID column -- required input format for regenie 
df_qced['FID'] = df_qced['IID']

# Create a phenotype table from our QCed data
df_phenotype = df_qced[['FID', 'IID', 'diabetes_cc', 'sex', 'age', 'ethnic_group', 'ever_smoked']]

In [None]:
df_phenotype = df_phenotype.to_pandas()

10. Select only samples that have WES data available and save them to CSV file.

In [None]:
# Get WES
path_to_family_file = f'/mnt/project/Bulk/Exome sequences/{exome_folder}/ukb{exome_field_id}_c1_b0_v1.fam'
plink_fam_df = pd.read_csv(path_to_family_file, delimiter='\s', dtype='object',                           
                           names = ['FID','IID','Father ID','Mother ID', 'sex', 'Pheno'], engine='python')
# Intersect the phenotype file and the 200K WES .fam file
# to generate phenotype DataFrame for the 200K participants
diabetes_wes_200k_df = df_phenotype.join(plink_fam_df.set_index('IID'), on='IID', rsuffix='_fam', how='inner')
# Drop unuseful columns from .fam file
diabetes_wes_200k_df.drop(
    columns=['FID_fam','Father ID','Mother ID','sex_fam', 'Pheno'], axis=1, inplace=True, errors='ignore'
)

In [None]:
# Write phenotype files to a TSV file
diabetes_wes_200k_df.to_csv('diabetes_wes_200k.phe', sep='\t', na_rep='NA', index=False, quoting=3)

11. Load file to project storage.

In [None]:
%%bash -s "$output_dir"
# Upload the geno-pheno intersect phenotype file back to the RAP project
dx upload diabetes_wes_200k.phe -p --path $1 --brief

Here is an example of phenotype file.

|    |     FID |     IID |   diabetes_cc |   sex |   age |   ethnic_group |   ever_smoked |
|---:|--------:|--------:|--------------:|------:|------:|---------------:|--------------:|
|  1 | 1234567 | 1234567 |             1 |     0 |    67 |              1 |             0 |
|  4 | 1234568 | 1234568 |             1 |     1 |    62 |              1 |             0 |
|  6 | 1234569 | 1234569 |             0 |     1 |    50 |              1 |             1 |
| 19 | 1234570 | 1234570 |             0 |     0 |    60 |              1 |             1 |
| 20 | 1234571 | 1234571 |             0 |     1 |    58 |              1 |             0 |
