# Notebook 1: Create phenotype data

This notebook uses proteomic expression data from [Kivisakk et al publication](https://academic.oup.com/braincomms/article/4/4/fcac155/6608340#366642284) as an example/test case to test this proteomics pQTL analysis workflow.

This notebook reads in the post-QC'd data from [Table S1 of the associated publication](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/braincomms/4/4/10.1093_braincomms_fcac155/7/fcac155_supplementary_data.zip?Expires=1677768977&Signature=hS7ey1m3UtIF3cV8qEUrfVjgbMTVcf0GPOZpHhVqRh3H44MhG0cFcZz6qwP6GbY2mize0Z1qG87iuYvNQd6-T~KEAPlNR-Ub1YVmenkT~MhkvtURg-MEIns79I9Q49DsKu8LzdbPMWIHvICoiQd~5ET3cUyWRacOkdgfnPsvkN4QTIKWY5uAnHOejaWZHTaf5KgzvqtMcg-dZMx4uuXUyb~3bFLwVCFtU-NwV4J0WdWd0R2QeQuVQMfi5aTdhZWI-QeeAUNvtm1VSQx0NzdQ9TG2Hyfitd8FczWFI32cwWLj~CrtTYbGgtEW3wANXrf89i0fknzRI141ir5XwoHSbQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA) and create a new phenotype file by replacing the sample IDs with those sample IDs found in the genotype data files from UKB. Once we have sample IDs that map between the phenotype (proteomic) data and the genotype (array, GEL imputed) data, we can input this into REGENIE.

### As-Is Software Disclaimer

This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.

[MIT License](https://github.com/dnanexus/UKB_RAP/blob/main/LICENSE) applies to this notebook.

### JupyterLab app details

<b>Launch spec:</b>
- App name: JupyterLab
- Kernel: Python_R
- Instance type: mem1_ssd1_v2_x2
- Runtime: =~ 1 min
- Cost: =~ £0.0069


<b>Data description:</b> File input for this notebook is csv file containing post-QC'd data from [Table S1 of the associated publication](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/braincomms/4/4/10.1093_braincomms_fcac155/7/fcac155_supplementary_data.zip?Expires=1677768977&Signature=hS7ey1m3UtIF3cV8qEUrfVjgbMTVcf0GPOZpHhVqRh3H44MhG0cFcZz6qwP6GbY2mize0Z1qG87iuYvNQd6-T~KEAPlNR-Ub1YVmenkT~MhkvtURg-MEIns79I9Q49DsKu8LzdbPMWIHvICoiQd~5ET3cUyWRacOkdgfnPsvkN4QTIKWY5uAnHOejaWZHTaf5KgzvqtMcg-dZMx4uuXUyb~3bFLwVCFtU-NwV4J0WdWd0R2QeQuVQMfi5aTdhZWI-QeeAUNvtm1VSQx0NzdQ9TG2Hyfitd8FczWFI32cwWLj~CrtTYbGgtEW3wANXrf89i0fknzRI141ir5XwoHSbQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)

### Dependencies

|Library |License|
|:------------- |:-------------|
|[pandas](https://pandas.pydata.org/) |[BSD-3](https://github.com/pandas-dev/pandas/blob/main/LICENSE)|

In [None]:
import os
import pandas as pd

## 1. Load proteomic expression data

Load in the proteomics data and get the number of samples.

In [None]:
# Number of times to replicate NPX dataframe to boost
# the number of samples we have.
# If there are not enough samples we will have too low variance for
# running REGENIE.
n = 700

In [None]:
# Load data
filename_download = "<file path on where Supplementary_Table1_Baseline_Olink_Data.csv is stored off of RAP>"
filename = "Supplementary_Table1_Baseline_Olink_Data.csv"

# Output directory
output_dir = "/output/"

In [None]:
os.system(" ".join(["dx download", filename_download]))

In [None]:
df = pd.read_csv(filename, index_col=0, header=0)

In [None]:
print(df.shape)
df.head()

### Remove missing data

Check if there are any missing values (None, numpy.na, also included is numpy.inf, "" since we set pandas.options.mode.use_inf_as_na = True) before we extend the expression matrix

REGENIE will throw an error if missing values are included (`ERROR: could not convert value to double: ''`)

In [None]:
pd.options.mode.use_inf_as_na = True
print(df.isna().values.any().sum())

In [None]:
# Get columns with NaN
df.columns[df.isna().any()].tolist()

In [None]:
# Drop column
df = df.drop(columns="TNC")
print(df.shape)

In [None]:
df = pd.concat([df] * n)
print(df.shape)

In [None]:
n_samples = df.shape[0]

## 2. Get sample IDs

Load in sample data from .fam files in UKB and select `n_samples` identifiers

In [None]:
# Download .fam sample file
gel_sample_filename = (
    "<file path on platform containing ukb21008_c1_b0_v1.sample using /mnt/project/>"
)

In [None]:
sample_df = pd.read_csv(
    gel_sample_filename, sep=" ", skiprows=[1], index_col=0, header=0
)

In [None]:
print(sample_df.shape)
sample_df.head()

In [None]:
sample_ids = list(sample_df.head(n_samples).index)

In [None]:
print(len(sample_ids))

## 3. Format expression data

Now that we have our sample ID that map to our genotype data we'll replace the proteomic sample IDs (`Plasma_Sample`) with our genotype sample IDs. We'll also remove all columns except those that contain expression (i.e. remove PIDN, Age_at_Baseline, Sex, and Outcome).

Note: We need a column for FID and IID for REGENIE.

In [None]:
metadata_colnames = ["PIDN", "Age_at_Baseline", "Sex", "Outcome"]
npx_df = df.drop(metadata_colnames, axis=1)

In [None]:
# Add FID
npx_df["FID"] = sample_ids

In [None]:
# Set index to FID
npx_df = npx_df.set_index("FID")

In [None]:
# IID column
npx_df.insert(0, "IID", sample_ids)

In [None]:
print(npx_df.shape)
npx_df.head()

In [None]:
# Check if missing values values
print(npx_df.isnull().values.any())
print(npx_df.isnull().values.sum())

In [None]:
npx_df.std().plot.hist()

## 4. Save

Save un-normalized and normalized data to test the affect of normalization on differential expression analysis results.

In [None]:
# npx_df.to_csv("pheno.txt", sep="\t")
npx_df.iloc[:, :201].to_csv("pheno_200.txt", sep="\t")

In [None]:
# Upload the counts csv to the project
os.system(" ".join(["dx upload", "pheno_200.txt", "--destination", output_dir]))