# StringDB - A Biology Resource for Bioinformatics


## Today we will be walking through a tutorial on how to use StringDB to investigate biological relationships between groups of genes
* As an example we will be generating a list of genes involved in smoking-associated lung cancer
* The data we are using is from The Cancer Genome Atlas lung cancer cohorts of lung adenocarcinoma and lung squamous cell carcinoma
* We will be identifying genes with large differences in median expression between smokers and non-smokers with lung cancer

In [2]:
import pandas as pd

In [3]:
expr = pd.read_csv("./tcga_selected_lung_expr.csv", index_col = 0)


In [4]:
expr.head()

Unnamed: 0_level_0,TCGA-49-4512,TCGA-55-8619,TCGA-75-6212,TCGA-97-A4M6,TCGA-50-6597,TCGA-69-7760,TCGA-55-6986,TCGA-55-8513,TCGA-75-5147,TCGA-55-A57B,...,TCGA-55-8089,TCGA-99-AA5R,TCGA-49-AARO,TCGA-55-8510,TCGA-55-7914,TCGA-64-1679,TCGA-55-8096,TCGA-91-6831,TCGA-38-4625,TCGA-75-7027
Hugo_Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
UBE2Q2P2,10.5919,12.1666,27.2807,11.4083,2.6202,6.0025,9.9855,31.6469,9.0652,35.6276,...,5.2768,24.0879,18.674,12.4602,9.995,7.9803,36.898,16.9631,2.3768,17.5941
HMGB1P1,71.9754,96.028,108.678,73.6311,58.1904,62.792,107.827,99.9036,208.927,84.3653,...,142.895,71.8784,79.6771,113.158,164.598,76.3349,83.8594,109.543,87.8119,238.002
RNU12-2P,0.7686,0.0,0.0,1.3706,0.0,0.6246,0.0,0.7418,0.2949,0.7052,...,0.8317,1.6287,2.3061,0.612,0.9995,0.2902,0.6237,0.7718,0.5632,0.2895
SSX9P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2949,3.1735,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
EZHIP,0.7686,0.0,0.0,0.4569,12.2526,0.0,1.4535,2.9674,7.9623,0.0,...,0.4158,4.886,0.0,1.836,0.0,2.3215,0.3118,0.7718,0.2816,0.0


# Now we need to read in the clinical annotations
* The column we are interested in is "tobacco_smoking_history"

In [9]:
clin_annot = pd.read_csv("./tcga_annot_selected.csv",
                         index_col = 0)

In [10]:
clin_annot.head()

Unnamed: 0,bcr_patient_uuid,acronym,gender,vital_status,days_to_birth,days_to_death,days_to_last_followup,days_to_initial_pathologic_diagnosis,age_at_initial_pathologic_diagnosis,icd_10,...,histological_type,tissue_source_site,form_completion_date,pathologic_T,pathologic_M,clinical_M,pathologic_N,system_version,pathologic_stage,tobacco_smoking_history
TCGA-49-4512,a1e65587-24c1-4b41-92a7-4e1f15fffd78,LUAD,FEMALE,Dead,-25502,905,157,0,69,C34.2,...,Lung Adenocarcinoma- Not Otherwise Specified (...,49,8/9/11,T2,MX,[Not Applicable],N2,6th,Stage IIIA,Lifelong Non-smoker
TCGA-55-8619,772324a5-5513-454d-ad6b-605798f69b73,LUAD,FEMALE,Alive,-26616,[Not Applicable],416,0,72,C34.30,...,Mucinous (Colloid) Carcinoma,55,1/11/13,T3,MX,[Not Applicable],N0,7th,Stage IIB,Lifelong Non-smoker
TCGA-75-6212,23069f19-eba5-4ea5-ae7d-1ee4399ce7c1,LUAD,FEMALE,Dead,[Not Available],1516,[Not Available],[Not Available],[Not Available],C34.3,...,Lung Micropapillary Adenocarcinoma,75,7/21/11,T2,M0,[Not Applicable],N1,6th,Stage IIB,Lifelong Non-smoker
TCGA-97-A4M6,31FF69B5-9E58-44DA-8326-BDFC7EE495C4,LUAD,FEMALE,Alive,-16764,[Not Applicable],568,0,45,C34.3,...,Lung Adenocarcinoma Mixed Subtype,97,3/7/13,T1a,M0,[Not Applicable],N0,7th,Stage IA,Lifelong Non-smoker
TCGA-50-6597,0d66bf6c-eed0-4726-bd5b-3bf6d610b4e0,LUAD,FEMALE,Dead,-29195,1268,1015,0,79,C34.3,...,Lung Adenocarcinoma- Not Otherwise Specified (...,50,8/26/11,T2,M0,[Not Applicable],N0,6th,Stage IB,Lifelong Non-smoker


In [11]:
clin_annot.tobacco_smoking_history.value_counts()

Current smoker         20
Lifelong Non-smoker    20
Name: tobacco_smoking_history, dtype: int64

In [14]:
curr_smokers = clin_annot.query("tobacco_smoking_history == 'Current smoker'").index
nonsmokers = clin_annot.query("tobacco_smoking_history == 'Lifelong Non-smoker'").index

# Now that we know which samples in the TCGA are smokers and which ones are lifelong non-smokers we want to identify genes that differ in median expression between these two groups

In [16]:
non_sub_curr_median_expr = expr.loc[:, nonsmokers].median(axis = 1) - expr.loc[:, curr_smokers].median(axis = 1)

# Now we extract the 20 genes that have higher expression in the non-smokers

In [17]:
# your function here to select
candidate_genes = non_sub_curr_median_expr.nlargest(20).index

print("\n".join(candidate_genes))

SFTPB
CD74
SLC34A2
NAPSA
PIGR
B2M
ADAM6
SFTPA2
SFTPC
CTSD
SFTPA1
CEACAM6
HLA-DRA
MUC1
AQP1
SDC1
ATP1A1
C1orf116
A2M
VIM


# Finally we want to load these into StringDB
* StringDB is a centralized information resource with manually and programmatically curated gene-gene relationships
* It serves to identify connections between groups of genes
* https://string-db.org/