In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import seaborn as sns

pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

print(sys.version)
print(pd.__version__)
print(np.__version__)

import io
import os
import pandas as pd
pd.set_option("display.max_rows", 1000)
pd.options.display.max_colwidth = 1000

3.9.13 (main, Aug 25 2022, 18:29:29) 
[Clang 12.0.0 ]
1.4.4
1.21.5


---

## 4: Clinical disease data (40 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Look at the file and tell me what gene and mutation combinations are classified as dangerous.”

Make sure that you only give your boss the dangerous mutations and include:

1) Gene name

2) Mutation ID number

3) Mutation Position (chromosome & position)

4) Mutation value (reference & alternate bases)

5) Clinical significance (CLNSIG)

6) Disease that is implicated

**Requirements**

1) The deliverables are the final result as a dataframe with a short discussion of any specifics. (that is, what data you would present to your boss with the explanation of your results)

2) Limit your output to the first 100 harmful mutations and tell your boss how many total harmful mutations were found in the file

3) Use the instructor-modified "clinvar_final.txt" at this link: https://drive.google.com/file/d/1Zps0YssoJbZHrn6iLte2RDLlgruhAX1s/view?usp=sharing This file was modified to be not exactly the same as 'standard' .vcf file to test your data parsing skills. **This is a large file so do NOT upload it into your github repo!**

4) Replace missing values in the dataframe with: 'Not_Given'. Print or display this (including the Not_Given count) for the column `CLNSIG` by using pandas value_counts() function (https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html).

5) State in your answer how you define harmful mutations

**6) Do your best on getting to above requirements and submit whatever you do before the deadline. If your work is incomplete be sure to describe the blockers that got in your way and how you might get past them (if given more time).**

7) You can use as many code blocks as you need. Please clean-up your code and make it readable for the graders!

**Hints** 
* We do not expect you to have any medical knowledge to solve this problem; look at the data, read the documentation provided, and write down your assumptions!

* Correct pseudocode will give you partial credit so start with that. 

* Map out which fields you want to extract: Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* A good way to start is to print out each line, then practice parsing them to see if you can recover the fields of interest

* A starting solution for parsing .vcfs can be found here: https://gist.github.com/dceoy/99d976a2c01e7f0ba1c813778f9db744 This solution does **NOT** work due to the changes we've made but can be modified to work. As with any solution that needs modifications, it may take less time to make your own solution!

* Filter out junk and lines with no mutation data. Just focus on the data your need to deliver to your boss. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* This is similar to a task that one of us tackled at work. You can answer the question with the information provided below or using the (partial) data dictionary file at this link: https://drive.google.com/file/d/1lx9yHdlcqmU_OlHiTUXKC_LQDqYBypH_/view?usp=sharing. Our goal is to see that you can put together a sensible plan, implement a solid parsing strategy, and document and justify the decisions that you made.

### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also can contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s)
5. ALT - alternate base(s)
6. FILTER - filter status
7. QUAL - quality
8. INFO - a semicolon-separated series of keys with values in the format: <key>=<data>

```
### Applicable INFO field specifications

```
GENEINFO = <Gene name>
CLNSIG =  <Clinical significance>
CLNDN = <Disease name>
```

### Sample ClinVar data (vcf file format - not exactly the same as the file to download!)

```
##fileformat=VCFv4.1
##fileDate=2019-03-19
##source=ClinVar
##reference=GRCh38							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDN=Heart_dis 
```

In [2]:
# 4) Your code here - can use as many code blocks as you would like

4) Please Write your assumptions here:

4) Findings / What would you present to your boss?

----

In [3]:
# Gene name

# Mutation ID number (GENEINFO, before :)--> ID 

# Mutation Position (chromosome & position) (CHROM + (GENEINFO, after :)) --> POS

# Mutation value (reference & alternate bases) --> (REF & ALF) column values

# Clinical significance --> (CLNSIG)

# Disease that is implicated --> (CLNDN)

# <font color = teal> Explore Raw Data

# <font color = gold> Current GOLD

## Try another import

In [4]:
# import pandas as pd
# with open('../clinvar_final.txt', "r") as f:
#     lines = f.readlines()
#     chrom_index = [i for i, line in enumerate(lines) if line.strip().startswith("#CHROM")] 
#     data = lines[chrom_index[0]:]
        

#     header = data[0].strip().split("\t")
#     informations = [d.strip().split("\t") for d in data[1:]]


# vcf = pd.DataFrame(informations, columns=header)

In [5]:
df = pd.read_csv('../clinvar_final.txt', comment='#', sep='\t')
df.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
0,1,1014O42,475283,G,A,.,.,"AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;ALLELEID=446939;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014042G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=143888043"
1,1,1O14122,542074,C,T,.,.,"AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014122C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=150861311"
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014143C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:147571.0003;GENEINFO=ISG15:9636;MC=SO:0001587|nonsense;ORIGIN=1;RS=786201005"
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014179C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1553169766"
4,1,1014217,475278,C,T,.,.,"AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;ALLELEID=446987;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014217C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=61766284"


In [6]:
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

In [7]:
# additional validations

# INFO -- SOMATIC indicates that the record is a somatic mutation, for cancer genomics
# PEDIGREE

In [8]:
df.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO,GENEINFO,CLNSIG,CLNDN
0,1,1014O42,475283,G,A,.,.,"AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;ALLELEID=446939;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014042G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=143888043",ISG15:9636,Benign,Immunodeficiency_38_with_basal_ganglia_calcification
1,1,1O14122,542074,C,T,.,.,"AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014122C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=150861311",ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014143C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:147571.0003;GENEINFO=ISG15:9636;MC=SO:0001587|nonsense;ORIGIN=1;RS=786201005",ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014179C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1553169766",ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
4,1,1014217,475278,C,T,.,.,"AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;ALLELEID=446987;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014217C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=61766284",ISG15:9636,Benign,Immunodeficiency_38_with_basal_ganglia_calcification


In [9]:
# print(df['GENEINFO'].head())
# print(df['CLNSIG'].head())
# print(df['CLNDN'].head())

# <font color = green> START HERE

#### Start by creating a new DF with just the columns needed

In [10]:
data = df.filter(['GENEINFO','ID', 'POS', 'REF', 'ALT', 'CLNSIG', 'CLNDN'], axis=1)
data

Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN
0,ISG15:9636,475283,1014O42,G,A,Benign,Immunodeficiency_38_with_basal_ganglia_calcification
1,ISG15:9636,542074,1O14122,C,T,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
2,ISG15:9636,183381,1014143,C,T,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
3,ISG15:9636,542075,1014179,C,T,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
4,ISG15:9636,475278,1014217,C,T,Benign,Immunodeficiency_38_with_basal_ganglia_calcification
...,...,...,...,...,...,...,...
102316,PIK3CA:5290,403908,179210507,A,G,Uncertain_significance,Cowden_syndrome
102317,PIK3CA:5290,526648,179210511,T,C,Likely_benign,Cowden_syndrome
102318,PIK3CA:5290,526640,179210515,A,C,Uncertain_significance,Cowden_syndrome|Hereditary_cancer-predisposing_syndrome
102319,PIK3CA:5290,246681,179210516,A,G,Uncertain_significance,Cowden_syndrome


---

#### determine null values - sanity counts

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 7 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   GENEINFO  97603 non-null   object
 1   ID        102321 non-null  int64 
 2   POS       102321 non-null  object
 3   REF       102321 non-null  object
 4   ALT       102321 non-null  object
 5   CLNSIG    100524 non-null  object
 6   CLNDN     89651 non-null   object
dtypes: int64(1), object(6)
memory usage: 5.5+ MB


---

#### replace null values with text

In [12]:
data.fillna('Not_Given', inplace = True)

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 7 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   GENEINFO  102321 non-null  object
 1   ID        102321 non-null  int64 
 2   POS       102321 non-null  object
 3   REF       102321 non-null  object
 4   ALT       102321 non-null  object
 5   CLNSIG    102321 non-null  object
 6   CLNDN     102321 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.5+ MB


---

#### <font color = orange> Analyze Significance (CLNSIG)

In [14]:
data['CLNSIG'].value_counts().head(10)

Uncertain_significance                          47980
Likely_benign                                   17885
Pathogenic                                      12313
Likely_pathogenic                                6269
Benign                                           6138
Conflicting_interpretations_of_pathogenicity     5404
Benign/Likely_benign                             3338
Not_Given                                        1797
Pathogenic/Likely_pathogenic                      854
risk_factor                                        98
Name: CLNSIG, dtype: int64

#### Focus on Pathogens in the Clinical Significance

In [15]:
data[data['CLNSIG'].str.contains('athog')].sample(10)

Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN
31558,FH:2271,198391,241504231,TGACAAAA,T,Pathogenic,Multiple_cutaneous_leiomyomas|Fumarase_deficiency|not_provided
7530,KCNQ4:9132,505302,40819463,G,C,Likely_pathogenic,Nonsyndromic_hearing_loss_and_deafness|not_specified
73768,BMPR2:659,425820,202518972,GC,G,Pathogenic,Primary_pulmonary_hypertension
9833,MMACHC:25974,558633,45508362,C,T,Likely_pathogenic,Methylmalonic_acidemia_with_homocystinuria
91044,Not_Given,17582,41224613,G,T,Conflicting_interpretations_of_pathogenicity,Hepatocellular_carcinoma|Hepatoblastoma|Medulloblastoma|Adrenocortical_carcinoma|Cutaneous_melanoma|Malignant_tumor_of_prostate|Craniopharyngioma|Lung_adenocarcinoma|Squamous_cell_carcinoma_of_the_head_and_neck|Malignant_melanoma_of_skin|Malignant_neoplasm_of_body_of_uterus|Adenocarcinoma_of_stomach
72802,Not_Given,144005,190995151,T,C,Pathogenic,Immunodeficiency_31C
57379,SCN1A:6323,189910,166051931,AT,A,Pathogenic,Severe_myoclonic_epilepsy_in_infancy
83707,Not_Given,2225,10149885,C,G,Conflicting_interpretations_of_pathogenicity,"Pheochromocytoma|Von_Hippel-Lindau_syndrome|Hereditary_cancer-predisposing_syndrome|Erythrocytosis,_familial,_2|not_provided"
40241,MSH2:4436,414995,47470955,C,T,Conflicting_interpretations_of_pathogenicity,Hereditary_nonpolyposis_colon_cancer|Hereditary_cancer-predisposing_syndrome|not_provided
48750,ACTG2:72,495145,73914698,G,A,Likely_pathogenic,Visceral_myopathy


#### Select top 3 impacted diseases

In [16]:
disease_list = ['Hereditary_cancer-predisposing_syndrome', 'Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G', 'Lynch_syndrome']

In [17]:
data = data[data['CLNSIG'].str.contains('athog', na=False)]
data


Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN
2,ISG15:9636,183381,1014143,C,T,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
8,ISG15:9636,161455,1014316,C,CG,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
9,ISG15:9636,161454,1014359,G,T,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
20,AGRN:375790,210112,1020239,G,C,Conflicting_interpretations_of_pathogenicity,"Myasthenic_syndrome,_congenital,_8|not_specified|not_provided"
24,AGRN:375790,243036,1022225,G,A,Pathogenic,Congenital_myasthenic_syndrome
...,...,...,...,...,...,...,...
102307,PIK3CA:5290,39706,179210289,TAGA,T,Pathogenic,Megalencephaly_cutis_marmorata_telangiectatica_congenita|not_provided
102309,PIK3CA:5290,376470,179210291,G,A,Likely_pathogenic,Hepatocellular_carcinoma|Transitional_cell_carcinoma_of_the_bladder|Lung_adenocarcinoma|Squamous_cell_lung_carcinoma|Neoplasm_of_brain|Neoplasm_of_the_breast|Squamous_cell_carcinoma_of_the_head_and_neck|Glioblastoma|Malignant_neoplasm_of_body_of_uterus|Adenocarcinoma_of_stomach
102310,PIK3CA:5290,376471,179210291,G,C,Likely_pathogenic,Hepatocellular_carcinoma|Transitional_cell_carcinoma_of_the_bladder|Lung_adenocarcinoma|Squamous_cell_lung_carcinoma|Neoplasm_of_brain|Neoplasm_of_the_breast|Squamous_cell_carcinoma_of_the_head_and_neck|Glioblastoma|Malignant_neoplasm_of_body_of_uterus|Adenocarcinoma_of_stomach
102311,PIK3CA:5290,45465,179210292,AAGATTTGCTGAACCC,A,Likely_pathogenic,Non-small_cell_lung_cancer


## Sample list of 100 records with likely impacted Disease

In [18]:
pd.set_option('max_colwidth', 40)
final = data[data['CLNDN'].isin(disease_list)].sample(10)
final

Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN
42393,MSH6:2956,89185,47799282,T,A,Pathogenic,Lynch_syndrome
40527,MSH2:4436,428476,47475194,AG,A,Pathogenic,Hereditary_cancer-predisposing_syndrome
40024,MSH2:4436,90667,47463086,TAAGAG,T,Pathogenic,Lynch_syndrome
87941,MLH1:4292,141632,37025998,GC,G,Pathogenic,Hereditary_cancer-predisposing_syndrome
38593,MSH2:4436,91193,47403264,G,GC,Pathogenic,Lynch_syndrome
38949,MSH2:4436,91093,47410125,AC,A,Pathogenic,Lynch_syndrome
62830,TTN:7273|TTN-AS1:100506866,202470,178569031,G,GT,Pathogenic/Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type..."
39111,Not_Given,91146,47410322,T,C,Pathogenic,Lynch_syndrome
88644,MLH1:4292,90045,37049018,G,C,Likely_pathogenic,Lynch_syndrome
39826,MSH2:4436,90595,47429943,T,A,Likely_pathogenic,Lynch_syndrome


In [19]:
# Gene name

# Mutation ID number (GENEINFO, before :)--> ID 

# Mutation Position (chromosome & position) (CHROM + (GENEINFO, after :)) --> POS

# Mutation value (reference & alternate bases) --> (REF & ALF) column values

# Clinical significance --> (CLNSIG)

# Disease that is implicated --> (CLNDN)

In [83]:
final = data[data['CLNDN'].isin(disease_list)].sample(10)
final

Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN
40639,MSH2:4436,90862,47476381,GGT,G,Pathogenic,Lynch_syndrome
65181,TTN:7273|TTN-AS1:100506866,534995,178605552,G,A,Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G"
39535,MSH2:4436,428458,47416341,C,CT,Pathogenic,Hereditary_cancer-predisposing_syndrome
88520,MLH1:4292,486840,37048611,T,C,Likely_pathogenic,Hereditary_cancer-predisposing_syndrome
42904,MSH6:2956,89243,47800030,G,GCT,Pathogenic,Lynch_syndrome
38892,MSH2:4436,428484,47408554,A,CGG,Pathogenic,Hereditary_cancer-predisposing_syndrome
40417,MSH2:4436,428548,47475073,A,AT,Pathogenic,Hereditary_cancer-predisposing_syndrome
44524,MSH6:2956,237199,47806511,T,TA,Pathogenic,Lynch_syndrome
43331,MSH6:2956,237162,47800663,C,T,Pathogenic,Lynch_syndrome
88377,MLH1:4292,89908,37047640,A,TTCTT,Pathogenic,Lynch_syndrome


#### Combine REF & ALT Bases to get Mutation Value

In [84]:
final['MUTATION_VALUE'] = final["REF"].astype(str) + "_" + final["ALT"]
final

Unnamed: 0,GENEINFO,ID,POS,REF,ALT,CLNSIG,CLNDN,MUTATION_VALUE
40639,MSH2:4436,90862,47476381,GGT,G,Pathogenic,Lynch_syndrome,GGT_G
65181,TTN:7273|TTN-AS1:100506866,534995,178605552,G,A,Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G",G_A
39535,MSH2:4436,428458,47416341,C,CT,Pathogenic,Hereditary_cancer-predisposing_syndrome,C_CT
88520,MLH1:4292,486840,37048611,T,C,Likely_pathogenic,Hereditary_cancer-predisposing_syndrome,T_C
42904,MSH6:2956,89243,47800030,G,GCT,Pathogenic,Lynch_syndrome,G_GCT
38892,MSH2:4436,428484,47408554,A,CGG,Pathogenic,Hereditary_cancer-predisposing_syndrome,A_CGG
40417,MSH2:4436,428548,47475073,A,AT,Pathogenic,Hereditary_cancer-predisposing_syndrome,A_AT
44524,MSH6:2956,237199,47806511,T,TA,Pathogenic,Lynch_syndrome,T_TA
43331,MSH6:2956,237162,47800663,C,T,Pathogenic,Lynch_syndrome,C_T
88377,MLH1:4292,89908,37047640,A,TTCTT,Pathogenic,Lynch_syndrome,A_TTCTT


### Reorder Columns

In [85]:
final = final[["GENEINFO", "ID", "POS", "MUTATION_VALUE","CLNSIG", "CLNDN", "REF", "ALT"]]
final

Unnamed: 0,GENEINFO,ID,POS,MUTATION_VALUE,CLNSIG,CLNDN,REF,ALT
40639,MSH2:4436,90862,47476381,GGT_G,Pathogenic,Lynch_syndrome,GGT,G
65181,TTN:7273|TTN-AS1:100506866,534995,178605552,G_A,Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G",G,A
39535,MSH2:4436,428458,47416341,C_CT,Pathogenic,Hereditary_cancer-predisposing_syndrome,C,CT
88520,MLH1:4292,486840,37048611,T_C,Likely_pathogenic,Hereditary_cancer-predisposing_syndrome,T,C
42904,MSH6:2956,89243,47800030,G_GCT,Pathogenic,Lynch_syndrome,G,GCT
38892,MSH2:4436,428484,47408554,A_CGG,Pathogenic,Hereditary_cancer-predisposing_syndrome,A,CGG
40417,MSH2:4436,428548,47475073,A_AT,Pathogenic,Hereditary_cancer-predisposing_syndrome,A,AT
44524,MSH6:2956,237199,47806511,T_TA,Pathogenic,Lynch_syndrome,T,TA
43331,MSH6:2956,237162,47800663,C_T,Pathogenic,Lynch_syndrome,C,T
88377,MLH1:4292,89908,37047640,A_TTCTT,Pathogenic,Lynch_syndrome,A,TTCTT


---

### Revise Column Names

In [86]:
col_name = {'GENEINFO': "Gene_Name", "ID": "Mutation_ID", "POS": "Mutation_Position", "REF": "Reference", 
            "ALT": "Alternate", "CLNSIG": "Significance", "CLNDN": "Impacted_Disease", "MUTATION_VALUE": "Mutation_Value"}

In [87]:
final = final.rename(columns = col_name)
final

Unnamed: 0,Gene_Name,Mutation_ID,Mutation_Position,Mutation_Value,Significance,Impacted_Disease,Reference,Alternate
40639,MSH2:4436,90862,47476381,GGT_G,Pathogenic,Lynch_syndrome,GGT,G
65181,TTN:7273|TTN-AS1:100506866,534995,178605552,G_A,Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G",G,A
39535,MSH2:4436,428458,47416341,C_CT,Pathogenic,Hereditary_cancer-predisposing_syndrome,C,CT
88520,MLH1:4292,486840,37048611,T_C,Likely_pathogenic,Hereditary_cancer-predisposing_syndrome,T,C
42904,MSH6:2956,89243,47800030,G_GCT,Pathogenic,Lynch_syndrome,G,GCT
38892,MSH2:4436,428484,47408554,A_CGG,Pathogenic,Hereditary_cancer-predisposing_syndrome,A,CGG
40417,MSH2:4436,428548,47475073,A_AT,Pathogenic,Hereditary_cancer-predisposing_syndrome,A,AT
44524,MSH6:2956,237199,47806511,T_TA,Pathogenic,Lynch_syndrome,T,TA
43331,MSH6:2956,237162,47800663,C_T,Pathogenic,Lynch_syndrome,C,T
88377,MLH1:4292,89908,37047640,A_TTCTT,Pathogenic,Lynch_syndrome,A,TTCTT


### Final Check - Drop Redundant Columns

In [88]:
final.drop(['Reference', 'Alternate'], axis = 1)

Unnamed: 0,Gene_Name,Mutation_ID,Mutation_Position,Mutation_Value,Significance,Impacted_Disease
40639,MSH2:4436,90862,47476381,GGT_G,Pathogenic,Lynch_syndrome
65181,TTN:7273|TTN-AS1:100506866,534995,178605552,G_A,Likely_pathogenic,"Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G"
39535,MSH2:4436,428458,47416341,C_CT,Pathogenic,Hereditary_cancer-predisposing_syndrome
88520,MLH1:4292,486840,37048611,T_C,Likely_pathogenic,Hereditary_cancer-predisposing_syndrome
42904,MSH6:2956,89243,47800030,G_GCT,Pathogenic,Lynch_syndrome
38892,MSH2:4436,428484,47408554,A_CGG,Pathogenic,Hereditary_cancer-predisposing_syndrome
40417,MSH2:4436,428548,47475073,A_AT,Pathogenic,Hereditary_cancer-predisposing_syndrome
44524,MSH6:2956,237199,47806511,T_TA,Pathogenic,Lynch_syndrome
43331,MSH6:2956,237162,47800663,C_T,Pathogenic,Lynch_syndrome
88377,MLH1:4292,89908,37047640,A_TTCTT,Pathogenic,Lynch_syndrome


---

#### <font color = orange> Analyze Implicated Disease (CLNDN)

In [23]:
pd.reset_option('display.max_rows')
pd.set_option('display.max_rows', 100)

In [24]:
data['CLNDN'].value_counts()

Not_Given                                                                             3760
Lynch_syndrome                                                                         790
not_specified|not_provided                                                             456
Hereditary_cancer-predisposing_syndrome                                                434
Ehlers-Danlos_syndrome,_type_4                                                         422
                                                                                      ... 
Retinitis_pigmentosa|Retinal_dystrophy|Stargardt_disease_1|not_provided                  1
Nephronophthisis|Joubert_syndrome|Renal_dysplasia_and_retinal_aplasia|not_provided       1
Retinitis_pigmentosa_19|not_provided                                                     1
Nephronophthisis|Nephronophthisis_1|not_provided                                         1
Non-small_cell_lung_cancer                                                               1

## Import Data - Header

In [25]:
df_header =  pd.read_table('../clinvar_final.txt',nrows=26)
pd.options.display.max_colwidth = 1000

In [26]:
df_header

Unnamed: 0,#fileformat=VCFv4.1
0,#fileDate=2019-03-01
1,#source=ClinVar
2,#reference=GRCh38
3,"#ID=<Description=""ClinVar Variation ID"">"
4,"#INFO=<ID=AF_ESP,Number=1,Type=Float,Description=""allele frequencies from GO-ESP"">"
5,"#INFO=<ID=AF_EXAC,Number=1,Type=Float,Description=""allele frequencies from ExAC"">"
6,"#INFO=<ID=AF_TGP,Number=1,Type=Float,Description=""allele frequencies from TGP"">"
7,"#INFO=<ID=ALLELEID,Number=1,Type=Integer,Description=""the ClinVar Allele ID"">"
8,"#INFO=<ID=CLNDN,Number=.,Type=String,Description=""ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB"">"
9,"#INFO=<ID=CLNDNINCL,Number=.,Type=String,Description=""For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB"">"


---

## Import Data - Lines

In [27]:
df_lines = pd.read_table('../clinvar_final.txt', sep='\t',skiprows=(27), header=(0))

In [28]:
df_lines.head(5)

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
0,1,1014O42,475283,G,A,.,.,"AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;ALLELEID=446939;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014042G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=143888043"
1,1,1O14122,542074,C,T,.,.,"AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014122C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=150861311"
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014143C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:147571.0003;GENEINFO=ISG15:9636;MC=SO:0001587|nonsense;ORIGIN=1;RS=786201005"
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014179C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1553169766"
4,1,1014217,475278,C,T,.,.,"AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;ALLELEID=446987;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014217C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=61766284"


In [29]:
df_lines.shape

(102321, 8)

In [30]:
df_lines.columns

Index(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'FILTER', 'QUAL', 'INFO'], dtype='object')

In [31]:
df_lines.count()

CHROM     102321
POS       102321
ID        102321
REF       102321
ALT       102321
FILTER    102321
QUAL      102321
INFO      102321
dtype: int64

In [32]:
df_lines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CHROM   102321 non-null  int64 
 1   POS     102321 non-null  object
 2   ID      102321 non-null  int64 
 3   REF     102321 non-null  object
 4   ALT     102321 non-null  object
 5   FILTER  102321 non-null  object
 6   QUAL    102321 non-null  object
 7   INFO    102321 non-null  object
dtypes: int64(2), object(6)
memory usage: 6.2+ MB


In [33]:
df_lines.describe()

# looks like the count for both matches the shape of the df

Unnamed: 0,CHROM,ID
count,102321.0,102321.0
mean,1.88,340282.99
std,0.71,163372.0
min,1.0,20.0
25%,1.0,216958.0
50%,2.0,342510.0
75%,2.0,479701.0
max,3.0,620635.0


## Count null value columns

In [34]:
print('-'*100)
print('Count of null values arranged in descending order')
print('-'*100)
df_lines.isna().sum(axis=0).sort_values(ascending = False)

----------------------------------------------------------------------------------------------------
Count of null values arranged in descending order
----------------------------------------------------------------------------------------------------


CHROM     0
POS       0
ID        0
REF       0
ALT       0
FILTER    0
QUAL      0
INFO      0
dtype: int64

    Appears there are no null values in any column

# Understand Documentation for Columns

In [35]:
# each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

# allele : one of two or more alternative forms of a gene that arise by mutation and are found at the same place on a chromosome.

# Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, 
    #depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.

## Types

### <font color = red> (1) Two alternate alleles

![image.png](attachment:a6752872-1d56-4117-9bb7-7a56b2b0dc1c.png)

In [36]:
# with one of them (T) being ancestral (possibly a reference sequencing error),

---

### <font color = red>  (2) Monomorphic reference

![image.png](attachment:4c6c0882-11f9-4f22-98c2-71812b49b6c2.png)
![image.png](attachment:99c85c9b-527c-4aff-aaa8-323140294c58.png)

In [37]:
# (i.e. with no alternate alleles),

---

### <font color = red> (3) Microsatellite

![image.png](attachment:21f0c9b1-6e74-4ff6-8486-2d4f777a35dd.png)
![image.png](attachment:0cdfc389-8632-4777-9ff7-87b97b2c433a.png)

In [38]:
# with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T).

---

## Data

## <font color = green> VCF Spec : Review

### CHROM - chromosome

In [39]:
# An identifier from the reference genome or an angle-bracketed ID String
# The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends

### POS - position

In [40]:
# The reference position, with the 1st base having position 1
# Positions are sorted numerically, in increasing order, within each reference sequence CHROM.
# It is permitted to have multiple records with the same POS
# Temomeres? A telomere is a region of repetitive nucleotide sequences associated with specialized proteins at the ends of 
        #linear chromosomes. Although there are different architectures, telomeres, in a broad sense, are a widespread genetic 
        # feature most commonly found in eukaryotes

### ID - identifier

In [41]:
# Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). 
# No identifier should be present in more than one data record. 
# If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) 

### REF - reference base(s)

In [42]:
# Each base must be one of A,C,G,T,N (case insensitive). 
# Multiple bases are permitted. 
# The value in the POS field refers to the position of the first base in the String.
# For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, 
    # the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), 
    # unless the event occurs at position 1 on the contig in which case it must include the base after the event; 
        # #this padding base is not required (although it is permitted) for e.g. complex substitutions or other events 
        # where all alleles have at least one base represented in their Strings.
# If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String “<ID>”) then the padding base is required and 
    # POS denotes the coordinate of the base preceding the polymorphism.

### ALT - alternate base(s)

In [43]:
# Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. 
# Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (“<ID>”) 
    # or a breakend replacement string as described in the section on breakends. 
#  If there are no alternative alleles, then the missing value should be used.

### FILTER - Filter Status

In [44]:
# PASS if this position has passed all filters, i.e. a call is made at this position. 
# Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail.
# e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is 
    # below 50% of the total number of samples
#  ‘0’ is reserved and should not be used as a filter String. 
# If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted) 

### QUAL - Quality

In [45]:
# Phred-scaled quality score for the assertion made in ALT. 
    # −10log10 prob(call in ALT is wrong)
# If ALT is ‘.’ (no variant) then this is −10log10 prob(variant)
# and if ALT is not ‘.’ this is −10log10 prob(no variant)
# High QUAL scores indicate high confidence calls.
# Although traditionally people use integer phred scores, this field is permitted to be a 
    # floating point to enable higher resolution for low confidence calls if desired

### <font color = red> INFO - additional information

In [46]:
# commas are permitted only as delimiters for lists of values
# INFO fields are encoded as a semicolon-separated series of short keys with 
    # optional values in the format: <key>=<data>[,data]
# Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional): 


* AA : ancestral allele 
* AC : allele count in genotypes, for each ALT allele, in the same order as listed 
* AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes 
* AN : total number of alleles in called genotypes 
* BQ : RMS base quality at this position 
* CIGAR : cigar string describing how to align an alternate allele to the reference allele • DB : dbSNP membership 
* DP : combined depth across samples, e.g. DP=154 
* END : end position of the variant described in this record (for use with symbolic alleles) • H2 : membership in hapmap2 
* H3 : membership in hapmap3 
* MQ : RMS mapping quality, e.g. MQ=52 
* MQ0 : Number of MAPQ == 0 reads covering this record 
* NS : Number of samples with data 
* SB : strand bias at this position 
* SOMATIC : indicates that the record is a somatic mutation, for cancer genomics 
    * An alteration in DNA that occurs after conception. Somatic mutations can occur in any of the cells of the body except the germ cells (sperm and egg) 
    * and therefore are not passed on to children. 
    * These alterations can (but do not always) cause cancer or other diseases.
* VALIDATED : validated by follow-up experiment 
* 1000G : membership in 1000 Genomes 


In [47]:
# The exact format of each INFO sub-field should be specified in the meta-information (as described above). 
    # Example for an INFO field: DP=154;MQ=52;H2.
# Keys without corresponding values are allowed in order to indicate group membership 
    # (e.g. H2 indicates the SNP is found in HapMap 2).
# . It is not necessary to list all the properties that a site does NOT have

---

### <font color = red> Genotype Fields

In [48]:
# If genotype information is present, then the same types of data must be present for all samples. 
# First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric String).
# This is followed by one field per sample, with the colon-separated data in this field corresponding to the types specified in the format. 
# The first sub-field must always be the genotype (GT) if it is present. There are no required sub-fields. 

#-------------
# GT : genotype, encoded as allele values separated by either of / or |
    #  The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 
    # 2 for the second allele list in ALT and so on.
    # ----------
    # diploid calls examples could be 0/1, 1 | 0, or 1/2, etc
    # triploid call might look like 0/0/1
    # ---------
    # The meanings of the separators are
    # ◦ / : genotype unphased 
    # ◦ | : genotype phased 
#  DP : read depth at this position for this sample
#  FT : sample genotype filte
#  GL : genotype likelihoods
#  GLE : genotype likelihoods of heterogeneous ploidy,
#  
    

## <font color = red> INFO keys used for structural variants

#### The following INFO keys are reserved for encoding structural variants. 


In [49]:
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> 
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation"> 
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> 

## <font color = red> FORMAT keys used for structural variants 

In [50]:
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events"> 
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events"> 
##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype likelihood for imprecise events"> 
##FORMAT=<ID=NQ,Number=1,Type=Integer,Description="Phred style probability score that the variant is novel"> 
##FORMAT=<ID=HAP,Number=1,Type=Integer,Description="Unique haplotype identifier"> 
##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype"> 

### Creating VCF entries for SNPs and small indels

#### example 1

![image.png](attachment:ed41d07b-8dce-4575-abbc-1287835bc19d.png)

In [51]:
# looking for 
# #CHROM POS ID REF ALT QUAL FILTER INFO 
# 20 3 . C G . PASS DP=100 
# 20 2 . TC T . PASS DP=100 
# 20 2 . TC TCA . PASS DP=100 


In [52]:
##fileformat=VCFv4.1
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=DEL:ME:L1,Description="Deletion of L1 element">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
#CHROM

In [53]:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
# 1 2827694 rs2376870 CGTGGATGCGGGGAC C . PASS SVTYPE=DEL;END=2827708;HOMLEN=1;HOMSEQ=G;SVLEN=-14 GT:GQ 1/1:14
# 2 321682 . T <DEL> 6 PASS SVTYPE=DEL;END=321887;SVLEN=-205;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
# 2 14477084 . C <DEL:ME:ALU> 12 PASS SVTYPE=DEL;END=14477381;SVLEN=-297;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
# 3 9425916 . C <INS:ME:L1> 23 PASS SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22 GT:GQ 1/1:15
# 3 12665100 . A <DUP> 14 PASS SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2
# 4 18665128 . T <DUP:TANDEM> 11 PASS SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10 GT:GQ:CN:CNQ ./.:0:5:8.3

![image.png](attachment:1940e1af-afa4-43bf-88dc-fe2e3b684961.png)

1. A precise deletion with known breakpoint, a one base micro-homology, and a sample that is homozygous for
the deletion.
2. An imprecise deletion of approximately 105 bp.
3. An imprecise deletion of an ALU element relative to the reference.
4. An imprecise insertion of an L1 element relative to the reference.
5. An imprecise duplication of approximately 21Kb. The sample genotype is copy number 3 (one extra copy of
the duplicated sequence).
6. An imprecise tandem duplication of 76bp. The sample genotype is copy number 5 (but the two haplotypes are
not known).

#### example 2

![image.png](attachment:01fac560-76ce-44ad-b11b-6ad3690d2370.png)

![image.png](attachment:dbdb296f-8d9a-48bd-9061-2ca6bd3a4775.png)

In [54]:
# looking for
# #CHROM POS ID REF ALT QUAL FILTER INFO 
# 20 2 . TCG TG,T,TCAG . PASS DP=100 


---

#### Clonal derivation relationships (Exploration Potential)

In [55]:
# In cancer, each VCF file represents several genomes from a patient, but one genome is special in that it represents
# the germline genome of the patient. This genome is contrasted to a second genome, the cancer tumor genome. In
# the simplest case the VCF file for a single patient contains only these two genomes. This is assumed in most of the
# discussion of the sections below.

![image.png](attachment:465fd8cc-bcd6-4b5b-9d88-40d86ea85ca0.png)

![image.png](attachment:383fa68e-93cc-4a26-aa6f-aa682b5dd88d.png)

---

# <font color = teal> Explore Columns

## <font color = blue> CHROM

In [56]:
df_lines['CHROM'].describe

# now sure what this is useful for. Values are 1, 2, 3

<bound method NDFrame.describe of 0         1
1         1
2         1
3         1
4         1
         ..
102316    3
102317    3
102318    3
102319    3
102320    3
Name: CHROM, Length: 102321, dtype: int64>

In [57]:
df_lines.tail(10)

# Clarifies what CHROM dtype is (it is integer 3)

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
102311,3,179210292,45465,AAGATTTGCTGAACCC,A,.,.,"ALLELEID=54632;CLNDISDB=Human_Phenotype_Ontology:HP:0030358,MeSH:D002289,MedGen:C0007131,SNOMED_CT:254637007;CLNDN=Non-small_cell_lung_cancer;CLNHGVS=NC_000003.12:g.179210293_179210307delAGATTTGCTGAACCC;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_pathogenic;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=PIK3CA:5290;ORIGIN=2;RS=397517200"
102312,3,179210293,376472,A,T,.,.,"ALLELEID=363351;CLNDISDB=Human_Phenotype_Ontology:HP:0001402,MedGen:C2239176,OMIM:114550,Orphanet:ORPHA88673,SNOMED_CT:187769009,SNOMED_CT:25370001|Human_Phenotype_Ontology:HP:0006740,MedGen:C0279680|Human_Phenotype_Ontology:HP:0030078,MeSH:C538231,MedGen:C0152013|Human_Phenotype_Ontology:HP:0030359,MedGen:C0149782|Human_Phenotype_Ontology:HP:0030692,MeSH:D001932,MedGen:C0006118,SNOMED_CT:126952004|Human_Phenotype_Ontology:HP:0100013,MeSH:D001943,MedGen:C1458155,Orphanet:ORPHA180250,SNOMED_CT:126926005|MeSH:C535575,MedGen:C1168401,OMIM:275355,Orphanet:ORPHA67037|MeSH:D005909,MedGen:C0017636,Orphanet:ORPHA360,SNOMED_CT:63634009|MedGen:C0153574,Orphanet:ORPHA213569|MedGen:C0278701;CLNDN=Hepatocellular_carcinoma|Transitional_cell_carcinoma_of_the_bladder|Lung_adenocarcinoma|Squamous_cell_lung_carcinoma|Neoplasm_of_brain|Neoplasm_of_the_breast|Squamous_cell_carcinoma_of_the_head_and_neck|Glioblastoma|Malignant_neoplasm_of_body_of_uterus|Adenocarcinoma_of_stomach;CLNHGVS=NC_000003.12:g...."
102313,3,179210318,584654,A,G,.,.,"ALLELEID=575696;CLNDISDB=MedGen:C0027672,SNOMED_CT:699346009;CLNDN=Hereditary_cancer-predisposing_syndrome;CLNHGVS=NC_000003.12:g.179210318A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001583|missense_variant;ORIGIN=0"
102314,3,179210438,376359,C,T,.,.,ALLELEID=363238;CLNDISDB=MedGen:C0302182;CLNDN=Trabecular_adenocarcinoma;CLNHGVS=NC_000003.12:g.179210438C>T;CLNREVSTAT=no_assertion_provided;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001583|missense_variant;ORIGIN=2;RS=1057519872
102315,3,179210475,412645,G,C,.,.,"AF_EXAC=0.00007;ALLELEID=393605;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000;CLNDN=Cowden_syndrome;CLNHGVS=NC_000003.12:g.179210475G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=202186428"
102316,3,179210507,403908,A,G,.,.,"ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000;CLNDN=Cowden_syndrome;CLNHGVS=NC_000003.12:g.179210507A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1060500027"
102317,3,179210511,526648,T,C,.,.,"ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000;CLNDN=Cowden_syndrome;CLNHGVS=NC_000003.12:g.179210511T>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=1480813252"
102318,3,179210515,526640,A,C,.,.,"AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000|MedGen:C0027672,SNOMED_CT:699346009;CLNDN=Cowden_syndrome|Hereditary_cancer-predisposing_syndrome;CLNHGVS=NC_000003.12:g.179210515A>C;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001583|missense_variant;ORIGIN=1;RS=199563773"
102319,3,179210516,246681,A,G,.,.,"AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000;CLNDN=Cowden_syndrome;CLNHGVS=NC_000003.12:g.179210516A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001583|missense_variant;ORIGIN=1;RS=753879573"
102320,3,179210538,259958,A,T,.,.,"AF_EXAC=0.00001;ALLELEID=251013;CLNDISDB=MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58037000|MedGen:CN169374;CLNDN=Cowden_syndrome|not_specified;CLNHGVS=NC_000003.12:g.179210538A>T;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Likely_benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=PIK3CA:5290;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=201108344"


In [58]:
df_lines['CHROM'].unique()

# confirms there are only 3 values for CHROM

array([1, 2, 3])

---

## <font color = blue> POS

In [59]:
df_lines.POS.describe()

# there are 95554 unique values (of a total of 102321) -- need to look into duplicates

# the max frequency is 31 -- ie the repetitions

count       102321
unique       95554
top       37025629
freq            31
Name: POS, dtype: object

In [60]:
df['POS'].value_counts().sample(10)

42688311     1
197102249    1
1887487      1
148469863    1
220171031    1
52451148     1
12002021     1
100462862    1
110172076    1
46194914     1
Name: POS, dtype: int64

In [61]:
#df['POS'].value_counts(ascending = False).sample(100)
df_lines['POS'].value_counts(ascending = False).head(30)

# it would be interesting to explore some of these cases

37025629     31
73385903     13
149172318    13
92478757     11
178713381    11
47414420     11
241500602    10
20651801     10
165294040     9
168293284     9
55058666      8
120992007     8
51032510      8
158611131     8
70955832      8
124746755     7
24118973      7
160039373     6
50346870      6
211492048     6
178698916     6
47806285      6
10376834      6
172666616     6
176119435     6
46709583      6
37025608      6
178535858     5
103005900     5
37050632      5
Name: POS, dtype: int64

In [62]:
df_lines[df_lines['POS'] == '37025629']

# this is very interesting -- what is ALT?
# also, these repetitions do not have any filter or qual

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
87677,3,37025629,36539,T,A,.,.,"AF_EXAC=0.00014;ALLELEID=45201;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090|MedGen:C0027672,SNOMED_CT:699346009|MedGen:C1333990,Orphanet:ORPHA144,SNOMED_CT:315058005|MedGen:C1333991,OMIM:609310|MedGen:C2936783,OMIM:120435|MedGen:CN169374;CLNDN=Hereditary_nonpolyposis_colon_cancer|Hereditary_cancer-predisposing_syndrome|Lynch_syndrome|Lynch_syndrome_II|Lynch_syndrome_I|not_specified;CLNHGVS=NC_000003.12:g.37025629T>A;CLNREVSTAT=reviewed_by_expert_panel;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=Center_for_Human_Genetics,_Inc:MLH1-A5|International_Society_for_Gastrointestinal_Hereditary_Tumours_(InSiGHT):c.1039-8T>A;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=193922367"
87678,3,37025629,140797,T,TA,.,.,"ALLELEID=150511;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090|MedGen:C0027672,SNOMED_CT:699346009|MedGen:CN169374;CLNDN=Hereditary_nonpolyposis_colon_cancer|Hereditary_cancer-predisposing_syndrome|not_specified;CLNHGVS=NC_000003.12:g.37025631dupA;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign;CLNVC=Duplication;CLNVCSO=SO:1000035;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=1553650466"
87679,3,37025629,182538,T,TTA,.,.,"ALLELEID=180144;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090|MedGen:C0027672,SNOMED_CT:699346009;CLNDN=Hereditary_nonpolyposis_colon_cancer|Hereditary_cancer-predisposing_syndrome;CLNHGVS=NC_000003.12:g.37025629_37025630insTA;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign/Likely_benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87680,3,37025629,215443,T,TTTA,.,.,"ALLELEID=212308;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090;CLNDN=Hereditary_nonpolyposis_colon_cancer;CLNHGVS=NC_000003.12:g.37025629_37025630insTTA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87681,3,37025629,215444,T,TTTTA,.,.,"ALLELEID=212309;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090;CLNDN=Hereditary_nonpolyposis_colon_cancer;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87682,3,37025629,215445,T,TTTTTA,.,.,"ALLELEID=212310;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090;CLNDN=Hereditary_nonpolyposis_colon_cancer;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTTA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87683,3,37025629,237302,T,TTTTTAA,.,.,"ALLELEID=239163;CLNDISDB=MedGen:C1333990,Orphanet:ORPHA144,SNOMED_CT:315058005;CLNDN=Lynch_syndrome;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTTAA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87684,3,37025629,215446,T,TTTTTTA,.,.,"ALLELEID=212311;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090|MedGen:CN169374;CLNDN=Hereditary_nonpolyposis_colon_cancer|not_specified;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTTTA;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign/Likely_benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87685,3,37025629,215447,T,TTTTTTTA,.,.,"ALLELEID=212312;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090;CLNDN=Hereditary_nonpolyposis_colon_cancer;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTTTTA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"
87686,3,37025629,215448,T,TTTTTTTTA,.,.,"ALLELEID=212313;CLNDISDB=MedGen:C0009405,Orphanet:ORPHA443090;CLNDN=Hereditary_nonpolyposis_colon_cancer;CLNHGVS=NC_000003.12:g.37025629_37025630insTTTTTTTA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Insertion;CLNVCSO=SO:0000667;GENEINFO=MLH1:4292;MC=SO:0001627|intron_variant;ORIGIN=1;RS=535965616"


In [63]:
# run this for another POS value

df_lines[df_lines['POS'] == '73385903']

# interesting relationshpo between REF and ALT

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
47466,2,73385903,218713,T,TGGA,.,.,"AF_TGP=0.76597;ALLELEID=215261;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374;CLNDN=Alstrom_syndrome|not_specified;CLNHGVS=NC_000002.12:g.73385940_73385942dupGGA;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Benign(5)%3BLikely_benign(1)%3BUncertain_significance(5);CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:497990|Illumina_Clinical_Services_Laboratory,Illumina:652996|Illumina_Clinical_Services_Laboratory,Illumina:686207|Illumina_Clinical_Services_Laboratory,Illumina:696234|Illumina_Clinical_Services_Laboratory,Illumina:834309;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47467,2,73385903,193379,T,TGGAGGA,.,.,"AF_TGP=0.76597;ALLELEID=190543;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374|MedGen:CN517202;CLNDN=Alstrom_syndrome|not_specified|not_provided;CLNHGVS=NC_000002.12:g.73385937_73385942dupGGAGGA;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Benign(3)%3BLikely_benign(2)%3BUncertain_significance(1);CLNVC=Duplication;CLNVCSO=SO:1000035;GENEINFO=ALMS1:7840;ORIGIN=5;RS=55889738"
47468,2,73385903,412658,T,TGGAGGAGGA,.,.,"AF_TGP=0.76597;ALLELEID=393066;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009;CLNDN=Alstrom_syndrome;CLNHGVS=NC_000002.12:g.73385934_73385942dupGGAGGAGGA;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Likely_benign(1)%3BUncertain_significance(1);CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:721510;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47469,2,73385903,220621,T,TGGAGGAGGAGGA,.,.,"AF_TGP=0.76597;ALLELEID=221323;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009;CLNDN=Alstrom_syndrome;CLNHGVS=NC_000002.12:g.73385931_73385942dupGGAGGAGGAGGA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Duplication;CLNVCSO=SO:1000035;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47470,2,73385903,459877,T,TGGAGGAGGAGGAGGA,.,.,"AF_TGP=0.76597;ALLELEID=451832;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374;CLNDN=Alstrom_syndrome|not_specified;CLNHGVS=NC_000002.12:g.73385928_73385942dup;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Likely_benign(1)%3BUncertain_significance(1);CLNVC=Duplication;CLNVCSO=SO:1000035;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47471,2,73385903,241005,T,TGGAGGAGGAGGAGGAGGA,.,.,"AF_TGP=0.76597;ALLELEID=238973;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009;CLNDN=Alstrom_syndrome;CLNHGVS=NC_000002.12:g.73385925_73385942dup;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47472,2,73385903,193377,TGGA,T,.,.,"AF_TGP=0.76597;ALLELEID=190541;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374|MedGen:CN517202;CLNDN=Alstrom_syndrome|not_specified|not_provided;CLNHGVS=NC_000002.12:g.73385940_73385942delGGA;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign/Likely_benign;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47473,2,73385903,193378,TGGAGGA,T,.,.,"AF_TGP=0.76597;ALLELEID=190542;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374;CLNDN=Alstrom_syndrome|not_specified;CLNHGVS=NC_000002.12:g.73385937_73385942delGGAGGA;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign/Likely_benign;CLNVC=Deletion;CLNVCSO=SO:0000159;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:497988;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47474,2,73385903,221075,TGGAGGAGGA,T,.,.,"AF_TGP=0.76597;ALLELEID=221322;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009;CLNDN=Alstrom_syndrome;CLNHGVS=NC_000002.12:g.73385934_73385942delGGAGGAGGA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"
47475,2,73385903,403930,TGGAGGAGGAGGA,T,.,.,"AF_TGP=0.76597;ALLELEID=393438;CLNDISDB=MedGen:C0268425,OMIM:203800,Orphanet:ORPHA64,SNOMED_CT:63702009|MedGen:CN169374;CLNDN=Alstrom_syndrome|not_specified;CLNHGVS=NC_000002.12:g.73385931_73385942delGGAGGAGGAGGA;CLNREVSTAT=criteria_provided,_conflicting_interpretations;CLNSIG=Conflicting_interpretations_of_pathogenicity;CLNSIGCONF=Likely_benign(2)%3BUncertain_significance(1);CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=ALMS1:7840;ORIGIN=1;RS=55889738"


In [64]:
# check another
df_lines[df_lines['POS'] == '149172318']

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
101033,3,149172318,343716,T,A,.,.,"ALLELEID=293218;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172318T>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:614092;GENEINFO=CP:1356|HPS3:84343;MC=SO:0001624|3_prime_UTR_variant;ORIGIN=1;RS=879086473"
101034,3,149172318,343717,T,TATCACA,.,.,"ALLELEID=292941;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172318_149172319insATCACA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Insertion;CLNVCSO=SO:0000667;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:632553;GENEINFO=CP:1356|HPS3:84343;MC=SO:0001624|3_prime_UTR_variant;ORIGIN=1;RS=374839757"
101035,3,149172318,343718,T,TCA,.,.,"ALLELEID=292950;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172355_149172356dupCA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:564127;GENEINFO=CP:1356|HPS3:84343;ORIGIN=1;RS=113015797"
101036,3,149172318,343719,T,TCACA,.,.,"ALLELEID=289088;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172353_149172356dupCACA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:562086;GENEINFO=CP:1356|HPS3:84343;ORIGIN=1;RS=113015797"
101037,3,149172318,343720,T,TCACACA,.,.,"ALLELEID=289848;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172351_149172356dupCACACA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:604269;GENEINFO=CP:1356|HPS3:84343;ORIGIN=1;RS=113015797"
101038,3,149172318,343721,T,TCACACACA,.,.,"ALLELEID=292947;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172349_149172356dupCACACACA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:747386;GENEINFO=CP:1356|HPS3:84343;ORIGIN=1;RS=113015797"
101039,3,149172318,343722,T,TCACACACACACA,.,.,"ALLELEID=293229;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172345_149172356dupCACACACACACA;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Duplication;CLNVCSO=SO:1000035;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:679019;GENEINFO=CP:1356|HPS3:84343;ORIGIN=1;RS=113015797"
101040,3,149172318,343723,T,TCTCACA,.,.,"ALLELEID=293224;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172319_149172320insTCACAC;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Insertion;CLNVCSO=SO:0000667;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:578518;GENEINFO=CP:1356|HPS3:84343;MC=SO:0001624|3_prime_UTR_variant;ORIGIN=1;RS=72453449"
101041,3,149172318,343724,T,TCTCACACA,.,.,"ALLELEID=289085;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172319_149172320insTCACACAC;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Insertion;CLNVCSO=SO:0000667;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:629305;GENEINFO=CP:1356|HPS3:84343;MC=SO:0001624|3_prime_UTR_variant;ORIGIN=1;RS=72453449"
101042,3,149172318,343725,T,TCTCACACACA,.,.,"ALLELEID=293225;CLNDISDB=MedGen:C0079504,Orphanet:ORPHA79430,SNOMED_CT:9311003;CLNDN=Hermansky-Pudlak_syndrome;CLNHGVS=NC_000003.12:g.149172319_149172320insTCACACACAC;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=Insertion;CLNVCSO=SO:0000667;CLNVI=Illumina_Clinical_Services_Laboratory,Illumina:669257;GENEINFO=CP:1356|HPS3:84343;MC=SO:0001624|3_prime_UTR_variant;ORIGIN=1;RS=72453449"


---

In [65]:
# Convert object to string

df_lines['CHROM'] = df_lines['CHROM'].astype('str') 

In [66]:
# df['CHROM2'] = df['CHROM'].astype('str') 

In [67]:
df_lines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CHROM   102321 non-null  object
 1   POS     102321 non-null  object
 2   ID      102321 non-null  int64 
 3   REF     102321 non-null  object
 4   ALT     102321 non-null  object
 5   FILTER  102321 non-null  object
 6   QUAL    102321 non-null  object
 7   INFO    102321 non-null  object
dtypes: int64(1), object(7)
memory usage: 6.2+ MB
