In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
import seaborn as sns

pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

print(sys.version)
print(pd.__version__)
print(np.__version__)

import io
import os
import pandas as pd

3.9.13 (main, Aug 25 2022, 18:29:29) 
[Clang 12.0.0 ]
1.4.4
1.21.5


---

## 4: Clinical disease data (40 pts)

Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Look at the file and tell me what gene and mutation combinations are classified as dangerous.”

Make sure that you only give your boss the dangerous mutations and include:

1) Gene name

2) Mutation ID number

3) Mutation Position (chromosome & position)

4) Mutation value (reference & alternate bases)

5) Clinical significance (CLNSIG)

6) Disease that is implicated

**Requirements**

1) The deliverables are the final result as a dataframe with a short discussion of any specifics. (that is, what data you would present to your boss with the explanation of your results)

2) Limit your output to the first 100 harmful mutations and tell your boss how many total harmful mutations were found in the file

3) Use the instructor-modified "clinvar_final.txt" at this link: https://drive.google.com/file/d/1Zps0YssoJbZHrn6iLte2RDLlgruhAX1s/view?usp=sharing This file was modified to be not exactly the same as 'standard' .vcf file to test your data parsing skills. **This is a large file so do NOT upload it into your github repo!**

4) Replace missing values in the dataframe with: 'Not_Given'. Print or display this (including the Not_Given count) for the column `CLNSIG` by using pandas value_counts() function (https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html).

5) State in your answer how you define harmful mutations

**6) Do your best on getting to above requirements and submit whatever you do before the deadline. If your work is incomplete be sure to describe the blockers that got in your way and how you might get past them (if given more time).**

7) You can use as many code blocks as you need. Please clean-up your code and make it readable for the graders!

**Hints** 
* We do not expect you to have any medical knowledge to solve this problem; look at the data, read the documentation provided, and write down your assumptions!

* Correct pseudocode will give you partial credit so start with that. 

* Map out which fields you want to extract: Are they in the same place every time? What strategy will you use to robustly extract and filter your data of interest? How do you plan to handle missing data?

* A good way to start is to print out each line, then practice parsing them to see if you can recover the fields of interest

* A starting solution for parsing .vcfs can be found here: https://gist.github.com/dceoy/99d976a2c01e7f0ba1c813778f9db744 This solution does **NOT** work due to the changes we've made but can be modified to work. As with any solution that needs modifications, it may take less time to make your own solution!

* Filter out junk and lines with no mutation data. Just focus on the data your need to deliver to your boss. 

* Pandas and NumPy parsers correctly recognize the end of each line in in the ClinVar file.

* The unit of observation of this dataset is one row per mutation.

* This is similar to a task that one of us tackled at work. You can answer the question with the information provided below or using the (partial) data dictionary file at this link: https://drive.google.com/file/d/1lx9yHdlcqmU_OlHiTUXKC_LQDqYBypH_/view?usp=sharing. Our goal is to see that you can put together a sensible plan, implement a solid parsing strategy, and document and justify the decisions that you made.

### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also can contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s)
5. ALT - alternate base(s)
6. FILTER - filter status
7. QUAL - quality
8. INFO - a semicolon-separated series of keys with values in the format: <key>=<data>

```
### Applicable INFO field specifications

```
GENEINFO = <Gene name>
CLNSIG =  <Clinical significance>
CLNDN = <Disease name>
```

### Sample ClinVar data (vcf file format - not exactly the same as the file to download!)

```
##fileformat=VCFv4.1
##fileDate=2019-03-19
##source=ClinVar
##reference=GRCh38							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDN=Heart_dis 
```

In [2]:
# 4) Your code here - can use as many code blocks as you would like

4) Please Write your assumptions here:

4) Findings / What would you present to your boss?

----

In [3]:
def read_vcf(path):
    with open(path, 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

In [21]:
df=pd.read_table('../clinvar_final.txt', sep='\t',skiprows=(27), header=(0))

In [22]:
df.head(5)

Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
0,1,1014O42,475283,G,A,.,.,AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1,1,1O14122,542074,C,T,.,.,AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:..."
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:..."
4,1,1014217,475278,C,T,.,.,AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CHROM   102321 non-null  int64 
 1   POS     102321 non-null  object
 2   ID      102321 non-null  int64 
 3   REF     102321 non-null  object
 4   ALT     102321 non-null  object
 5   FILTER  102321 non-null  object
 6   QUAL    102321 non-null  object
 7   INFO    102321 non-null  object
dtypes: int64(2), object(6)
memory usage: 6.2+ MB


In [27]:
df['POS'].unique()

array(['1014O42', '1O14122', '1014143', ..., '179210515', '179210516',
       '179210538'], dtype=object)

In [32]:
pd.set_option("display.max_rows", None)

In [34]:
df['POS'].value_counts().sample(100)

178572986    1
81705620     1
239134438    1
119415387    1
39056718     1
216198366    1
73424847     1
74368830     1
120989380    1
42930724     2
27322487     1
166045072    1
237380912    1
37025874     1
73600746     1
3405543      1
202565727    1
219055644    1
165314010    1
43404420     1
189064580    2
62489068     1
178558419    1
2306689      1
102888543    1
13875181     1
121530919    1
47806640     1
178565431    1
149207611    1
38550890     1
45331761     1
21007058     1
61031834     1
141559949    1
127654492    1
178575154    1
188991049    1
52407375     1
63492941     1
147908442    1
10142006     1
3425566      1
237454544    1
218263064    1
37025884     1
71570214     1
188990317    2
193249726    1
168968486    1
162276924    1
186119831    1
197442274    1
26773661     1
68446845     1
158691473    1
44832154     1
45508329     1
96254056     1
12609236     1
219418935    1
233767922    1
178559608    1
189004248    1
197124103    1
25300227     1
201077934 

In [20]:
# Convert object to string

df['CHROM'] = df['CHROM'].astype('str') 

In [18]:
# df['CHROM2'] = df['CHROM'].astype('str') 

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 9 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CHROM   102321 non-null  object
 1   POS     102321 non-null  object
 2   ID      102321 non-null  int64 
 3   REF     102321 non-null  object
 4   ALT     102321 non-null  object
 5   FILTER  102321 non-null  object
 6   QUAL    102321 non-null  object
 7   INFO    102321 non-null  object
 8   CHROM2  102321 non-null  object
dtypes: int64(1), object(8)
memory usage: 7.0+ MB


In [8]:
data.describe()

Unnamed: 0,CHROM,ID
count,102321.0,102321.0
mean,1.88,340282.99
std,0.71,163372.0
min,1.0,20.0
25%,1.0,216958.0
50%,2.0,342510.0
75%,2.0,479701.0
max,3.0,620635.0


In [11]:
data['POS'].head()

0    1014O42
1    1O14122
2    1014143
3    1014179
4    1014217
Name: POS, dtype: object

In [12]:
data['CHROM'].unique_values()

AttributeError: 'Series' object has no attribute 'unique_values'