# Clinical Disease Data EDA

### Business Case:
Your boss comes to you Monday morning and says “I figured out our next step; we are going to pivot from an online craft store and become a data center for genetic disease information! I found **ClinVar** which is a repository that contains expert curated data, and it is free for the taking. This is a gold mine! Look at the file and tell me what gene and mutation combinations are classified as dangerous.”

Make sure that you only give your boss the dangerous mutations and include:

1) Gene name

2) Mutation ID number

3) Mutation Position (chromosome & position)

4) Mutation value (reference & alternate bases)

5) Clinical significance (CLNSIG)

6) Disease that is implicated

As a final deliverable, you're planning to provide a dataframe with a short discussion of any specifics you plan to present to your boss (alongside the explanation of your results). You're also planning to limit your output to the first 100 harmful mutations and tell your boss how many total harmful mutations were found in the file.

**Dataset: clinvar_final.txt file** (not disclosed here for privacy purposes)

*Note: Missing values should be replaced with 'Not_Given'.*

The unit of observation in this dataset is one row per mutation.

A partial data dictionary for the dataset is available here: https://drive.google.com/file/d/1lx9yHdlcqmU_OlHiTUXKC_LQDqYBypH_/view?usp=sharing.

### VCF File Description (Summarized from Version 4.1):

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also can contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s)
5. ALT - alternate base(s)
6. FILTER - filter status
7. QUAL - quality
8. INFO - a semicolon-separated series of keys with values in the format: <key>=<data>

```
### Applicable INFO Field Specifications:

```
GENEINFO = <Gene name>
CLNSIG =  <Clinical significance>
CLNDN = <Disease name>
```

### Sample ClinVar Data (VCF File Format):

```
##fileformat=VCFv4.1
##fileDate=2019-03-19
##source=ClinVar
##reference=GRCh38							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDN=Heart_dis 
```

In [12]:
# ==============================================================================================================
# STEP 1: INITIAL SETUP - LIBRARY AND DATA FILE IMPORTS
# ==============================================================================================================

# Import necessary libraries
import pandas as pd
import numpy as np
import json

# Set up pandas display options to display dataframes with 100 rows/columns maximum
pd.set_option("display.max_rows", 100, "display.max_columns", 100)
pd.options.display.float_format = "{:,.2f}".format

# Read vcf file (.txt format)
# All data lines are tab-delimited, based on dataset documentation
# Skip first 27 rows in dataset to eliminate meta-information lines
clinical_df = pd.read_table("clinvar_final.txt", delimiter = "\t", skiprows = 27)

# Display initial dataframe
clinical_df


Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
0,1,1014O42,475283,G,A,.,.,AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;...
1,1,1O14122,542074,C,T,.,.,AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926...
2,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:..."
3,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:..."
4,1,1014217,475278,C,T,.,.,AF_ESP=0.00515;AF_EXAC=0.00831;AF_TGP=0.00339;...
...,...,...,...,...,...,...,...,...
102316,3,179210507,403908,A,G,.,.,"ALLELEID=393412;CLNDISDB=MedGen:C0018553,Orpha..."
102317,3,179210511,526648,T,C,.,.,"ALLELEID=519163;CLNDISDB=MedGen:C0018553,Orpha..."
102318,3,179210515,526640,A,C,.,.,AF_EXAC=0.00002;ALLELEID=519178;CLNDISDB=MedGe...
102319,3,179210516,246681,A,G,.,.,AF_EXAC=0.00001;ALLELEID=245287;CLNDISDB=MedGe...


#### Step 1 Notes/Observations:
- vcf file/dataset imported and loaded correctly, with clean column names
- Immediately apparent that this file is large based on its shape (102321 rows x 8 columns)
- The INFO column appears to have several additional data fields embedded within it, which makes sense based on the documentation - will require some additional work to extract these fields
- Looks like the data is a mix of numeric/integer and string data

In [13]:
# ==============================================================================================================
# STEP 2a: CLEAN AND MERGE DATA AS NECESSARY TO PREP FOR EDA
# ==============================================================================================================

# Get basic info and Dtypes for dataframe, look for null values, get original shape to check dimensions, etc.
print("Dataframe info:")
print(clinical_df.info(), end = "\n\n")
print("Dataframe shape:", clinical_df.shape)


Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 8 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   CHROM   102321 non-null  int64 
 1   POS     102321 non-null  object
 2   ID      102321 non-null  int64 
 3   REF     102321 non-null  object
 4   ALT     102321 non-null  object
 5   FILTER  102321 non-null  object
 6   QUAL    102321 non-null  object
 7   INFO    102321 non-null  object
dtypes: int64(2), object(6)
memory usage: 6.2+ MB
None

Dataframe shape: (102321, 8)


#### Step 2a Notes/Observations:
- No null values - makes sense and is expected given the documentation above, i.e., "In all cases, missing values are specified with a dot (‘.’)"
- All expected columns are present based on the documentation above
- CHROM column Dtype looks okay based on the documentation
- POS column should have Dtype _int_ based on the documentation, but is currently Dtype _object_ - will need to investigate further
- INFO column contains a large number of sub-components within it and is semicolon-delimited based on dataset documentation - will require an additional parsing step

In [14]:
# ==============================================================================================================
# STEP 2b: CLEAN AND MERGE DATA AS NECESSARY TO PREP FOR EDA
# ==============================================================================================================

# Define list of all sub-components within INFO column based on dataset documentation/meta-information
# THIS LIST IS STRICTLY USED FOR REFERENCE
info_scs_reflist = [
    "AF_ESP",
    "AF_EXAC",
    "AF_TGP",
    "ALLELEID",
    "CLNDN",
    "CLNDNINCL",
    "CLNDISDB",
    "CLNDISDBINCL",
    "CLNHGVS",
    "CLNREVSTAT",
    "CLNSIG",
    "CLNSIGCONF",
    "CLNSIGINCL",
    "CLNVC",
    "CLNVCSO",
    "CLNVI",
    "DBVARID",
    "GENEINFO",
    "MC",
    "ORIGIN",
    "RS",
    "SSR"
]

# Separate INFO column into sub-components based on consistent delimiters/characters (e.g., ";", "=", etc.)
# Transform INFO sub-components into key-value string representations in a dictionary - necessary for JSON prep
# pd.json_normalize method takes INFO dictionary and turns it into a flat table, with sub-components as columns
# Inspiration for parsing method (referencing here for transparency): https://www.biostars.org/p/428170/
info_subs = '{"' + clinical_df["INFO"].str.split(';').str.join('","').str.replace('=','":"').str.replace("\"\",", "") + '"}' 
clinical_info_df = pd.json_normalize(info_subs.apply(eval))

# Display new dataframe containing sub-components of INFO column split out into individual columns
clinical_info_df


Unnamed: 0,AF_ESP,AF_EXAC,AF_TGP,ALLELEID,CLNDISDB,CLNDN,CLNHGVS,CLNREVSTAT,CLNSIG,CLNVC,CLNVCSO,GENEINFO,MC,ORIGIN,RS,CLNVI,CLNSIGCONF,CLNDISDBINCL,CLNDNINCL,CLNSIGINCL,DBVARID
0,0.00546,0.00165,0.00619,446939,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014042G>A,"criteria_provided,_single_submitter",Benign,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,143888043,,,,,,
1,0.00015,0.00010,,514926,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014122C>T,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,150861311,,,,,,
2,,,,181485,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014143C>T,no_assertion_criteria_provided,Pathogenic,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001587|nonsense,1,786201005,OMIM_Allelic_Variant:147571.0003,,,,,
3,,,,514896,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014179C>T,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,1553169766,,,,,,
4,0.00515,0.00831,0.00339,446987,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014217C>T,"criteria_provided,_single_submitter",Benign,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001819|synonymous_variant,1,61766284,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102316,,,,393412,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210507A>G,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,1060500027,,,,,,
102317,,,,519163,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210511T>C,"criteria_provided,_single_submitter",Likely_benign,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001819|synonymous_variant,1,1480813252,,,,,,
102318,,0.00002,,519178,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome|Hereditary_cancer-predisposing...,NC_000003.12:g.179210515A>C,"criteria_provided,_multiple_submitters,_no_con...",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,199563773,,,,,,
102319,,0.00001,,245287,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210516A>G,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,753879573,,,,,,


#### Step 2b Notes/Observations:
- Parsing method seemed to work by splitting up INFO column into its various sub-components and storing each one as a new column
- Only the "SSR" sub-component field appears to be missing, which means it must contain no entries/data
- NaN values are apparent throughout new clinical_info_df - these values will need to be replaced with the standard missing value character (".") at a later step to maintain consistency with rest of dataframe
- The new clinical_info_df should now be merged with/joined to the original clinical_info_df to bring all columns into one combined dataframe

In [15]:
# ==============================================================================================================
# STEP 2c: CLEAN AND MERGE DATA AS NECESSARY TO PREP FOR EDA
# ==============================================================================================================

# Merge original clinical_df and clinical_info_df together (inner join - 1:1 index match)
clinical_expanded_df = pd.merge(
    clinical_df,
    clinical_info_df,
    left_index = True,
    right_index = True
)

# Drop original INFO column now that all INFO sub-components are included in clinical_expanded_df
clinical_expanded_df.drop(["INFO"], axis = 1, inplace = True)

# Display new merged dataframe
clinical_expanded_df


Unnamed: 0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,AF_ESP,AF_EXAC,AF_TGP,ALLELEID,CLNDISDB,CLNDN,CLNHGVS,CLNREVSTAT,CLNSIG,CLNVC,CLNVCSO,GENEINFO,MC,ORIGIN,RS,CLNVI,CLNSIGCONF,CLNDISDBINCL,CLNDNINCL,CLNSIGINCL,DBVARID
0,1,1014O42,475283,G,A,.,.,0.00546,0.00165,0.00619,446939,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014042G>A,"criteria_provided,_single_submitter",Benign,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,143888043,,,,,,
1,1,1O14122,542074,C,T,.,.,0.00015,0.00010,,514926,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014122C>T,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,150861311,,,,,,
2,1,1014143,183381,C,T,.,.,,,,181485,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014143C>T,no_assertion_criteria_provided,Pathogenic,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001587|nonsense,1,786201005,OMIM_Allelic_Variant:147571.0003,,,,,
3,1,1014179,542075,C,T,.,.,,,,514896,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014179C>T,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001583|missense_variant,1,1553169766,,,,,,
4,1,1014217,475278,C,T,.,.,0.00515,0.00831,0.00339,446987,"MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563",Immunodeficiency_38_with_basal_ganglia_calcifi...,NC_000001.11:g.1014217C>T,"criteria_provided,_single_submitter",Benign,single_nucleotide_variant,SO:0001483,ISG15:9636,SO:0001819|synonymous_variant,1,61766284,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102316,3,179210507,403908,A,G,.,.,,,,393412,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210507A>G,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,1060500027,,,,,,
102317,3,179210511,526648,T,C,.,.,,,,519163,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210511T>C,"criteria_provided,_single_submitter",Likely_benign,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001819|synonymous_variant,1,1480813252,,,,,,
102318,3,179210515,526640,A,C,.,.,,0.00002,,519178,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome|Hereditary_cancer-predisposing...,NC_000003.12:g.179210515A>C,"criteria_provided,_multiple_submitters,_no_con...",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,199563773,,,,,,
102319,3,179210516,246681,A,G,.,.,,0.00001,,245287,"MedGen:C0018553,Orphanet:ORPHA201,SNOMED_CT:58...",Cowden_syndrome,NC_000003.12:g.179210516A>G,"criteria_provided,_single_submitter",Uncertain_significance,single_nucleotide_variant,SO:0001483,PIK3CA:5290,SO:0001583|missense_variant,1,753879573,,,,,,


#### Step 2c Notes/Observations:
- Merging clinical_df and clinical_info_df worked exactly as expected and brought in all sub-components/sub-fields from the INFO column
- Original INFO column in new clinical_expanded_df was redundant after the merge, so it was dropped
- Now that we have a relatively clean and organized dataframe, this would be a good time to update all missing values ("." and NaN) with "Not_Given" - per the instructions above - and update all other improper characters in additional columns (e.g., letter characters in the POS column) as necessary. This will help get the dataset ready for additional column-specific deep-dives and analysis (e.g., value_counts/distributions of values, etc.)

In [16]:
# ==============================================================================================================
# STEP 2d: CLEAN AND MERGE DATA AS NECESSARY TO PREP FOR EDA
# ==============================================================================================================

# Replace all NaN values and standard missing values (".") in dataframe with "Not_Given"
clinical_expanded_df.replace(
    {
        np.nan: "Not_Given",
        ".": "Not_Given"
    },
    inplace = True
)

# Replace "O" characters in "POS" column with "0", since column is supposed to only include integer values
clinical_expanded_df["POS"] = clinical_expanded_df["POS"].str.replace("O", "0")

# Subset clinical_expanded_df to only columns of interest, filtering out any unnecessary/extraneous columns
clinical_filtered_df = clinical_expanded_df[
    [
        "CHROM",
        "POS",
        "ID",
        "REF",
        "ALT",
        "CLNDN",
        "CLNSIG",
        "GENEINFO"
    ]
]

# Display filtered dataframe (post-cleaning)
clinical_filtered_df


Unnamed: 0,CHROM,POS,ID,REF,ALT,CLNDN,CLNSIG,GENEINFO
0,1,1014042,475283,G,A,Immunodeficiency_38_with_basal_ganglia_calcifi...,Benign,ISG15:9636
1,1,1014122,542074,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Uncertain_significance,ISG15:9636
2,1,1014143,183381,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Pathogenic,ISG15:9636
3,1,1014179,542075,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Uncertain_significance,ISG15:9636
4,1,1014217,475278,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Benign,ISG15:9636
...,...,...,...,...,...,...,...,...
102316,3,179210507,403908,A,G,Cowden_syndrome,Uncertain_significance,PIK3CA:5290
102317,3,179210511,526648,T,C,Cowden_syndrome,Likely_benign,PIK3CA:5290
102318,3,179210515,526640,A,C,Cowden_syndrome|Hereditary_cancer-predisposing...,Uncertain_significance,PIK3CA:5290
102319,3,179210516,246681,A,G,Cowden_syndrome,Uncertain_significance,PIK3CA:5290


#### Step 2d Notes/Observations:
- Updates to dataframe worked successfully - all missing/NaN entries now show "Not_Given" as expected and POS column now shows only numbers (instead of strings with "O" characters in them)
- clinical_expanded_df now appears to be clean and ready for further investigation/analysis

In [17]:
# ==============================================================================================================
# STEP 3a: PERFORM INITIAL/BASIC EDA
# ==============================================================================================================

# Get basic info, Dtypes, and shape for updated/cleaned dataframe to ensure no rows have dropped along the way
print("Dataframe info:")
print(clinical_filtered_df.info(), end = "\n\n")
print("Dataframe shape:", clinical_filtered_df.shape, end = "\n\n")
print("-" * 100)

# Display distribution (counts) of variants by clinical significance to see unique/highest-density CLNSIG values
print("Distribution of Variants by Clinical Significance:")
print(clinical_filtered_df["CLNSIG"].value_counts())


Dataframe info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102321 entries, 0 to 102320
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   CHROM     102321 non-null  int64 
 1   POS       102321 non-null  object
 2   ID        102321 non-null  int64 
 3   REF       102321 non-null  object
 4   ALT       102321 non-null  object
 5   CLNDN     102321 non-null  object
 6   CLNSIG    102321 non-null  object
 7   GENEINFO  102321 non-null  object
dtypes: int64(2), object(6)
memory usage: 6.2+ MB
None

Dataframe shape: (102321, 8)

----------------------------------------------------------------------------------------------------
Distribution of Variants by Clinical Significance:
Uncertain_significance                                                                      47980
Likely_benign                                                                               17885
Pathogenic                                        

#### Step 3a Notes/Observations:
- No rows were dropped in the cleaning and filtering steps above (good)
- Distribution of variant counts (rows of data) by clinical significance category provides a helpful view:
    - Almost half (~50%) of all variants (rows of data) have the "Uncertain_significance" label for CLNSIG
    - The remaining variants (rows of data) are distributed across a variety of CLNSIG categories, with many containing labels for "pathogenic" or "benign" (though not all)
    - These labels will serve as the starting point for mapping variants into a simple binary classification: "dangerous" versus "all other" (which would include both non-dangerous/benign variants and variants with inconclusive or conflicting significance)
- A good next step will be to collapse this data down to a simpler, binary (0 vs. 1) variable

In [18]:
# ==============================================================================================================
# STEP 3b: PERFORM INITIAL/BASIC EDA
# ==============================================================================================================

# Create new IMPLIEDCLASS column in dataframe by mapping certain CLNSIG values to 1s
# This new column assigns a value of 1 to presumed "dangerous" variants (defined below)
clinical_filtered_df["IMPLIEDCLASS"] = clinical_filtered_df["CLNSIG"].map(
    {
        "Pathogenic": 1,
        "Likely_pathogenic": 1,
        "Pathogenic/Likely_pathogenic": 1,
        "risk_factor": 1,
        "Pathogenic,_risk_factor": 1,
        "Pathogenic/Likely_pathogenic,_other": 1,
        "Pathogenic/Likely_pathogenic,_risk_factor": 1,
        "Likely_pathogenic,_risk_factor": 1,
        "Pathogenic,_other": 1,
        "Pathogenic,_Affects": 1,
        "Likely_pathogenic,_other": 1,
        "Pathogenic,_protective": 1,
        "Likely_pathogenic,_association": 1,
        "Pathogenic,_association,_protective": 1
    }
)

# Replace all NaN values in new IMPLIEDCLASS column with 0s
clinical_filtered_df["IMPLIEDCLASS"].replace(
    {np.nan: 0},
    inplace = True
)

# Display new version of dataframe with additional column
clinical_filtered_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clinical_filtered_df["IMPLIEDCLASS"] = clinical_filtered_df["CLNSIG"].map(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


Unnamed: 0,CHROM,POS,ID,REF,ALT,CLNDN,CLNSIG,GENEINFO,IMPLIEDCLASS
0,1,1014042,475283,G,A,Immunodeficiency_38_with_basal_ganglia_calcifi...,Benign,ISG15:9636,0.00
1,1,1014122,542074,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Uncertain_significance,ISG15:9636,0.00
2,1,1014143,183381,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Pathogenic,ISG15:9636,1.00
3,1,1014179,542075,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Uncertain_significance,ISG15:9636,0.00
4,1,1014217,475278,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Benign,ISG15:9636,0.00
...,...,...,...,...,...,...,...,...,...
102316,3,179210507,403908,A,G,Cowden_syndrome,Uncertain_significance,PIK3CA:5290,0.00
102317,3,179210511,526648,T,C,Cowden_syndrome,Likely_benign,PIK3CA:5290,0.00
102318,3,179210515,526640,A,C,Cowden_syndrome|Hereditary_cancer-predisposing...,Uncertain_significance,PIK3CA:5290,0.00
102319,3,179210516,246681,A,G,Cowden_syndrome,Uncertain_significance,PIK3CA:5290,0.00


#### Step 3b Notes/Observations:
- New IMPLIEDCLASS column now categorizes variants as potentially "dangerous" (1) or "non-dangerous/benign/inconclusive" (0) for easy grouping and summarizing
- A good next step will be to determine what percentage of variants (rows) in the dataframe are potentially dangerous/harmful based upon this categorization (i.e., variants labeled with 1 only), and then to limit the dataframe to the first 100 potentially dangerous/harmful variants per the instructions 

In [19]:
# ==============================================================================================================
# STEP 3c: PERFORM INITIAL/BASIC EDA
# ==============================================================================================================

# Determine counts of potentially dangerous/harmful and non-dangerous/non-harmful variants in dataframe
print("Number of Potentially Dangerous/Harmful Variants (IMPLIEDCLASS = 1):")
print(clinical_filtered_df["IMPLIEDCLASS"].sum(), end = "\n\n")

# Determine percentage of potentially dangerous/harmful variants out of all variants in dataframe
print("Percentage of Potentially Dangerous/Harmful Variants out of Total Variants:")
print(round(clinical_filtered_df["IMPLIEDCLASS"].sum() / len(clinical_filtered_df), 2) * 100)

# Filter dataframe to just potentially dangerous/harmful variants (IMPLIEDCLASS = 1)
clinical_harmful_df = clinical_filtered_df[clinical_filtered_df["IMPLIEDCLASS"] == 1]

# Create final dataframe, limited to only first 100 potentially dangerous/harmful variants in dataset
clinical_final_df = clinical_harmful_df[:100].reset_index()

# Display distribution (counts) of variants by clinical disease name to see unique/highest-density CLNDN values
# clinical_final_df["CLNDN"].value_counts()

# Display final dataframe
clinical_final_df


Number of Potentially Dangerous/Harmful Variants (IMPLIEDCLASS = 1):
19572.0

Percentage of Potentially Dangerous/Harmful Variants out of Total Variants:
19.0


Unnamed: 0,index,CHROM,POS,ID,REF,ALT,CLNDN,CLNSIG,GENEINFO,IMPLIEDCLASS
0,2,1,1014143,183381,C,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Pathogenic,ISG15:9636,1.0
1,8,1,1014316,161455,C,CG,Immunodeficiency_38_with_basal_ganglia_calcifi...,Pathogenic,ISG15:9636,1.0
2,9,1,1014359,161454,G,T,Immunodeficiency_38_with_basal_ganglia_calcifi...,Pathogenic,ISG15:9636,1.0
3,24,1,1022225,243036,G,A,Congenital_myasthenic_syndrome,Pathogenic,AGRN:375790,1.0
4,26,1,1022313,243037,A,T,Congenital_myasthenic_syndrome,Pathogenic,AGRN:375790,1.0
5,46,1,1041354,574478,CGCCCGCCAGGAGAATGTCTTCAAGAAGTTCGACG,C,"Myasthenic_syndrome,_congenital,_8",Pathogenic,AGRN:375790,1.0
6,49,1,1041582,126556,C,T,Congenital_myasthenic_syndrome|Myasthenic_synd...,Pathogenic,AGRN:375790,1.0
7,63,1,1042136,243038,T,TC,Congenital_myasthenic_syndrome,Pathogenic,AGRN:375790,1.0
8,237,1,1049672,489335,C,T,Not_Given,Likely_pathogenic,AGRN:375790,1.0
9,270,1,1050473,243039,G,A,Congenital_myasthenic_syndrome,Pathogenic,AGRN:375790,1.0


**ASSUMPTIONS:**
- The first 27 rows of the vcf file/dataset do not need to be included in the final dataframe because they simply contain meta-information. I'm assuming it's okay to not even read them into the original dataframe.
- The POS column from the original vcf file/dataset contains a mix of Integer and String values. Based on the dataset documentation, this column _should_ be in **Integer** form only. Some of the values contain "O" characters in addition to numbers/digits, making them Strings. I'm assuming these "O" values were unintentional and should be replaced with 0s (which provides an opportunity for data-cleaning).
- After performing the parsing step on the INFO column (step 2b), I'm assuming that no additional parsing of any of the INFO sub-components/sub-fields (e.g., AF_ESP, AF_EXAC, etc.) is necessary for the ask from the boss.
- The "SSR" sub-component/sub-field within the INFO column is actually non-existent in the data (though it's mentioned in the meta-information data for the vcf file). I'm assuming this to be the case given the output of the INFO column parsing step (resulting from json.normalize in step 2b) - which should capture data for each sub-component/sub-field of INFO _wherever data actually exists_. To be extra sure, I also imported the vcf file data into an Excel workbook and performed a Find search for "SSR" in the INFO column and found no hits.
- The exact CLNSIG descriptions I categorized as potentially dangerous/harmful are shown in step 3b. I chose these particular descriptions because they were either defined as pathogenic, likely pathogenic, or as a risk factor, which implies that there's a high probability each of these variants is harmful. I chose not to include all other CLNSIG descriptions in my potentially dangerous/harmful categorization (including descriptions such as benign, likely benign, inconclusive, conflicting, etc.) because I assumed they all had a low (or no) probability of actually being harmful.
- I'm assuming that _first 100 harmful mutations_ (as described in the "Business Case" above) literally means the first 100 harmful mutations that appear in the dataset, starting at the top (index/row 0). I am _not_ assuming that statement to mean the top 100 most common harmful mutations in the dataset (which would be based on counts of occurrence).

Assumed Mappings of Required Fields to Dataframe Column Names (based on documentation) - keep these columns:
- 1) Gene name -> **GENEINFO** column
- 2) Mutation ID number -> **ID** column
- 3) Mutation Position (chromosome & position) -> **CHROM** and **POS** columns
- 4) Mutation value (reference & alternate bases) -> **REF** and **ALT** columns
- 5) Clinical significance (CLNSIG) -> **CLNSIG** column
- 6) Disease that is implicated -> **CLNDN** column

### What would you present to your boss?

The final dataframe I would share with my boss is displayed under step 3c above. This dataframe output shows **the first 100 potentially dangerous/harmful variants in the dataset.**

I would also let my boss know that after performing some initial/basic EDA on the dataset and grouping variants by their clinical significance (CLNSIG), I'd conclude that **approximately 19% of all variants in the file - or roughly 19,572 records out of 102,321 records in total - represent potentially dangerous/harmful variants.**

To support these numbers, I would share with my boss how I classified potentially dangerous/harmful variants and potentially non-dangerous/non-harmful variants. For transparency, I would briefly summarize **all of the assumptions I listed above** but would specifically call out bullet point #5, mentioning that my classification of variants was based on the CLNSIG descriptions in the file and was relatively intuitive (i.e., I treated "pathogenic"/"likely-pathogenic" variants as potentially dangerous/harmful and treated _most_ others as non-dangerous/non-harmful). Of course, this classification scheme also makes the baseline assumption that the CLNSIG data is credible, so I would qualify everything I share with a statement that I could certainly learn more about this data and would be eager to spend more time getting comfortable with the documentation before we make any recommendations or decisions. (For instance, I'd probably say something along these lines: "I know you pulled this file off the Internet and had me dive into it right away to take a look - and that's awesome - but I'm also still relatively new to everything that's in here and what it all means ... so this is _truly_ just a first pass at some initial insights!")

A cursory glance at the final dataframe reveals that a variety of diseases (CLNDN descriptions) are implicated in the first 100 records. For instance, I see a handful of hits for "Congenital_myasthenic_syndrome", "Robinow_syndrome", "Neurodevelopmental_Disability", and quite a few others. Based on this quick review, I'd mention to my boss that **a deeper-dive into the CLNDN data - as well as several of the other data columns - would be helpful in order to better understand the exact gene and mutation combinations that produce (or lead to) dangerous/harmful conditions.** As a next step, we could take a closer look at how much overlap there might be - if any - between some of these diseases and the gene/mutation combinations listed in the file (e.g., do some gene/mutation combinations produce a wider array of dangerous/harmful results - manifested in multiple different types of diseases - than others). We could also try to learn more about the CLNSIG descriptions and confirm their accuracy (or the overall confidence level put into each category), and see if there are certain descriptions/categories that are more severe - or that have a higher degree of confidence around pathogenicity - than others.