# Introduction to Pandas for Feature Engineering
Welcome! If you are looking for a short and sweet, but effective, introduction to some of the coding steps required to perform feature engineering and data analysis for machine learning, look no further. 

This notebook will use a genomics data set, analyze it, and prepare it for machine learning modeling. 

I'm assuming you are familiar with writing and reading code, and that you've touched some Python syntax. I'll help ramp you up, very quickly.

### Get libraries
First, we'll want to reference the open source libraries that make all of this possible. I'm using Pandas as my backbone here, but we'll build ontop of that as we need more and more libraries.

In [8]:
import pandas as pd 

### Get some data
Ok! We're going to pull some genomics data, in particular with respect to its relationshp to human health. This is a public dataset coming from ClinVar.
- https://www.ncbi.nlm.nih.gov/clinvar/ 

In [5]:
!wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz

--2019-08-13 17:48:33--  ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz
           => ‘clinvar.vcf.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::11
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/clinvar/vcf_GRCh37 ... done.
==> SIZE clinvar.vcf.gz ... 19531482
==> PASV ... done.    ==> RETR clinvar.vcf.gz ... done.
Length: 19531482 (19M) (unauthoritative)


2019-08-13 17:48:34 (247 MB/s) - ‘clinvar.vcf.gz’ saved [19531482]



### Load into a Pandas DataFrame

If you're thinking, now what on Earth is a .vcf file, that makes two of us! Fortunately the kind developers who announced this dataset also have a script that converts this into a csv file. That script is hosted on Github, right here: 
- https://github.com/arvkevi/clinvar-kaggle/blob/master/process_clinvar.py 

Let's clone that repository, and run this script.

In [3]:
!git clone https://github.com/arvkevi/clinvar-kaggle.git

Cloning into 'clinvar-kaggle'...
remote: Enumerating objects: 66, done.[K
remote: Total 66 (delta 0), reused 0 (delta 0), pack-reused 66[K
Unpacking objects: 100% (66/66), done.


In [6]:
!cp clinvar.vcf.gz clinvar-kaggle
!python clinvar-kaggle/process_clinvar.py

  cv_df.set_value(cv_df.CLNSIGCONF.notnull(), 'CLASS', 1)


In [11]:
!cp clinvar_conflicting.csv raw_data.csv

If you're able to run that copy command, then you successfully converted the vcf file! Congrats. Now, let's load that into a dataframe. Based on the Python KungFu that was performed by Kevin Arvkei, now we can use the single most import function in pandas `pd.read_csv()`. 

You'll want to make sure you have `import pandas as pd` in order to use the `read_csv` method. It just needs a csv file name, and it will turn that into a dataframe.

In [9]:
df = pd.read_csv('clinvar_conflicting.csv')

  interactivity=interactivity, compiler=compiler, result=result)


As long as that worked, now you'll perform the second most commonly used function in Pandas. `df.head()`. That gives you a visual confirmation on all of your data. Give it a try!

In [10]:
df.head()

Unnamed: 0,CHROM,POS,REF,ALT,AF_ESP,AF_EXAC,AF_TGP,CLNDISDB,CLNDN,CLNHGVS,CLNVC,GENEINFO,MC,ORIGIN,CLNVI,CLNDISDBINCL,CLNDNINCL,CLNSIGINCL,CLASS
0,1,955563,G,C,0.0,0.00114,0.00958,"MedGen:C3808739,OMIM:615120|MedGen:CN169374","Myasthenic_syndrome,_congenital,_8|not_specified",NC_000001.10:g.955563G>C,single_nucleotide_variant,AGRN:375790,SO:0001583|missense_variant,1.0,,,,,0
1,1,955597,G,T,0.0,0.42418,0.28255,MedGen:CN169374,not_specified,NC_000001.10:g.955597G>T,single_nucleotide_variant,AGRN:375790,SO:0001819|synonymous_variant,1.0,,,,,0
2,1,955619,G,C,0.0,0.03475,0.00879,"MedGen:C3808739,OMIM:615120|MedGen:CN169374|Me...","Myasthenic_syndrome,_congenital,_8|not_specifi...",NC_000001.10:g.955619G>C,single_nucleotide_variant,AGRN:375790,SO:0001583|missense_variant,1.0,,,,,1
3,1,957568,A,G,0.01761,0.00493,0.01418,MedGen:CN169374,not_specified,NC_000001.10:g.957568A>G,single_nucleotide_variant,AGRN:375790,SO:0001627|intron_variant,1.0,,,,,0
4,1,957640,C,T,0.03175,0.02016,0.03275,"MedGen:C3808739,OMIM:615120|MedGen:CN169374","Myasthenic_syndrome,_congenital,_8|not_specified",NC_000001.10:g.957640C>T,single_nucleotide_variant,AGRN:375790,SO:0001819|synonymous_variant,1.0,,,,,0


In [4]:
# For loops

In [5]:
# dictionaries

In [None]:
# functions in Python 

In [9]:
# missing values

In [8]:
# one hot encoding

In [13]:
# detecting anomalies

In [12]:
# unit tests in Python

In [14]:
# map reduce for feature engineering 

In [6]:
# plotting

In [7]:
# correlation & decorrelation

In [None]:
# package your ETL code as an inference pipeline 

In [10]:
# splitting 