**This notebook is about preparing the data: <u>Analysis (EDA) and Cleaning</u>. It will first go about exploring and describing every feature alone. We will look for correlation and relationships between the features in the second part.**

<u>*EDA and data cleaning are two seperate processes.*</u>

*"*there are different approaches to EDA, so it's going to be hard to know what analysis to perform and how to do it properly."*

#### EDA's purpose is to give an understanding of: 
- **Properties of the data** (*statistical properties, schema...*)<br>
- **Quality of the data:**
    - Missing Values. 
    - Inconsistent data types. <br>
- **Predictive power of the data:**
    - Correlation of features against the target. 







### 1) Descriptive analysis (Univariate Analysis) 
 - Provides an understanding of each attribute/variate of the dataset. <br>
 - Offers evidence for feature preprocessing and selection for later stages. 

There are three common types of attributes: **Numerical, Categorical, Textual**(?). 


| Attribute type | Analysis/calculation | Details |
| :- | :-: | :- |
| ***Common*** |<br> *Data Type* <br> <br>*Missing values*<br><br> | <br> Attribute's data type<br><br>Percentage of missing values<br><br>
| ***Numerical*** |<br> *Quantile statistics* <br><br>*Descriptive statistics*<br><br>*Distribution histogram*<br><br> | Q1, Q2, Q3, min, max, range, interquantile range<br><br>Mean, mode, standard deviation, median absolute deviation, kurtosis, skewness.<br><br> based on appropriate number of bins
| ***Categorical*** |<br> *Cardinality* <br> <br>*Unique counts*<br><br> | <br> Number of unique values for the  categorical attribute<br><br>Number of occurrences for each unique value of the cateforical attribute<br><br>
| ***Textual*** |<br> *Tokens* <br> <br>*DF/TF*<br><br> | <br> Number of unique tokens<br><br>Distribution of document frequency and term frequency with/without standard english stop words<br><br>


### 2) Correlation analysis (Bivariate Analysis)
Examines the <u>*relationship between two attributes*</u>, whether they are correlated or not. This analysis is done from two perspectives: 
- **Qualitative:** <br><br> Descriptive analysis of dependent attributes (num/cat) against each unique value of THE independent categorical attribute(?) *(dependent att are num or cat? THE independent categorical attribute?)* <br><br>
    This perspective helps understand ***intuitively*** the relationship between X and Y.<br><br>
    Visualizations are used together with quanlitative analysis to help understand. 
    
| Attribute type | Analysis |
| :- | :- | 
| *Both Categorical (X,Y)* | Contingence table with unique counts of X(Y) per unique value of Y(X)|
| *Categorical (X) <br>versus numerical (Y)* |Descriptive statistics or histogram of Y per unique value of X|

- **Quantitative:** <br><br>
 Test of the relationship between X and Y based on <u>*hypothesis testing framework.*</u> <br><br>
 Provides a mathematical methodology to quantitatively determine the existence and/or the strength of the relationship. 
 

|X/Y | Categorical | Numerical |
| :-: | :-: | :-: |
|**Categorical**| Chi-square test <br><br> Information gain | <br><br>Student T-test <br><br> ANOVA<br><br> Logistic regression<br><br>Discretize Y *(left column)*<br><br>|
|**Numerical**|<br><br>Student T-test <br><br> ANOVA<br><br> Logistic regression<br><br>Discretize X *(right column)*<br><br> | <br><br>Correlation <br><br> Linear regression <br><br> Discretize Y *(left column)*<br><br>Discretize X *(right column)*<br><br>|

In [62]:
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib notebook
import numpy as np
import seaborn as sns
#cross-tab on SYMBOL and feature
df = pd.read_csv("C:/Users/elkha/Desktop/clinvar_conflicting.csv")

  interactivity=interactivity, compiler=compiler, result=result)


#### Crosstab for Feature and SYMBOL
*When I previously established univariate analysis; going through each feature alone. I noticed that the most frequent values in Feature and SYMBOL have the equal frequencies. I figured that cross tabulation might help me show the overlapping between Feature and SYMBOL.*

In [65]:
cross=pd.crosstab(df.SYMBOL, df.Feature)
a=df["SYMBOL"].value_counts()[:25].index[0:25]
b=df["Feature"].value_counts()[:25].index[0:25]
cross=cross[b]
a=a.tolist()
cross_20=cross.loc[a]

In [66]:
f, ax = plt.subplots(figsize=(13, 9))
ax = sns.heatmap(cross_20,
            cmap="YlGnBu", annot=False, cbar=True)

<IPython.core.display.Javascript object>

##### references
- https://cloud.google.com/blog/products/ai-machine-learning/building-ml-models-with-eda-feature-selection