# <font color= "purple">Regression : Explore Stage</font>

* Explore the interactions of all attributes and target variable to help discover drivers of our target variable.

```
"Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis, and to check assumptions with the help of summary statistics and graphical representations." - Prasad Patil
```



## <font color=" green">Main Stages in Exploration Phase</font>

1. **Hypothesize**: Form and document your initial hypotheses about how the predictors (independent variables, features, or attributes) interact with the target (y-value or dependent variable).

2. **Visualize**: use visualization techniques (scatterplots, jointplot, pairgrid, heatmap) to identify drivers. When a visualization needs to be followed up with a statistical test, do so.

3. **Test your Hypotheses**: We will analyze the drivers of a continuous variable using appropriate statistical tests (t-tests, correlation, and chi-squared hypothesis tests)



## <font color="orange">Standing Orders for the Exploration Stage</font>

* Document your initial hypotheses. Write them down so they're concrete and not in your head.
* Document any surprises you may find in visualizing.
* Document your hypothesis test results. That means writing up when the tests reject the null hypothesis or fail to reject your null hypothesis, for each hypothesis you make.
* Write down your takeaways. Documenting your takeaways is a huge component of your final deliverable/analysis.
* Identitfy features that correlate with eachother. If feature A and feature B are each tightly correlated with the target variable, but they're also tightly correlated with eachother, we should use one feature that correlates better, rather than use both.


#### <font color ="darkgreen">General Steps for Visualizations in your Explore Stage</font>

* Plot out the distributions of each feature.

* This is critical b/c many of our statisitical tools and machine learning algorithms assume certain distributions. If your data isn't remotely normally distributed, then avoid using any tools that assume normally distributed data.

* Plot out the interaction of 2 or more variables.

* Plot out how subgroups compare to each-other and to the overall population.



## <font color="blue">Types of Visualizations</font>

* **Continous and Continous**
    * Scatter-Matrix with seaborn's pairplot
    * Scatterplot: relplot and lmplot in seaborn
    * Heatmaps of correlation coefficients

* **Discrete and Continuous**
    * Swarmplot, Violinplot, Box plots


* **Bar Plots**
    * Discrete x Discrete Variables
    * Cross tabulations and heatmaps of them


```python
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

import env
import wrangle
import split_scale

# acquire data and remove null values 
df = wrangle.wrangle_grades()

# split into train, validate, and test sets
# Notice that we are keeping X and Y together, so far
train_and_validate, test = train_test_split(df, random_state=123)
train, validate = train_test_split(train_and_validate, random_state=123)
```

### seaborn.jointplot
```python
with sns.axes_style('white'):
    j = sns.jointplot("exam1", "final_grade", data=train, kind='reg', height=5);
plt.show()
```

### seaborn.pairgrid + matplotlib.pyplot.hist + matplotlib.pyplot.scatter

```python

# This is roughly equivalent to sns.jointplot, but we see here that we have the
# flexibility to customize the type of the plots in each position.

g = sns.PairGrid(train)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);
```

### seaborn.heatmap

```python
plt.figure(figsize=(8,6))
sns.heatmap(train.corr(), cmap='Blues', annot=True)
plt.ylim(0, 4)
```

## <font color ="red"> Further Reading</font>

* <a href="https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html">Visualization With Seaborn</a>

* https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15

* https://www.itl.nist.gov/div898/handbook/index.html

* https://adataanalyst.com/data-analysis-resources/visualise-categorical-variables-in-python/

* <a href="https://matplotlib.org/3.2.1/gallery/statistics/boxplot_vs_violin.html">Boxplot vs. Violin example </a>

* https://datavizcatalogue.com/
