# Introduction to Exploratory Data Analysis (EDA)

### What is EDA?

**Exploratory Data Analysis(EDA)** is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. 

### Why is EDA Useful? 
EDA allows us to get an initial feel for the data

This lets us determine if the data makes sense, or if further cleaning or more data is needed. 

EDA helps to identify patterns and trends in the data (these can be just as important as findings from modeling)

---
### Techniques for EDA

- **Summary Statistics**:
    - Average, Median, Min, Max, Correlation, etc.
- **Data Visualization**:
    - Histograms, Box Plots, Scatter Plots, etc.

---
### Tools for EDA

- **Data Wranling**:
    - Pandas
- **Data Visualization**:
    - Matplotlib, Seaborn

---
### EDA: Job Applicant Summary Statistics
Suppose we want to examine characteristics of job applicants. 

- **Average**: We could look at the average of all interview scores (perhaps by city or job function).

- **Max**: We could look at the correlations between technical assessment and years experience.(Perhaps by type of experience)
---
### Sampling from DataFrames

```python
### Sample 5 rows without replacement

sample = data.sample(n=5, replace=False)
print(sample.iloc[:, -3:])
```
![image.png](attachment:image.png)


- There are many reasons to consider random samples from DataFrames.
    - For large data, a random sample can make computation easier.
    - We may want to train models on a random samples of the data.
    - We may want to over-or under-sample observations when outcomes are uneven.


--- 
## Visualization Libraries

**Visualizations** can be created in multiple ways:

- Matplotlib
- Pandas (via Matplotlib)
- Seaborn
    - Statistically-focused plotting methods.
    -  Global preferences incorporated by matplotlib.

---
### Basic scatter plots with Matplotlib

```python
# Pandas Dataframe approach
import matplotlib.pyplot as plt
plt.plot(data.sepal_length,
         data.sepal_width,
         linestyle='none',
         marker='o',
         alpha=0.5)
plt.show()
```
![image-2.png](attachment:image-2.png)

---
### Scatter Plots with Multiple Layers

```python
# First plot statement
plt.plot(data.sepal_length,
         data.sepal_width,
         linestyle='none',
         marker='o',
         alpha=0.5,
         label='sepal')

# Second plot statement
plt.plot(data.petal_length,
         data.petal_width,
         linestyle='none',
         marker='o',
         alpha=0.5,
         label='petal')
```
![image-3.png](attachment:image-3.png)

---
### Histograms
```python
# Pandas Dataframe approach
plot.hist(data.sepal_length, bins=25)
plt.show()
```
![image-4.png](attachment:image-4.png)

---
### Customizing Plots

```python
# matplotlib syntax
fig, ax = plt.subplots()
ax.barh(np.arange(10),
        data.sepal_length.iloc[:10])

# Set positions of ticks and tick labels
ax.set_yticks(np.arange(0.4, 10.4, 1))
ax.set_yticklabels(np.arange(1, 11))
ax.set(xlabel='xlabel', ylabel='ylabel',
       title='Title')
plt.show()
```
![image-5.png](attachment:image-5.png)

--- 
### Customizing plots: by group

```python
# Pandas Dataframe approach
data.groupby('species').mean()
.plot(color=['red', 'blue', 'green'],
      fontsize=15.0, figsize=(4, 4))
plt.show()
```
![image-6.png](attachment:image-6.png)

---
### Pair plots for features
    
```python
# Seaborn plot, feature correlations
sns.pairplot(data, hue='species', size=3)
plt.show()
```
![image-7.png](attachment:image-7.png)

![image-8.png](attachment:image-8.png)
---
### Seaborn  Example: Hexbin plot

```python 
# Seaborn hexbin plot
sns.jointplot(x=data['sepal_length'],
              y=data['sepal_width'],
              kind="hex", color="k")    
plt.show()
```
![image-9.png](attachment:image-9.png)
---
### Seaborn Example: Facet Grid

```python
# Seaborn plot, Facet Grid
# First plot statement
plot = sns.FacetGrid(data,
                     col='species',
                     margin_titles=True)
plot.map(plt.hist, 'sepal_length', color='green')

# Second plot statement
plot = sns.FacetGrid(data,
                     col='species',
                     margin_titles=True)
plot.map(plt.hist, 'sepal_width', color='blue')
```
![image-10.png](attachment:image-10.png)

![image-11.png](attachment:image-11.png)