# Exploratory Data Analysis

## Agenda

- First hour of the class will be a lesson on using data visualization library Seaborn
- The rest of class will be dedicated to an exploratory data analysis group exercise. You'll be divided into two groups and assigned two different datasets. Your objective is to come with business-oriented questions about the dataset and attempt to solve them using the EDA techniques we've learned so far in this class.

### Seaborn
["Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics."](https://seaborn.pydata.org/index.html)

In [1]:
#Imports

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#Load in the famous iris dataset.

path = "../data/iris.csv"
iris = pd.read_csv(path)
iris.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
#Scatter plot of sepal_length and sepal_width in matplotlib



Let's import seaborn and run this code again.

Run "!pip install seaborn" inside a jupyter notebook to download the library if you don't have it.

When seaborn is loaded, it automatically adjusts the style of your matplotlib plots.

Let's take a stroll through the seaborn art gallery. 

First up, swarmplot.

In [None]:
#Call .swarmplot() on sns. Set x = "species",  y = "petal_length", data = iris)



The x variable should be a categorical variable, that we use to group the data. Y variable is continuous. 
<br><br>
How does this compare to boxplot? Better or worse? Let's see how to make boxplots in seaborn.

This is the box plot version of the swarmplot above.

In [None]:
#Call .boxplot() on sns. Set x = "petal_length", data = tips



Grouped bar plots with another categorical variable.

In [None]:
#Load in the tips datasets

#Seaborn as datasets pre-loaded into its library, including iris.


In [None]:
#Create grouped barplots x = "day", y = "total_bill", hue = "sex", data = tips



What does this show us?

Violin plot version of the plots above.

Is this a better way to show the story of this data?

Swarm plot version

Bar plot version

Seaborn has the ability to fit a regression to a scatter plot
<br><br>
We're going plot the amount of a restaurant check vs the tips a waiter receives on that bill.

In [None]:
#Call lmplot() on sns. Set x = "total_bill", y = "tip", data = "tips"



For every 5 dollar increase in the bill, how much more in tips should a server receive? What about a 10 dollar increase? 

If a bill is 35 dollars, how much in tips should a waiter expect to receive? What about if the bill is 60 dollars?

How does the prediction for 35 dollars compare to 60 dollars?

Let's go back to the iris dataset and look at the relationships between the numerical features.

In pandas we used a correlation matrix to look at the correlations between columns.

In [None]:
#Call .corr() on iris


What do you notice about the correlations of this data?

Now let's "visualize" this data using a heatmap.

In [None]:
#Call .heatmap on sns and pass in iris data correlations


Does that look better? Again, no right or wrong answer.

<b>Presenting my favorite plot in seaborn!!<b>

Paiplot: A matrix of scatter plots that plot every numerical value against one another. Includes ability to color-encode based on categorical variables.

In [None]:
#call .paiplot(), pass in iris and set hue = "species"



What values does this provide us?

More plots in seaborn

Countplot

In [None]:
#Countplot which is another way of calling a bar plot

sns.countplot(x = "species", data= iris);

In [None]:
sns.countplot(x = "time", data= tips);

Distribution plot

In [None]:

sns.distplot(iris.sepal_length);

Density plot

In [None]:
sns.kdeplot(tips.total_bill, tips.tip);

Joint distribution plot

In [None]:
sns.jointplot(x='sepal_width', y='sepal_length', data=iris);

Seaborn resources:

https://www.datacamp.com/community/tutorials/seaborn-python-tutorial#sm

https://seaborn.pydata.org/examples/index.html

https://elitedatascience.com/python-seaborn-tutorial#step-3

## Exploratory Data Analysis exercise!!!

<b>Objective:</b> 

     - Each group is tasked with investigating a business-scenario dataset assigned to them. 
     - Your goal is find points of interest that are useful to a business. 
     - Devise several questions that you will answer. 
     - Use the tools that we've learned so far in this class. Make lots of charts!!
     - I will be available to help but you use each other! Work as a team.

<b>Presentation:</b>

    - At the end of the class, each group will have ten minutes to present their findings.
    - Go over the interesting things you learned from the project. Tell us the questions you came up with and the answers you got for them.
    - Show us the visualizations you've made.
    - You don't have to make a powerpoint, you can use show us a jupyter notebook, but try to make it organized.
    - Decide on who will be the presenter.

<b>Team Pandas:</b> Shivendra, Eric, Darren, Ana Varejao
<br><br>
Dataset: Employee churn dataset

Link to information: https://www.kaggle.com/ludobenistant/hr-analytics


In [None]:
#Team pandas dataset
path = "../data/HR_comma_sep.csv"

df = pd.read_csv(path)

<b>Team Numpy:</b> Shohba, Corey, Cedric, Anna Gadiraju

Dataset: Peer-to-peer loans dataset

Link to informations: https://www.kaggle.com/wordsforthewise/lending-club

In [None]:
#Team numpy dataset

path = "../data/loansData.csv"

df = pd.read_csv(path)