# Data Visualization

In data science & analytics, data visualization is a fundamental tool to communicate information clearly and effectively. By using statistical graphics and plots, complex numerical information is transformed into significant insights and results. Visual elements such as charts, graphs, boxplots, histograms and heatmaps provide an easy way to interpret results, understand tendencies, detect multiple patterns in data and visualize outliers. Visual communication is both an art and a science and can be viewed as a branch of descriptive statistics. We explore here two Python libraries for data visualization: <i>matplotlib</i> and <i>seaborn</i>.

#### <i>Matplotlib</i>

The most commonly used Python library for data visualization is `matplotlib`. It started in 2002 by John Hunter as a project inspired by MATLAB's plotting style. It can be used in combination with <i>NumPy</i> and `pandas`.

#### <i>Seaborn</i>

Seaborn is a Python library for data visualization. It is based on `matplotlib` and integrates well with `pandas`. It is particularly useful for <i>statistical</i> data visualization as it provides a high-level interface to design high quality statistical plots. Some of its most useful functions include `regplot`, `distplot`, `pairplot` among others. 

<b>Useful links</b>: 
https://seaborn.pydata.org/
https://seaborn.pydata.org/examples/index.html
https://matplotlib.org/
https://matplotlib.org/tutorials/colors/colormaps.html



Let's start by importing the main libraries, generating some random list of number and visualizing their distribution. 


In [2]:


## import matplotlib
import matplotlib.pyplot as plt

import numpy as np

data = np.arange(100)
data

In [3]:
plt.plot(data)



Based on the plot above, what is the `plot` function assuming to be the 'x' values and the 'y' axis of the previous plot? 


#### `ANSWER`

'plot' assumes the x values are the indexes (or positions) of the values of the 'data' array. Therefore, we have 100 values distributed 1 by 1 up to a hundred, a perfect 45 degree line. 

Now that you understand that `plot` implicitly assumes 'x' values as the indices of a certain array, plot a function that ranges from 1 to 100 and that has only 20 points. 

In [7]:

data = np.arange(0,100,5)
plt.plot(data, 'o')



Let's explore customization to understand how the plots can be visually more interesting than the previous ones. The main object in matplotlib is a `Figure`. We can create a figure such as: 
````
fig = plt.figure()
````
We can also add multiple subplots to that figure, by using: 
````
fig.add_subplot(2,2,1)
````
The expression above indicates a subplot of \\(2\\) rows per \\(2\\) columns on the \\(1\\)st position \\((2,2,1)\\). Let's take a further look at the example below. Those are fundamental pieces of a plotting in Python. 

In [9]:
 

## Write Python code here

fig = plt.figure()
fig.add_subplot(2,2,1)
fig.add_subplot(2,2,2)
fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)
ax4.plot(np.random.randn(50).cumsum(), 'k--')

In the previous example, the function was plotted in the last subplot (position \\((2,2)\\)) and no definition was provided to declare the graph position. To define the final position of the graph, it is necessary to declare different axis positions (e.g. ax1, ax2, ax3, ax4) and use the `plot` function in those axis. For instance: 

````
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax1.plot(np.random.randn(50).cumsum(), 'k--')
````



Rewrite the previous plotting expression with the graph in the \\(2\\)nd subplot.

In [12]:
 

fig = plt.figure()
fig.add_subplot(2,2,1)
fig.add_subplot(2,2,3)
fig.add_subplot(2,2,4)
ax1 = fig.add_subplot(2,2,2)
ax1.plot(np.random.randn(50).cumsum(), 'k--')


Write a script to create a figure with \\(4\\) subplots.

In [14]:
 
fig = plt.figure()

fig.add_subplot(2,2,1).plot(np.random.randn(50).cumsum(), 'k--')
fig.add_subplot(2,2,2).plot(np.random.randn(50).cumsum(), 'k--')
fig.add_subplot(2,2,3).plot(np.random.randn(50).cumsum(), 'k--')
fig.add_subplot(2,2,4).plot(np.random.randn(50).cumsum(), 'k--')




A better way to define the figure and the axis is using `subplots()`. This method is already implemented in matplotlib and returns - simultaneously - a figure and a list of axis (it is a tuple). For example: 
````
fig, axes = plt.subplots(2,3)
````
The expression above creates a \\(2\\) by \\(3\\) matrix of subplots (\\(6\\) subplots) with \\(6\\) different axis. 

To create a plot in the third free space in the first line we just need to use the follow script:

```Python
axes[0,2].plot(np.random.randn(50).cumsum(), 'k--')
```


In [16]:
 

## Write Python code here
fig, axes = plt.subplots(2,3)
axes[0,2].plot(np.random.randn(50).cumsum(), 'k--')
#plt.show()


Using the ``.subplots()`` method, create a \\(3 \times 3\\) plotgrid and plot a random plot in the center of the grid.  

In [18]:
 

## Write Python code here
fig, axes = plt.subplots(3,3)
axes[1,1].plot(np.random.randn(50).cumsum(), 'k--')
axes[1,1].set_xlim(1,30)

In [19]:
fig, axes = plt.subplots(3,3)
for axe in axes:
    for ax in axe:
        print(ax)

In [20]:
list_color = ['#0f0f0f80','#0f0f0f','g','or', 'g--', 'k--', '^k', 'b-.', 'pb']
i = 0
fig, axes = plt.subplots(3,3)
for axe in axes:
    for ax in axe:
        ax.plot(np.random.randn(50).cumsum(), list_color[i])
        ax.set_xlim(1,30)
        i += 1
plt.show()

# a complete list of format-colors can be found in https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.pyplot.plot.html
# how colors works for matplotlib https://matplotlib.org/tutorials/colors/colors.html
# it accpets RGB tuple, hex RGB, CSS4 colors, xkcd colors, Tableau Colors, and others.

Here is a list very useful for plots when dealing with Axes:

Use the follow codes to change the titles for each Axis (including the title)
```Python
Axes.set_xlabel("X axis label")
Axes.set_ylabel("Y axis label")
Axes.set_title("Plot title")
```

If you want to create a limitation in the size of each Axis you can use the two methods above:
```Python
Axes.set_xlim(begin, end)
Axes.set_ylim(begin, end)

#begin: The lowest value in the Axis
#end: The highest value in the Axis
# Both need to be an integer

```
If you need to change the ticks and also rotate them to fit better in the plot, use the follow two methods:
```Python
Axes.set_xtickslabels(labels, rotation)
Axes.set_ytickslabels(labels, rotation)

# labels: list containing the new labels for all ticks
# rotation: a integer number indicating the rotation angle
```


To add a legend in your plot:

```Python
Axes.legend(labels , loc)

# labels: optional, if you want to change the legend label
# loc: legend location, it can be: 'best','upper left', 'upper right', 'lower left', 'lower right', 'upper center', 'lower center', 'center left', 'center right', 'center'

```

If you think your plot is not fitting well in the figure area:
```
plt.tight_layout()
```

This dataset contains credit card default information of clients in Taiwan. An entire modeling methodology is explored, starting from the basics of data exploration and treatment and ending by exploring different techniques for predictive analytics (logistic regression, decision trees, gradient boosting, etc.) <br>

What follows is a brief description of the 25 variables:
<b>ID</b>: ID of each client
<b>LIMIT_BAL</b>: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
<b>SEX</b>: Gender (1 = male; 2 = female).
<b>EDUCATION</b>: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
<b>MARRIAGE</b>: Marital status (1 = married; 2 = single; 3 = others).
<b>AGE</b>: Age (year).

History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:

<b>PAY_0</b>:  the repayment status in September, 2005;
<b>PAY_2</b>: the repayment status in August, 2005; . . .;
<b>PAY_3</b>: . . .
<b>PAY_4</b>: . . .
<b>PAY_5</b>: . . .>
<b>PAY_6</b>: the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

Amount of bill statement (NT dollar).

<b>BILL_AMT1</b>: amount of bill statement in September, 2005;
<b>BILL_AMT2</b>: amount of bill statement in August, 2005; . . .;
<b>BILL_AMT3</b>: . . .;
<b>BILL_AMT4</b>: . . .;
<b>BILL_AMT5</b>: . . .;
<b>BILL_AMT6</b>: amount of bill statement in April, 2005.

Amount of previous payment (NT dollar).

<b>PAY_AMT1</b>: amount paid in September, 2005;
<b>PAY_AMT2</b>: amount paid in August, 2005; . . .;
<b>PAY_AMT3</b>: . . .;
<b>PAY_AMT4</b>: . . .;
<b>PAY_AMT5</b>: . . .;
<b>PAY_AMT6</b>: amount paid in April, 2005;
<b>default.payment.next.month</b>: payment default (1 = yes; 2 = no)

<b>References/Sources:</b>

[1]UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
[2] Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.<br>
[3] Name: I-Cheng Yeh 
email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw 
institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. 
other contact information: 886-2-26215656 ext. 3181 





In [23]:
print("Importing required libraries")
import pandas as pd
import os, boto3, subprocess, re, sys, gc
from botocore.client import Config

print("All libraries successfully loaded!")

kms_key = os.environ['AW_S3_ENCRYPTION_KEY']

bucket_name = os.environ['AW_S3_STORAGE_BUCKET']
storage_key = os.environ['AW_S3_STORAGE_KEY'] + '/awdata/rawfiles/'
full_s3_location = 's3://' + bucket_name + '/' + storage_key 
print("full_s3_location: '{}'".format(full_s3_location))

df= pd.read_csv(full_s3_location + "UCI_Credit_Card.csv", nrows = 100)
z.show(df)

The bar graph is commonly used for variables of categorical type or string 'and provides visual information on the number of times each element of a variable appears.

When using ``matplotlib`` you can use the  ``bar`` or ``barh`` (horizontal bars) methods to make this type of graph. The syntax is defined as follows:


```Python
bar(x, height, width = 0.8, color = 'blue')
barh(x, height, width = 0.8, color = 'blue')
```


where:

* **x** : Sequence of the unique values of the categorical variable
* **height** : number of times the unique values appears
* **width** : the width of the bar
* **color** : Bar's color. List of colors : https://matplotlib.org/3.1.0/gallery/color/named_colors.html


In [25]:
import numpy as np
import pandas as pd
df_bar = pd.DataFrame(np.array(['A','A','C','C','C','A','A','B','A','C','C','A','A','B','A']), columns = ['Letters'])

df_bar_gb = df_bar.groupby(['Letters']).size()
df_bar_gb

In [26]:
import pandas as pd
import matplotlib.pyplot as plt

df_bar = pd.DataFrame(np.array(['A','A','C','C','C','A','A','B','A','C','C','A','A','B','A']), columns = ['Letters'])

df_bar_gb = df_bar.groupby(['Letters']).size()

fig, ax = plt.subplots(1,2, figsize=(11, 4))

ax[0].bar(df_bar_gb.index.values, df_bar_gb)
ax[0].set_xlabel("Letter")
ax[0].set_ylabel("QTT")
ax[0].set_title("Quantity per letter - vertical")
ax[0].set_xticklabels(['a', 'b', 'c'], rotation = 45)

ax[1].barh(df_bar_gb.index.values, df_bar_gb, color = 'red')
ax[1].set_xlabel("QTT")
ax[1].set_ylabel("Letter")
ax[1].set_title("Quantity per letter - Horizontal")
ax[1].set_yticklabels(['a', 'b', 'c'], rotation = 90)


In ``seaborn`` package we can use the ``barplot`` method to plot a barplot.  The syntax is defined as follows:

```Python
barplot(x, y, hue, ci, estimator,  palette , ax)
barplot(x, y, hue, ci, estimator, palette, orient = "h", ax)
```

where:

* **x**, **y**, and **hue**: name of the columns you want to plot.
* **data**: Dataset for plotting.
* **ci**: Size of confidence inteval to be draw in the plot
* **palette**: Palette's color https://seaborn.pydata.org/tutorial/color_palettes.html
* **estimator**: Statistical function to estimate within each category. By default it's "mean" 
* **orient**: By default it's vertical but if you need it horizontal use it equal "h".
* **ax**: if you need to specify the Axes for the plot, use this option.

To add/change some titles use:

```Python
.set(xlabel = "X axis title", ylabel = "Y axis titl", title = "Plot title")
```

In [28]:
import seaborn as sns
fig, ax = plt.subplots(1,2, figsize = (15,7))

sns.barplot  (x = 'SEX'
            , y = 'LIMIT_BAL'
            , hue = 'PAY_0'  # optional
            , data = df
            , palette = 'Blues'
            , estimator = np.mean # Other options: "lambda x: len(x)", "sum"
            , ci=0
            , ax = ax[0]).set(title = "Sex and PAY_0 vs LIMIT_BAL")
            
            
            
sns.barplot  (x = 'PAY_0'
            , y = 'LIMIT_BAL'
            , hue = 'SEX'  # optional
            , data = df
            , palette = 'Blues'
            , estimator = np.mean # Other options: "lambda x: len(x)", "sum"
            , ci=0
            , ax = ax[1])
            
ax[1].legend(loc='best')
ax[1].set_title("PAY_0 and Sex vs LIMIT_BAL")

Which category of ``EDUCATION`` has the hightest ``PAY_AMT1`` average? And which one has the highest max value? 

In [30]:
import seaborn as sns

sns.barplot  (x = 'EDUCATION'
            , y = 'PAY_AMT1'
            , data = df
            , palette = 'Blues'
            , estimator = np.mean # Other options: lambda x: len(x), sum
            , ci=0).set(title = "EDUCATION vs AVG(PAY_AMT1)")
plt.show()

In [31]:
import seaborn as sns

sns.barplot  (x = 'EDUCATION'
            , y = 'PAY_AMT1'
            , data = df
            , palette = 'Reds'
            , estimator = lambda x: np.max(x) # Other options: lambda x: len(x), sum
            , ci=0).set(title = "EDUCATION vs MAX(PAY_AMT1)")

In [32]:
df.groupby(["EDUCATION"])["PAY_AMT1"].max() 

#gb_df = df.groupby(["EDUCATION"])[["PAY_AMT1"]].max() 
#gb_df.loc[gb_df.PAY_AMT1 == gb_df.PAY_AMT1.max(),:]



In [33]:
### Pandas + Matplotlib + Stacked Barplot 

fig, ax = plt.subplots(1)

df_pivot = df.groupby(["EDUCATION","SEX"])["PAY_AMT1"].mean().unstack().fillna(0)
df_pivot.plot(kind = 'bar', stacked = True, ax = ax)
plt.show()

# kind types: 'bar', 'barh', 'hist', 'box', 'scatter', and others

In [34]:
### Pandas + matplotlib + 100% Stacked Barplot 
fig, ax = plt.subplots(1)
df_pivot2 = df.groupby(["EDUCATION","SEX"])["PAY_AMT1"].mean().unstack().fillna(0)

df_pivot2['Row_SUM'] = df_pivot2.sum(axis = 1)
df_pivot2.loc[:,[1,2]].div(df_pivot2["Row_SUM"], axis=0)
df_pivot2.loc[:,[1,2]].div(df_pivot2["Row_SUM"], axis=0).plot(kind = 'bar', stacked = True, ax = ax)


A very common analysis in the daily life of a data scientist is to perform bi-varied analyzes, that is, to understand the behavior of two variables together. To help in this understanding, we can use the **Scatter Plot** that in the matplotlib can be represented by the ``scatter`` method. The syntax is defined as follows:

```python
plt.scatter(x, y, c = 'blue', marker = 'o')
```

Where:

* **x** and **y**: the variables to be plotted. They need to have the same size.
* **c** : Plot Color
* **marker**: the shape of the markers. All disponible options here: https://matplotlib.org/3.1.1/api/markers_api.html#module-matplotlib.markers

In [36]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1)

ax.scatter( x = df['PAY_AMT1'], y = df['BILL_AMT1'], c = 'red', marker = "D" )


ax.set_xlabel("PAY_AMT1")
ax.set_ylabel("BILL_AMT1")
ax.set_title("ScatterPlot: PAY vs BILL")
#ax.legend()
plt.show() #optional

We can do tha same thing using ``seaborn`` with the follow sintax:

```Python
sns.scatterplot(x,y, hue, data,  palette, marker, ax)
```

where:

* **x**, **y**, and **hue**: name of the columns you want to plot.
* **data**: Dataset for plotting.
* **palette**: Palette's color https://seaborn.pydata.org/tutorial/color_palettes.html
* **marker**: the style of the marker
* **ax**: if you need to specify the Axes for the plot, use this option.


In [38]:
import seaborn as sns

sns.scatterplot( x = 'PAY_AMT1', y = 'BILL_AMT1', data = df, palette = 'coolwarm')
plt.show()



In [39]:
import seaborn as sns

scat_df = df.loc[ (df['PAY_AMT2'] < 100000) & (df['PAY_AMT1'] < 100000), ]

fig, ax = plt.subplots(1)

sns.scatterplot( x = 'PAY_AMT1', y = 'PAY_AMT2', hue = 'default.payment.next.month', data = scat_df, palette = sns.color_palette("cubehelix", 2), ax = ax)
plt.show()


In [40]:
### Pandas + Matplotlib + Scatter 

fig, ax = plt.subplots(1)

df_pivot = df[["PAY_AMT1","LIMIT_BAL"]].plot(kind = 'scatter', x = 'PAY_AMT1', y = 'LIMIT_BAL', ax = ax)
plt.show()

# kind types: 'bar', 'barh', 'hist', 'box', 'scatter', and others

The histogram can be very helpful when the objective is to understand the dispersion of a single variable or to compare the distribution of some variables. To create the plot, this methodology divide the variable in **bins** equally sized and count how many rows fit in each bin. It can be done using ``matplotlib`` with the following syntax:



```Python
plt.hist(x, bins, orientation, color , edgecolor, alpha )
```

where:

* **x**: Input variables
* **bins**: number of bins to compute
* **orientation**: can be 'horizontal' or 'vertical'
* **color**: plot color
* **edgecolor**: bin edge color
* **alpha**: integer between 0 and 1. Lower values means more plot transparency.



In [42]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1)

ax.hist(  df['LIMIT_BAL'],  bins = 10, orientation = 'vertical', edgecolor='black', color = 'blue', alpha = 0.4)

ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

plt.show()


the same can be done using the ``seaborn`` library with the following syntax:

```Python
sns.distplot(a, bins , hist, kde, norm_hist, color, hist_kws , ax)
```

where:

* **a**: Input variable
* **bins**: number of bins
* **hist**: if ``True`` will plot a histogram
* **kde**: By default it's ``True`` and if ``True`` will plot a gaussian kernel density estimate
* **norm_hist**: By default it's ``False`` but If True, the histogram height shows a density rather than a count.
* **color**: Plot Color
* **hist_kws**: Use this if you want some matplotlib histogram options. For example, use ``hist_kws=dict(edgecolor="black")`` to use the edgecolor feature.
* **ax**: Optional, if provided will plot on the axis


In [44]:
import seaborn as sns
fig, ax = plt.subplots(1)

sns.distplot(df['LIMIT_BAL'], bins = 30 , kde = True, hist_kws=dict(edgecolor="k") )
plt.show()


Using histogram, compare the distribution of the variable **LIMIT_BAL** by **default.payment.next.month** when **LIMIT_BAL** values are lower than 600,000.00

In [46]:
import seaborn as sns

PAY_00 = df.loc[ (df['default.payment.next.month'] == 0) & (df['LIMIT_BAL'] < 600000),'LIMIT_BAL']
PAY_01 = df.loc[(df['default.payment.next.month'] == 1) & (df['LIMIT_BAL'] < 600000),'LIMIT_BAL']

fig, ax = plt.subplots(1)
sns.distplot(PAY_00, bins = 30, ax = ax , kde = True, norm_hist = True)
ax.set_xlim(1,600000)

sns.distplot(PAY_01, bins = 30, ax = ax , kde = True, norm_hist = True)


In [47]:
### Pandas + Matplotlib + Histogram

fig, ax = plt.subplots(1)

df_pivot = df[["LIMIT_BAL"]].plot(kind = 'hist', ax = ax, bins = 30)
plt.show()

# kind types: 'bar', 'barh', 'hist', 'box', 'scatter', and others

In [48]:
l = [4,2,2,3,1,5]
fig, ax = plt.subplots(1,3, figsize = (10,5))

sns.distplot(l, norm_hist=True, kde = False, bins=3, ax = ax[0])

weights = np.ones_like(np.array(l))/(len(np.array(l)))

sns.distplot(l, bins=3, kde = False, ax = ax[1], hist_kws = {'weights':weights})


ax[2].hist(l, weights=weights, bins = 3)
plt.show()

In [49]:
# if norm_hist = True then the y axis will represent the density (and not a count or probability)
# This is implied if a KDE or fitted density is plotted.

# The normalization process will make the sum of the bars times the bar widths to be equal 1. That way, we cannot say we have probability values in y-axis but we can sure know that the area under the curve is equal 1.

# in this particular case we have:
bar_widths = (5 - 1)/3
print("Every bar has a width of " + str(bar_widths) )


y_axis = [i.get_height() for i in sns.distplot(l, kde=False,norm_hist = True, bins=3).patches]

print("Y_axis:"+str(y_axis))

print("Sum of all bar weights is "+ str(sum(y_axis)))
print("Final Plot Density " + str(sum(y_axis) * bar_widths))

We have previously seen some graphs that assist in the analysis of the dispersion of a single variable such as the **Histogram** plot. Another way to analyze data dispersion is through the Boxplot, which has a very different way of construction, as shown in the figure below:

![Imgur](https://i.imgur.com/LUZgGM5.png)

So before plot first it needs to calculate all the main quartiles, the minimum and the maximum of our data. An interesting point is the definition of outliers as this may vary according to the case. by default, matplotlib considers an outlier values whose distances, below \\(Q1\\) or above \\(Q3\\), are \\(1.5\\) times greater than the amplitude between \\(Q1\\) and \\(Q3\\).

For example, if \\(Q1 = 5\\) and \\(Q3 = 8\\), the amplitude between them is \\(3\\). that way, will be consider an outlier any value:

* Greater than \\( \( 3 \times 1.5 \) + 8  = 12.5 \\)
* Lower than \\(5 - \( 3 \times 1.5 \)  = 0.5 \\)

To use this graph we can use the `boxplot` method from ``matplotlib``, which has the following syntax:



```python
boxplot(x, vert = True, whis = 1.5)
```

where:

* **x**: array of values sequence. If it has multiple data it will return multiple boxplots
* **vert**: if True will plot a vertical plot
* **whis**: outlier identification metric



In [51]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1)

ax.boxplot( [df['PAY_AMT1'],df['LIMIT_BAL'] ] )



ax.set_xticklabels(['PAY_AMT1', 'LIMIT_BAL'])
plt.show()


Using the same data, create two separe boxplots side-by-side.

In [53]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1,2, figsize = (15,5))

ax[0].boxplot(df['LIMIT_BAL'])
ax[0].set_ylim(0,1001000)
ax[0].set_xticklabels(['LIMIT_BAL'], fontsize=12)

ax[1].boxplot(df['PAY_AMT1'])
ax[1].set_ylim(0,1001000)
ax[1].set_xticklabels(['PAY_AMT1'], fontsize=12)

plt.show()

By using ``seaborn`` we can use the boxplot with he following syntax:

```Python
sns.boxplot(x, y, hue, data, whis = 1.5, color, ax)
```

Where:

* **x**, **y** and **hue** : name of the columns you want to plot.
* **data**: dataframe name
* **whis**: outlier identificaton metric
* **color**: Plot color
* **ax**: if you need to specify the Axes for the plot, use this option.

In [55]:
fig, ax = plt.subplots(1)
sns.boxplot(x = 'EDUCATION', y=  'PAY_AMT1', hue = 'SEX', data = df, ax = ax)
plt.show()

Adjust the axis limits so you can see better what is happening in the plot.




In [57]:
fig, ax = plt.subplots(1)

sns.boxplot(x = 'EDUCATION', y=  'PAY_AMT1', hue = 'SEX', data = df, ax = ax)
ax.set_ylim(0,15000)
plt.show()


In [58]:
### Pandas + Matplotlib + BoxPlot 

fig, ax = plt.subplots(1)

df_pivot = df[["PAY_AMT1","LIMIT_BAL"]].plot(kind = 'box', ax = ax)
plt.show()

# kind types: 'bar', 'barh', 'hist', 'box', 'scatter', and others

Together with the bar graph, the pie graph is an excellent choice when we want to look at categorical data to discover and compare the amounts of values between categories of the same variable.

To perform this representation in ``matplotlib`` we can use the ``pie`` method, which has the following standard syntax:

```Python
plt.pie(x)
```

where:

* **x** : Array or sequence of values representing the weight of each category

To help improve the pie chart we can add some options:

* **labels** : Sequence of labels
* **autopct** : Number format that will appear in the graph
* **pctdistance**: Distance from the center that the numbers will appear
* **labeldistance**: distance from the center that the labels will appear
* **startangle**: Initial angle to start the pie plot
* **explode**: sequence of values indicating how far each label will distance itself from the center of the graph

In [60]:
df_pie_gb = df.groupby("MARRIAGE").size()/df.shape[0]

fig, axes = plt.subplots(1,2, figsize = (10,4))

axes[0].pie(x = df_pie_gb)
axes[0].set_title("Simple Pie Chart")


explode = (0, 0.1, 0)
axes[1].pie(  x = df_pie_gb
            , labels = df_pie_gb.index 
            , autopct='%1.1f%%'
            , pctdistance=0.6
            , labeldistance=1.1
            , startangle=90
            , explode=explode)

axes[1].set_title("Pie Chart + options")

plt.show()

Heatmap is another great tool to analyse the interaction between two categorical variables. It's a plot available for ``seaborn``:

```Python
sns.heatmap(data, annot )
```

where:

* **data**: needs to be a rectangular data, like a cross table.
* **annot**: if ``True``, it will display the values inside ``data`` for each shell.


In [62]:
matrix_count = np.round(df.groupby(["default.payment.next.month",'PAY_0']).size().unstack().fillna(0)/ df.shape[0],2)
matrix_count

In [63]:
matrix_count = np.round(df.groupby(["default.payment.next.month",'PAY_0']).size().unstack().fillna(0)/df.shape[0],2)
sns.heatmap(matrix_count,  annot = True)

In [64]:
fig, ax = plt.subplots(1,1,figsize=(16,10))

corr = df.astype('int64').corr(method='pearson')

mask = np.zeros_like(corr, dtype=np.bool)
print("Mask Shape"+str(mask.shape))
mask[np.triu_indices_from(mask)] = True

# np.triu_indices_from(arr) return the indices for the upper-triangle of arr.

import seaborn as sns
sns.heatmap(corr,
            vmin=-1,
            vmax=1.0,
            mask=mask,
            cmap='RdBu',
            ax=ax,
            annot=True,
            fmt='.2f',
            annot_kws={'size':10,'weight':'bold'})
plt.tight_layout()

# where:
# vmin and vmax are the minimum and maximum values (respectively) to anchor the colormap
# cmap is a nother method to color seaborn plots
# if annot = True, it will annote all the data values in each cell
# fmt is used to specify the number format in the plot
# mask needs to receive a boolean mask and it will "delete" (from the plot) all the positions with value equal True


In [65]:
g =  sns.pairplot(df[['PAY_AMT1', 'PAY_AMT2']] , size=4)
g.fig.draw(
    g.fig.canvas.get_renderer()
)  # required, as matplotlib calculates ticks during draw time
for ax in g.axes.flat: 
    for label in ax.get_xticklabels():
        label.set_rotation(45)
    

In [66]:
fig, ax = plt.subplots(1,2, figsize = (13,5))

sns.violinplot(x = 'LIMIT_BAL', data = df, ax = ax[0]).set(title = 'Violin Plot')

sns.boxenplot(x = 'LIMIT_BAL', data = df, ax = ax[1]).set(title = 'Boxen Plot')
plt.show()

Create a function that will make a plot for a unique variable and this plot needs to fit the data type. That way, this functions needs to receive:

1. The Dataframe
2. The variable's name to plot
3. A string that indicates the type of the data (can only receive two types: "numeric" or "string")


``NOTE`` - if you need more inputs it's ok.

In [68]:
def make_plot(data, idvar, var, type):
    if type == 'string':

        fig, ax = plt.subplots(1, figsize = (4,3))
        sns.barplot  (x = var
                    , y = idvar
                    , data = data
                    , palette = 'Blues'
                    , estimator = lambda x: len(x) # Other options: lambda x: len(x), sum
                    , ci=0
                    ,ax = ax).set(title = var + " Barplot")    
        fig.tight_layout()

    if type == 'numeric':

        fig, ax = plt.subplots(1, figsize = (4,3))
        sns.distplot( data[var]
                    , bins = 30 
                    , kde = False
                    , hist_kws=dict(edgecolor="k") ).set(title = var + " Histogram")                       
        fig.tight_layout()

Create a function that will create a plot to all variables of a given DataFrame (Use the function from Exercise 12) with the following conditions:

1. The function need to guess the data types of all columns from the given DataFrame (Use this: The variable is a string if number of unique elements is less than 15)
2. The function needs to receive a DataFrame as input

``NOTE`` - if you need more inputs it's ok.

In [70]:
def make_plot(data, idvar, var, type, ax):
    if type == 'string':
        sns.barplot  (x = var
                    , y = idvar
                    , data = data
                    , palette = 'Blues'
                    , estimator = lambda x: len(x) # Other options: lambda x: len(x), sum
                    , ci=0
                    ,ax = ax).set(title = var + " Barplot")    

    if type == 'numeric':
        sns.distplot( data[var]
                    , bins = 30 
                    , kde = False
                    , hist_kws=dict(edgecolor="k") ).set(title = var + " Histogram")                       



import seaborn as sns
import matplotlib.pyplot as plt

def plot_maker(df, id_var = "ID",  guess_type = True, all = True):
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    if all:
        if guess_type:
            columns = df.columns.values 
            unique_cols = ['string' if len(np.unique(df[str(col)])) < 15 else 'numeric' for col in columns ]
            
            
            
            for col in zip(columns, unique_cols):
                fig, ax = plt.subplots(1, figsize = (5,2))
                
                make_plot(df, id_var, col[0], col[1],ax)
                
                fig.tight_layout()

plot_maker(df)


Create a function with the following conditions:

1. Receive a maximum of 6 columns names from a given Dataframe as input
2. Return a error message if no column names are given
3. Create a grid that better fit the number of variables (For example, if it has 5 variables, it's better to create a 2 x 3 plot grid )
4. Guess the data types (use the same rule as Exercise 11)

In [72]:
reference_ = [(6,(2,3)),(5,(2,3)),(4,(2,2)),(3,(2,2)),(2,(1,2)),(1,(1,1))]
    
[i[1] for i in reference_ if i[0] == 5][0]

In [73]:
def make_plot(data, idvar, var, type, ax):
    if type == 'string':
        sns.barplot  (x = var
                    , y = idvar
                    , data = data
                    , palette = 'Blues'
                    , estimator = lambda x: len(x) # Other options: lambda x: len(x), sum
                    , ci=0
                    ,ax = ax).set(title = var + " Barplot")    

    if type == 'numeric':
        sns.distplot( data[var]
                    , bins = 30 
                    , kde = False
                    , hist_kws=dict(edgecolor="k")
                    , ax = ax).set(title = var + " Histogram")  

import seaborn as sns
import matplotlib.pyplot as plt

def plot_maker(df, id_var = "ID",  guess_type = True, list_vars = [] , not_consider = None):
    import seaborn as sns
    import matplotlib.pyplot as plt

    if list_vars == []:
        print("list_vars need to have at least len equal 1")
        return None
        
    if len(list_vars) > 6:
        print("list_vars cannot be greater than 6")
        return None

    reference_ = [(6,(2,3)),(5,(2,3)),(4,(2,2)),(3,(2,2)),(2,(1,2)),(1,(1,1))]
    
    nRows, nCols = [i[1] for i in reference_ if i[0] == len(list_vars)][0]

    if guess_type:
        columns = df[list_vars].columns.values
        
        unique_cols = ['string' if len(np.unique(df[str(col)])) < 15 else 'numeric' for col in columns ]
        zip_obj = zip(columns, unique_cols)
        i = 0
        fig, ax = plt.subplots(nRows, nCols)
        
        for r in range(nRows):
            for c in range(nCols):
                if i==len(list_vars):
                    if nRows == 1:
                        fig.delaxes(ax[c])
                    else:
                        fig.delaxes(ax[r,c])
                    break

                if nRows == 1:
                    make_plot(df, id_var,columns[i], unique_cols[i], ax = ax[c])
                else:
                    make_plot(df, id_var,columns[i], unique_cols[i], ax = ax[r,c])
                i+=1
        fig.tight_layout()

plot_maker(df.loc[0:1000,:], id_var = "ID", list_vars = ["PAY_0","PAY_2","PAY_3","PAY_4","PAY_5"])


