#                                       Data Visualization

**The ability to produce meaningful and insightful data visualizations is an essential part of your skill set as a data analyst.Data visualization is a graphical representation of information and data. By using various visual techniques like graphs and maps, data visualization provides an easier way to see and understand trends, outliers, correlations and patterns in data that might go undetected in raw data. In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions. 
Data visualization is a good technique to use before performing any formal analysis**





# An introduction to seaborn

**Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures.**

**I recommend to keep the data and jupyter notebook in the same folder**
**If you have not done so, you have to either provide the path of the file or change the working directory as suggested below**

In [None]:
# to change the working directory
# this is just an example, I have commented this code. In case you want to do it, use your folder location in chdir
import os
#os.chdir('G:\\My Drive\\TCNJ Courses\\IST350Spring2018\\IST350Spring2020\\Week5-6')
os.getcwd()

## to install the python package on your system
1. ! pip install seaborn
2. ! pip install plotly
3. ! pip insall cufflink

1. Matplotlib is one of the most popular library for plotting
2. Another library seaborn is built upon the matplotlib


In [None]:
# to check the installed packages, this may take some time (couple of seconds) to execute
#help('modules')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# allow you to see the plot inside the jupyter notebook
%matplotlib inline
import seaborn as sns

In [None]:
# Visit this site to know more about the seaborn
import webbrowser
webbrowser.open('https://seaborn.pydata.org/index.html')

# Categorical Data Plots from the perspective of the seaborn library

Here we will discuss using seaborn to plot categorical data! There are a few main plot types for this:

* barplot (it is essentially based on the count but seaborn provides a functionality to visualize this graph based on the quantitative data perspective)
* countplot
* boxplot (it is essentially a graph based on the quantitative variable, but seaborn provides a functionality to look this graph from the qualitative data perspective)
* violinplot(it is essentially a graph based on the quantitative variable, but seaborn provides a functionality to look this graph from the qualitative data perspective)
* stripplot
* swarmplot


## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean

countplot is essentially the same as barplot except we count the number of occurrences. Which is why we only pass the x value:

**We will work using the pokemon data to understand the capabilities of the seaborn to perform the data visualisation**
### Pokemon is a TV series that has expanded into video games, card games, movies, merchandise and everything inbetween. The motivation behind this analysis is to understand the dynamics of the pokemon universe through data visualisation.
**Source of the data**
https://www.kaggle.com/code/mmetter/pokemon-data-analysis-tutorial/data
**Variables**
* Name: The English name of the Pokemon
* Type1: The Primary Type of the Pokemon
* Type2: The Secondary Type of the Pokemon
* Total:
* HitPoints: The Base HP of the Pokemon
* Attack: The Base Attack of the Pokemon
* Defense: The Base Defense of the Pokemon
* SpecialAttack: The Base Special Attack of the Pokemon
* SpecialDefense: The Base Special Defense of the Pokemon
* Speed: The Base Speed of the Pokemon
* Generation: The numbered generation which the Pokemon was first introduced
* Legendary: Denotes if the Pokemon is legendary.



In [None]:
# Read the data
pokemon = pd.read_csv('Pokemonstudent.csv',index_col = 0)

In [None]:
# Basic information about the variables
pokemon.info()

In [None]:
pokemon.head()

## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean

countplot is essentially the same as barplot except we count the number of occurrences. Which is why we only pass the x value:

In [None]:
pokemon['Type 1'].unique()

In [None]:
pokemon['Type 2'].unique()

In [None]:
pokemon.groupby('Type 1')[['Attack']].mean()


In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x='Type 1',y='Attack',data=pokemon)
#A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle 
#and provides some indication of the uncertainty around that estimate using error bars.

In [None]:
# Create a Bar Plot between Type 1 and Defense

In [None]:
pokemon['Type 1'].value_counts().sort_index()

In [None]:
sns.countplot(x='Type 1',data=pokemon)

In [None]:
# Create a count plot for Type 2

## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
#plt.figure(figsize=(12,8))
sns.boxplot(x='Type 1', y='Attack', data=pokemon)
# Rotate x-labels
plt.xticks(rotation=45)

#Add the Palette
#palette = 'Set3'
#palette="coolwarm"
#sns.boxplot(x='Type 1', y='Attack', data=pokemon, palette='rainbow')


In [None]:
# Violin Plots are a combination of the box plot with the kernel density estimates.
#Kernel Density Estimation (KDE) is a way to estimate the probability density function of
#a continuous random variable. It is used for non-parametric analysis.
'''
The quartile and whisker values from the boxplot are shown inside the violin. As the
violin plot uses KDE, the wider portion of violin indicates the higher density and narrow
region represents relatively lower density. The Inter-Quartile range in boxplot and higher
density portion in kde fall in the same region of each category of violin plot.

'''
# Set theme
plt.figure(figsize=(12,8))
sns.set_style('whitegrid')
 
# Violin plot is alternative to box plot
sns.violinplot(x='Type 1', y='Attack', data=pokemon)
#  Dragon types tend to have higher Attack stats than Ghost types, but they also have greater variance

In [None]:
# Box plot and violin plot using the Type 2 and Attack

In [None]:
# you can use the color palette
# For a corresponding hexadecimal values for color code can use this site 
#https://bulbapedia.bulbagarden.net/wiki/Category:Type_color_templates
import webbrowser
webbrowser.open('https://bulbapedia.bulbagarden.net/wiki/Category:Type_color_templates')

In [None]:
pkmntypecolors = ['#78C850',  # Grass
                    '#F08030',  # Fire
                    '#6890F0',  # Water
                    '#A8B820',  # Bug
                    '#A8A878',  # Normal
                    '#A040A0',  # Poison
                    '#F8D030',  # Electric
                    '#E0C068',  # Ground
                    '#EE99AC',  # Fairy
                    '#C03028',  # Fighting
                    '#F85888',  # Psychic
                    '#B8A038',  # Rock
                    '#705898',  # Ghost
                    '#98D8D8',  # Ice
                    '#7038F8',  # Dragon
                   ]
# to use this in the pallette

In [None]:
# Box plot and violin plot using the Type 2 and Attack and use the above defined palette
# Set theme
plt.figure(figsize=(12,8))
sns.set_style('whitegrid')
sns.violinplot(x='Type 2', y='Attack', data=pokemon, palette=pkmntypecolors)

# factorplot/catplot
catplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type

In [None]:
sns.catplot(x = 'Generation', y = 'SpecialDefense', data=pokemon,kind='bar')

In [None]:
sns.catplot(x='Generation', data = pokemon, kind = 'count', palette = 'rainbow')

In [None]:
#sns.catplot(x='SpecialDefense', data = pokemon, kind = 'box', palette = 'rainbow')
sns.catplot(x='Generation',y='SpecialDefense', data = pokemon, kind = 'box', palette = 'rainbow')

In [None]:
# Factor Plot
g = sns.catplot(x='Type 1', 
                   y='Attack', 
                   data=pokemon, 
                   hue='Generation',  # Color by Generation
                   col='Generation',  # Separate by Generation
                   kind='swarm') # Swarmplot
 
# Rotate x-axis labels
g.set_xticklabels(rotation=-45)

In [None]:
#sns.distplot(pokemon['SpecialDefense'],kde=False,bins=30)
sns.distplot(pokemon['SpecialDefense'],kde_kws={"color": "red", "lw": 3, "label": "KDE"})

In [None]:
sns.jointplot(x='SpecialDefense',y='Defense',data=pokemon,kind='scatter')

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [None]:
sns.pairplot(pokemon[['Attack','Defense']])

## Scatterplot

In [None]:
# lmplot for scatter plot
# two ways
sns.lmplot('Attack', 'Defense', data = pokemon)

In [None]:
# Regression line is in place and we need to remove it
sns.lmplot('Attack', 'Defense', data = pokemon,
          fit_reg = False, # Regression line
          hue = 'Generation') # colour based on the genration

In [None]:
# Integration with matplotlib 
# Regression line is in place and we need to remove it
sns.lmplot('Attack', 'Defense', data = pokemon,
          fit_reg = False, # Regression line
          hue = 'Generation') # colour based on the genration
# Tweak using Matplotlib
plt.ylim(0, None)
plt.xlim(0, None)

In [None]:
# Pre-format DataFrame
plt.figure(figsize=(12,8))
pm_df = pokemon.drop(['Year', 'Generation', 'Legendary'], axis=1)
 
# New boxplot using stats_df
sns.boxplot(data=pm_df)

In [None]:
# Swarm plot with Pokemon color palette
sns.swarmplot(x='Type 1', y='Attack', data=pokemon, 
              palette=pkmntypecolors)


In [None]:
# overlaying
# Set figure size with matplotlib
plt.figure(figsize=(12,8))
 
# Create plot
sns.violinplot(x='Type 1',
               y='Attack', 
               data=pokemon, 
               inner=None, # Remove the bars inside the violins
               palette=pkmntypecolors)
 
sns.swarmplot(x='Type 1', 
              y='Attack', 
              data=pokemon, 
              color='k', # Make points black
              alpha=0.7) # and slightly transparent
 
# Set title with matplotlib
plt.title('Attack by Type')


In [None]:
pm_df.head()

In [None]:
# HeatMap
corr = pm_df.corr()
print(corr)

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(corr)

**Plotly's Python graphing library makes interactive graphs**
**cufflink connects plotly with pandas to create graphs and charts of dataframes directly.**

In [None]:
# to install the package
#! pip install plotly
#! pip install cufflinks

In [None]:
from plotly.offline import download_plotlyjs, init_notebook_mode,plot, iplot

In [None]:
import cufflinks as cf

In [None]:
init_notebook_mode(connected=True)

In [None]:
# offline
cf.go_offline()

In [None]:
pokemon[['Attack','SpecialAttack']].iplot()

In [None]:
pokemon.iplot(kind = 'scatter', x = 'Attack', y = 'SpecialAttack', mode = 'markers', size = 3)

In [None]:
pokemon.iplot(kind = 'bar',x = 'Generation', y = 'SpecialAttack')

In [None]:
pokemon[['Generation', 'SpecialAttack']].iplot(kind = 'box')

In [None]:
#import all the necessary functions and classes
import plotly.express as px
import plotly.offline as py
py.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from plotly.figure_factory import create_table

In [None]:
px.bar(pokemon,x='Generation',y='Attack',height=400) 

In [None]:
px.bar(pokemon,x='Generation',y='Attack',color='SpecialAttack',height=400,
      labels={'Attack':'BaseAttack'})   # add color to the barchart to see SpecialAttack and on y axis plot BaseAttack

In [None]:
px.scatter(pokemon,x='Attack',y='Defense')  

In [None]:
px.scatter(pokemon,x='Attack',y='Defense', color = 'Generation')

In [None]:
#draw a bubble plot
px.scatter(pokemon,x='Attack',y='Defense', color = 'Type 1', size = 'Generation', size_max = 30)


In [None]:
#add hover name - Type
px.scatter(pokemon,x='Attack',y='Defense', color = 'Type 1', size = 'Generation', size_max = 30,
          hover_name='Type 2')



In [None]:
#facetplot
px.scatter(pokemon,x='Attack',y='Defense', color = 'Type 1', size = 'Generation', size_max = 30,
          hover_name='Type 2', facet_col = 'Generation')


In [None]:
#adding animation to bubble plot
#facetplot
px.scatter(pokemon,x='Attack',y='Defense', color = 'Type 1', size = 'Generation', size_max = 30,
          hover_name='Type 2', log_x = True, animation_frame = 'Year', animation_group = 'Generation')

