<a href="https://colab.research.google.com/github/alitarraf/Data-Science-Training/blob/master/Data_Visualization_in_Python_Unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization 101

We sift through a lot of data for our jobs. Data about website performance, sales performance, product adoption, customer service, marketing campaign results ... the list goes on.

When you manage multiple content assets, such as social media or a blog, with multiple sources of data, it can get overwhelming.What should you be tracking? What actually matters? How do you visualize and analyze the data so you can extract insights and actionable information?

More importantly, how can you make reporting more efficient when you're busy working on multiple projects at once?

<img src=https://i.ibb.co/XYSq34N/good-visuals.png width="600"/>

## Why visualization matters?

1. Visualization lets you see things that would rather go unnoticed. Any data contain information but if there's no visual data you're missing out on trends, behavior patterns and dependencies.

2. Visualization gives answers faster. Looking at a graph and identifying a trend is an instant. Now imagine how much time it will take to eye-scan rows of numbers.

3. A good visualization gives way to research data, to play with them, to investigate some curious cause-effect relationships. This is very important for investigation and research work, as in journalism.

4. Data volume is growing at a crazy rate. Visualization helps to leverage not only the volume, but the ever increasing diversity of data.

# Plotting with Plotly + Cufflinks in Python

In this notebook, we will see how to use [plotly](https://plot.ly/python/) and [cufflinks](https://github.com/santosjorge/cufflinks) to create stunning, interactive figures in a single line of Python. This combination of libraries is simple to use, makes excellent charts, and, in my opinion, much more efficient than other methods of plotting in Python.

This introduction will show us the basics of using plotly + cufflinks, focusing on what we can do in one line of code (for the most part). 


In [0]:
!pip install plotly_express

In [0]:
!pip install statsmodels==0.10.0rc2 --pre

In [0]:
import plotly
from plotly.offline import init_notebook_mode
import plotly_express as px
plotly.__version__

In [0]:
import statsmodels.api as sm
import warnings

## Plotly + Cufflinks Overview

You can create a free account and upload your graphs to share with others (this requires making the graphs and data public). 

We will run plotly completely in offline mode which means that we won't be publishing any of our graphs online.

The plotly Python package is an open-source library built on plotly.js which in turn is built on d3.js. We’ll be using a wrapper on plotly called cufflinks designed to work with Pandas dataframes. So, our entire stack is cufflinks > plotly > plotly.js > d3.js which means we get the efficiency of coding in Python with the incredible interactive graphics capabilities of d3.

[Cufflinks is a wrapper ](https://github.com/santosjorge/cufflinks) around the plotly library specifically for plotting with Pandas dataframes. With cufflinks, we don't have to dig into the details of plotly, instead building our charts with minimal code. Basically, you can make charts directly in plotly for more control, or you can use cufflinks to rapidly prototype plots and explore the data.

In [0]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

In [0]:
# plotly standard imports
import plotly.graph_objs as go
import plotly.plotly as py

# Cufflinks wrapper on plotly
import cufflinks

# Data science imports
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = 30

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

We'll be using plotly + cufflinks in offline mode. We will also set the global `cufflinks` theme to `pearl`. 

In [0]:
from plotly.offline import iplot
cufflinks.go_offline()

# Set global theme
cufflinks.set_config_file(world_readable=True, theme='pearl')

After importing cufflinks, plotly plots can be made using `df.iplot()` and then specifying parameters. 

### Data

We are using medium article statistics data. 

In [0]:
df = pd.read_parquet('https://github.com/WillKoehrsen/Data-Analysis/blob/master/plotly/data/medium_data_2019_01_06?raw=true')
#df = pd.read_parquet('medium_data_2019_01_06')
df.head()

The Read ratio is the difference between your reads and views. 

Also we will be using the gapminder Dataset, which contains info about countries such as GDP, life Expectency by year and it comes pre-loaded with Plotly express as well as the Tips dataset which contains info about Food servers’ tips in restaurants.

#### Tips dataset

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. 

In [0]:
tips = px.data.tips()
tips.head()

# Univariate (Single Variable) Distributions

For single variables, I generally start out with histograms. Plotly has these basic charts well-covered.

## Histograms

In [0]:
enable_plotly_in_cell()

df['claps'].iplot(
    kind='hist',
    bins=30,
    xTitle='likes',
    linecolor='black',
    yTitle='count',
    title='Claps Distribution')

* Notice that we can hover over any of the bars to get the exact numbers. You can also format the `text` to display different information on hovering.
* Histograms use quantitative data (numeric data).
* A histogram doesn't have any gaps, otherwise you'll be looking at a bar graph which contains categorical data, We will cover bar graphs during this session.
* Bin size determines how many intervals range of data will be divided into.

In [0]:
enable_plotly_in_cell()

tips['tip'].iplot(
    kind='hist',
    bins=10,   #bar width
    xTitle='tips',
    linecolor='black',
    yTitle='count',
    title='Tips Distribution')

### Percentage Histogram

To get the same chart but instead showing the percentage, we simple pass in `percent` as the histnorm parameter.

In [0]:
enable_plotly_in_cell()

tips['total_bill'].iplot(
    kind='hist',
    bins=11,
    xTitle='Total bill',
    linecolor='black',
    histnorm='percent',
    yTitle='percentage (%)',
    title='Total bill Distribution in Percent')

## Grouped Histogram

When we want to display two different distributions on the same plot, we can group together the data to show it side-by-side. This means setting `barmode` to `group` with two distributions.

In [0]:
df.head()

In [0]:
def to_time(dt):
    return dt.hour + dt.minute / 60

The previous function simply turns time range into a numeric range by dividing minutes by 60 then adding hour

Example: 14:24 becomes 14.4

In [0]:
df['time_started'] = df['started_date'].apply(to_time)
df['time_published'] = df['published_date'].apply(to_time)

enable_plotly_in_cell()
df[['time_started', 'time_published']].iplot(
    kind='hist',
    linecolor='black',
    bins=24,
    histnorm='percent',
    bargap=0.1,
    opacity=0.8,
    barmode='group',
    xTitle='Time of Day',
    yTitle='(%) of Articles',
    title='Time Started and Time Published')

## Overlaid Histogram

If we prefer the bars to be laid over one another, we specify the `barmode` to be `overlay`.

In [0]:
enable_plotly_in_cell()
df[['time_published', 'time_started']].iplot(
    kind='hist',
    bins=24,
    linecolor='black',
    opacity=0.8,
    histnorm='percent',
    barmode='overlay',
    xTitle='Time of day',
    yTitle='(%) of articles',
    title='Time Started and Time Published Overlaid')

**Exercise**: Plot an overlaid histogram showing "total bill" and "tip" percentage from the tips dataset

**Instructions:**
- "total bill" can be referenced by column name "total_bill"
- "tip" can be referenced by column name "tip"
- Use 10 as number of bins
- After coding your function, run the cell to check if your result is correct.

In [0]:
enable_plotly_in_cell()
### START CODE HERE ### (≈ 10 line of code)

### END CODE HERE ###

#HINT
##You can run tips.head() to see how the data looks

**Expected output**:

---


<img src=https://i.ibb.co/HVkh34h/newplot-1.png width="800"/>

Why does the plot look like this? Because data is not evenly distibuted

## Bar Plot

For a bar plot, we need to apply some sort of aggregation function and then plot. For example, we can show the `count` of articles in each publication with the following.

In [0]:
enable_plotly_in_cell()
tips['sex'].value_counts().iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Visualize Count of Tips Recorded by Gender')

In [0]:
enable_plotly_in_cell()
tips['day'].value_counts().iplot(
    kind='bar',
    xTitle='Day',
    yTitle='Count of tips',
    title='Visualize Count of Days for Recorded Tips',
    linecolor='black')

How a histogram is different than a bar chart?


1.   A histogram has numerical groups
2.   A bar graph uses cayegorial data as we discussed earlier



## Bar Plot with Two Categories

Here we'll show two distributions side-by-side. First, we'll set the index to be the date, then resample to month frequency, then take the mean and plot.

In [0]:
df.head()

We will only be working with the views,reads and published_date columns

In [0]:
df2 = df[['views', 'reads',
          'published_date']].set_index('published_date').resample('M').mean()
#We set the index to published_date for resampling of time series, the resmaple 
#function is a pandas function that takes as argument the time range,
#pandas is smart enough to understand that 'M' means we weant to sample records 
#by month and use as representation the last day of each month.

#after resampling we can use functions such as sum or mean of each sample.
df2.head()

In [0]:
enable_plotly_in_cell()
df2.iplot(
    kind='bar',
    xTitle='Date',
    yTitle='Average',
    title='Monthly Average Views and Reads')

By hovering over any month on the graph, we can make direct comparisons. This is a very handy way to explore your data! 

## Bar Plot with Second Y-Axis

If we want to put two very different ranges on the same graph, we can just use a secondary y-axis.

In [0]:
enable_plotly_in_cell()
df2 = df[['views', 'read_time',
          'published_date']].set_index('published_date').resample('M').mean()

df2.iplot(
    kind='bar',
    xTitle='Date',
    secondary_y='read_time',
    secondary_y_title='Average Read Time',
    yTitle='Average Views',
    title='Monthly Averages')

# Scatter Plots

The scatterplot is the heart of most analyses. It allows us to see the evolution of a variable over time or the relationship between two (or more) variables.

## Time-Series

A considerable portion of real-world data has a time element. Luckily, plotly + cufflinks was designed with time-series visualizations in mind.

It's very simple to make time-series plots if we set the index to be the datetime. Then we can simply pass in a column as y and plotly will know to use the index to make a date xaxis. 

In [0]:
df.head()

In [0]:
tds = df[df['publication'] == 'Towards Data Science'].set_index(
    'published_date')

#here we filter by publication type and set the index to published_Date(time series)

tds.head()

In [0]:
enable_plotly_in_cell()
tds['reads'].iplot(
    mode='lines+markers',
    opacity=0.8,
    size=8,
    symbol=1,
    xTitle='Date',
    yTitle='No. Readers',
    title='Number of Readers Trend')

Notes about the previous plot:

* The first thing to look for is whether the data has a positive or negative trend to it as time progresses.
* You have to have enough data (enough time) to assess the trend, having data for 2-3 days is Not a trend.
* What's the variation? High points, max points...

## Two Variables Time-Series

For a second variable, we should add the second variable on a secondary y-axis.


In [0]:
enable_plotly_in_cell()
tds[['claps', 'views', 'title']].iplot(
    y='claps',
    mode='lines+markers',
    secondary_y = 'views',
    secondary_y_title='views',
    opacity=0.8,
    size=8,
    symbol=1,
    xTitle='Date',
    yTitle='claps',
    text='title',
    title='Fans and Views over Time')

Here we are doing quite a few different things all in one line:

*   Getting a nicely formatted time-series x-axis automatically
*   Adding a secondary y-axis because our variables have different ranges
*   Adding in the title of the articles as hover information
*  although the ranges are different, can you see the relation between the variables?

# Heatmap

To visualize the correlations between numeric variables, we calculate the correlations and then make an annotated heatmap:

In [0]:
colorscales = [
    'Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu', 'Reds', 'Blues',
    'Picnic', 'Rainbow', 'Portland', 'Jet', 'Hot', 'Blackbody', 'Earth',
    'Electric', 'Viridis', 'Cividis'
]

In [0]:
import plotly.figure_factory as ff
corrs = df.corr()
#Pandas dataframe.corr() is used to find the pairwise correlation of 
#all columns in the dataframe. 
#Any na values are automatically excluded. 
#For any non-numeric data type columns in the dataframe it is ignored.
corrs.head()

In [0]:
enable_plotly_in_cell()

figure = ff.create_annotated_heatmap(
    z=corrs.values,
    x=list(corrs.columns),
    y=list(corrs.index),#rows
    colorscale='Earth',
    annotation_text=corrs.round(2).values,#Round a DataFrame to a variable number of decimal places.
    showscale=True, #Display colorscale
    reversescale=True)

figure.layout.margin = dict(l=200, t=200)
figure.layout.height = 800
figure.layout.width = 1000

iplot(figure)

# Pie Chart

In [0]:
enable_plotly_in_cell()
df.groupby(
    'publication', as_index=False)['reads'].count().iplot(
        kind='pie', labels='publication', values='reads', title='Percentage of Reads by Publication')

In [0]:
enable_plotly_in_cell()
df.groupby(
    'publication', as_index=False)['word_count'].sum().iplot(
        kind='pie', labels='publication', values='word_count', title='Percentage of Words by Publication')

# Plotly express

Plotly Express is a terse, consistent, high-level wrapper around Plotly.py for rapid data exploration and figure generation.

Let's take a look at the GapMinder Dataset

In [0]:
gapminder = px.data.gapminder()
gapminder.head()
gapminder2007=gapminder.query("year==2007")

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


##Scatterplot

If you want a basic scatter plot, it’s just px.scatter(data, x="column_name", y="column_name")

In [0]:
enable_plotly_in_cell()
px.scatter(gapminder2007, x="gdpPercap", y="lifeExp")

Notes about the previous plot:

* The goal of this scatter plot is to see the relation between gdpPercap and life expectency
* We see that when the gdp is very low, life expectancy has a wide range. Some people die young, some live to their 60s
* However, As the gdp increses we see that life expectancy definitely increases. Better Income means better Healthcare!
* As you can see, We have some outliers in the data.

**Exercise**: Plot a scatterplot on the tips dataset showing "total bill" and "tip" colored by smoker

**Instructions:**
- "total bill" can be referenced by column name "total_bill"---> X asis
- "tip" can be referenced by column name "tip"-->Y axis
- Use the argument color="column name" to color your plot by a certain column
- After coding your function, run the cell to check if your result is correct.

In [0]:
enable_plotly_in_cell()
### START CODE HERE ### (≈ 1 line of code)

### END CODE HERE ###

#HINT
##You can run tips.head() to see how the data looks

**Expected output**:

---


<img src=https://i.ibb.co/sbLYkJZ/newplot-2.png width="800"/>

Back to the gapminder dataset,
maybe we want to scale the points by the country population… no problem: there’s an arg for that too! Unsurprisingly, it’s called "size".
Curious about which point is which country? Add a hover_name and you can easily identify any point: never again wonder “what is that outlier?”... just mouse over the point you're interested in!

**Exercise**: Plot a scatterplot on the gapminder 2007 dataset scaled by country population where points are identified by name and colored by continent

**Instructions:**
- use color argument for coloring
- use size argument for points size
- set size_max attribute to 60
- use hover_name argument for point identification

In [0]:
enable_plotly_in_cell()

### START CODE HERE ### (≈ 1 line of code)

### END CODE HERE ###

#HINT
##You can run tips.head() to see how the data looks

**Expected output**:

---


<img src=https://i.ibb.co/371xqhL/newplot.png width="800"/>

## Animated Scatter Plot

Here’s an example with the Gapminder dataset showing life expectancy vs GPD per capita by country colored By Continent scaled by the country population.

Curious about which point is which country? Add a hover_name and you can easily identify any point: never again wonder “what is that outlier?”

We also want to see how this chart evolved over time. You can animate it by setting animation_frame="year" and animation_group="country" to identify which circles match which ones across frames. 

In [0]:
enable_plotly_in_cell()
px.scatter(gapminder, x="gdpPercap", y="lifeExp",size="pop", size_max=60, color="continent", hover_name="country",
           animation_frame="year", animation_group="country", log_x=True, range_x=[100,100000], range_y=[25,90],
           labels=dict(pop="Population", gdpPercap="GDP per Capita", lifeExp="Life Expectancy"))

## Whisker Plot

* Box and whisker plots are ideal for comparing distributions,as it displays the range of data along a number line.
* Data is ordered from least to greatest
* The median is the middle value that splits the data into two equal groups
* lower quartile(often called q1) is the median of the lower half of data
* upper quartile(often called q3) is the median of the upper half of data
* extremes are the smallest and biggest values of data respectively
* the distance between the points in the box tells you about the distribution of the data


---


<img src=https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/img/5214889_01-eng.gif width="400"/>

<img src=https://i.stack.imgur.com/mpbTr.gif width="400"/>



In [0]:
enable_plotly_in_cell()
px.box(tips,x="size",y="tip")

Notes about the previous plot:

* We notice that groups of 6 people have a high value as the minimum of their tip range.
* Why does the box in group 5 look like this?
* the median line doesn't have to be in the middle!
*Can you spot the difference between upper extreme and outliers?
* The benefits of interactivity are that we can explore and subset the data as we like. There’s a lot of information in a boxplot, and without the ability to see the numbers, we’ll miss most of it!

**Exercise**: Plot a Whisker plot showing "total bill" distributed by day colored by Smoking habit and plotted horizontally

**Instructions:**
- "total bill" can be referenced by column name "total_bill"
- "days" can be referenced by column name "day"
- Color your plot using "smoker" column
- You can set plot orientation by using argument --->orientation="h"
- After coding your function, run the cell to check if your result is correct.

In [0]:
enable_plotly_in_cell()
### START CODE HERE ### (≈ 1 line of code)

### END CODE HERE ###

#BONUS
#If i wanted to order the days I can add argument: category_orders={"day":["Thur","Fri","Sat","Sun"]}

**Expected output**:

---


<img src=https://i.ibb.co/JmkL9GZ/newplot.png width="800"/>

## Scatter and Box and Violin

In [0]:
warnings.filterwarnings('ignore')
enable_plotly_in_cell()
gapminder2=gapminder.query("year>1980")
px.scatter(gapminder2, x="gdpPercap", y="lifeExp", color="continent", marginal_y="violin",
           marginal_x="box", trendline="ols",hover_name="country") #ordinary least square regression

Notes about the previous plot:

* If we sample the points of 1 continent , it looks like there's a roughly linear relationship
* We want to find the best fit regression line, a regression model allows us to predict what future values will look like.
<img src=https://i.ibb.co/hXWYZRV/1.png width="800"/>


---


* What's the point of Violin plot if we can use box?

<img src=https://datavizcatalogue.com/methods/images/anatomy/SVG/violin_plot.svg width="800"/>

* It's like a histogram of the data! Only smoothed to look nicer.
* A box plot shows range of data but doesn't show values density.
* Looking at two continents that have the same median using box plots only can lead to the incorrect conclusion that they both have the same data distribution.

##Parallel Categories Diagram

The parallel categories diagram is a visualization of multi-dimensional categorical data sets. Each variable in the data set is represented by a column of rectangles, where each rectangle corresponds to a discrete value taken on by that variable. The relative heights of the rectangles reflect the relative frequency of occurrence of the corresponding value.

Combinations of category rectangles across dimensions are connected by ribbons, where the height of the ribbon corresponds to the relative frequency of occurrence of the combination of categories in the data set.

In [0]:
enable_plotly_in_cell()
px.parallel_categories(tips, color="size", color_continuous_scale=px.colors.sequential.Inferno)

Notes about the previous plot:

* We notice that number of male smokers is more than its female counterpart
* Highest number of smokers is witnessed on Saturday
* Groups of 4 often visit at dinner time.
* Apparently people only have lunch on Thursday!

## Parallel Coordinates Diagram

Parallel coordinates are richly interactive by default, it allows you to visualize relation between variables

Drag the lines along the axes to filter values and drag the axis names across the plot to rearrange variables.

In [0]:
enable_plotly_in_cell()
px.parallel_coordinates(tips,color="size",color_continuous_scale=["red","green","blue"])

## Dash with Tips Showcase
<img src=https://aws1.discourse-cdn.com/standard17/uploads/plot/original/2X/e/e1628056dbc0e8b57586b8ab264991da933152ca.gif width="900"/>

# Where to take it from here?

Hopefully you now have an idea of the capabilties of plotly/ Plotly Express + cufflinks. We have only scrachted the surface of this library, so check out the cufflinks documentation and the plotly documentation for plently of more examples.

As we have seen, designing an efficient and effective data visualization application is a systematic process. 
This process involves representing the data of interest, processing the data to extract relevant information for the problem at hand, designing a mapping of this information to a visual representation, rendering this representation, and combining all this functionality in an easy-to-use application.

# Data Visualization Showcase


* https://www.plotly.express/
* [Arlington Visual Budget](http://arlingtonvisualbudget.org/revenues/2019/t/352a9d4d) #Treemap
* [Migration Flow](http://download.gsb.bund.de/BIB/global_flow/) #chord diagram
* [D3 library](https://github.com/d3/d3/wiki/Gallery) #D3.js is a JavaScript library for producing dynamic, interactive data visualizations in web browsers.