<a href="https://colab.research.google.com/github/alishaminj12/alisha_data690/blob/main/chapter_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 6 - Go Beyond Plotly Express

Plotly Express is convenient and fast, but it can only take you to select designations. To go where Plotly Express cannot reach, you can resort to Plotly, the foundational library that Plotly Express was built on.

Since Plotly Express does not provide a Pareto chart, let's build one from scratch using Plotly.

Using 2020 population of countries as an example.

Demonstrate the use of subplots with secondary Y axis.

Demonstrate the use of color scales and how to hide the scale.

In [1]:
# As of this writing, the Google Colab has Plotly version 4.4.1 pre-installed
# We need to upgrade it to the latest version

!pip install --upgrade plotly

Collecting plotly
  Downloading plotly-5.7.0-py2.py3-none-any.whl (28.8 MB)
[K     |████████████████████████████████| 28.8 MB 252 kB/s 
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
Successfully installed plotly-5.7.0


To note, for this chapter, we will not import Plotly Express module. WE will import Plotly's graph_objects module instead. 

In [2]:
import numpy as np                  # We use numpy to generate some sample data for ploting
import plotly.graph_objects as go   # graph_opjects package is the core of plotly
import plotly.io as pio
import pandas as pd
import plotly
plotly.__version__

'5.7.0'

## 6.1 Plotly Basics

Plotly uses Java Script Object Notation (JSON) format to describe how data are visualized. JSON is a standard format for web applications and data integrations. It is similar to Python's dictionary object and uses key-value pairs to describe data and computing instructions.

A Plotly data visualization is represented by a **Figure** object. A figure has two components: **Data** and **Layout**. 

The data component is is a list of **Traces**. A trace describes any predefined type of charts such as boxplot, bar chart, and scatter plot and any custom-coded type of charts.

The layout component describes the overall characteristics of a figure such as its title, legend, and titles of the axes among many others. 

A sophisticated visualization can be implemented by incorporating multiple traces each representing a unique visual component with customized layout.

![](https://github.com/wcj365/plotly-express/blob/main/static/images/plotly_module.jpg?raw=1)


## 6.2 A "Hello World" Chart
This example uses the method `update_layout()` of Figure class to add a title for the figure as well as the X axis and Y axis.

This simple chart has no data to display. 



In [3]:
fig = go.Figure()
fig.update_layout(title="Hello World!")

# Alternatively,
# my_layout = go.Layout(title="Hello World!")
# fig = go.Figure(layout = my_layout)

fig.show()

Here, the outpt shows thie figure has no data but its layout has a value for the title. 

In [4]:
print(fig)

Figure({
    'data': [], 'layout': {'template': '...', 'title': {'text': 'Hello World!'}}
})


## 6.3. A Boxplot of Ages of Some Men

Here, we create a trace of type "boxplot" for a list of numbers and add it to the Figure object using `add_trace()` method of the Figure class. A trace is represented by a Python dictionary data type which contains key-value pairs. We use the Graph object's box() method to create the trace. Alternatively, we can just create a Python dictionary. See the sectoin on Best Practices of which option to choose.

#### Statistics Background

"A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed."

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

Fences can be used to illustrate extreme values (outliers) in box plots. Sometimes you might see reference to “inner fences” and “outer fences”. These are defined as:
- Lower inner fence: Q1 – (1.5 * IQR)
- Upper inner fence: Q3 + (1.5 * IQR)
- Lower outer fence: Q1 – (3 * IQR)
- upper outer fence: Q3 + (3 * IQR)

Points beyond the inner fences in either direction are mild outliers; 

points beyond the outer fences in either direction are extreme outliers.


In addition, we also add a title for the X Axis.

In [5]:
# Here, we use Numpy to generate a list of random numbers to represent the ages for a group of men.

male_ages = np.random.randint(low=1, high=101, size=20)   # 20 random integers between 1 and 101 excluding 101.

print(male_ages)

[ 3 54 54 84 47 73  2 51 71 72 27 63 64  1  7 58 33 91 23 44]


In [6]:
# A trace is represented by a Python Dict object which contains key-value pairs.
# The key "x" represents the X Axis and its values are represented by a Python List object
# The key "type" represents the type of the chart, "box" for boxplot, "scatter" scatter plot, etc.

trace_0 = go.Box(   # type of chart: Boxplot
    x=male_ages,    
    name="Male"     # The name of the trace, used as a legend to distinguish multiple traces.

)

# Alternatively, just use the Python dictionary 
# trace_0 = {
#     "x":male_ages,
#     "type":"box",    # type of chart: Boxplot
#     "name":"Male"    # The name of the trace, used as a legend to distinguish multiple traces.
# }

fig = go.Figure()
fig.add_trace(trace_0)

# Alternatively,
# fig = go.Figure(data=[trace_0])

fig.update_layout(
    title="Boxplot of Ages of Some Men",
    xaxis={"title":"Age"}         # This is equivalent to xaxis_title="Age"
)

fig.show()

Since Plotly figures are interactive, you can move your mouse around to see the five summary statistics.

Here the print() function show that the figure has one trace of type boxplot and the data points. The figure also has some custom layout properties specified including its title, title for the X axis, and title for Y axis.

In [7]:
print(fig)

Figure({
    'data': [{'name': 'Male',
              'type': 'box',
              'x': array([ 3, 54, 54, 84, 47, 73,  2, 51, 71, 72, 27, 63, 64,  1,  7, 58, 33, 91,
                          23, 44])}],
    'layout': {'template': '...', 'title': {'text': 'Boxplot of Ages of Some Men'}, 'xaxis': {'title': {'text': 'Age'}}}
})


## 6.4. A Boxplot of Ages of Some Men and Women
We add another trace representing the boxplot of ages of some women.

In [8]:
male_ages = np.random.randint(low=1, high=100, size=20)

trace_0 = go.Box(   
    x=male_ages,    
    name="Male"   
)

female_ages = np.random.randint(low=1, high=100, size=20)

trace_1 = go.Box(   
    x=female_ages,    
    name="Female"    
)

fig = go.Figure()
fig.add_trace(trace_0)
fig.add_trace(trace_1)

# Alternatively,
# fig = go.Figure(data=[trace_0, trace_1])

fig.update_layout(
    title="Boxplot of Ages of Some Men and Women",
    xaxis={"title":"Age"},
    showlegend=True             # The legend can be shown or hidden
)

fig.show()

Since we already have the label "Male" and "Female" for the Y axis, the color legend on the upper right is not necessary. We can hide it by changing the `showlegend` property of the Layout to `False`. 

In [9]:
fig.update_layout(showlegend=False)

# Alternatively,
# fig.layout.showlegend = False

fig.show()

## 6.5 Plotly Flexibility

### 6.5.1 Different Ways to Create a Figure 

We can create an empty Figure object and then add traces and update layout properties like this:
```
trace_0 = go.Box(   
    x=male_ages,    
    name="Male"   
)

fig = go.Figure()
fig.add_trace(trace_0)
fig.update_layout(title="A Boxplot")
```
Alternatively, we can create traces and add them to the Data object and create a Layout object with some specified properties and then create the figure using the Data object and Layout object as inputs:

```
trace_0 = go.Box(   
    x=male_ages,    
    name="Male"   
)

my_layout = go.Layout(title="A Boxplot")
fig = go.Figure(data=[trace_0], layout=my_layout)
```


    

### 6.5.2 Different Ways to Create a Trace

We can use a specifc method of the Graph object. Here we use Box() method to create a Boxplot. This method creates a Python dictionary object to represent a Boxplot.

```
trace_0 = go.Box(
    x=[10, 3, -5, -35, 23, 8, 78, -65, 13,31, 82],  
    name="Trace Name"                   
)
```

Alternatively, we can use a Python dictionary object to represent a trace:

```
trace_0 = {                         
    "x":[10, 3, -5, -35, 23, 8, 78, -65, 13,31, 82],  
    "type":"box",
    "name":"Trace Name"                   
}
```

### 6.5.3 Different Ways to Specify a Layout Property

For example, to specify the title of the X axis, the following three methods work the same:

- `fig.update_layout(xaxis={"title":"Age"})`
- `fig.update_layout(xaxis_title="Age")`
- `fig.layout.xaxis.title = "Age"`

Python is a flexible language and offers alternative ways to achieve the same outcome. In some cases, there are industry best practices. For example, the commonly used indentation is four spaces. In other cases, it is up to your personal preference. In the latter, you should try to pick one and use it consistently. 


## 6.6. Steps to Create a Plotly Visualization

Here are the steps to create a plotly chart:

1. Create an instance of the Figure class. 
2. Create traces (one or more) each representing a plot.
3. Add the traces to the Figure instance.
4. Update the layout of the figure (title, legend, etc.).
5. Display or export the figure.

## 6.7 Create a Pareto Chart



In [10]:
df = pd.read_csv("https://raw.githubusercontent.com/wcj365/jay_data690/main/wdi_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Year,Country Code,Country Name,Region,Income Group,Lending Type,NY.GDP.PCAP.PP.CD,SH.STA.SUIC.P5,SP.DYN.LE00.IN,SP.POP.TOTL
0,1,2016,AFG,Afghanistan,South Asia,Low income,IDA,1981.118069,4.0,63.763,35383028.0
1,2,2016,AGO,Angola,Sub-Saharan Africa,Lower middle income,IBRD,7103.226431,6.2,59.925,28842482.0
2,3,2016,ALB,Albania,Europe & Central Asia,Upper middle income,IBRD,12078.843136,4.7,78.194,2876101.0
3,5,2016,ARE,United Arab Emirates,Middle East & North Africa,High income,Not classified,63968.888039,6.0,77.47,9360975.0
4,6,2016,ARG,Argentina,Latin America & Caribbean,Upper middle income,IBRD,20307.870052,8.3,76.221,43590368.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 704 entries, 0 to 703
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         704 non-null    int64  
 1   Year               704 non-null    int64  
 2   Country Code       704 non-null    object 
 3   Country Name       704 non-null    object 
 4   Region             704 non-null    object 
 5   Income Group       704 non-null    object 
 6   Lending Type       704 non-null    object 
 7   NY.GDP.PCAP.PP.CD  704 non-null    float64
 8   SH.STA.SUIC.P5     704 non-null    float64
 9   SP.DYN.LE00.IN     704 non-null    float64
 10  SP.POP.TOTL        704 non-null    float64
dtypes: float64(4), int64(2), object(5)
memory usage: 60.6+ KB


In [12]:
df = df.query("Year == 2019")
df.sample(5)

Unnamed: 0.1,Unnamed: 0,Year,Country Code,Country Name,Region,Income Group,Lending Type,NY.GDP.PCAP.PP.CD,SH.STA.SUIC.P5,SP.DYN.LE00.IN,SP.POP.TOTL
553,680,2019,BTN,Bhutan,South Asia,Lower middle income,IDA,12366.526476,4.6,71.777,763094.0
608,747,2019,JAM,Jamaica,Latin America & Caribbean,Upper middle income,IBRD,10190.474651,2.4,74.475,2948277.0
533,658,2019,ARM,Armenia,Europe & Central Asia,Upper middle income,IBRD,14231.180189,3.3,75.087,2957728.0
590,725,2019,GNB,Guinea-Bissau,Sub-Saharan Africa,Low income,IDA,2021.301943,7.0,58.322,1920917.0
596,734,2019,HND,Honduras,Latin America & Caribbean,Lower middle income,IDA,5978.764399,2.1,75.27,9746115.0


In [13]:
trace_0 = go.Bar(
    x=df["Country Name"],
    y=df["SP.POP.TOTL"]
)

fig = go.Figure()

fig.add_trace(trace_0)

fig.show()

In [14]:
df.sort_values(by="SP.POP.TOTL", ascending=False, inplace=True)
df.head()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0.1,Unnamed: 0,Year,Country Code,Country Name,Region,Income Group,Lending Type,NY.GDP.PCAP.PP.CD,SH.STA.SUIC.P5,SP.DYN.LE00.IN,SP.POP.TOTL
559,687,2019,CHN,China,East Asia & Pacific,Upper middle income,IBRD,16653.338844,8.1,76.912,1407745000.0
601,740,2019,IND,India,South Asia,Lower middle income,IBRD,6997.863988,12.7,69.656,1366418000.0
695,854,2019,USA,United States,North America,High income,Not classified,65279.529026,16.1,78.787805,328330000.0
600,738,2019,IDN,Indonesia,East Asia & Pacific,Lower middle income,IBRD,12311.503273,2.4,71.716,270625600.0
653,801,2019,PAK,Pakistan,South Asia,Lower middle income,Blend,4896.393145,8.9,67.273,216565300.0


In [15]:
df2 = df.head(10)
df2.shape

(10, 11)

In [16]:
trace_0 = go.Bar(
    x=df2["Country Name"],
    y=df2["SP.POP.TOTL"],
    marker=dict(color=df["SP.POP.TOTL"], coloraxis="coloraxis")   
)

fig = go.Figure()

fig.add_trace(trace_0)

fig.update_layout(
    title="2020 Population by Country",
    yaxis={"title":"2020 Population"}
)

fig.show()

In [17]:
df2["cumulative_%"] = 100 * df2["SP.POP.TOTL"].cumsum() / df2["SP.POP.TOTL"].sum()
df2.head(6)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0.1,Unnamed: 0,Year,Country Code,Country Name,Region,Income Group,Lending Type,NY.GDP.PCAP.PP.CD,SH.STA.SUIC.P5,SP.DYN.LE00.IN,SP.POP.TOTL,cumulative_%
559,687,2019,CHN,China,East Asia & Pacific,Upper middle income,IBRD,16653.338844,8.1,76.912,1407745000.0,31.729375
601,740,2019,IND,India,South Asia,Lower middle income,IBRD,6997.863988,12.7,69.656,1366418000.0,62.527269
695,854,2019,USA,United States,North America,High income,Not classified,65279.529026,16.1,78.787805,328330000.0,69.927546
600,738,2019,IDN,Indonesia,East Asia & Pacific,Lower middle income,IBRD,12311.503273,2.4,71.716,270625600.0,76.027216
653,801,2019,PAK,Pakistan,South Asia,Lower middle income,Blend,4896.393145,8.9,67.273,216565300.0,80.908415
550,677,2019,BRA,Brazil,Latin America & Caribbean,Upper middle income,IBRD,15388.234916,6.9,75.881,211049500.0,85.665291


In [18]:
trace_1 = go.Scatter(
    x=df2["Country Name"],
    y=df2["cumulative_%"],
    mode="markers+lines"
)

fig = plotly.subplots.make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(trace_0)

fig.add_trace(trace_1,secondary_y=True)

fig.update_layout(
    title="2020 Population by Countries",
    yaxis={"title":"2020 Population"},
    showlegend=False,
    coloraxis_showscale=False
)

#fig.update(layout_coloraxis_showscale=False)

fig.show()