# Tutorial of Creating Pareto Chart Using Plotly 

## What is Pareto Chart?

The Pareto chart is a visual representation of the 80-20 rule by combining a bar + line chart. The 80-20 rule, also know as the Pareto Principle, is a widely seen phenomenon that people sometimes don't realize its existense.  It's named after Villefredo Pareto who found that approximately 80 percent of all wealth of Italian cities he researched was held by only 20 percent of the families. According to Wikipedia, the Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of causes (Wikepedia, "Pareto principle", 2022).

The Pareto Chart visualizes the Pareto Principle by displaying the ordered frequencies of categories (Count) along with the cumulative frequencies of categories (Cumulative Percentage). The lengths of the bars represent frequency, and are arranged with longest bars on the left and the shortest to the right, showing which situations are more significant. The line chart represents the cumulative frequencies of the values. A third line can be added to the graph as the "80%" line to help you quickly identify where the 80% locates.  

It's also necessary to note that in the Pareto principle, the inputs and outputs do not have to add up to 100%. The relationship could as well be 80/40, 60/20, 90/30, or 100/10. The 80/20 distribution is just the most common one (Gulbis, 2016).

Pareto chart is a great tool for decision-making. It can come in handy in the following circumstances:

1. You have a list of rootcauses to a problem, and you want to determine where you should spend the most resources on. For example: Which software bugs cause the most crashes?
2. You want to organize and sort the data by the categories' priorities
3. You want to explore if the dataset follows the 80-20 rule 
4. You want to analyze data in business that you can count or measure: like Which customers attribute to most of your income or profit? Which products are the most popular and why? Which employees attribute to most company sales? (Gulbis, 2016) 

Pareto chart is commonly used in quality control analysis, to illustrate major of defect or problem in descending order of their frequency (number of times of occurrence) and their cumulative impact. It provides a better explanation regarding defects that are needed to be resolved first. One can also decide and plan about correct and important measures or actions that are needed to be taken regarding defect (GeeksforGeeks, 2020).

This visualization technique also has some cons: 

1. It does not represent severity of defect or any problem since it only shows the frequencies. People may tend to ignore the problems that have lower frequencies while there is a chance these problem may cause larger losses. In other words, the pareto analysis makes people focus on the most significant ones and ignore the less significant one, leading to unfairness.

2. Parelo chart has limited use cases. When the dataset doesn't follow the 80-20 rule closely, it may not be very helpful for decision-making. Additional tools need to be explored.  

3. Pareto charts can only show qualitative data that can be observed. It merely shows the frequency of an attribute or measurement. One disadvantage of generating Pareto charts is that they cannot be used to calculate the average of the data, its variability or changes in the measured attribute over time (Wilhite, 2021).




## Visualization using Plotly

Plotly is an OPEN SOURCE visualization toolkit licensed under the MIT license. It doesn't require any account registration and it's free for individual users. It also can work offline. You can view the source, report issues or contribute using Github.

Plotly is much more interactive & visually flexible than Matplotlib or Seaborn. It works well with my dataset (a csv file) via pandas and it's easy to do a Pareto Chart to visualize my dataset. The interactivity feature makes a pareto chart more readable especially when your data have many rows(categories). The dataset I will use in my demonstration "2021 Tokyo Olympics Medals Count" contains 93 countries, so the interactivity feature is vital. You can hover over each measure to see it's count. 



#### Installation

We can execute the below code to install the package and necessary extensions for Jupyter Lab. Just copy and paste them into a cell, then run the cell. 

`!pip install plotly`

`!jupyter labextension install jupyterlab-plotly`

`!pip install "jupyterlab>=3" "ipywidgets>=7.6"`


Plotly is declarative. You don't necessarily need to write down step by step to make an interactive visualization. It's pretty automatic. 

Plotly can integrate with Jupyter easily by installing the extensions mentioned above. 

I choose Plotly to do Pareto Chart because of the following reasons:
1. Interactive and ready-to-present quality graphs can be made with simple code. It requires less lines of code than Matplotlib to do the same graph. 
2. Clear documentation to help you identify which graph fits your data. I found the documentation easier to understand than Matplotlib.
3. Beautiful gallaries on their website showing how to make bar charts, line charts and how to change formatting to make a pareto chart
4. There is a community called Chart Studio, where you can search publich charts generated by users. I found a good example of a Pareto chart there to learn to do the visualization. https://chart-studio.plotly.com/~timopyr/11/diagram-pareto/#/code

If you are interested, I recommend this article on 'towardsdatascience.com' which compares Matplotlib vs Plotly by demonstrating some basic charts. 
https://towardsdatascience.com/matplotlib-vs-plotly-express-which-one-is-the-best-library-for-data-visualization-7a96dbe3ff09

Limitations of Plotly:
The documentation can be out of date. There is a large range of Plotly tools (Chart
Studio, Express, etc.) that can be confusing and hard to keep up with. It's less popular than Matplotlib so you won't find as many resources online for Plotly.




### Step by Step Tutorial


The dataset I picked is the "Tokyo 2020 Olympic Medal Count(2021)" dataset downloaded from Kaggle (ALAN, 2021). 

Data Source : https://www.kaggle.com/datasets/berkayalan/2021-olympics-medals-in-tokyo

The 2020 Summer Olympics , officially the Games of the XXXII Olympiad and branded as Tokyo 2020, is an ongoing international multi-sport event being held from 23 July to 8 August 2021 in Tokyo, Japan.

From the top of my head I believe that the count of Olympic Medals by country follows the 80-20 rule approximately. I wanted to verify it through a Pareto Chart. So I picked this dataset. 

Firstly let's import the libraries:



In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly as py
import plotly.io as pio
pio.renderers.default = 'iframe'

# import warnings
# warnings.filterwarnings('ignore')

And then we use pandas to read the csv file. I would firstly sort the DataFrame by Gold Medal count descending. And then add a column to calculate cumulative percentage. I will also add a column with value '80' to be used to demonstrate where 80% is. 


(Note that some countries have the same count of gold medals, I used "Rank by total" as the second criteria to sort)

In [12]:
#import the 2021 Tokyo Medals dataset from csv file and create a DataFrame
df = pd.read_csv('./TokyoMedals2021.csv')

#sort DataFrame by Gold Medal count descending
df = df.sort_values(by=['Gold Medal','Rank By Total'], ascending=[False, False])

#add a column to calculate cumulative percentage
df['cumulative_perc_gold'] = round(df['Gold Medal'].cumsum()/df['Gold Medal'].sum()*100,2)

#add a column for 80
df['eighty'] = 80

#Now let's take a look at the dataframe
print(len(df))
df.head(10)


93


Unnamed: 0,Country,Gold Medal,Silver Medal,Bronze Medal,Total,Rank By Total,cumulative_perc_gold,eighty
0,United States of America,39,41,33,113,1,11.47,80
1,People's Republic of China,38,32,18,88,2,22.65,80
2,Japan,27,14,17,58,5,30.59,80
3,Great Britain,22,21,22,65,4,37.06,80
4,ROC,20,28,23,71,3,42.94,80
5,Australia,17,7,22,46,6,47.94,80
7,France,10,12,11,33,10,50.88,80
6,Netherlands,10,12,14,36,9,53.82,80
8,Germany,10,11,16,37,8,56.76,80
9,Italy,10,10,20,40,7,59.71,80


Based on the dataframe we just created, we will create the first graph (Graph 1) by specifying the x and y axes. Now we have a basic bar chart showing the count of Gold Medals by Country descending. If you mouse over the bars, you can already see the interactive feature, showing the count for each country! Great! Now we can see how easy it is to use Plotly to create interactive charts!

#### First let's make a basic bar chart with gold medal counts

In [13]:
trace1 = go.Bar(
    x=df['Country'],
    y=df['Gold Medal']   
)
fig = go.Figure(data = trace1)
fig.update_layout(width = 1000,margin=dict(l=0, r=100, t=50, b=50))
fig.show()

#### Let's add some formatting to make it prettier 
This adds a title, and appropriate x-axis label font size to show all the country names:

In [14]:
trace1 = go.Bar(
    x=df['Country'],
    y=df['Gold Medal']
)
data = [trace1]
fig = go.Figure(data = data)
fig.update_layout(
    title = 'Tokyo 2020 Olympics - Gold Medal By Country',
    width=1300,
    margin=dict(l=0, r=100, t=50, b=50),
    xaxis = dict(
        tickfont = dict(size=10),
        tickangle=-70),
    annotations = [dict(xref='paper',
                                        yref='paper',
                                        x=0.02, y=1.05,
                                        showarrow=False,
                                        text ='93 Countries in total')] #using annotation to add subtitle
)
fig.show()

#### Let's plot the other half of the Pareto Chart - the cumulative percentage

This is a basic Scatter chart. Again, it has interactive feature automatically! Note that in Plotly, go.Scatter can be used both for plotting points (makers) or lines, depending on the value of 'mode'('lines','markers'). For more info, see reference here: https://plotly.com/python/reference/scatter/

In [15]:
trace2 = go.Scatter(
    x=df['Country'],
    y=df['cumulative_perc_gold'],
    mode = 'lines',
)

fig = go.Figure(data=trace2)
fig.update_traces(marker_color='orange')#change the color of the marker line
fig.update_layout(width = 1300, margin=dict(l=0, r=150, t=50, b=50))
fig.show()

#### Combining the two together 
Let's add a second Y-axis to Graph 1, and combine the two basic charts. I also created a layout object to decorate the visualization. 

In [16]:
trace1 = go.Bar(
    x=df['Country'],
    y=df['Gold Medal'],
    name='Count of Gold Medal',
    marker_color = '#219ebc'
)


trace2 = go.Scatter(
    x=df['Country'],
    y=df['cumulative_perc_gold'],
    name='Cumulative Percentage',
    marker_color='orange',
    yaxis='y2',
    line=dict(
        width=2.4
       )
)

layout = go.Layout(
    title = 'Tokyo 2020 Olympics - Gold Medal By Country',
    titlefont=dict(
        # color='black',
        family='Balto, sans-serif',
        size=20
    ),
    width=1300,
    height=600,
    paper_bgcolor='white',
    plot_bgcolor='#e6f6fe',
    showlegend=True,
       legend=dict(
          x=.83,
          y=1.4,
          font=dict(
            family='Balto, sans-serif',
            size=12
        ),
    ),
    
    annotations = [dict(
        xref='paper',
        yref='paper',
        x=0.03, y=1.3,
        showarrow=False,
        text ='93 Countries in total')],

    xaxis=dict(
        tickfont = dict(size=10),
        tickangle=-70),
    
    yaxis=dict(
        title='Count',
        range=[0,50],
        tickfont=dict(
            color='#219ebc'
        ),
        tickvals = [0,10,20,30,40,50],
        titlefont=dict(
            family='Balto, sans-serif',
            size=14,
            color='#219ebc')
    ),
  
    yaxis2=dict(
        title = 'Cumulative Percentage',
        range=[0,101],
        tickfont=dict(
            color='darkorange'
        ),
        tickvals = [0,20,40,60,80,100],
          titlefont=dict(
                family='Balto, sans-serif',
                size=14,
                color='darkorange'),
        overlaying='y',
        side='right'
    )
)

data = [trace1,trace2] #Combine the two charts 
fig = go.Figure(data=data,layout = layout)
fig.update_layout(width = 1500,margin=dict(l=0, r=200, t=200, b=50))
fig.show()

#### Looking pretty good, now let's add the 80% line 
Thanks to the interactive feature of Plotly, we can simply mouse over to the intercept of the orange line and grey dash line and find out that the 80% cumulative percentage is at Bulgaria. I disabled the hover function on the grey dash line on purpose, so it doesn't confuse us.

In [17]:
trace3 = go.Scatter(
    x=df['Country'],
    y=df['eighty'],
    name='80 Percentage Line',
    hoverinfo='skip', #disable the hover feature
    yaxis='y2',#specify which yaxis to use
    line=dict(
        color='grey',
        dash = 'dash',
        width=2.0
       )#adjust color, type and width of the line
)
data = [trace1, trace2,trace3]#Now combine all three


fig = go.Figure(data=data,layout = layout)
fig.update_layout(width = 1500,margin=dict(l=0, r=200, t=200, b=50))
fig.show()

#### Let's add some final touch
First add annotation for rankings

In [18]:
rank = [i for i in range(1,100)]
 
fig.update_traces(
    text = rank,
    textposition='inside',
    textfont = dict(color='white',size = 9),
    selector=dict(type="bar")
)

Secondly, let's add a vertical line so we can easily identify the cutoff for 80% of cumulative percentage. I also added annotation to mark the intercept using text.

### Final Visualization

In [19]:
fig.add_shape(type="line",
    xref="x", yref="y",
    x0='Bulgaria', y0=0, x1='Bulgaria', y1=50,
    line=dict(
        color="LightSeaGreen",
        dash = 'dash',
        width=1))

fig.add_annotation(dict(
    xref="x", 
    yref="y",
    x= 'Bulgaria', 
    y= 38,
    showarrow=False,
    text ='Bulgaria, Rank 24th among 93 countries',))

fig.show()


Now if you mouse over to Bulgaria in the above chart, you can see that Bulgaria is ranked at 24th. Our dataset has 93 countries. So 24/93 countries comprises 80% of the total Gold medal counts. It's not exactly 20/80, but very close! So the Pareto chart is a great visualization technique that can help us determine whether our dataset follows the 80/20 rule, and identify where the cutoff is. 

In a nutshell, Plotly is very helpful when it comes to drawing a Pareto Chart due to its advanced and automatic interactive features, easiness to code and clear documentation.

## Other Use Cases: A bit more exploration of the dataset and Plotly

Same as Matplotlib, Plotly also offers the functionality to create subplots. Using the same dataset, let's create subplots to see Gold, Silver and Bronze Medal counts altogether. Firstly let's create dataframes for silver and bronze medal charts respectively. Note that we need to sort each dataframe seperately, by the medal count column so we can calculate cumulative percentages correctly: 

(There are countries with the same silver/bronze medal count, so we sort by medal count column as well as "Rank By Total" column)

In [10]:
df2 = df 
df3 = df 
df2 = df2.sort_values(by=['Silver Medal','Rank By Total'], ascending=[False, False])
df3 = df3.sort_values(by=['Bronze Medal','Rank By Total'], ascending=[False, False])
df2['cumulative_perc_silver'] = round(df2['Silver Medal'].cumsum()/df2['Silver Medal'].sum()*100,2)
df3['cumulative_perc_bronze'] = round(df3['Bronze Medal'].cumsum()/df3['Bronze Medal'].sum()*100,2)

df3

Unnamed: 0,Country,Gold Medal,Silver Medal,Bronze Medal,Total,Rank By Total,cumulative_perc_gold,eighty,cumulative_perc_bronze
0,United States of America,39,41,33,113,1,11.47,80,8.21
4,ROC,20,28,23,71,3,42.94,80,13.93
5,Australia,17,7,22,46,6,47.94,80,19.40
3,Great Britain,22,21,22,65,4,37.06,80,24.88
9,Italy,10,10,20,40,7,59.71,80,29.85
...,...,...,...,...,...,...,...,...,...
37,Ecuador,2,1,0,3,60,88.82,80,100.00
51,South Africa,1,2,0,3,60,96.18,80,100.00
45,Romania,1,3,0,4,47,96.47,80,100.00
46,Venezuela,1,3,0,4,47,96.76,80,100.00


In [11]:
from plotly.subplots import make_subplots

fig2 = make_subplots(rows=3, cols=1,subplot_titles=('Gold Medal',  'Silver Medal','Bronze Medal'))

fig2.append_trace(go.Bar(
    x=df['Country'],
    y=df['Gold Medal'],
    name='Count of Gold Medal',
    marker_color = 'gold'
), row=1, col=1)

fig2.append_trace(go.Scatter(
    x=df['Country'],
    y=df['cumulative_perc_gold'],
    name='Cumulative Percentage',
    marker_color='orange',
    line=dict(
        width=2.4
       )
), row=1, col=1)


fig2.append_trace(go.Bar(
    x=df2['Country'],
    y=df2['Silver Medal'],
    name='Count of Silver Medal',
    marker_color = 'silver'
), row=2, col=1)

fig2.append_trace(go.Scatter(
    x=df2['Country'],
    y=df2['cumulative_perc_silver'],
    name='Cumulative Percentage',
    marker_color='lightblue',
    line=dict(
        width=2.4
       )
), row=2, col=1)


fig2.append_trace(go.Bar(
    x=df3['Country'],
    y=df3['Bronze Medal'],
    name='Count of Bronze Medal',
    marker_color = '#C9B037'
), row=3, col=1)



fig2.append_trace(go.Scatter(
    x=df3['Country'],
    y=df3['cumulative_perc_bronze'],
    name='Cumulative Percentage',
    marker_color='#B4B4B4',
    line=dict(
        width=2.4
       )
), row=3, col=1)

rank = [i for i in range(1,100)]
 
fig2.update_traces(
    text = rank,
    textposition='inside',
    textfont = dict(color='white',size = 9),
    selector=dict(type="bar")
)

fig2.update_xaxes(
    tickfont_size= 8,
    tickangle=-70)

fig2.update_yaxes(
    range=[0,101],
    tickvals = [0,20,40,60,80,100],
    title = 'Count of Medals'
    )
fig2.update_layout(
    height=1000, 
    width=1400, 
    margin=dict(l=150, r=150, t=50, b=50),
    title_text="Tokyo 2020 Olympics Medal Count",
    showlegend=True,
    legend=dict(
          x=.83,
          y=1.2,
          font=dict(
            family='Balto, sans-serif',
            size=12
        ),
    ),
                 )
fig2.update_layout(width = 1500,margin=dict(l=0, r=200, t=200, b=50))

fig2.show()

The above chart clearly tells us that all three subplots follows the Pareto principle. But the trend is more significant in Gold Medal and Silver Medal charts. I didn't draw the 80% line and the vertical line in this chart, but thanks to the interactive functionality of Plotly, I can still get the same info by hovering over the chart and doing a little big calculation. 

For the silver medal chart: Republic of Korea, ranked 26th, is the cutoff for cumulative 80%. So 26 out of 93 countries won 80% of the silver medals, proving that it approximately follows the 80-20 principle. 

For the Bronze medal chart: Azerbaijan, ranked 29th, is the cutoff for cumulative 80%. 29 out of 93 countries won 80% of the Bronze medals. It is more close to a 70-30 relationship, still a good pattern of Pareto principle.




## Summary

In summary, Pareto principle is everywhere. I chose the Olympic dataset according to my common sense, so it's kind of like guessing. But it indeed follows the Pareto principle! It's very interesting to see and prove it.  

In addition, Plotly does a great job in helping us do interactive visualization, with limited lines of code. The visualization made from Plotly are beautiful and professional enough for business presentations and publications. It makes data exploration much more easier. Thus it's a great tool that we can use in exploratory data analysis. We may need to do more analysis to further investigate the dataset depending on the needs, but what I've just walked through is a great start! 

Hope you enjoyed this tutorial!

## References:


Plotly JSON chart schema. (2018). Plotly. https://plotly.com/chart-studio-help/json-chart-schema/


Wikipedia contributors. (2022, February 25). Pareto principle. Wikipedia. https://en.wikipedia.org/wiki/Pareto_principle


Gulbis, J. ā. (2016). The 80/20 Rule – The Law of Unfair Advantage. eazyBI. https://eazybi.com/blog/the-80-20-rule


Wilhite, T. (2021, November 20). The Disadvantages of Pareto Analysis. Bizfluent. https://bizfluent.com/list-6831238-disadvantages-pareto-analysis.html


GeeksforGeeks. (2020, September 15). Advantages and Disadvantages of Pareto Chart. https://www.geeksforgeeks.org/advantages-and-disadvantages-of-pareto-chart/


Araujo, I. (2022, January 6). Matplotlib vs. Plotly Express: Which One is the Best Library for Data Visualization? Medium. https://towardsdatascience.com/matplotlib-vs-plotly-express-which-one-is-the-best-library-for-data-visualization-7a96dbe3ff09


ALAN, B. (2021, August 9). Tokyo 2020 Olympics Medals. Kaggle. https://www.kaggle.com/datasets/berkayalan/2021-olympics-medals-in-tokyo


Diagram Pareto | bar chart made by Timopyr | plotly. (2018). Chart-Studio. https://chart-studio.plotly.com/%7Etimopyr/11/diagram-pareto/#/code
