### About rendering interactive plots in this notebook

If you are viewing this notebook in github, all interactive plots in this file will not be shown, as github file view doesn't render JS embedded code, which is not needed for them to work. To view this file properly, you can open it with **nbviewer**. The link is in the README file.

---
# Interactive Plots Handout


Interactive plots are dynamic visualizations that allow users to explore and manipulate the data. They provide a range of features, the basic ones like **zooming** in the plot, **panning**, showing a **tooltip** with additional info on hover, as well as the more advanced ones, such as **linked interactions** (when several plots are linked together, so that when a user selects or interacts with one plot, the other plots are updated accordingly), **adding controls** to manipulate some parameters (sliders or dropdown menus) and update the plots real time, **navigating 3D plots** (being able to rotate them and view from each side), as well as many others. All of these features make interactive plots much more informative, as compared to the ordinary static ones.


Plenty of python libraries and packages are capable of producing interactive plots. Here is an incomplete list of the most popular ones:
*   [Plotly](https://plotly.com/python/)
*   [Bokeh](https://docs.bokeh.org/en/latest/index.html)
*   [mpld3](https://mpld3.github.io/)
*   [Altair](https://altair-viz.github.io/)
*   [pygal](https://www.pygal.org/en/stable/index.html)
*   [bqplot](https://bqplot.readthedocs.io/en/latest/usage/pyplot/)

In this handout, I will focus on the first two of them -- mainly on Plotly and a bit on Bokeh and show some of the interactive plots that can be made with them.



## 0. Load Data

I will use an annotated text of Leo Tolstoy's War and Piece as the source of data for visualisation. So, let's first import all necessary packages and load the data:

In [1]:
import pandas as pd
import numpy as np
import requests

In [2]:
tolstoy_df = pd.read_csv('https://raw.githubusercontent.com/dashapopova/Data-Analysis-Python-II/main/24.09/tolstoy.csv', sep='\t').fillna('')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
tolstoy_df.shape

(216099, 25)

In [4]:
tolstoy_df.head(15)

Unnamed: 0,lex,word,POS,time,gender,case,number,verbal,adj_form,comp,...,имя,отч,фам,вводн,гео,сокр,обсц,разг,редк,устар
0,том,том,S,,муж,вин,ед,,,,...,,,,,,,,,,
1,первый,первый,ANUM,,муж,вин,ед,,,,...,,,,,,,,,,
2,часть,часть,S,,жен,вин,ед,,,,...,,,,,,,,,,
3,первый,первая,ANUM,,жен,им,ед,,,,...,,,,,,,,,,
4,ну,ну,PART,,,,,,,,...,,,,,,,,,,
5,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
6,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
7,садиться,садитесь,V,непрош,,,мн,изъяв,,,...,,,,,,,,,,
8,и,и,CONJ,,,,,,,,...,,,,,,,,,,
9,рассказывать,рассказывайте,V,,,,мн,пов,,,...,,,,,,,,,,


## 1. Plotly

The first library that we will look at is plotly. Let's install the latest version:

In [5]:
pip install plotly==5.13.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import plotly and initialise it:

In [6]:
import plotly.offline as pyo
import plotly.express as px
import plotly.io as pio

pyo.init_notebook_mode(connected=True)
pio.renderers.default = 'notebook_connected'
pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'notebook_connected'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

The renderer setting should be configured depending on where the code is going to be shown.

I use `notebook_connected` for plots to be visible in **nbviewer**.

### 1.1 Bar chart

As a first interactive plot, let's make a bar chart that shows the distrubution of names that are mentioned in the text. We will show only name lexemes that are mentioned at least 15 times, and color them according to their gender.

In [7]:
names_count = tolstoy_df[tolstoy_df['имя'] != ''][['lex', 'gender']].value_counts().reset_index()
names_count.columns = ['lex', 'gender', 'count']
names_count = names_count[names_count['count'] > 15]

In [8]:
bar_fig = px.bar(names_count, x='lex', y='count', color='gender')
bar_fig.update_layout(xaxis={'categoryorder':'total descending'}) # descending order of bars
bar_fig.update_layout(xaxis_title="Character Name", yaxis_title="Number of Mentions", title='Character Mentions')
bar_fig.show()

We can see the exact number of mentions on hover, as well as the other information. Also, the chart can be zoomed in, or a particular slice of it (horizantal or vertical) selected. Plotly allows to configure the information in the hover tooltip in any format. For example, let's for each name lexeme create a list of its possible forms and show it in tooltip.

In [9]:
names_count = tolstoy_df[tolstoy_df['имя'] != ''][['lex', 'gender', 'word']].groupby(['lex', 'gender'], as_index=False).agg({'word': [set, 'count']})
names_count = names_count[names_count['word']['count'] > 15]
names_count['forms'] = names_count['word']['set'].apply(lambda x: ", ".join(list(x))) # create a column 'forms' with string values -- all forms separated by commas

In [10]:
bar_fig2 = px.bar(x=names_count['lex'], y=names_count['word']['count'], color=names_count['gender'], hover_data={'forms': names_count['forms']})
bar_fig2.update_layout(xaxis={'categoryorder':'total descending'}) # descending order of bars
bar_fig2.update_layout(xaxis_title="Character Name", yaxis_title="Number of Mentions", title='Character Mentions')
bar_fig2.show()

### 1.2 Distribution plot

Now let's select the three most mentioned characters (Пьер, Андрей, Наташа) and see how their mentions are distributed along the text.

A distribution plot, which is similar to a histogram, can help with that. The difference is that besides the histogram, it plots a KDE curve. This will make the visualisation more informative if there are several distributions on one plot.

In [11]:
import plotly.figure_factory as ff

In [12]:
main_characters = ['пьер', 'андрей', 'наташа']

main_characters_df_list = [list(tolstoy_df[tolstoy_df['lex'] == character].index) for character in main_characters]

In [13]:
dist_fig = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000)
dist_fig.show()

We can hide the histograms now. Also, remember, that every part of the plot can be zoomed in here.

In [14]:
dist_fig2 = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000, show_hist=False)
dist_fig2.show()

### 1.3 Pie chart



Let's draw a pie chart to show the distribution of different Parts Of Speech in the text:

In [15]:
pos_shares = tolstoy_df['POS'].value_counts(normalize=True)

In [16]:
pie_fig = px.pie(values=pos_shares, names=pos_shares.keys(), title='Parts of Speech distribution')
pie_fig.show()

### 1.4 Box plot

Now let's see for each part of speech, how many times do lexemes of this part of speech usually repeat in the text. We can make a box plot to show that.

In [17]:
lex_frequency_df = tolstoy_df[['lex', 'POS', 'gender']].groupby(['lex', 'POS'], as_index=False).count()
lex_frequency_df.columns = ['lex', 'POS', 'total']
lex_frequency_df = lex_frequency_df[(lex_frequency_df.total > 10) & (lex_frequency_df.total < 3000)]

In [18]:
box_fig = px.box(lex_frequency_df, x='POS', y='total', color='POS', hover_data=['lex'])
box_fig.show()

The boxplot shows that the most reusable lexemes are pronouns, which is to be expected. Also, the tooltip functionality allows us to see all the exact values on hover. And when hovering outliars, it shows the lexeme, which is quite useful. And of course, it is zoomable as well.

### 1.5 Heatmap

Now let's draw a heatmap to illustrate the distrubution of parts of speech bigrams.
Namely, for every pair of two parts of speech, we will calculate the percentage of all bigrams in which two words are of such parts of speech.

In [19]:
import nltk
from collections import Counter

bigrams = list(nltk.bigrams(tolstoy_df['POS']))
sorted_bigrams = sorted(zip(Counter(bigrams).values(), Counter(bigrams).keys()), reverse=True)

In [20]:
parts_of_speech = list(tolstoy_df['POS'].unique())
bigrams_stats = list([0] * len(parts_of_speech) for i in range(len(parts_of_speech)))

In [21]:
for count, bigram in sorted_bigrams:
  bigrams_stats[parts_of_speech.index(bigram[0])][parts_of_speech.index(bigram[1])] = count / len(bigrams)

In [22]:
heat_fig = px.imshow(bigrams_stats, x=parts_of_speech, y=parts_of_speech)
heat_fig.update_layout(xaxis_title="2nd word", yaxis_title="1st word", title='Bigrams POS distribution')
heat_fig.show()

## 2. Bokeh

Let's look at the other library for drawing interactive plots, Bokeh. Install it:

In [23]:
pip install bokeh

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import and initialise:

In [24]:
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot
from bokeh.models import Title, ColumnDataSource
from bokeh.transform import factor_cmap
from bokeh.io import output_notebook

output_notebook()

### 2.1 Scatter plot

In my opinion, Bokeh is better suited for scatter plots, as its grid is infinitable scalable. Let's create a scatter plot. I can't think of any meaningful visualisation of Tolstoy's data in a form of scatter plot, so I will use the standart Iris dataset here.

In [25]:
iris_df = px.data.iris()
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
0,5.1,3.5,1.4,0.2,setosa,1
1,4.9,3.0,1.4,0.2,setosa,1
2,4.7,3.2,1.3,0.2,setosa,1
3,4.6,3.1,1.5,0.2,setosa,1
4,5.0,3.6,1.4,0.2,setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,3
146,6.3,2.5,5.0,1.9,virginica,3
147,6.5,3.0,5.2,2.0,virginica,3
148,6.2,3.4,5.4,2.3,virginica,3


In [26]:
scat_fig = figure(width=600, height=600)
scat_fig.add_layout(Title(text="Sepal Length", align="center"), "below")
scat_fig.add_layout(Title(text="Sepal Width", align="center"), "left")

scat_fig.scatter('sepal_length', 'sepal_width', source=iris_df,
                 color=factor_cmap('species', 'Category10_3', sorted(iris_df.species.unique())), # map species values to colors
                 legend_group='species', size=10, alpha=0.5)

show(scat_fig)

This scatter plot can be easily zoomed in and out and dragged.

## 2.2 Linked Scatter Plots

Bokeh allows to link several plots that are build on the same data, and interact with them simultaneously.

In [27]:
SPECIES = sorted(iris_df.species.unique())
TOOLS = "pan,wheel_zoom,box_zoom,reset,box_select,lasso_select,help"

SOURCE = ColumnDataSource(iris_df) # unified data source to select from both simultaneously

left_scat = figure(width=500, height=500, title="Sepal Size",
                   tools=TOOLS, background_fill_color="#fafafa")
left_scat.add_layout(Title(text="Sepal Length", align="center"), "below")
left_scat.add_layout(Title(text="Sepal Width", align="center"), "left")
left_scat.scatter("sepal_length", "sepal_width", source=SOURCE,
                  color=factor_cmap('species', 'Category10_3', SPECIES),
                  legend_group='species', size=10, alpha=0.7)

right_scat = figure(width=500, height=500, title="Petal Size",
                    tools=TOOLS, background_fill_color="#fafafa",
                    x_range=left_scat.x_range, y_range=left_scat.y_range) # share the ranges to move around them simultaneously
right_scat.add_layout(Title(text="Petal Length", align="center"), "below")
right_scat.add_layout(Title(text="Petal Width", align="center"), "left")
right_scat.scatter("petal_length", "petal_width", source=SOURCE,
                   color=factor_cmap('species', 'Category10_3', SPECIES),
                   legend_group='species', size=10, alpha=0.7)

show(gridplot([[left_scat, right_scat]]))

As you can see, these two graphs are dragged and zoomed simultaneously. Also, if you choose a lasso or box selection tool and select some dots on one graph, the dots corresponding to them in the other graph will also be shown. This can be quite useful.

# 3. Conclusions

In conclusion, interactive plots are a powerful tool for exploring and visualizing data. By allowing interaction and showing tooltips, they can convey more information, but still in an apprehencible way.

In this handout, I've shown a little of their possibilities, but of course there are many features that have not been covered here.