### About rendering interactive plots in this notebook

If you are viewing this notebook in github, all interactive plots in this file will not be shown, as github file view doesn't render JS embedded code, which is not needed for them to work. To view this file properly, you can open it with **nbviewer** at:

[https://nbviewer.org/github/gbulg/Data-analysis-class/blob/main/Final%20Project/handout.ipynb](https://nbviewer.org/github/gbulg/Data-analysis-class/blob/main/Final%20Project/handout.ipynb)

---
# Interactive Plots Handout


Interactive plots are dynamic visualizations that allow users to explore and manipulate the data. They provide a range of features, the basic ones like **zooming** in the plot, **panning**, showing a **tooltip** with additional info on hover, as well as the more advanced ones, such as **linked interactions** (when several plots are linked together, so that when a user selects or interacts with one plot, the other plots are updated accordingly), **adding controls** to manipulate some parameters (sliders or dropdown menus) and update the plots real time, **navigating 3D plots** (being able to rotate them and view from each side), as well as many others. All of these features make interactive plots much more informative, as compared to the ordinary static ones.


Plenty of python libraries and packages are capable of producing interactive plots. Here is an incomplete list of the most popular ones:
*   [Plotly](https://plotly.com/python/)
*   [Bokeh](https://docs.bokeh.org/en/latest/index.html)
*   [mpld3](https://mpld3.github.io/)
*   [Altair](https://altair-viz.github.io/)
*   [pygal](https://www.pygal.org/en/stable/index.html)
*   [bqplot](https://bqplot.readthedocs.io/en/latest/usage/pyplot/)

In this handout, I will focus on the first two of them -- Plotly and Bokeh and show some of the interactive plots that can be made with them.



## 0. Load Data

I will use an annotated text of Leo Tolstoy's War and Piece as the source of data for visualisation. So, let's first import all necessary packages and load the data:

In [1]:
import pandas as pd
import numpy as np
import requests

In [2]:
tolstoy_df = pd.read_csv('https://raw.githubusercontent.com/dashapopova/Data-Analysis-Python-II/main/24.09/tolstoy.csv', sep='\t').fillna('')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
tolstoy_df.shape

(216099, 25)

In [4]:
tolstoy_df.head(15)

Unnamed: 0,lex,word,POS,time,gender,case,number,verbal,adj_form,comp,...,имя,отч,фам,вводн,гео,сокр,обсц,разг,редк,устар
0,том,том,S,,муж,вин,ед,,,,...,,,,,,,,,,
1,первый,первый,ANUM,,муж,вин,ед,,,,...,,,,,,,,,,
2,часть,часть,S,,жен,вин,ед,,,,...,,,,,,,,,,
3,первый,первая,ANUM,,жен,им,ед,,,,...,,,,,,,,,,
4,ну,ну,PART,,,,,,,,...,,,,,,,,,,
5,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
6,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
7,садиться,садитесь,V,непрош,,,мн,изъяв,,,...,,,,,,,,,,
8,и,и,CONJ,,,,,,,,...,,,,,,,,,,
9,рассказывать,рассказывайте,V,,,,мн,пов,,,...,,,,,,,,,,


## 1. Plotly

The first library that we will look at is plotly. Let's install the latest version:

In [5]:
pip install plotly==5.13.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting plotly==5.13.0
  Downloading plotly-5.13.0-py2.py3-none-any.whl (15.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.2/15.2 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.5.0
    Uninstalling plotly-5.5.0:
      Successfully uninstalled plotly-5.5.0
Successfully installed plotly-5.13.0


Import plotly and initialise it:

In [6]:
import plotly.offline as pyo
import plotly.express as px
import plotly.io as pio

pyo.init_notebook_mode(connected=True)
pio.renderers.default = 'notebook_connected+colab'
pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'notebook_connected+colab'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

The renderer setting should be configured depending on where the code is going to be shown.

I use `notebook_connected` for plots to be visible in **nbviewer**.

### 1.1 Bar chart

As a first interactive plot, let's make a bar chart with the distrubution of names that are mentioned in the text. We will show only name lexemes that are mentioned at least 15 times, and color them according to their gender.

In [33]:
names_count = tolstoy_df[tolstoy_df['имя'] != ''][['lex', 'gender']].value_counts().reset_index()
names_count.columns = ['lex', 'gender', 'count']
names_count = names_count[names_count['count'] > 15]

In [35]:
bar_fig = px.bar(names_count, x='lex', y='count', color='gender')
bar_fig.update_layout(xaxis={'categoryorder':'total descending'}) # descending order of bars
bar_fig.update_layout(xaxis_title="Character Name", yaxis_title="Number of Mentions", title='Character Mentions')
bar_fig.show()

We can see the exact number of mentions on hover, as well as the other information. Also, the chart can be zoomed in, or a particular slice of it (horizantal or vertical) selected. Plotly allows to configure the information in the hover tooltip in any format. For example, let's for each name lexeme create a list of its possible forms and show it in tooltip.

In [88]:
names_count = tolstoy_df[tolstoy_df['имя'] != ''][['lex', 'gender', 'word']].groupby(['lex', 'gender'], as_index=False).agg({'word': [set, 'count']})
names_count = names_count[names_count['word']['count'] > 15]
names_count['forms'] = names_count['word']['set'].apply(lambda x: ", ".join(list(x))) # create a column 'forms' with string values -- all forms separated by commas

In [90]:
bar_fig2 = px.bar(x=names_count['lex'], y=names_count['word']['count'], color=names_count['gender'], hover_data={'forms': names_count['forms']})
bar_fig2.update_layout(xaxis={'categoryorder':'total descending'}) # descending order of bars
bar_fig2.update_layout(xaxis_title="Character Name", yaxis_title="Number of Mentions", title='Character Mentions')
bar_fig2.show()

### 1.2 Distribution plot

Now let's select the three most mentioned characters (Пьер, Андрей, Наташа) and see how their mentions are distributed along the text.

A distribution plot, which is similar to a histogram, can help with that. The difference is that besides the histogram, it plots a KDE curve. This will make the visualisation more informative if there are several distributions on one plot.

In [91]:
import plotly.figure_factory as ff

In [92]:
main_characters = ['пьер', 'андрей', 'наташа']

main_characters_df_list = [list(tolstoy_df[tolstoy_df['lex'] == character].index) for character in main_characters]

In [93]:
dist_fig = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000)
dist_fig.show()

We can hide the histograms now. Also, remember, that every part of the plot can be zoomed in here.

In [96]:
dist_fig2 = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000, show_hist=False)
dist_fig2.show()

### 1.3 Pie chart



Let's draw a pie chart to show the distribution of different Parts Of Speech in the text:

In [97]:
pos_shares = tolstoy_df['POS'].value_counts(normalize=True)

In [99]:
pie_fig = px.pie(values=pos_shares, names=pos_shares.keys(), title='Parts of Speech distribution')
pie_fig.show()

### 1.4 Box plot

Now let's see for each part of speech, how many times do lexemes of this part of speech usually repeat in the text. We can make a box plot to show that.

In [105]:
lex_frequency_df = tolstoy_df[['lex', 'POS', 'gender']].groupby(['lex', 'POS'], as_index=False).count()
lex_frequency_df.columns = ['lex', 'POS', 'total']
lex_frequency_df = lex_frequency_df[(lex_frequency_df.total > 10) & (lex_frequency_df.total < 3000)]

In [106]:
box_fig = px.box(lex_frequency_df, x='POS', y='total', color='POS', hover_data=['lex'])
box_fig.show()

The boxplot shows that the most reusable lexemes are pronouns, which is to be expected. Also, the tooltip functionality allows us to see all the exact values on hover. And when hovering outliars, it shows the lexeme, which is quite useful and would not be achievable by a static plot. And of course, it is zoomable as well.

### 1.5 Heatmap

Now let's draw a heatmap to illustrate the distrubution of parts of speech bigrams.
Namely, for every pair of two parts of speech, we will calculate the percentage of all bigrams in which two words are of such parts of speech.

In [107]:
import nltk
from collections import Counter

bigrams = list(nltk.bigrams(tolstoy_df['POS']))
sorted_bigrams = sorted(zip(Counter(bigrams).values(), Counter(bigrams).keys()), reverse=True)

In [108]:
parts_of_speech = list(tolstoy_df['POS'].unique())
bigrams_stats = list([0] * len(parts_of_speech) for i in range(len(parts_of_speech)))

In [111]:
for count, bigram in sorted_bigrams:
  bigrams_stats[parts_of_speech.index(bigram[0])][parts_of_speech.index(bigram[1])] = count / len(bigrams)

In [117]:
heat_fig = px.imshow(bigrams_stats, x=parts_of_speech, y=parts_of_speech)
heat_fig.update_layout(xaxis_title="2nd word", yaxis_title="1st word", title='Bigrams POS distribution')
heat_fig.show()

## 2. Bokeh

Let's look at the other library for drawing interactive plots, Bokeh. Install it:

In [6]:
pip install bokeh

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import and initialise:

In [7]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

### 2.1 Scatter plot

In my opinion, Bokeh is better suited for scatter plots, as its grid is infinitable scalable. Let's create a scatter plot.

In [92]:
scat_y_data = tolstoy_df.groupby(['lex', 'word', 'POS'], as_index=False).count().groupby(['lex', 'POS'], as_index=False).count()

In [105]:
scat_x_data = tolstoy_df.groupby(['word'], as_index=False).count()
scat_x_data = scat_x_data[scat_x_data.time > 10]

In [106]:
p2 = figure(width=600, height=600)

p2.circle(list(scat_x_data.time), list(len(x) for x in scat_x_data.word), size=20, color='navy', alpha=0.5)
show(p2)

In [21]:
p = figure(width=600, height=600)

# add a circle renderer with a size, color, and alpha
p.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5)

# show the results
show(p)

# 3. Conclusions

some conclusions

---
---
---

end of finished version.

manipulatable plot in plotly?

bokeh:

1. scatter plot with ability to move around
2. maybe line chart also with ability to move around
3. smth with linked selection?
4. network plot in bokeh
5. mb scatterplot grid?
6. manipulatable plot
7. timeseries plot

Let's plot a heatmap that will survey the correlation between animacy and cases.

In [21]:
case_df = tolstoy_df[(tolstoy_df.animacy != '')&(tolstoy_df.case != '')]

In [22]:
fig8 = px.density_heatmap(case_df, x=case_df.case, y=case_df.animacy)
fig8.show()

In [9]:
list(tolstoy_df.columns)

['lex',
 'word',
 'POS',
 'time',
 'gender',
 'case',
 'number',
 'verbal',
 'adj_form',
 'comp',
 'person',
 'aspect',
 'voice',
 'animacy',
 'transitivity',
 'имя',
 'отч',
 'фам',
 'вводн',
 'гео',
 'сокр',
 'обсц',
 'разг',
 'редк',
 'устар']

In [19]:
tolstoy_df[(tolstoy_df['comp'] != '')][['lex', 'word', 'POS', 'comp']]

Unnamed: 0,lex,word,POS,comp
381,великий,величайшая,A,прев
407,ужасно,ужаснее,ADV,срав
1154,высокий,высшая,A,прев
1338,хорошо,лучше,ADV,срав
1390,милый,милее,A,срав
...,...,...,...,...
215012,сильный,сильнейший,A,прев
215414,малый,малейшего,A,прев
215620,храбрый,храбрейшему,A,прев
215626,храбро,храбрее,ADV,срав


In [42]:
set(tolstoy_df.case)

{'', 'вин', 'дат', 'зват', 'им', 'пр', 'род', 'твор'}

In [24]:
tolstoy_df[tolstoy_df['case'] != ''][tolstoy_df['animacy'] != ''][['lex', 'word', 'case', 'animacy']]


Boolean Series key will be reindexed to match DataFrame index.



Unnamed: 0,lex,word,case,animacy
0,том,том,вин,неод
1,первый,первый,вин,неод
2,часть,часть,вин,неод
13,июль,июле,пр,неод
14,год,года,вин,неод
...,...,...,...,...
216093,честный,честный,вин,неод
216094,человек,человек,вин,од
216096,след,следам,дат,неод
216097,этот,этого,вин,од


You can start with a brief overview of what interactive plots are, followed by a detailed explanation of two popular libraries in Python: Plotly and Bokeh. You can then compare the two libraries, highlighting their pros and cons, use cases, and functionality. Finally, you can conclude with some best practices and recommendations for further study.

This should give you a comprehensive handout on interactive plots in Python that covers both the basics and advanced topics.

### Interactive Plots in Python

Interactive plots allow users to explore and manipulate data in a more engaging and informative way than static plots. In Python, there are several libraries that provide functionality for creating interactive plots, such as Plotly, Bokeh, and Altair. These libraries offer a wide range of interactive features, such as zooming, panning, hovering, and selecting data points.

### Usage in Linguistic Research

Interactive plots can be particularly useful in linguistic research, where complex linguistic data often requires visual exploration to understand patterns and relationships. Here are some examples of how interactive plots can be used in linguistic research:

Visualizing word frequency distributions: Interactive plots can be used to create interactive word clouds, where users can zoom in and out to see the most frequent words in a corpus. This is especially useful in linguistic research as it allows researchers to quickly identify the most important words in a large corpus of text.

Plotting linguistic change over time: Interactive line charts can be used to plot linguistic change over time, allowing researchers to see the trends in language usage. For example, researchers can plot the frequency of certain words or phrases over time to see how they have changed in usage.

Analyzing lexical similarity: Interactive plots can be used to visualize the similarity between words in a lexicon. Researchers can create interactive heatmaps to see the similarity between words, or use interactive scatter plots to see the relationships between words.

Exploring linguistic diversity: Interactive plots can be used to explore linguistic diversity in a corpus. For example, researchers can use interactive bar charts to plot the frequency of different languages in a corpus, or use interactive scatter plots to visualize the diversity of language usage in different regions.

### Conclusion

In conclusion, interactive plots are a powerful tool for exploring and visualizing linguistic data. By allowing users to interact with data, they provide a more engaging and informative way to understand linguistic patterns and relationships. Whether you're a researcher, student, or simply someone interested in language, interactive plots can help you get the most out of your linguistic data.

### What data to use.

For a talk on visualization in Python, a good dataset to use might be a linguistic one that allows you to demonstrate various types of plots and how they can be used to explore linguistic data. Here are a few suggestions:

Word frequency data: You can use a dataset of word frequency counts in a specific corpus to create bar plots, histograms, and word clouds. This will help you demonstrate how to visualize the distribution of words in a corpus and identify common patterns.

Part of speech data: You can use a dataset of part of speech annotations for a corpus to create pie charts, bar plots, and stacked bar plots. This will allow you to demonstrate how to visualize the distribution of different parts of speech in a corpus and identify patterns in their usage.

N-gram data: You can use a dataset of n-grams (sequences of words) from a corpus to create bar plots and histograms. This will allow you to demonstrate how to visualize the distribution of n-grams in a corpus and identify common patterns.

Sentiment analysis data: You can use a dataset of sentiment annotations for a corpus to create bar plots, pie charts, and scatter plots. This will allow you to demonstrate how to visualize the distribution of sentiments in a corpus and identify patterns in the data.

These are just a few examples, and there are many other datasets you could use for a talk on visualization in Python. The important thing is to choose a dataset that is interesting and relevant to you and your audience, and that allows you to demonstrate various types of plots and how they can be used to explore linguistic data.




In [27]:
print('Hello world')

Hello world
