### About rendering interactive plots in this notebook

If you are viewing this notebook in github, all interactive plots in this file will not be shown, as github file view doesn't render JS embedded code, which is not needed for them to work. To view this file properly, you can open it with **nbviewer** at:

[https://nbviewer.org/github/gbulg/Data-analysis-class/blob/main/Final%20Project/handout.ipynb](https://nbviewer.org/github/gbulg/Data-analysis-class/blob/main/Final%20Project/handout.ipynb)

---
# Interactive Plots Handout


In addition to ordinary, static plots, there are ways to make plots interactive.

This will allow popups to show on hover to get better undertanding of data, plots can be zoomed in and out. other features! Panning, Zooming, Tooltips. Selection. Interactive selection.

Plenty of python libraries and packages allow for making interactive plots. Here is not an exhausting list of them:
*   [Plotly](https://plotly.com/python/)
*   [Bokeh](https://docs.bokeh.org/en/latest/index.html)
*   [mpld3](https://mpld3.github.io/)
*   [Altair](https://altair-viz.github.io/)
*   [pygal](https://www.pygal.org/en/stable/index.html)
*   [bqplot](https://bqplot.readthedocs.io/en/latest/usage/pyplot/)

In this handout, I will focus on the first two of them -- Plotly and Bokeh and show some of the interactive plots that they allow to make.



## 0. Load Data

I will use an annotated text of Leo Tolstoy's War and Piece as the source of data for visualisation. So first, let's import all necessary packages and load the data:

In [1]:
import pandas as pd
import numpy as np
import requests

In [66]:
tolstoy_df = pd.read_csv('https://raw.githubusercontent.com/dashapopova/Data-Analysis-Python-II/main/24.09/tolstoy.csv', sep='\t').fillna('')


Columns (21,23,24) have mixed types.Specify dtype option on import or set low_memory=False.



In [70]:
tolstoy_df.shape

(216099, 25)

In [72]:
tolstoy_df.head(15)

Unnamed: 0,lex,word,POS,time,gender,case,number,verbal,adj_form,comp,...,имя,отч,фам,вводн,гео,сокр,обсц,разг,редк,устар
0,том,том,S,,муж,вин,ед,,,,...,,,,,,,,,,
1,первый,первый,ANUM,,муж,вин,ед,,,,...,,,,,,,,,,
2,часть,часть,S,,жен,вин,ед,,,,...,,,,,,,,,,
3,первый,первая,ANUM,,жен,им,ед,,,,...,,,,,,,,,,
4,ну,ну,PART,,,,,,,,...,,,,,,,,,,
5,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
6,здравствовать,здравствуйте,V,,,,мн,пов,,,...,,,,,,,,,,
7,садиться,садитесь,V,непрош,,,мн,изъяв,,,...,,,,,,,,,,
8,и,и,CONJ,,,,,,,,...,,,,,,,,,,
9,рассказывать,рассказывайте,V,,,,мн,пов,,,...,,,,,,,,,,


## 1. Plotly

Plotly allows to plot all sorts of graphs with hoverability and ???

Install the latest version:

In [9]:
pip install plotly==5.13.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import plotly and initialise it:

In [94]:
import plotly.express as px
import plotly.offline as pyo
import plotly.io as pio

import re


init_notebook_mode(connected=True)

pio.renderers.default += '+colab'
pio.renderers

Renderers configuration
-----------------------
    Default renderer: 'plotly_mimetype+notebook_connected+colab'
    Available renderers:
        ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode',
         'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab',
         'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg',
         'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe',
         'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']

All basic features are in express, while more complex ones need another imports.

A few words about renderer.

### 1.1 Bar chart

Let's make a bar plot with the distrubution of all names in the book.

In [95]:
names_count = tolstoy_df[tolstoy_df['имя'] != '']['lex'].value_counts()

In [96]:
fig2 = px.bar(names_count[names_count > 10])
fig2.update_layout(xaxis_title="Character Name", yaxis_title="Number of Mentions", title='Characters')
fig2.show()

### 1.2 Distribution plot

Now let's select the three most mentioned characters (Пьер, Андрей, Наташа) and see how their mentions are distributed along the text.

A distribution plot, which is similar to a histogram, can help with that. The difference is that besides the histogram, it provides curves, which are much more informative ??? when there are several distributions on one plot.

In [14]:
import plotly.figure_factory as ff  

In [15]:
main_characters = ['пьер', 'андрей', 'наташа']

main_characters_df_list = [list(tolstoy_df[tolstoy_df['lex'] == character].index) for character in main_characters]

In [92]:
fig4 = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000)
fig4.show()

We can hide the histograms:

In [91]:
fig5 = ff.create_distplot(main_characters_df_list, group_labels=main_characters, bin_size=1000, show_hist=False)
fig5.show()

### 1.3 Pie chart



Let's show the distribution of different Parts Of Speech in the text:

In [38]:
pos_shares = tolstoy_df['POS'].value_counts(normalize=True)
pos_shares

S         0.256952
V         0.198913
PR        0.102078
SPRO      0.099019
CONJ      0.093596
A         0.064887
APRO      0.057747
ADV       0.053642
PART      0.046344
ADVPRO    0.017645
NUM       0.005141
ANUM      0.002383
INTJ      0.001652
Name: POS, dtype: float64

In [40]:
fig3 = px.pie(values=pos_shares, names=pos_shares.keys(), title='Part of Speech distribution')
fig3.show()

### 1.4 Box plot

In [86]:
lex_frequency_df = tolstoy_df[['lex', 'POS', 'gender']].groupby(['lex', 'POS'], as_index=False).count()
lex_frequency_df.columns = ['lex', 'POS', 'total']
lex_frequency_df = lex_frequency_df[(lex_frequency_df.total > 10) & (lex_frequency_df.total < 2000)]

In [87]:
fig7 = px.box(lex_frequency_df, x='POS', y='total')
fig7.show()

### 1.5 Heatmap

Let's plot a heatmap that will survey the correlation between animacy and cases.

Parts of Speech in bigrams?

In [56]:
case_df = tolstoy_df[(tolstoy_df.animacy != '')&(tolstoy_df.case != '')]

In [61]:
fig8 = px.density_heatmap(case_df, x=case_df.case, y=case_df.animacy)
fig8.show()

---
---
---

end of finished version.


6. maybe line chart also with exact values on hover
7. 3D plot
8. manipulatable plot

bokeh:

1. scatter plot with ability to move around
2. maybe line chart also with ability to move around
3. smth with linked selection?
4. network plot in bokeh
5. mb scatterplot grid?
6. manipulatable plot
7. timeseries plot

In [68]:
list(tolstoy_df.columns)

['lex',
 'word',
 'POS',
 'time',
 'gender',
 'case',
 'number',
 'verbal',
 'adj_form',
 'comp',
 'person',
 'aspect',
 'voice',
 'animacy',
 'transitivity',
 'имя',
 'отч',
 'фам',
 'вводн',
 'гео',
 'сокр',
 'обсц',
 'разг',
 'редк',
 'устар']

In [5]:
tolstoy_df[tolstoy_df['case'] != ''][tolstoy_df['animacy'] != ''][['lex', 'word', 'case', 'animacy']]

  tolstoy_df[tolstoy_df['case'] != ''][tolstoy_df['animacy'] != ''][['lex', 'word', 'case', 'animacy']]


Unnamed: 0,lex,word,case,animacy
0,том,том,вин,неод
1,первый,первый,вин,неод
2,часть,часть,вин,неод
13,июль,июле,пр,неод
14,год,года,вин,неод
...,...,...,...,...
216093,честный,честный,вин,неод
216094,человек,человек,вин,од
216096,след,следам,дат,неод
216097,этот,этого,вин,од


## 2. Bokeh

now let's make some plots with the bokeh library

In [17]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

p = figure(width=400, height=400)

# add a circle renderer with a size, color, and alpha
p.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5)

# show the results
show(p)

In [18]:
print("aaa")

aaa


I. Introduction

A. Explanation of what interactive plots are and their importance in data analysis

B. Overview of Python libraries used for creating interactive plots (e.g. Plotly, Bokeh)


In this plan, you can start with a brief overview of what interactive plots are, followed by a detailed explanation of two popular libraries in Python: Plotly and Bokeh. You can then compare the two libraries, highlighting their pros and cons, use cases, and functionality. Finally, you can conclude with some best practices and recommendations for further study.

This should give you a comprehensive handout on interactive plots in Python that covers both the basics and advanced topics.

### Interactive Plots in Python

Interactive plots allow users to explore and manipulate data in a more engaging and informative way than static plots. In Python, there are several libraries that provide functionality for creating interactive plots, such as Plotly, Bokeh, and Altair. These libraries offer a wide range of interactive features, such as zooming, panning, hovering, and selecting data points.

### Usage in Linguistic Research

Interactive plots can be particularly useful in linguistic research, where complex linguistic data often requires visual exploration to understand patterns and relationships. Here are some examples of how interactive plots can be used in linguistic research:

Visualizing word frequency distributions: Interactive plots can be used to create interactive word clouds, where users can zoom in and out to see the most frequent words in a corpus. This is especially useful in linguistic research as it allows researchers to quickly identify the most important words in a large corpus of text.

Plotting linguistic change over time: Interactive line charts can be used to plot linguistic change over time, allowing researchers to see the trends in language usage. For example, researchers can plot the frequency of certain words or phrases over time to see how they have changed in usage.

Analyzing lexical similarity: Interactive plots can be used to visualize the similarity between words in a lexicon. Researchers can create interactive heatmaps to see the similarity between words, or use interactive scatter plots to see the relationships between words.

Exploring linguistic diversity: Interactive plots can be used to explore linguistic diversity in a corpus. For example, researchers can use interactive bar charts to plot the frequency of different languages in a corpus, or use interactive scatter plots to visualize the diversity of language usage in different regions.

### Conclusion

In conclusion, interactive plots are a powerful tool for exploring and visualizing linguistic data. By allowing users to interact with data, they provide a more engaging and informative way to understand linguistic patterns and relationships. Whether you're a researcher, student, or simply someone interested in language, interactive plots can help you get the most out of your linguistic data.

### What data to use.

For a talk on visualization in Python, a good dataset to use might be a linguistic one that allows you to demonstrate various types of plots and how they can be used to explore linguistic data. Here are a few suggestions:

Word frequency data: You can use a dataset of word frequency counts in a specific corpus to create bar plots, histograms, and word clouds. This will help you demonstrate how to visualize the distribution of words in a corpus and identify common patterns.

Part of speech data: You can use a dataset of part of speech annotations for a corpus to create pie charts, bar plots, and stacked bar plots. This will allow you to demonstrate how to visualize the distribution of different parts of speech in a corpus and identify patterns in their usage.

N-gram data: You can use a dataset of n-grams (sequences of words) from a corpus to create bar plots and histograms. This will allow you to demonstrate how to visualize the distribution of n-grams in a corpus and identify common patterns.

Sentiment analysis data: You can use a dataset of sentiment annotations for a corpus to create bar plots, pie charts, and scatter plots. This will allow you to demonstrate how to visualize the distribution of sentiments in a corpus and identify patterns in the data.

These are just a few examples, and there are many other datasets you could use for a talk on visualization in Python. The important thing is to choose a dataset that is interesting and relevant to you and your audience, and that allows you to demonstrate various types of plots and how they can be used to explore linguistic data.

### About sentiment analysis

Sentiment analysis is a field of study within computational linguistics that focuses on determining the sentiment expressed in a piece of text. Sentiment can be characterized as positive, negative, or neutral, and sentiment analysis algorithms use various techniques, such as natural language processing and machine learning, to classify text according to these sentiments.

Sentiment analysis data is a type of linguistic data that is annotated to indicate the sentiment expressed in a piece of text. This data can come from a variety of sources, such as online reviews, social media posts, and news articles. The sentiment annotations can be made by humans, or generated automatically using algorithms.

In the context of visualization, sentiment analysis data can be used to create various types of plots, such as bar plots, pie charts, and scatter plots, to visualize the distribution of sentiments in a corpus of text. This can be useful for identifying patterns in the data and exploring how sentiment varies across different texts, authors, or time periods.





In [19]:
print('Hello world')

Hello world


(6p) a handout on one of the topics we haven't covered in class, e.g., BERT, clustering, machine learning, regression, interactive plots

(2p) a corresponding homework with a solution