<center><b>DIGHUM101</b></center>
<center>3-2: Data Visualization</center>

---

# Fast review

1. What are some key components of text preprocessing? 
2. Why is text preprocessing important?

# Learning objectives

1. Learn some theory and best practices of data visualization.
2. Make a matplotlib histogram, barplot, boxplot, and scatterplot.
3. Optionally, begin to read Claus O. Wilke's [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/)

In [None]:
import pandas as pd

# Import matplotlib (https://en.wikipedia.org/wiki/Matplotlib)
import matplotlib

# (https://www.quora.com/What-is-the-difference-between-Python-modules-packages-libraries-and-frameworks)
# Import matplotlib.pyplot and assign it to an alias `plt`
import matplotlib.pyplot as plt

# Import seaborn as assign to an alias `sns`
import seaborn as sns

💡 **Tip**: `pyplot` is a collection of command style functions that make matplotlib work like MATLAB and save many lines of repeated code. By convention, `pyplot` is aliased to `plt`, which we just did in the above import cell. 

In [None]:
# You can also change the default style of matplotlib plots

# Indicate the template to use for the plot (not required)
plt.style.use('seaborn-v0_8-bright')

In [None]:
# Run plt.style.available to experiment with different styles
plt.style.available

## When to use each package

`matplotlib`:
- Versatile (basic and complex plots)
- Foundation of using other packages
- Lengthy syntax
- Ideal for customization
- Not ideal for presentation and publication

`pandas`:
- Plot basic plots
- Handy to use for Exploratory Data Analysis (EDA)
- Well connected with `pandas`
- Not ideal for custimization

`seaborn`:
- Easier for complex plots 
- Shorter syntax
- Require knowledge of reading documentation
- Well connected with `pandas`
- Ideal for customization and presentation

## What is Data Visualization?

Data visualization is an art and a science of transforming data into meaningful visual stories, where your choices in design—from labels to color—can reveal insights or obscure them.

As with any art or science, it begins with understanding foundational principles and learning from examples that illustrate what is possible, what is effective, and how our choices shape what we see and understand.

We can look at some examples in [Python Graph Gallery](https://python-graph-gallery.com/) and in [Seaborn's Example Gallery](https://seaborn.pydata.org/examples/index.html)

<a id='section1'></a>

# Principles of Data Visualization

Visualization is meant to convey information.

> The power of a graph is its ability to enable one to take in the quantitative information, organize it, and see patterns and structure not readily revealed by other means of studying the data.

\- Cleveland and McGill, 1984

That said, to accurately and efficiently communicate the information hidden within the data, we should also be aware of the common pitfalls of data visualization.

It's always good to sit back and ask ourseleves:
- Does the plot include sufficient text descriptions (e.g. labels, legend, and title)?
- Does the plot has an approriate size and scale?
- Does the plot contain too much or too little data?
- Does the plot include a common scale for group comparison?
- And does the chosen color contrast accurately convey the differences?

The answers to these questions vary depending on the data we have and the message we want to convey through the plot.

Throughout the class meeting, we will discuss the decisions we need to make when encountering such questions, as well as the solutions to address them.

# Theory of Data Visualization

Certain techniques make that information easier to interpret and understand. In their 1984 paper titled, "[Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods](https://www-jstor-org.libproxy.berkeley.edu/stable/2288400?seq=1#page_scan_tab_contents)," Cleveland and McGill identify 10 elementary perceptual tasks that are used to "extract quantitative information from graphs." Their premise is:

> A graphical form that involves elementary perceptual tasks that lead to more accurate judgments than another graphical form (with the same quantitative information) will result in better organization and increase the chances of a correct perception of patterns and behavior.

Whereas graph design had, up to that point, been "largely unscientific," Cleveland and McGill took a systematic approach in analyzing human graphical perception through experimentation. Their researched helped identify the most and least accurate elementary perceptual tasks, ordered below:

1. Position along a common scale
2. Position along non-aligned scales
3. Length, direction, angle
4. Area
5. Volume, curvature
6. Shading, color saturation

In 2010, [Heer and Bostock](http://vis.stanford.edu/files/2010-MTurk-CHI.pdf) confirmed these results using Amazon's Mechanical Turk.

Let's take a look at a few examples. Because we're only interested in relative sizes, we don't include a legend with size information or reference points.

![circles](../img/circles.png)

For circles of distinctly different sizes, the comparison is simple. For example, "A" is smaller than "B." However, for circles, such as "L" and "M," that are almost the same size, it's difficult to tell which is smaller. Area, according to Cleveland and McGill's research, is less accurate than, say, length, which we consider next.

![circles](../img/bars.png)

Focusing on "L" and "M," it is clear to see which is larger. You might be wondering whether scale makes a difference&mdash;that is, if the small circle sizes make it difficult to compare&mdash;it doesn't.

Next, we consider a case where we want to plot two series. For this example, let's suppose we're working with student English and Math test scores. Here, we'll want to use bars, which we arbitrarily label Z-L. The question is, which bars should we use? This is a case where the answer depends on what we're trying to communicate. If we're interested in showing total scores, we could use a stacked bar chart.

![circles](../img/two-series-0.png)

We can tell that "Y" and "L" had the highest cumulative scores. What if we want to know which students scored highest on the math exam? Because the math portions of each bar are stacked on top of the English scored (they are on "non-aligned scales," as Cleveland and McGill call it), it's difficult to tell. One solution is to plot these on opposite sides of the x-axis.

![circles](../img/two-series-1.png)

Now, it's easier to see that "R" scored quite well on the math exam. The tradeoff with this layout is that it's difficult to compare cumulative scores. Comparing "Z" and "O," for example, is a challenge. Again, it depends on what the message is.

These findings are a *guide* for what works when the goal is to make accurate judgments. Sometimes, however, the goal might not be to allow for precise comparisons but, rather, to facilitate the perception of larger patterns.

## Form and Function

> A good graphic realizes two basic goals: It **presents** information, and it allows users to **explore** that information.

\- Alberto Cairo

> A data visualization should only be beautiful when beauty can promote understanding in some way without undermining it in another. Is beauty sometimes useful? Certainly. Is beauty always useful? Certainly not.

\- Stephen Few

> Good displays of data help to reveal knowledge relevant to understanding mechanism, process and dynamics, cause and effect.

\- Edward Tufte

A figure is ineffective if it "wastes too much real estate (and the designer's time) on things that don't help readers understand [it]." - Alberto Cairo

> The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

\- William Cleveland

> [A]lways take advantage of the space you have available to seek depth within reasonable limits. After that, *and only after that*, worry about how to make the presentation prettier.

\- Alberto Cairo

Ultimately, identify your audience and their needs and interests. The same data should be visualized differently for a scientific publication versus a magazine.

## Dataset

Remember we looked at this dataset last week? 

We've decided to use the so-called Gapminder dataset, which was compiled by Jennifer Bryan. This dataset contains several key demographic and economic statistics for many countries, across many years. For more information, see the [gapminder](https://github.com/jennybc/gapminder) repository.

In [None]:
# Load the gapminder dataset
# Notice we did not set our working directory first - what did we do instead?
gap = pd.read_csv("../Data/gapminder-FiveYearData.csv")

In [None]:
print(gap.shape)
gap.head()

In [None]:
gap.dtypes

Note that the "int64" and "float64" data types refer to the amount of memory taken up by each variable. The size of these data types in Pandas matters only if you are using very large datasets.

# Histogram

Histograms plot a discretized distribution of a one-dimensional dataset across all the values it has taken. They visualize how many of the data points are in each of $b$ bins, each of which has a pre-defined range.

Use a histogram when you would like to visualize the distribution of a single float or integer variable. 

In [None]:
# Note the semicolon, which tells Jupyter not to print additional output about the plot
gap.hist();

In [None]:
# This is the additional information btw
gap.hist()

### We can also adjust the figure size.

In [None]:
gap.hist(figsize = (15, 5)); # width, height

### Change the number of bins

Bins are like intervals in which to break up the distribution of data. Say we want a "higher resolution" than the plots above:

In [None]:
gap.hist(bins = 25, figsize = (15, 5), color = "green");

### Plot a single variable

In [None]:
gap.hist(column = "lifeExp", figsize = (8, 4), color = "black");

We can use these visualizations as a starting point for further analysis. For instance, we see that a few country/year combinations have a very low life expectancy. Let's have a closer look using `sort_values()`:

In [None]:
gap.sort_values(by='lifeExp', ascending=True)

💡 **Tip**: We are now jumping ahead a little and using a bar plot. We could have created a histogram from this data, but it is not the right choice for what we would like to show here. Histograms are typically used for continuous data to show the distribution of a variable, whereas our example is about showing the top 10 countries with the lowest life expectancy, which is a discrete data set.

In [None]:
# Let's plot the top 10 countries with the lowest life expectancy

top_10_lowest = gap.sort_values(by='lifeExp', ascending=True).head(10)

# Plot a horizontal bar chart for the top 10 lowest life expectancies
plt.figure(figsize=(10, 6))
plt.barh(top_10_lowest['country'], top_10_lowest['lifeExp'], color='skyblue')
plt.xlabel('Life Expectancy')
plt.ylabel('Country')
plt.title('Top 10 Countries with the Lowest Life Expectancy')
plt.show()

### Let's find the issues with the previous example.

Hint 1: How many data points are in our plot? How many did we expect?

Hint 2: What is the difference in data points between the data frame and the plot?


### Countries are repeated in this dataset! 

Remember last week, when we talked about this data, we saw that the statistics are reported 12 times, as in for 12 different years, for each country?

A more meaningful way to compare the population statistics across different countries would be to do it by year.

In [None]:
# Let's subset our data to only include the latest year
latest_year = gap['year'].max()
latest_year

In [None]:
gap_2007 = gap[gap['year'] == latest_year]
gap_2007.head(10)

In [None]:
# Let's double check that we did the subsetting correctly

print('Number of unique countries in the entire dataset:', gap['country'].nunique())
print('Number of rows in the 2007 subset:', gap_2007.shape[0])

In [None]:
# Now we can plot the life expectancy for each country in 2007, in ascending order
top_10_lowest_2007 = gap_2007.sort_values(by='lifeExp', ascending=True).head(10)

# Plot a horizontal bar chart for life expectancy in 2007
plt.figure(figsize=(15, 8))
plt.barh(top_10_lowest_2007['country'], top_10_lowest_2007['lifeExp'], color='lightgreen')
plt.xlabel('Life Expectancy')
plt.ylabel('Country')
plt.title('Life Expectancy by Country in 2007')
plt.show()

### Let's see one more histogram example

We will now plot the GDP for each country in 2007. The GDP data is a perfect starting point for us to try out histograms as we'll be able to visualize the distribution of GDP across countries.

Also, this time we will focus on the differences between the three libraries we can use for histograms: matplotlib, pandas, seaborn

### matplotlib

Syntax: call the **library** (`plt`), followed by the **plot type**: (`.hist()`): `plt.hist()`

Let's take a look at the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) for `plt.hist()` together. The input value that the function takes should be an array, so we pass in the `gdpPercap` column to the function. As a generic plotting method, the function `plt.hist()` itself is not specific to the data we are plotting, so we need to pass in the data we want to plot.

In [None]:
# default bin size: 10
plt.hist(gap_2007['gdpPercap']);

Each line in the histogram represents a bin. The height of the line corresponds to the number of items (countries in this case) within the range of values covered by the bin. In the previous plots, we used the default number of bins (10). Now, let's increase the number of bins by specifying the `bins=30` parameter.

In [None]:
plt.hist(gap_2007['gdpPercap'], bins=30);

This histogram tells us that many of the countries had a low GDP, which was less than 5,000. There is also a second "bump" in the histogram around 30,000. This type of distribution is known as **bi-modal**, since there are two modes, or common values.

To make this histogram more interpretable let's add a title and labels for the x and y axes. We'll pass strings to `plt.title()`, `plt.xlabel()`, and `plt.ylabel()` to do so.

In [None]:
plt.hist(gap_2007['gdpPercap'], bins=30)

plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

### pandas

Syntax: call the **dataframe** (`gap_2007`), followed by the plot type (`.hist()`): `gap_2007.hist()`

Let's take a look at [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) of `pd.hist()`. The required input value is a column of a dataframe. We no longer need to select the column using square brackets; instead, we simply pass the name of the column to the column parameter. This highlights how `pandas` differs from `matplotlib`. The function `pd.hist()` is not intended to be a generic plotting method; rather, it is specific to the dataframe where the data comes from. 

In [None]:
gap_2007.hist(column='gdpPercap');

We can remove the grid by specifying `grid=False`; to make the plot complete, let's add the title and labels. 

In [None]:
gap_2007.hist(column='gdpPercap', bins=30, grid=False)

plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

### seaborn

Syntax: call the **library** (`sns`), followed by the plot type (`.histplot()`): `sns.histplot()`

Let's check out the [documentation](https://seaborn.pydata.org/generated/seaborn.histplot.html) of `sns.histplot()`. A `seaborn` plotting function typically requires two things: the dataframe and the specific subset of the data we want to plot. In this case, we pass `gap_2007` to the `data` parameter to indicate that the GDP data we want to plot comes from the `gap_2007` dataframe. We then specify `x='gdpPercap'` to indicate that the column we want to plot is `gdpPercap`. Now you can see that `sns.histplot()` is still a generic plotting function, but it integrates well with `pandas`.

### On a more general note:

Seaborn is a a library built on top of matplotlib and integrates closely with pandas data structures. It has three basic level graphing methods: `relplot()`, `distplot()` and `catplot()`, which each has a number of submethods which are basically shorthands for the main methods.

<img src="https://seaborn.pydata.org/_images/function_overview_8_0.png">

You can see some code examples with seaborn in this excerpt from [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html)

In [None]:
sns.histplot(data=gap_2007, x='gdpPercap');

In [None]:
sns.histplot(data=gap_2007, x='gdpPercap', bins=30)

plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

Now we have known the basic syntax of each plotting library. What's your favourite one? 

In this course, we will mainly use `seaborn` to visualize all kinds of plots. Importantly, it should be noted that `seaborn` is built on top of `matplotlib`; behind the scene, `seaborn` uses matplotlib to draw plots but it provides users a high-level interface that is easier to learn and interact.  

## Kernal Density Plot

Histograms represent the distribution with discrete bins. A similar method is called Kernel Density Plot (KDE), which visualizes the distribution with a continuous probability density curve. KDE can be plotted independently; quite frequently, it is overlaid with histograms. In Seaborn, achieving this is straightforward by enabling the `kde` parameter and simply setting it to `True`.

KDE smooths out the rough edges of the histogram by using a kernel function (often Gaussian) to estimate the probability density.

KDE is useful for understanding the overall distribution of data, especially when you want to see the general shape of the distribution without the sharp edges that histograms can have due to binning.

In [None]:
sns.histplot(data=gap_2007, 
             x='gdpPercap', 
             bins=30, 
             kde=True) # this ass a kernel density estimate (KDE) line

plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

# Bar Plot

You can use a bar plot when you want to illustrate differences in frequencies of some category. Let's look at the 12 most frequent words in "jordan2013.txt"

In [None]:
jordan = open("../Data/human-rights/jordan2013.txt", 
              encoding = "utf-8").read()
print(len(jordan.split()))
print(jordan[0:500])

In [None]:
from collections import Counter

In [None]:
# Tokenize jordan.txt into single words
jordan_tokens = jordan.split()
print(jordan_tokens[0:25])

In [None]:
# Count the 12 most common words
jordan_freq = Counter(jordan_tokens)
jordan_barplot = jordan_freq.most_common(12)
jordan_barplot

In [None]:
# Convert to data frame
jordan_df = pd.DataFrame(data = jordan_barplot, 
                         columns = ["Word", "Frequency"])
jordan_df

In [None]:
# Plot it! Tip: check TAB while scrolling over "barh" to see other chart types you could use.
jordan_df.plot.bar(x = "Word", y = "Frequency", figsize = (6,4));

# another way of doing this: using attributes
# jordan_df.plot(kind="bar", x = "Word", y = "Frequency", figsize = (6,4));

In [None]:
# Change x and y axis labels; add title; change to horizontal bar chart; invert y-axis
jordan_df.plot.barh(x = "Word", y = "Frequency", figsize = (5,3)).invert_yaxis()
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.title("Bar plot of most common words in Jordan 2013");

### What does this plot tell us really? 

Hint: Think of what you learnt about this text that is upwards of 11k words by only looking at this plot

### You guessed it right, the issue was stopwords

Let's clean this up a bit and plot it again

In [None]:
# Let's start with lowercasing everything
jordan_lower = jordan.lower()


In [None]:
# Remove punctuation and stopwords
from string import punctuation

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

In [None]:
for char in punctuation:
    jordan_lower = jordan_lower.replace(char, "")

In [None]:
# Let's see what this did
print(jordan[0:500])

In [None]:
print(jordan_lower[0:500])

In [None]:
# Let's tokenize again and remove stopwords
jordan_tokens_cleaned = [word for word in jordan_lower.split() if word not in stop_words]
print(jordan_tokens_cleaned[0:25])


### List Comprehension

What we did above was list comprehension. For some coding tasks, there are many ways to solve the same problem. They all offer different advantages and sometimes disadvantages.

We could have repeated the code above to create a list of token first and then removed stopwords while counting them to create a df of word frequencies. 

Instead we created a list and cleaned the stopwords while creating it.

Even here, we could have done a for loop instead of list comprehension. They are equivalent in this case. Let's take a closer look:

```python
jordan_tokens_cleaned = [word for word in jordan_lower.split() if word not in stop_words]
```

VS


```python
jordan_tokens_cleaned = []
jordan_lower_tokens = jordan_lower.split()
for token in jordan_lower_tokens:
    if token not in stop_words:
        jordan_tokens_cleaned.append(token)
```



In [None]:
# Let's count the most common 12 words again
jordan_freq_cleaned = Counter(jordan_tokens_cleaned)
jordan_barplot_cleaned = jordan_freq_cleaned.most_common(12)
jordan_barplot_cleaned

In [None]:
# Convert to data frame
jordan_df_cleaned = pd.DataFrame(data = jordan_barplot_cleaned, 
                         columns = ["Word", "Frequency"])
jordan_df_cleaned

In [None]:
# Let's plot the 12 most common words again
jordan_df_cleaned.plot.barh(x = "Word", y = "Frequency", figsize = (5,3)).invert_yaxis()
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.title("Bar plot of most common words in Jordan 2013");

### Doesn't this give us a better understanding of the contents of this document?

Some other directions we could have taken include lemmatization to make sure 'recommendations' and 'recommendation' count as the same. We could have also removed jordan and related words like jordanian because we know that this document is about jordan

# Box plot

A boxplot - also called a box and whisker plot — displays the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. This includes the minimum, first quartile, median, third quartile, and maximum.

In a box plot, we draw a box from the [first quartile to the third quartile](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review). A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

Use boxplots when you want to illustrate variation in a single float or integer, and to identify outliers.

In [None]:
# For the entire dataset...
gap.boxplot(column=["lifeExp"]);

The led line above is the median value in the data; the bottom and top of the box are the first and third quartiles.

Let's make boxplots of life expectancy **_by_** continent in the gapminder dataset.

In [None]:
# For each continent
gap.boxplot(column=["lifeExp"], 
            by = "continent", 
            figsize = (5, 4)
           )

plt.title("");

Note the circles, which refer to outliers in the data.

# Scatterplot

Scatterplots are useful to show the relationships between two float/integer variables. Make a scatterplot with life expectancy on the x-axis and GDP per capita on the y-axis from the gapminder dataset.

In [None]:
# Note the attribute-style syntax for calling the particular columns!
gap.plot.scatter(x = "lifeExp", 
                 y = "gdpPercap", 
                 figsize = (4, 3));
plt.xlabel("Life Expectancy")
plt.ylabel("GDP per Capita")
plt.title("GDP vs Life Expectancy");

In [None]:
# We could have done this with seaborn as well
scatterplot = sns.scatterplot(data = gap, 
                               x = "lifeExp", 
                               y = "gdpPercap", 
                               marker = ".")

In [None]:
# An even more advanced option is to use seaborn's FacetGrid
# (https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)
# This allows us to create a grid of plots based on a categorical variable
# In this case, we will create a grid of scatter plots for each continent

scatterplot_facet = sns.FacetGrid(gap, col = "continent", 
                                  col_wrap = 3, height = 4, sharex = False)
scatterplot_facet.map(plt.scatter, "lifeExp", "gdpPercap", marker = ".");

# Exporting Figures

You will want to export some figures to include in your presentations and other work. Add the `plt.savefig();` call as your last line of code! Remember: this will save to your working directory.

In [None]:
jordan_df_cleaned.plot.bar(x = "Word", y = "Frequency", figsize = (8,4))
plt.xlabel("Word (including stop words)")
plt.ylabel("Frequency")
plt.title("Bar plot of most common words in Jordan 2013")
plt.savefig("barplot_example.jpg", dpi = 100);