<center><b>DIGHUM101</b></center>
<center>2-2: Data Visualization</center>

---

# Fast review

1. What are some key components of text preprocessing? 
2. Why is text preprocessing important?

# Learning objectives

1. Learn some theory and best practices of data visualization.
2. Make a matplotlib histogram, barplot, boxplot, and scatterplot.
3. Begin to read Claus O. Wilke's [Fundamentals of Data Visualization](https://serialmentor.com/dataviz/)

In [None]:
import pandas as pd

# Import matplotlib (https://en.wikipedia.org/wiki/Matplotlib)
import matplotlib

# Import the pyplot "submodule" for fast plotting
# (https://www.quora.com/What-is-the-difference-between-Python-modules-packages-libraries-and-frameworks)
import matplotlib.pyplot as plt

# Graphics should appear "inline" (within the Jupyter Notebook instead of somewhere else)
%matplotlib inline

# Indicate the template to use for the plot (not required)
# Type plt.style.available to experiment with different styles
plt.style.use('seaborn-bright')

import seaborn as sns

In [None]:
plt.style.available

In [None]:
%pwd # print working directory

In [None]:
# Sometimes there is no need to set the working directory! 
# What is going on in the file path in pd.read_csv()?

lit = pd.read_csv("../../Data/childrens_lit.csv", sep = "\t", index_col=False)

print(lit.shape)
lit.head(8)

In [None]:
# Let's get rid of that "Unnamed: 0" column; it has to do with how the CSV was saved.
del(lit['Unnamed: 0'])
lit

In [None]:
# How to get the text of a book from a single cell?

lit["text"][0][:100]

# Or... remember loc and iloc?
#lit.loc[0,"text"][:100]
#lit.iloc[0,3][:100]

# Theory of Data Visualization

Visualization is meant to convey information.

> The power of a graph is its ability to enable one to take in the quantitative information, organize it, and see patterns and structure not readily revealed by other means of studying the data.

\- Cleveland and McGill, 1984

Certain techniques make that information easier to interpret and understand. In their 1984 paper titled, "[Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods](https://www-jstor-org.libproxy.berkeley.edu/stable/2288400?seq=1#page_scan_tab_contents)," Cleveland and McGill identify 10 elementary perceptual tasks that are used to "extract quantitative information from graphs." Their premise is:

> A graphical form that involves elementary perceptual tasks that lead to more accurate judgments than another graphical form (with the same quantitative information) will result in better organization and increase the chances of a correct perception of patterns and behavior.

Whereas graph design had, up to that point, been "largely unscientific," Cleveland and McGill took a systematic approach in analyzing human graphical perception through experimentation. Their researched helped identify the most and least accurate elementary perceptual tasks, ordered below:

1. Position along a common scale
2. Position along non-aligned scales
3. Length, direction, angle
4. Area
5. Volume, curvature
6. Shading, color saturation

In 2010, [Heer and Bostock](http://vis.stanford.edu/files/2010-MTurk-CHI.pdf) confirmed these results using Amazon's Mechanical Turk.

Let's take a look at a few examples. Because we're only interested in relative sizes, we don't include a legend with size information or reference points.

![circles](../../Img/circles.png)

For circles of distinctly different sizes, the comparison is simple. For example, "A" is smaller than "B." However, for circles, such as "L" and "M," that are almost the same size, it's difficult to tell which is smaller. Area, according to Cleveland and McGill's research, is less accurate than, say, length, which we consider next.

![circles](../../Img/bars.png)

Focusing on "L" and "M," it is clear to see which is larger. You might be wondering whether scale makes a difference&mdash;that is, if the small circle sizes make it difficult to compare&mdash;it doesn't.

Next, we consider a case where we want to plot two series. For this example, let's suppose we're working with student English and Math test scores. Here, we'll want to use bars, which we arbitrarily label Z-L. The question is, which bars should we use? This is a case where the answer depends on what we're trying to communicate. If we're interested in showing total scores, we could use a stacked bar chart.

![circles](../../Img/two-series-0.png)

We can tell that "Y" and "L" had the highest cumulative scores. What if we want to know which students scored highest on the math exam? Because the math portions of each bar are stacked on top of the English scored (they are on "non-aligned scales," as Cleveland and McGill call it), it's difficult to tell. One solution is to plot these on opposite sides of the x-axis.

![circles](../../Img/two-series-1.png)

Now, it's easier to see that "R" scored quite well on the math exam. The tradeoff with this layout is that it's difficult to compare cumulative scores. Comparing "Z" and "O," for example, is a challenge. Again, it depends on what the message is.

These findings are a *guide* for what works when the goal is to make accurate judgments. Sometimes, however, the goal might not be to allow for precise comparisons but, rather, to facilitate the perception of larger patterns.

## Form and Function

> A good graphic realizes two basic goals: It **presents** information, and it allows users to **explore** that information.

\- Alberto Cairo

> A data visualization should only be beautiful when beauty can promote understanding in some way without undermining it in another. Is beauty sometimes useful? Certainly. Is beauty always useful? Certainly not.

\- Stephen Few

> Good displays of data help to reveal knowledge relevant to understanding mechanism, process and dynamics, cause and effect.

\- Edward Tufte

A figure is ineffective if it "wastes too much real estate (and the designer's time) on things that don't help readers understand [it]." - Alberto Cairo

> The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

\- William Cleveland

> [A]lways take advantage of the space you have available to seek depth within reasonable limits. After that, *and only after that*, worry about how to make the presentation prettier.

\- Alberto Cairo

Ultimately, identify your audience and their needs and interests. The same data should be visualized differently for a scientific publication versus a magazine.

In [None]:
# Load the gapminder dataset
# Notice we did not set our working directory first - what did we do instead?
gap = pd.read_csv("../../Data/gapminder-FiveYearData.csv")

In [None]:
print(gap.shape)
gap.head()

In [None]:
gap.dtypes

Note that the "int64" and "float64" data types refer to the amount of memory taken up by each variable. The size of these data types in Pandas matters only if you are using very large datasets.

# Histogram

Use a histogram when you would like to visualize the distribution of a single float or integer variable. 

In [None]:
# Note the semicolon, which tells Jupyter not to print additional output about the plot
gap.hist();

### We can also adjust the figure size.

In [None]:
gap.hist(figsize = (15, 5)); # width, height

### Change the number of bins

Bins are like intervals in which to break up the distribution of data. Say we want a "higher resolution" than the plots above:

In [None]:
gap.hist(bins = 25, figsize = (15, 5), color = "green");

### Plot a single variable

In [None]:
gap.hist(column = "lifeExp", figsize = (8, 4), color = "black");

We can use these visualizations as a starting point for further analysis. For instance, we see that a few country/year combinations have a very low life expectancy. Let's have a closer look using `sort_values()`:

In [None]:
gap.sort_values(by='lifeExp', ascending=True)

# Bar Plot

You can use a bar plot when you want to illustrate differences in frequencies of some category. Let's look at the 12 most frequent words in "jordan2013.txt"

In [None]:
jordan = open("../../Data/human-rights/jordan2013.txt", 
              encoding = "utf-8").read()
print(jordan[0:500])

In [None]:
from collections import Counter

In [None]:
# Tokenize jordan.txt into single words
jordan_tokens = jordan.split()
print(jordan_tokens[0:25])

In [None]:
# Count the 12 most common words
jordan_freq = Counter(jordan_tokens)
jordan_barplot = jordan_freq.most_common(12)
jordan_barplot

In [None]:
# Convert to data frame
jordan_df = pd.DataFrame(data = jordan_barplot, 
                         columns = ["Word", "Frequency"])
jordan_df

In [None]:
# Plot it! Tip: check TAB while scrolling over "barh" to see other chart types you could use.
jordan_df.plot.bar(x = "Word", y = "Frequency", figsize = (6,4));

# another way of doing this: using attributes
# jordan_df.plot(kind="bar", x = "Word", y = "Frequency", figsize = (6,4));

In [None]:
# Change x and y axis labels; add title; change to horizontal bar chart; invert y-axis
jordan_df.plot.barh(x = "Word", y = "Frequency", figsize = (5,3)).invert_yaxis()
plt.xlabel("Word (including stop words)")
plt.ylabel("Frequency")
plt.title("Bar plot of most common words in Jordan 2013");

# Box plot

A boxplot - also called a box and whisker plot â€” displays the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. This includes the minimum, first quartile, median, third quartile, and maximum.

In a box plot, we draw a box from the [first quartile to the third quartile](https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/box-whisker-plots/a/box-plot-review). A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

Use boxplots when you want to illustrate variation in a single float or integer, and to identify outliers.

In [None]:
# For the entire dataset...
gap.boxplot(column=["lifeExp"]);

The led line above is the median value in the data; the bottom and top of the box are the first and third quartiles.

Let's make boxplots of life expectancy **_by_** continent in the gapminder dataset.

In [None]:
# For each continent
gap.boxplot(column=["lifeExp"], 
            by = "continent", 
            figsize = (5, 4)
           )

plt.title("");

Note the circles, which refer to outliers in the data.

# Scatterplot

Scatterplots are useful to show the relationships between two float/integer variables. Make a scatterplot with life expectancy on the x-axis and GDP per capita on the y-axis from the gapminder dataset.

In [None]:
# Note the attribute-style syntax for calling the particular columns!
gap.plot.scatter(x = "lifeExp", 
                 y = "gdpPercap", 
                 figsize = (4, 3));
plt.xlabel("Life Expectancy")
plt.ylabel("GDP per Capita")
plt.title("GDP vs Life Expectancy");

# Exporting Figures

You will want to export some figures to include in your presentations and other work. Add the `plt.savefig();` call as your last line of code! Remember: this will save to your working directory.

In [None]:
jordan_df.plot.bar(x = "Word", y = "Frequency", figsize = (8,4))
plt.xlabel("Word (including stop words)")
plt.ylabel("Frequency")
plt.title("Bar plot of most common words in Jordan 2013")
plt.savefig("barplot_example.jpg", dpi = 100);

You will then want to export this .PDF to a .TIFF or similar file depending on submission requirements. 

Mac users open the .PDF file in Preview, click File --> Export --> Select .tiff --> select .jpeg compress

Windows users - try the same thing but use "Save As" instead of "Export"

# Using Seaborn

Matplotlib is good, but seaborn is better! We can do more with less code - you might even find the syntax a little easier to understand. [Start looking through examples here](https://seaborn.pydata.org/tutorial.html).

Seaborn is a a library built on top of matplotlib and integrates closely with pandas data structures. It has three basic level graphing methods: `relplot()`, `distplot()` and `catplot()`, which each has a number of submethods which are basically shorthands for the main methods.

<img src="https://seaborn.pydata.org/_images/function_overview_8_0.png">

Let's begin with a histogram that counts the amount of times something appears in each one column (the x axis). Let's also colour by season (hue=), and layer the different seasons ontop of each other (multiple="layer").

In [None]:
sns.histplot(              
    data=gap,         
    x="lifeExp",
    hue="continent", 
    multiple="layer",
    bins=20
);

A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analagous to a histogram.

In [None]:
sns.kdeplot(data=gap, x="lifeExp", hue="year");

Let's make the same scatterplot we did above using seaborn. We use a `relplot`, which takes x and y strings for arguments, as well as a kind argument which should either be "scatter" or "line".
We take into account two dimentions - the life expectancy and gdp per capita - while applying colours according to continent.

In [None]:
sns.relplot (
    data=gap,
    kind="scatter",
    x="lifeExp",
    y="gdpPercap",
    hue="continent"
);

Finally, if we want scatterplots for all continents separately, we can do that using [FacetGrid](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html).

In [None]:
scatterplot_facet = sns.FacetGrid(gap, col = "continent", 
                                  col_wrap = 3, height = 4, sharex = False)
scatterplot_facet.map(plt.scatter, "lifeExp", "gdpPercap", marker = ".");