# Lecture 7 –  Data Visualization 📈
## DSC 10, Winter 2022

### Announcements

- Homework 2 is due on **Saturday 1/22 at 11:59pm**.
- Lab 3 is due on **Tuesday, 1/25 at 11:59pm**.
- Watch the [supplemental video](https://youtu.be/xg7rnjWnZ48) for Lecture 6 with answers to the challenge problems.
    - Or look at [this post](https://campuswire.com/c/G6950E967/feed/78).
- There was a [discussion walkthrough video](https://youtu.be/Q3mww8m3iIQ) posted yesterday. Watch it!

### Agenda

- Why visualize?
- Terminology.
- Scatter plots.
- Line plots.
- Bar charts.

**Note:** Don't forget about the [resources](https://dsc10.com/resources) tab of the course website!

Run the next 3 cells, don't worry about their content.

In [None]:
# Run this cell.
import babypandas as bpd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

plt.style.use('fivethirtyeight')

from IPython.display import HTML, display, IFrame

In [None]:
chapters = open('data/lw.txt').read().split('CHAPTER ')[1:]

In [None]:
# Counts of names in the chapters of Little Women

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)
counts

# cumulative number of times each name appears

lw_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 48, 1)
)

## Why visualize?

### Little Women

- Who is the main character?
- Which pair of characters gets married in Chapter 35?

In [None]:
lw_counts.plot(x='Chapter');

### Napoleon's March

<center><img src="./data/minard.jpg"/></center>

### John Snow

<center><img src='data/map.jpg'></center>

### Why visualize?

- Information in DataFrames can be hard to interpret.
- Visualizations make it easier to spot trends, gather insight, and communicate our results to others.
- There are many types of visualizations: scatter plots, line plot, bar charts, etc.
    - The right choice depends on the type of data.

## Terminology

### Individuals and variables

<center><img src='data/individuals-variables.png' width=800/></center>

- **Individual (row)**: Person/place/thing for which data is recorded.
- **Variable (column)**: Something that is recorded for each individual, a.k.a. a "feature".

### Types of variables

There are two main types of variables:

- **Numerical**: It makes sense to do arithmetic with the values.
- **Categorical**: Values fall into categories.

### Examples of numerical variables

- Salaries of NBA players 🏀.
    - An individual is an NBA player; the variable is salary.
- Gross earnings of movies 💰.
    - An individual is a movie; the variable is gross earnings.
- Booster doses administered per day 💉.
    - An individual is a date; the variable is number of booster doses administered that day.

### Examples of categorical variables

- High schools of all students in DSC 10 🧑‍🎓.
    - An individual is a student; the variable is high school.
- Zip codes 🏠.
    - An individual is a US resident; the variable is zip code.
        - Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).
- Movie genres 🎬.
    - An individual is a movie; the variable is genre.

### Types of visualizations

The type of visualization we create depends on the kinds of variables we're visualizing.

- **Scatter plot**: numerical vs. numerical.
- **Line plot**: sequential numerical (time) vs. numerical.
- **Bar chart**: categorical vs. numerical.
- **Histogram**: distribution of numerical.
    - Will cover next time.
    
**Note:** We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

### Discussion Question

Which of these is **not** a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.


### To answer, go to **[menti.com](https://menti.com)** and enter the code **7882 3531**.

## Scatter plots

### Dataset of 50 top-grossing actors

|Column |Contents|
|----------|------------|
Actor|Name of actor
Total Gross|	Total gross domestic box office receipt, in millions of dollars, of all of the actor’s movies
Number of Movies|	The number of movies the actor has been in
Average per Movie|	Total gross divided by number of movies
#1 Movie|	The highest grossing movie the actor has been in
Gross|	Gross domestic box office receipt, in millions of dollars, of the actor’s #1 Movie

In [None]:
actors = bpd.read_csv('data/actors.csv').set_index('Actor')
actors

### Scatter plots

What is the relationship between `'Number of Movies'` and `'Total Gross'`?

In [None]:
actors.plot(kind='scatter', x='Number of Movies', y='Total Gross');

### Scatter plots

- Scatter plots visualize the relationship between two numerical variables.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='scatter', 
    x=x_column_for_horizontal, 
    y=y_column_for_horizontal
)
```
- The resulting scatter plot has one point per row of `df`.
- If you put a semicolon after a call to `.plot`, it will hide the weird text output that displays.

### Scatter plots

What is the relationship between `'Number of Movies'` and `'Average per Movie'`?

In [None]:
actors.plot(kind='scatter', x='Number of Movies', y='Average per Movie');

**What do you notice about the above plot?**

- There's a _negative_ association.

- There's an outlier.

### Who was in 60 or more movies?

In [None]:
actors[actors.get('Number of Movies') >= 60]

### Who is the outlier?

Whoever they are, they made very few, high grossing movies.

In [None]:
actors[actors.get('Number of Movies') < 10]

<center><img src='data/c3po.png' width=200></center>

## Line plots 📉

### New dataset aggregating movies by year

|Column|	Content|
|------|-----------|
Year|	Year
Total Gross|	Total domestic box office gross, in millions of dollars, of all movies released
Number of Movies|	Number of movies released
#1 Movie|	Highest grossing movie

In [None]:
# Load in movies_by_year.csv
movies_by_year = bpd.read_csv('data/movies_by_year.csv').set_index('Year')
movies_by_year

### Line plots

- How has the number of movies changed over time? 🤔

In [None]:
movies_by_year.plot(kind='line', y='Number of Movies');

### Line plots

- Line plots show trends in numerical variables over time.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='line', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)
```

### Plotting tip

- **Tip**: if you want the x-axis to be the index, omit the `x=` argument!
- Doesn't work for scatter plots, but works for most other plot types.

In [None]:
movies_by_year.plot(kind='line', y='Number of Movies');

### Focus on 2000-2015

We can create a line plot of just 2000 to 2015 by querying `movies_by_year` before calling `.plot`.

In [None]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Number of Movies');

Why is there a big drop between 2007 and 2009?

### How did this affect total gross?

In [None]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Total Gross');

### Trivia: what was the top grossing movie of 2009?

A. Avatar

B. Harry Potter and the Half-Blood Prince

C. The Hangover

D. Up

### To answer, go to **[menti.com](https://menti.com)** and enter the code **7882 3531**.

### Answer

In [None]:
...

## Bar charts 📊

### Dataset of the top 200 songs on Spotify as of Monday (1/17)

[Downloaded from here – check it out!](https://spotifycharts.com/regional)

In [None]:
charts = bpd.read_csv('data/regional-global-daily-latest.csv').set_index('Position')
charts

### Bar charts

How many streams do the top 15 songs have?

In [None]:
charts.take(np.arange(15))

In [None]:
charts.take(np.arange(15)).plot(kind='barh', x='Track Name', y='Streams');

### Bar charts

- Bar charts visualize the relationship between a categorical variable and a numerical variable.
- In a bar chart...
    - The thickness and spacing of bars is arbitrary.
    - The order of the categorical labels doesn't matter.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='barh', 
    x=categorical_column_name, 
    y=numerical_column_name
)
```
- The **"h"** in `'barh'` stands for **"horizontal"**.
    - It's easier to read labels this way.
- In the previous chart, we set `y='Streams'` even though streams are measured by x-axis length.

In [None]:
# The bars appear in the opposite order relative to the DataFrame
charts.take(np.arange(15)).sort_values(by='Streams').plot(kind='barh', x='Track Name', y='Streams');

### How many songs do the top 15 artists have in the top 200?

In [None]:
# Create a DataFrame with a single column that describes the number of songs in the top 200 per artist.
# Keep only the top 15 artists.

top_15_artists = charts.groupby('Artist') \
                       .count() \
                       .sort_values(by='Streams', ascending=False) \
                       .take(np.arange(15)) \
                       .get(['Streams'])
top_15_artists

In [None]:
# Relabel the column in top_15_artists to be Count.

top_15_artists = top_15_artists.assign(Count=top_15_artists.get('Streams')).drop(columns=['Streams'])
top_15_artists

In [None]:
# Again, weirdly, we have to sort in **ascending** order because
# Python reverses the order of the rows before creating bars
# (Only applies to horizontal bar charts)

top_15_artists.sort_values(by='Count').plot(kind='barh', y='Count');

### Vertical bar charts

- Use `kind='bar'` instead of `kind='barh'`.

In [None]:
top_15_artists.plot(kind='bar', y='Count');

### Discussion Question

Suppose we run the following block of code. What does the resulting bar chart tell us?

```py
mystery = charts.columns[1]
charts.groupby(mystery).sum() \
      .sort_values(by='Streams', ascending=False) \
      .take(np.arange(10)) \
      .sort_values(by='Streams') \
      .plot(kind='barh', y='Streams');
```

A. The number of total streams for the top 10 artists in the charts.

B. The number of total streams for the bottom 10 artists in the charts.

C. The number of total streams for the top 10 songs in the charts.

D. The number of total streams for the bottom 10 songs in the charts.

E. Something else entirely.

### To answer, go to **[menti.com](https://menti.com)** and enter the code **7882 3531**.

In [None]:
...

### How many streams did Justin Bieber's songs on the chart receive?

In [None]:
charts[charts.get('Artist') == 'Justin Bieber'].sort_values('Streams').plot(kind='barh', x='Track Name', y='Streams');

It seems like we're forgetting a very popular song.

In [None]:
charts

### How do we include featured songs, as well?

Answer: Using `.str.contains`.

In [None]:
charts[charts.get('Track Name').str.contains('Justin Bieber')]

In [None]:
bieber = charts[(charts.get('Artist') == 'Justin Bieber') | (charts.get('Track Name').str.contains('Justin Bieber'))]
bieber

In [None]:
bieber.sort_values('Streams').plot(kind='barh', x='Track Name', y='Streams');

## Fun demo

In [None]:
# Run this cell, don't worry about what it does
def show_spotify(url):
    code = url[url.rfind('/')+1:]
    src = f"https://open.spotify.com/embed/track/{code}"
    width = 400
    height = 75
    display(IFrame(src, width, height))

#### Let's find the URL of a song we care about.

In [None]:
charts

In [None]:
favorite_song = "STAY (with Justin Bieber)"

In [None]:
song_url = charts[charts.get('Track Name') == favorite_song].get('URL').iloc[0]
song_url

Watch what happens! 🎶

In [None]:
show_spotify(song_url);

Try it out yourself!

## Summary

### Summary

- Visualizations make it easy to extract patterns from datasets.
- There are two main types of variables, categorical and numerical.
- The types of the variables we're visualizing inform our choice of which type of visualization to use.
- Today, we looked at scatter plots, line plots, and bar charts.
- **Next time:** Histograms and overlaid plots.