In [None]:
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

from IPython.display import display, IFrame, YouTubeVideo

def show_grouping_animation():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTgVlFngQcLMYHP-z1vq5lVXjsBgcHebc-3TX7SW6L_gjX6TD1gsflvVDQUpWiDdeEPqJASenUIfBVd/embed?start=false&loop=false&delayms=60000"
    width = 960
    height = 509
    display(IFrame(src, width, height))
    
import warnings
warnings.simplefilter('ignore')

# Lecture 6 – Data Visualization 📈
## DSC 10, Spring 2023

### Announcements

- Lab 1 is due **tomorrow at 11:59PM**.
- Homework 1 is due on **Tuesday at 11:59PM**.
- Come to office hours for help! The schedule is [here](https://dsc10.com/calendar).
- Make sure to try the "challenge problems" at the end of Wednesday's lecture and watch [this](https://www.youtube.com/watch?v=xg7rnjWnZ48) walkthrough video for the answers.

#### Don't forget about these resources!

- [DSC 10 Reference Sheet 📌](https://drive.google.com/file/d/1ky0Np67HS2O4LO913P-ing97SJG0j27n/view). 
- [`babypandas` notes](https://notes.dsc10.com).
- [`babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html).
- [The Resources tab of the course website](https://dsc10.com/resources/).

### Agenda

- Recap: GroupBy.
- Why visualize?
- Terminology.
- Scatter plots.
- Line plots.
- Bar charts.

### Aside: Keyboard shortcuts

There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).

Particularly useful shortcuts:

| Action | Keyboard shortcut |
| --- | --- |
| Run cell + jump to next cell | SHIFT + ENTER |
| Save the notebook | CTRL/CMD + S |
| Create new cell above/below | A/B |
| Delete cell | DD |

## Recap: GroupBy

In [None]:
show_grouping_animation()

Run the cell below to load in the `requests` DataFrame from last class.

In [None]:
requests = bpd.read_csv('data/get-it-done-requests.csv')
requests = requests.assign(total=requests.get('closed') + requests.get('open'))
requests

### Which neighborhood had the most requests?

In [None]:
requests

In [None]:
requests.groupby('neighborhood').sum()

In [None]:
# Note the use of .index – remember, the index isn't a column!
(
    requests
    .groupby('neighborhood')
    .sum()
    .sort_values(by='total', ascending=False)
    .index[0]
)

### Example: Number of different services

How do we find the number of different services requested in each neighborhood?

As always when using `groupby`, there are two steps:

1. Choose a column to group by.
    - Here, `'neighborhood'` seems like a good choice.

2. Choose an aggregation method.
   - Common aggregation methods include `.count()`, `.sum()`, `.mean()`, `.median()`, `.max()`, and `.min()`.

In [None]:
# How many different requests are there for the neighborhood 'University'?
requests[requests.get('neighborhood') == 'University']

In [None]:
# How do we find this result for every neighborhood?

### Observation #4

The column names of the output of `.groupby` don't make sense when using the `.count()` aggregation method.

In [None]:
num_diff_services = requests.groupby('neighborhood').count()
num_diff_services

Consider dropping unneeded columns and renaming columns as follows:
1. Use `.assign` to create a new column containing the same values as the old column(s).
2. Use `.drop(columns=list_of_column_labels)` to drop the old column(s). Alternatively, use `.get(list_of_column_labels)` to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

In [None]:
num_diff_services = num_diff_services.assign(
                    count_of_services=num_diff_services.get('open')
                    ).drop(columns=['service', 'closed', 'open', 'total'])
num_diff_services

## Why visualize?

Run these cells to load the _Little Women_ data from Lecture 1.

In [None]:
chapters = open('data/lw.txt').read().split('CHAPTER ')[1:]

In [None]:
# Counts of names in the chapters of Little Women.

counts = bpd.DataFrame().assign(
    Amy=np.char.count(chapters, 'Amy'),
    Beth=np.char.count(chapters, 'Beth'),
    Jo=np.char.count(chapters, 'Jo'),
    Meg=np.char.count(chapters, 'Meg'),
    Laurie=np.char.count(chapters, 'Laurie'),
)

# Cumulative number of times each name appears.

lw_counts = bpd.DataFrame().assign(
    Amy=np.cumsum(counts.get('Amy')),
    Beth=np.cumsum(counts.get('Beth')),
    Jo=np.cumsum(counts.get('Jo')),
    Meg=np.cumsum(counts.get('Meg')),
    Laurie=np.cumsum(counts.get('Laurie')),
    Chapter=np.arange(1, 48, 1)
)

lw_counts

### Little Women

In Lecture 1, we were able to answer questions about the plot of _Little Women_ without having to read the novel and without having to understand Python code. Some of those questions included:

- Who is the main character?
- Which pair of characters gets married in Chapter 35?

We answered these questions from a data visualization alone!

In [None]:
lw_counts.plot(x='Chapter');

### Napoleon's March

> "Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812." ([source](https://www.edwardtufte.com/tufte/posters))

<center><img src="./images/minard.jpg"/></center>

### Why visualize?

- Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.

- Visualizations allow us to understand lots of data quickly – they make it easier to spot trends and communicate our results with others.

- There are many types of visualizations; in this class, we'll look at scatter plots, line plots, bar charts, and histograms, but there are many others.
    - The right choice depends on the type of data.

## Terminology

### Individuals and variables

<center><img src='images/ind-var.png' width=90%/></center>

- <span style="color:#6d9eeb"><b>Individual (row)</b></span>: Person/place/thing for which data is recorded. Also called an **observation**.

- <span style="color:#ff9900"><b>Variable (column)</b></span>: Something that is recorded for each individual. Also called a **feature**.

### Types of variables

There are two main types of variables:

- **Numerical**: It makes sense to do arithmetic with the values.
- **Categorical**: Values fall into categories, that may or may not have some _order_ to them.

Note that here, "variable" does not mean a variable in Python, but rather it means a column in a DataFrame.

### Examples of numerical variables

- Salaries of NBA players 🏀.
    - Individual: An NBA player.
    - Variable: Their salary.

- Movie gross earnings 💰.
    - Individual: A movie.
    - Variable: Its gross earnings.

- Booster doses administered per day 💉.
    - Individual: Date.
    - Variable: Number of booster doses administered on that date.

### Examples of categorical variables

- Movie genres 🎬.
    - Individual: A movie.
    - Variable: Its genre.

- Zip codes 🏠.
    - Individual: US resident.
    - Variable: Zip code.
        - Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).

- Level of prior programming experience for students in DSC 10 🧑‍🎓.
    - Individual: Student in DSC 10.
    - Variable: Their level of prior programming experience, e.g. none, low, medium, or high. 
        - There is an _order_ to these categories!

### Concept Check ✅ – Answer at [cc.dsc10.com](http://cc.dsc10.com) 

Which of these is **not** a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.

### Types of visualizations

The type of visualization we create depends on the kinds of variables we're visualizing.

- **Scatter plot**: Numerical vs. numerical.
- **Line plot**: Sequential numerical (time) vs. numerical.
- **Bar chart**: Categorical vs. numerical.
- **Histogram**: Numerical.
    - Will cover next time.
    
We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

## Scatter plots

### Dataset of 50 top-grossing actors

|Column |Contents|
|----------|------------|
`'Actor'`|Name of actor
`'Total Gross'`|	Total gross domestic box office receipt, in millions of dollars, of all of the actor’s movies
`'Number of Movies'`|	The number of movies the actor has been in
`'Average per Movie'`|	Total gross divided by number of movies
`'#1 Movie'`|	The highest grossing movie the actor has been in
`'Gross'`|	Gross domestic box office receipt, in millions of dollars, of the actor’s #1 Movie

In [None]:
actors = bpd.read_csv('data/actors.csv').set_index('Actor')
actors

### Scatter plots

What is the relationship between `'Number of Movies'` and `'Total Gross'`?

In [None]:
actors.plot(kind='scatter', x='Number of Movies', y='Total Gross');

### Scatter plots

- Scatter plots visualize the relationship between two numerical variables.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='scatter', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)
```
- The resulting scatter plot has one point per row of `df`.
- If you put a semicolon after a call to `.plot`, it will hide the weird text output that displays.

### Scatter plots

What is the relationship between `'Number of Movies'` and `'Average per Movie'`?

In [None]:
actors.plot(kind='scatter', x='Number of Movies', y='Average per Movie');

Note that in the above plot, there's a _negative_ association and an outlier.

### Who was in 60 or more movies?

In [None]:
actors[actors.get('Number of Movies') >= 60]

### Who is the outlier?

Whoever they are, they made very few, high grossing movies.

In [None]:
actors[actors.get('Number of Movies') < 10]

<center><img src='images/c3po.png' width=200></center>

## Line plots 📉

### Dataset aggregating movies by year

|Column|	Content|
|------|-----------|
`'Year'`|	Year
`'Total Gross in Billions'`|	Total domestic box office gross, in billions of dollars, of all movies released
`'Number of Movies'`|	Number of movies released
`'#1 Movie'`|	Highest grossing movie

In [None]:
movies_by_year = bpd.read_csv('data/movies_by_year.csv').set_index('Year')
movies_by_year

### Line plots

How has the number of movies changed over time? 🤔

In [None]:
movies_by_year.plot(kind='line', y='Number of Movies');

### Line plots

- Line plots show trends in numerical variables over time.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='line', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)
```

### Plotting tip

- If you want the x-axis to be the index, omit the `x=` argument!
- Doesn't work for scatter plots, but works for most other plot types.

In [None]:
movies_by_year.plot(kind='line', y='Number of Movies');

### Zooming in

We can create a line plot of just 2000 onwards by querying `movies_by_year` before calling `.plot`.

In [None]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Number of Movies');

What do you think explains the declines around 2008 and 2020?

### How did this affect total gross?

In [None]:
movies_by_year[movies_by_year.index >= 2000].plot(kind='line', y='Total Gross in Billions');

### What was the top grossing movie of 2018?

In [None]:
...

### Extra video on line plots

If you're curious how line plots work under the hood, watch [this video](https://www.youtube.com/watch?v=glzZ04D1kDg) we made a few quarters ago.

In [None]:
YouTubeVideo('glzZ04D1kDg')

## Bar charts 📊

### Dataset of the top 200 songs in the US on Spotify as of Thursday (4/13/2023)

[Downloaded from here – check it out!](https://spotifycharts.com/regional)

In [None]:
charts = (bpd.read_csv('data/regional-us-daily-2023-04-13.csv')
          .set_index('rank')
          .get(['track_name', 'artist_names', 'streams', 'uri'])
         )
charts

### Bar charts

How many streams do the top 10 songs have?

In [None]:
charts

In [None]:
charts.take(np.arange(10))

In [None]:
charts.take(np.arange(10)).plot(kind='barh', x='track_name', y='streams');

### Bar charts

- Bar charts visualize the relationship between a categorical variable and a numerical variable.
- In a bar chart...
    - The thickness and spacing of bars is arbitrary.
    - The order of the categorical labels doesn't matter.
- To create one from a DataFrame `df`, use
```
df.plot(
    kind='barh', 
    x=categorical_column_name, 
    y=numerical_column_name
)
```
- The **"h"** in `'barh'` stands for **"horizontal"**.
    - It's easier to read labels this way.
- In the previous chart, we set `y='Streams'` even though streams are measured by x-axis length.

In [None]:
# The bars appear in the opposite order relative to the DataFrame.
(charts
 .take(np.arange(10))
 .sort_values(by='streams')
 .plot(kind='barh', x='track_name', y='streams')
);

In [None]:
# Change "barh" to "bar" to get a vertical bar chart. These are a little harder to read.
(charts
 .take(np.arange(10))
 .sort_values(by='streams')
 .plot(kind='bar', x='track_name', y='streams')
);

### Aside: How many streams did The Weeknd's songs on the chart receive?

In [None]:
(charts
 [charts.get('artist_names') == 'The Weeknd']
 .sort_values('streams')
 .plot(kind='barh', x='track_name', y='streams')
);

It seems like we're missing some popular songs...

### How do we include songs with other artists, as well?

Answer: Using `.str.contains`.

In [None]:
weeknd = charts[charts.get('artist_names').str.contains('The Weeknd')]
weeknd

In [None]:
weeknd.sort_values('streams').plot(kind='barh', x='track_name', y='streams');

## Fun demo 🎵

In [None]:
# Run this cell, don't worry about what it does.
def show_spotify(uri):
    code = uri[uri.rfind(':')+1:]
    src = f"https://open.spotify.com/embed/track/{code}"
    width = 400
    height = 75
    display(IFrame(src, width, height))

#### Let's find the URI of a song we care about.

In [None]:
charts

In [None]:
favorite_song = 'Die For You (with Ariana Grande) - Remix'

In [None]:
song_uri = (charts
            [charts.get('track_name') == favorite_song]
            .get('uri')
            .iloc[0])
song_uri

Watch what happens! 🎶

In [None]:
show_spotify(song_uri)

Try it out yourself!

## Summary

### Summary

- Visualizations make it easy to extract patterns from datasets.
- There are two main types of variables: categorical and numerical.
- The types of the variables we're visualizing inform our choice of which type of visualization to use.
- Today, we looked at scatter plots, line plots, and bar charts.
- **Next time**: More bar charts, histograms, and overlaid plots.