# Lecture 8 – Visualizing Categorical Variables

## Data 6, Summer 2022

In [None]:
from datascience import * # datascience has plotting features built in
import numpy as np

Table.interactive_plots() 

## Bar Charts

Bar charts are helpful for visualizing the relationship between a categorical variable and a numerical variable, or for visualizing the distribution of a categorical variable. For example, we can visualize how many of the [R1 Universities](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States) in the U.S. are public vs. private schools.

In [None]:
schools = Table.read_table('data/r1_with_students.csv')
schools

Don't worry about what `tbl.group()` does. You will learn about this later.

In [None]:
schools_by_type = schools.group('Type')

In [None]:
schools_by_type.barh('Type')

If you table has more than two columns, you may need to specify both the categories (the y axis) and the value to be displayed on the x axis by following this form: `tbl.barh(y_label, x_label)`

For example, given `state_mean_enrollments` a table of average enrollment at R1 universities by state, we can generate a bar chart of average enrollment by state.

In [None]:
# Don't worry about how this cell works — just run it for now
state_mean_enrollments = schools.select('State', 'Number_students', 'Score_Result').group('State', np.mean) \
                                .relabeled('Number_students mean', 'Average Enrollment')
state_mean_enrollments

In [None]:
state_mean_enrollments.barh('State', 'Average Enrollment')

Looking at this graph, Minnesota clearly stands out. Why is Minnesota's average enrollment more than 20,000 students about all other states?

### Example: Top 10 Songs on Spotify

The streaming service Spotify has a lot of data we can work with. For the next few questions, we will work with Spotify data from 2021. You can download an up-to-date copy of this data [here](https://spotifycharts.com/regional).

In [None]:
streams = Table.read_table('data/regional-global-daily-latest.csv', header = 1)
top_10 = streams.select('Track Name', 'Streams').take(np.arange(10))

In [None]:
top_10

In [None]:
... # Create a bar chart of streams for each of the songs in the top 10

### Example 2: Artists with the Most Songs in the Spotify Top 200 (2021)

In [None]:
streams

In [None]:
# Don't worry about this yet — just run this cell
artists_top_200 = streams.group('Artist') \
                         .sort('count', descending = True) \
                         .where('count', are.above(2))
artists_top_200

In [None]:
... # Plot a bar chart of artists by number of singles in the top 200

### Example 3: Number of Students at the 15 Largest Universities (in our dataset)

In [None]:
schools

In [None]:
# Find the 15 schools with the largest enrollment
# Don't worry, you haven't learn .take() yet
top_15_schools = schools.select('University', 'Number_students') \
                        .sort('Number_students', descending = True) \
                        .take(np.arange(15))

Can you find Berkeley?

In [None]:
... # Plot the enrollment of the top 15 schools

### Quick Check 1

Given the table `schools`, generate a bar chart of the enrollments at universities in Georgia (GA) (James' home state). _(Hint: `.where` will be helpful here)_

In [None]:
...

## Grouped Bar Charts

To illustrate the power of bar charts (and visualization more generally), we are going to work with economic data from countries across the world. The `gdp` table includes both _nominal_ GDP (essentially total economic output in dollars) and GDP [Purchasing Power Parity (PPP)](https://en.wikipedia.org/wiki/Purchasing_power_parity), which adjusts for the cost of living across countries.

In [None]:
# Just run this cell.
def remove_comma(s):
    return int(s.replace(',', ''))

nominal = Table.read_table('data/gdp-nominal.csv')
ppp = Table.read_table('data/gdp-ppp.csv').drop(3)
gdp = nominal.join('Country/Territory', ppp) \
       .drop(1, 3) \
       .relabeled(['GDP(US$million)', 'GDP(millions of current Int$)'], ['GDP Nominal', 'GDP PPP'])
gdp = gdp.with_columns(
    'GDP Nominal', gdp.apply(remove_comma, 'GDP Nominal'),
    'GDP PPP', gdp.apply(remove_comma, 'GDP PPP')
)
gdp = gdp.sort('GDP Nominal', descending = True)

In [None]:
gdp

In the cell below, we first take the first 15 countries sorted by nominal GDP. The specifics of how we do this isn't in scope yet, but consider why we first take a subset of all 181 rows in our table. _(Hint: what would our bar chart look like if we tried to plot bars for all countries?)_

In [None]:
gdp_top_15 = gdp.take(np.arange(15))

Now, let's look at the nominal GDP for these top 15 countries

In [None]:
gdp_top_15.select('Country/Territory', 'GDP Nominal').barh('Country/Territory')

We can also look at the GDP PPP for these countries.

In [None]:
gdp_top_15.select('Country/Territory', 'GDP PPP').barh('Country/Territory')

**Since the scales/magnitudes of both of these variables are similar** it makes sense to plot them in one overlaid (or "grouped") bar chart:

In [None]:
... # Create an overlaid histogram of nominal and PPP GDP for `gdp_top_15`

We can sort by GDP PPP, too:

In [None]:
gdp.sort('GDP PPP', descending = True).take(np.arange(15)).barh('Country/Territory')

## Customization

You can also customize bar charts (and other plots) by adding in the following optional inputs:
* `title`: The title of the bar chart
* `xaxis_title`: The x axis label
* `yaxis_title`: The y axis label
* `width`: The width of the chart
* `height`: The height of the chart

For example:

In [None]:
gdp_bottom_15 = gdp.sort('GDP Nominal', descending = False).take(np.arange(15))
gdp_bottom_15

In [None]:
gdp_bottom_15.barh('Country/Territory', 
                   title= 'Top 15 Lowest GDP (Nominal) Countries',
                   xaxis_title = 'Nominal GDP (Millions of $)',
                   yaxis_title = 'Country',
                   height = 600,
                   width = 900,)

### Quick Check 2

In [None]:
top_gdp = gdp.take(np.arange(7))
top_gdp

Fill in the blanks in the cell below to produce this bar chart:

<img src="bar_chart_example.png" width="70%">

In [None]:
top_gdp.barh(..., width = 800, height = 500,
             ... = 'GDP (Millions of USD)', ... = 'Country',
             ... = 'The top 7 ranked countries by GDP (Nominal) in 2020')