In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

<img style="display: block; margin-left: auto; margin-right: auto" src="./ccsf-logo.png" width="250rem;" alt="The CCSF black and white logo">

# Lab 04 - Visualizations

## References

* [Sections 7.0 - 7.3 of the Textbook](https://inferentialthinking.com/chapters/07/Visualization.html)
* [datascience Documentation](https://datascience.readthedocs.io/)
* [[Optional] Matplotlib Documentation](https://matplotlib.org/stable/api/index.html)
* [[Optional] Plotly Documentation](https://plotly.com/python/)

## Assignment Reminders

- Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- For all tasks indicated with a 🔎 that you must write explanations and sentences for, provide your answer in the designated space.
- Throughout this assignment and all future ones, please be sure to not re-assign variables throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!_
- Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.
- View the related <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page for additional details.

Run the following cell to set up the lab, and make sure you run the cell at the top of the notebook that initializes Otter.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Attributes

Visualizing data is an essential step in gaining insights from the vast and complex datasets that permeate our modern world. There exists a myriad of techniques and tools to transform raw data into comprehensible, meaningful representations. Among these techniques, a set of standard visualizations has emerged as go-to options, each with its unique strengths and applications. The choice of which standard visualization to employ hinges on various factors, a key factor among them being the attribute type of the data under investigation. It's important to note that this attribute type may not always align with the data type in which information is stored, making the selection of an appropriate visualization an artful and pivotal decision in the data analysis process.

To streamline our understanding of attribute types for data visualization, we can simplify them into two broad categories: **numerical** and **categorical**. 
* Numerical attributes encompass data that consists of continuous or discrete numeric values. These attributes are typically quantitative and can be operated on mathematically. Examples of numerical attributes include variables like age, temperature, or income.
* Categorical attributes deal with data that fall into distinct categories or labels. They represent qualitative information where mathematical operations typically do not have clear meanings. Examples of categorical attributes include gender (male, female, nonbinary), color (red, blue, green), or product categories (electronics, clothing, food).

By classifying attributes into these two fundamental types, we can better tailor our choice of visualization methods to the nature of the data, allowing us to extract more valuable insights from it.

### Numbers for Categories

Just because an attribute has values that are numbers, does not mean you should treat the attribute as numerical. Postal codes are numbers. However, the attribute type for postal codes is categorical rather than numerical. Postal codes represent specific geographical regions and are not meant for mathematical operations like addition or subtraction. For example, `90210` (Beverly Hills) and `10001` (Manhattan) are categorical values representing different locations. Choosing an appropriate visualization method for postal codes would involve treating them categorically, not numerically, despite their data type.

### Task 01 📍

Which of the following attributes are categorical in nature? Assign `categorical_attributes` to an array with the numbers for the variables that represent categorical attributes.

1. Height in centimeters
2. Eye color (e.g., blue, brown, green)
3. Temperature in degrees Celsius
4. Years of education completed
5. Vehicle make and model (e.g., Toyota Camry)
6. Employee identification number
7. Blood type (e.g., A, B, AB, O)
8. Stock prices
9. Time of day (e.g., morning, afternoon, evening)
10. Mobile phone number

In [None]:
categorical_attributes = ...

In [None]:
grader.check("task_01")

## Bar Charts

When it comes to visualizing categorical data, one of the standard and effective methods is the use of bar charts. Bar charts allow us to represent categorical variables by displaying their frequencies or proportions as bars of different lengths or heights. 

In Python, you can create bar charts easily using libraries like matplotlib or datascience. Specifically, the datascience library provides the `bar` and `barh` table methods, which simplify the process of generating bar charts. The `bar` method is used for vertical bar charts, while the `barh` method is employed for horizontal bar charts.

Run the following code cell to create the table `car_inventory`.

In [None]:
car_inventory = Table().with_columns(
    'Car Type', ['Sedan', 'SUV', 'Truck', 'Hatchback'],
    'Count', [25, 15, 12, 8]
)
car_inventory

Using the `barh` table method, you can create a horizontal bar chart to visualize the distribution of car types in `car_inventory`.

Run the following code cell to generate the bar chart.

In [None]:
car_inventory.barh('Car Type', 'Count')

# Optional Customization
plt.title('Distribution of Car Types')
plt.show()

This visualization method is invaluable for quickly grasping the distribution and comparison of categorical data.

The `barh` method has a few arguments, but two important ones are the first two.

* The first argument `column_for_categories` specifies which column in the table to use for the categorical values. In this case, `column_for_categories='Car Type'`.
* The second argument `select` specifies what values to use as the length of the bars. If you don't specify this argument, then it will try to use columns with numerical data to generate the bars. In this case, `select='Count'`.

There are other arguments to experiment with, but you will typically just need to work with these two.


### Campaign Expenditures

According to the [OpenSecrets Campaign Expenditures page](https://www.opensecrets.org/campaign-expenditures):

>While disclosure is often vague or incomplete, the FEC's (Federal Election Commission) expenditures data sheds light on the strategies that campaigns use to turn dollars into votes, the vendors making a fortune on elections, and the groups living large on their donors' money.
>
>Campaigns must report to the FEC the purpose and payee of all disbursements over $200. OpenSecrets uses this information to classify campaign expenditures into nine major categories: Administrative, Campaign Expenses, Fundraising, Media, Contributions, Strategy & Research, Transfers, Wages & Salaries, and Unclassifiable.

Run the following code cell to create a table called `expenditures` that contains the campaign expenditures for the 2022 cycle.

In [None]:
categories = ['Fundraising', 'Media', 'Unclassifiable', 
              'Salaries', 'Campaign Expenses', 
              'Administrative', 'Strategy & Research']
percentages = [12.95, 44.90, 11.27, 8.93, 8.41, 8.11, 5.41]

expenditures = Table().with_columns(
    'Categories', categories,
    'Expenditure Percentage', percentages
)
expenditures

### Task 02 📍🔎

<!-- BEGIN QUESTION -->

Using `expenditures`, create a bar chart showing the distribution of expenditure categories where the lengths of the bars are determined by the reported expenditure percentages. Make sure that the bars of your chart are organized such that the largest bars are at the top and the smallest are at the bottom.

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

plt.title('Distribution of Campaign Expenditures')
plt.show()

<!-- END QUESTION -->

Hopefully, you see that the visualization really stresses how much more campaigns spend on the media compared to other categories.

### Getting Counts

It is pretty common that a table doesn't actually have a column of values that represent counts or proportions to use for the length of bars, so part of your analysis process involves creating that information.

The `group` table method in the `datascience` library is a powerful tool for summarizing and aggregating data based on the unique values within one or more columns. At its core, it works by grouping rows in a table based on the distinct values in one or more specified columns.

Here is a simple example illustrating the need for the `group` method. 

Run the following code cell to create the table `survey` that contains the response from a survey.

In [None]:
survey_responses = ['Yes', 'No', 'Maybe', 'No', 'Yes',
                    'Maybe', 'No', 'Yes', 'Maybe', 'No',
                    'Yes', 'Maybe', 'No', 'Yes', 'Maybe',
                    'No', 'Yes', 'Maybe', 'No', 'Yes']

survey = Table().with_column('Response', survey_responses)
survey

If you want to summarize this distribution, you need a count of the `Yes`, `No`, and `Maybe` values. Use `survey.group('Response')` to have the computer do that for you. 

Run the following code cell to see the results.

In [None]:
survey_grouped = survey.group('Response')
survey_grouped

The `group` method also has several arguments. For now, we will focus on the first one `column_or_label`. This just represents the column index or label to group by. You will learn more about `group` later in the course. For now, you want to learn enough to make bar charts.

With the new table that contains the counts for each response value, we can create a bar chart by using that table with the command `survey_grouped.barh('Response')`. 

Run the following code to see the results.

In [None]:
survey_grouped.barh('Response')

# Optional Customization
plt.title('Distribution of Survey Responses')
plt.show()

You would get an error if you tried to use `survey.barh('Response')`, because there is not a column of values to use for the lengths of bars. That is why it was important to create the grouped table.

### Corporate Contributions

Continuing with election campaigning, the OpenSecret Corporate Contributions to Outside Groups page states that:

> Unlike conventional PACs (political action committees), super PACs and other outside groups can receive direct contributions from corporate treasuries.

Run the following code cell to see a list of corporations that have made direct contributions of $1,000,000 or more to an outside group this cycle.

_Note: This cell contains extra code that you are not responsible for. Extra formatting was needed to get the dates in a usable format._

In [None]:
corporate_contributions = Table.read_table('corporate_contributions_2022.csv')

# Convert all date strings to datetime objects
from datetime import datetime
input_format = "%m/%d/%Y"
dates = corporate_contributions.column('Date')
dates = [datetime.strptime(date_string, input_format) for date_string in dates]
corporate_contributions = corporate_contributions.with_column('Date', dates)

corporate_contributions

### Task 03 📍

Use the `group` method with the `corporate_contributions` table to create a table called `viewpoints` with two columns in the following order: '`Recipient Viewpoint`' and `'Percentage'`. This table should show the distribution of `'Recipient Viewpoint'` values in this data set.

**Hint**: The group method gives you counts by default. Extract the count information from the grouped table, divide each count by the total number of items in the table, and multiply by 100 to get the correct percentage. You can then put the percentages back in the table using `with_column` and clean up the table using the `drop` method.

In [None]:
viewpoints = ...

In [None]:
grader.check("task_03")

### Task 04 📍🔎

<!-- BEGIN QUESTION -->

Using the `viewpoints` table, create a bar chart showing the distribution of recipient viewpoints for corporate contributions of at least $1,000,000. Make sure that the bars of your chart are organized such that the largest bars are at the top and the smallest are at the bottom.

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

plt.title('Distribution of Corporate Contribution Recipient Viewpoints')
plt.show()

<!-- END QUESTION -->

One thing to keep in mind when looking at this visualization is that **some corporations made more than one contribution to the same recipient**. So, once you know how to work with the `group` method a little bit more, you might want to revisit this example and create a visualization that better shows the association between corporate contributions and recipient viewpoints.

## Histograms

When it comes to visualizing numerical data, one of the default and fundamental techniques is to use a histogram. Histograms provide a graphical representation of the distribution of numerical values within a dataset, allowing you to observe patterns, central tendencies, and variations. 

In the `datascience` library, you can create histograms conveniently using the `hist` table method. 

When working with histograms, it's essential to consider the choice of bin sizes or intervals, as this can impact the interpretation of the data. 

The `hist` method defaults to displaying data density on the vertical axis rather than raw counts. This means that the height of each bar in the histogram represents the density of data within that bin, and the area of the bar, not the count, reflects the amount of data. This distinction is crucial for accurately understanding the distribution of numerical data and is a core concept in data visualization and analysis.

Run the following code cell to create a table called `ages` contain the ages of a group of people.

In [None]:
ages = Table().with_column(
    'Age', [12, 15, 18, 20, 22, 25, 26, 28, 30, 32, 35, 36, 38, 40, 45, 50, 55]
)
ages

To visualize this distribution, you can use the code `ages.hist('Age')`. `hist` has several arguments, but the first argument `columns` identifies the column(s) that contain the numerical data for the histogram.

In [None]:
ages.hist('Age')

# Optional Customization
plt.title('Distribution of Ages')
plt.show()

The `hist` table method offers two other arguments that are worth mentioning for this class: `bins` and `unit`. These arguments play a pivotal role in customizing the appearance and interpretation of the histogram. 

* The `bins` argument allows you to specify the number of bins or intervals into which the data range will be divided. A well-chosen number of bins can significantly affect the visual representation of the data, influencing the granularity of the histogram. By default, there is an algorithm that attempts to generate "good" bins, but you might need to specifically define the bins with an array or the number of bins with an integer to get the histogram to look good for your situation.
* The `unit` argument provides a way to provide labels to the horizontal and vertical axes as a reminder of what the units of the data are.

The ages are most likely measured in years and it might make sense to bin these ages by creating bins that are 10 years wide. You can achieve this with the parameters `unit="Years"` and `bins=np.arange(10, 61, 10)`. 

Run the following code cell to see the results.

In [None]:
ages.hist('Age', unit="Years", bins=np.arange(10, 61, 10))

# Optional Customization
plt.title('Distribution of Ages')
plt.show()

Notice how the shape of the histogram changes! The same numerical data can look very different in a histogram depending on how it is binned. This can be used as a tool for analysis and inquiry, but it can also be used as a tool to misguide.

### Task 05 📍

The `'Amount'` label in the `corporate_contributions` table shows the amount that the corporation contributed. Your next task will be to visualize the distribution of contribution amounts, but there is a problem with the data. The amounts are presented as strings. Later in the course, you'll learn how to deal with this. For now, we will help you out by creating an array called `amounts` of all the dollar amounts as floats. 

Run the following cell to create that array.

In [None]:
# Just run this cell. You will learn a way to do this later in the course
amounts_as_strings = corporate_contributions.column('Amount')
amounts = np.array([float(s.replace('$', '').replace(',', '')) for s in amounts_as_strings])
amounts

Now, update the data in the `'Amount'` column in `corporate_contributions` with this array of float values.

**Hint**: If you use the `with_column` method with a column label that already exists in the table, then the information in that column will be updated with the array you use with `with_column`.

In [None]:
corporate_contributions = ...

In [None]:
grader.check("task_05")

### Task 06 📍🔎

<!-- BEGIN QUESTION -->

Now that you have the contribution amounts as a numerical data type. Use the `hist` method on `corporate_contributions` to show the distribution of contribution amounts. The default bins do not end up creating a good-looking chart, try adjusting the bins to see how the bins impact the visualization.

_Ultimately, the issue with the binning is that there are a few really large contributions and most contributions are exactly $1,000,000. A better practice might be to visualize the distribution of really large contribution amounts separately._

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

plt.title('Distribution of Corporate Contributions')
plt.show()

<!-- END QUESTION -->

## Line Plots and Scatter Plots

When it comes to visualizing numerical relationships in data, scatter plots and line plots are two fundamental tools that provide valuable insights. 

In the `datascience` library, the `scatter` method is used to create scatter plots, while the `plot` method is employed to generate line plots. 

These visualization techniques share a conceptual similarity: both display data points on a two-dimensional plane, typically with one numerical variable on the x-axis and another on the y-axis. However, the key distinction lies in the purpose and interpretation.

* Scatter plots are versatile and are primarily used to showcase the association and general pattern between two numerical variables. They are excellent for revealing relationships, correlations, and outliers in the data.
* Line plots are best suited when the horizontal axis represents sequential data, such as time or distance. These plots connect the data points with lines, making them ideal for visualizing trends and showing how a numerical variable changes over a continuous range, as in the case of tracking revenue over time.

In essence, scatter plots excel at depicting associations, while line plots are tailor-made for illustrating trends and sequential relationships in numerical data.

For example, run the following code cell to generate a table called `company_data` showing revenue and profit data for the last ten years for some hypothetical company.

In [None]:
# Generate random data
np.random.seed(0)
years = np.arange(2014, 2024)
revenue = np.random.poisson(14, 10) * 2_500
profit = revenue * np.random.normal(0.08, 0.002, 10)

company_data = Table().with_columns('Year', years, 'Revenue', revenue, 'Profit', profit)
company_data

A line plot would be a standard choice to visualize the trend of profit over time. This can be done with the command `company_data.plot('Year', 'Profit')`.

Run the following code cell to see the results.

In [None]:
company_data.plot('Year', 'Profit')

# Optional Customization
plt.title('Profits Over Time')
plt.show()

The trend of this line shows that there was something very significant that happened around 2018 to make the company very profitable. After a short period of time of high profits, there was a steep decline in profit levels that were lower than in years previous to 2018. For the last few years, the company's profits seem to lack stability. This is likely due to economic instability surrounding the pandemic, but all this profit data was made up and so is money. 🤓

It might be nice to compare the trends of two numerical distributions over the same horizontal axis. This would be a great time to try an overlaid line plot. For example, you could plot the lines for both profit and revenue over time.

The `plot` method can handle this by making sure the table only contains the variables you are interested in `'Year'`, `'Profit'`, and `'Revenue'` and just specifying the horizontal axis in the `plot` method. For example, just use `company_data.plot('Year')`.

Run the following cell to see that a line is created for every numerical column in the table other than `'Year'`.

In [None]:
company_data.plot('Year')

# Optional Customization
plt.title('Revenue and Profits Over Time')
plt.show()

Revenue looks much less stable on this graph because of the scale of the values. Profits were hovering around 8% of revenue, so putting both lines on the y-axis doesn't offer a fair comparison.

How do you better see the relationship between revenue and profit? Since the data are not sequential and you are just looking to visualize the association, use a scatter plot. The `scatter` method would help out with this. Since profit follows from revenue, it is standard practice to have the horizontal axis reflect revenue values. Use the command `company_data.scatter('Revenue', 'Profit')` to make this happen.

Run the following cell to see the results.

In [None]:
company_data.scatter('Revenue', 'Profit')

# Optional Customization
plt.title('Revenue vs. Profit')
plt.show()

This shows a pretty strong (linear) positive relationship between revenue and profit. Dividing the profit values by the revenue shows a pretty stable profit percentage of roughly 8%.

In [None]:
company_data.column('Profit') / company_data.column('Revenue')

### Task 07 📍🔎

<!-- BEGIN QUESTION -->

Let's return to the corporate contribution data. Create a line plot to visualize the trend of corporate contributions over the election cycle.

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

# Customization
plt.title('Contributions Over Time')
plt.gcf().set_size_inches(15, 5)
plt.xticks(rotation=45)
plt.show()

<!-- END QUESTION -->

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

What do you notice about the line graph in terms of contribution trends in relation to the election cycle?

_Make sure to check your response with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Outside Spending

According to the [OpenSecrets Outside Spending page](https://www.opensecrets.org/outside-spending/summary):

> A January 2010 Supreme Court decision (Citizens United v. Federal Election Commission) permits corporations and unions to make political expenditures from their treasuries directly and through other organizations, as long as the spending -- often in the form of TV ads -- is done independently of any candidate. In many cases, the activity takes place without complete or immediate disclosure about who is funding it, preventing voters from understanding who is truly behind many political messages. The spending figures cited are what the groups reported to the FEC; it does not account for all the money the groups spent, since certain kinds of ads are not required to be reported.

Run the following cell to create the table `outside_spending` showing the outside spending for top 10 races, excluding party committees in the 2022 election cycle.

In [None]:
outside_spending = Table.read_table('outside_spending.csv')
outside_spending

### Task 09 📍🔎

<!-- BEGIN QUESTION -->

Create a few scatter plots using the `outside_spending` data. Stop when you find a visualization that shows a positive association.

**Suggestion**: Try to find the positive association between two different attributes that seems the strongest!

_Make sure to check your visualization with a classmate, a tutor, or the instructor before moving on since there is not an auto-grader for this lab task._

In [None]:
# Generate your chart in this cell
...

<!-- END QUESTION -->

Great work! You now have some practice with the basics of data visualization.

## Submit your Lab to Canvas

Once you have finished working on the lab questions, prepare to submit your work in Canvas by completing the following steps.

1. In the related Canvas Assignment page, check the requirements for a Complete score for this lab assignment.
2. Double-check that you have run the code cell near the end of the notebook that contains the command `grader.check_all()`. This command will run all of the run tests on all your responses to the auto-graded tasks marked with 📍.
3. Double-check your responses to the manually graded tasks marked with 📍🔎.
4. Select the menu items `File`, `Save and Export Notebook As...`, and `Html_embed` in the notebook's Toolbar to download an HTML version of this notebook file.
5. In the related Canvas Assignment page, click Start Assignment or New Attempt to upload the downloaded HTML file.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()