In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

<img src="https://github.com/data-6-berkeley/materials-fa24/blob/main/hw/hw03/data6.png?raw=true" style="width: 15%; float: right; padding: 1%; margin-right: 2%;"/>

# Lab 7 – Data Visualization

## Data 6 Visualizations Module
So far, we have discussed methods to interpret the data, but what if we want to present our data in a visual format? In this lab, you'll learn several important table methods for producing data visualizations. **Visualizations** are some of the most powerful tools in data science; they're helpful for showing data to people who don't necessarily have a background in data science, and allow data scientists like yourselves to help others understand the data in a more intuitive way.

In Lecture 8, we talked about methods we could use to visualize one variable, namely the `barh` and `hist` methods. We added the `scatter` and `plot` methods in Lecture 9. These methods allow us to visualize two or more variables at once, which can open up more patterns in the data and can further improve your ability to visualize data for people who do not necessarily understand data science.

As data scientists, it is not only our job to be able to implement various visualization methods, but also to know *when* to use each method. As we build our toolkit of visualization techniques going forward, it's important to understand the **advantages and disadvantages of each visualization type.**

In [None]:
# Run this cell to load all required Python libraries
import numpy as np
from datascience import *

import matplotlib.pyplot as plt
plt.style.use("ggplot")
%matplotlib inline

In [None]:
salary = Table.read_table("Salary_data.csv")
clean_salary = Table.read_table("clean_data.csv")
clean_salary

<div class="alert alert-warning">
Something important to note before we begin is that the <code>salary</code> dataset that we'll be using today, which includes information on jobholders and their salaries, came from <a href=https://www.kaggle.com/datasets/mohithsairamreddy/salary-data/data>Kaggle</a> and was supposedly combined from multiple surveys, job postings, and other public sources. However, the Kaggle source does not provide any of the original sources that the data was taken from, so we have no idea how reliable or real this data is. It's okay to use data like this for the sake of practice, but when doing so, it is important to remember that the conclusions you can make become much less reliable and trustworthy. When looking to use data that can make an impact, be sure to thoroughly research where your data is coming from and how it was collected. Keep this fact in mind as you're going through the lab!
</div>

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Part 1: Data Visualization Methods for Multiple Variables

<div class="alert alert-warning">
In last week's lab, we saw how we could use bar charts and histograms to visualize indvidual (and occasionally multiple) variables at once, in order to get a better idea of how our dataset is broken down and distributed across different features. In this section, we'll dive more into how we can visualize the relationships between variables and how one variable may affect another. For this part, we'll be honing in on the <code>"Years of Experience"</code> variable to test our informal hypothesis that an individual's years of experience may be positively correlated with their salary. Let's start this exploration with scatter plots.
</div>

### **The [scatter](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) method**

As we mentioned, visualizing two variables can show us patterns in the data. The `scatter` method allows us to see the relationship between two numerical variables in our data by producing a **scatter plot**. The first provided column name goes along the x-axis and the second goes along the y-axis.

Let's take a look at the relationship between **years of experience** and **salary** using our `clean_salary` table.

### Producing Scatter Plots

Now, we can call `scatter` on the `clean_salary` table. Run the following cell to do so.

In [None]:
clean_salary.scatter("Years of Experience", "Salary")

Just like that, you've produced your first scatter plot! It looks a little messy, however. Often, scatter plots can suffer from what's known as **[overplotting](https://www.displayr.com/what-is-overplotting/)**: when many data points fall on top of each other, creating a blob of data. When this happens, it's often difficult to see the individual data points.

To fix this, we can focus in on a smaller subset of the data. In this case, we'll look at individuals who have a PhD education level.

<div class="alert alert-warning">
We decide to take only individuals with PhDs as it significantly reduces the size of the data, but something important to note is that whatever trends we find in the scatter plot below might not completely line up with or be true to the trend regarding the general population of the whole dataset.
</div>

### Question 1.1
Using `clean_salary`, create a smaller subset of the data named `scatter_phd` that contains only individuals with a PhD

In [None]:
# Create a smaller subset of data; only individuals with a PhD
scatter_phd = clean_salary.where("Education Level", "PhD")
scatter_phd

In [None]:
grader.check("q1_1")

<!-- BEGIN QUESTION -->

### Question 1.2
Using the `scatter_phd` table, produce a scatter plot that plots `"Years of Experience"` on the x-axis and `"Salary"` on the y-axis. Your code should be very similar to the previous scatter plot.

In [None]:
# Replace the ... with the necessary code to plot the scatter plot

<!-- END QUESTION -->

That looks a little better! There is still a cluster of data points in the bottom left corner, but a relationship can be seen between the two variables.

<div class="alert alert-warning">
Analyze your scatter plot above and see if you notice anything interesting. One question to consider is why there are distinct vertical lines of data points, and how this makes sense based on how the <code>"Years of Experience"</code> feature is represented.
</div>

<!-- BEGIN QUESTION -->

### Question 1.3 (Discussion)
What relationship between years of experience and salary (for PhD holders specifically, in this case) does the above scatter plot reveal? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Optional argument: `group`

The `scatter` method also allows you to specify specific a group for each data point using the `group` keyword argument.

Say we wanted to investigate the relationship between an individual's **years of experience** and their **salary** with respect to their reported **gender**.

In [None]:
scatter_phd.scatter("Years of Experience", "Salary", group = "Gender")

By utilizing the `group` argument, we see our scatter plot stratified into the different categories our data has for gender. This gives us a better insight into the trends of the relationship between years of experience and salary for each gender, rather than simply looking at all gender categories together.

<!-- BEGIN QUESTION -->

### Question 1.4 (Discussion)
Are there any patterns you can notice from the scatter plot? Gender biases, when one gender is given preferential treatment (promotions, higher salaries, less work, etc.) over another or when there is a prejudice against one gender, can be prevalent within the workplace. Does this scatter plot show any gender biases? What might this look like in a real-world setting?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Scatter plots are useful when visualizing two numerical variables together. If you want to plot two numerical variables but one variable corresponds to time, we can use a line plot to visualize this instead.

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Part 2: Visualizing with `plot`

---
### **The [plot](http://data8.org/datascience/_autosummary/datascience.tables.Table.plot.html#datascience.tables.Table.plot) method**

Similar to `scatter`, we give `plot` the names of two numerical columns and it creates a **line plot** for us. If we want to draw multiple line plots on the same set of axes, we give it a table with multiple numerical columns, and tell it which one contains the values for the x-axis.

The `plot` method allows us to see how non-time variables change over time. Let's use `plot` to look at the age patterns over the course of years of experience. First, we will look at a single line plot using `plot`:

In [None]:
# Just run this cell -- don't worry about the `group` or `drop` methods
experience_age = clean_salary.group("Years of Experience", np.mean).drop("Gender mean", "Job Title mean", "Education Level mean", "Salary mean")
experience_age

<!-- BEGIN QUESTION -->

### Question 2.1
Using the `experience_age` table and the `plot` method, produce a *line plot* that plots the average age over years of experience.

*Hint*: You'll want to plot the years of experience on the x-axis and average age on the y-axis.

In [None]:
# Replace the ... with the necessary code to plot the scatter plot

<!-- END QUESTION -->

### Identifying Temporal Patterns

Line plots are incredibly effective tools for identifying temporal patterns (i.e. changes over time). Let's utilize our newfound knowledge of the `plot` method to uncover underlying temporal patterns within each education level as they get more years of experience. Run the following cell to create tables for each education level and the average salary for each additional year of experience. The subsequent cells will create their respective plots. Analyze the graphs and answer the question that follows.

In [None]:
# Create tables for each education level
hs_salary_avg = clean_salary.where("Education Level", are.equal_to("High School")).group("Years of Experience", np.mean).drop("Gender mean", "Job Title mean", "Education Level mean", "Age mean")
bachelor_salary_avg = clean_salary.where("Education Level", are.equal_to("Bachelor's Degree")).group("Years of Experience", np.mean).drop("Gender mean", "Job Title mean", "Education Level mean", "Age mean")
master_salary_avg = clean_salary.where("Education Level", are.equal_to("Master's Degree")).group("Years of Experience", np.mean).drop("Gender mean", "Job Title mean", "Education Level mean", "Age mean")
phd_salary_avg = clean_salary.where("Education Level", are.equal_to("PhD")).group("Years of Experience", np.mean).drop("Gender mean", "Job Title mean", "Education Level mean", "Age mean")

In [None]:
# Run this cell to produce a line plot for the high school education salary average
hs_salary_avg.plot("Years of Experience", "Salary mean")

In [None]:
# Run this cell to produce a line plot for the bachelor's degree salary average
bachelor_salary_avg.plot("Years of Experience", "Salary mean")

In [None]:
# Run this cell to produce a line plot for the master's degree salary average
master_salary_avg.plot("Years of Experience", "Salary mean")

In [None]:
# Run this cell to produce a line plot for the PhD salary average
phd_salary_avg.plot("Years of Experience", "Salary mean")

<!-- BEGIN QUESTION -->

### Question 2.5 (Discussion)
What patterns do you notice when comparing these line plots? Do any of them stand out to you? Do the results you are seeing make sense with respect to your knowlege of education levels? Be sure to pay close attention to the scales of the axes for each plot!

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Multiple Variables
If we want to visualize multiple variables on one plot, we can include them all in the table we call `plot` on.

In [None]:
experience_age_salary = clean_salary.group("Years of Experience", np.mean).drop("Gender mean", "Education Level mean", "Job Title mean")
experience_age_salary

Since we are trying to compare `"Salary mean"` and `"Age mean"` and their units are different, we have to manipulate the data before plotting. To do this, let's first divide the `"Salary mean"` column by 1000 to get a better sense of the relationship. The cell below does this data manipulation for you.

In [None]:
experience_age_salary = experience_age_salary.with_column('Salary mean', experience_age_salary.column('Salary mean') / 1000)
experience_age_salary

<!-- BEGIN QUESTION -->

### Question 2.6
Using the `experience_age_salary` table, produce a scatter plot with *one line per variable* other than `"Years of Experience"`. That is, `"Years of Experience"` should be plotted on the x-axis.

In [None]:
# Replace the ... with the necessary code to plot the scatter plot

<!-- END QUESTION -->

---
## Done! 😇

---

## Pets of Data 6
Make sure to be well rested!

<img src="https://github.com/data-6-berkeley/materials-su24/blob/main/lab/lab03/paulina.JPG?raw=true" width="50%" alt="Cute dog"/>

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)