# lab_simpsons_paradox

Hi Data Scientists,

Welcome to the fourth lab!  In this lab, you will look for confounding variables and find Simpson's paradox in several datasets using Python. Along the way, you'll also get more practice writing conditionals for pandas dataframes. As you go through the lab:

- Make sure to run every cell with Python code and observe the output
- Complete all puzzles and submit this lab before Monday evening at 11:59pm

### Labs are Collaborative

In your lab section (right now!), make sure you are working with a small group! Find a person or two around you, or if you're on Zoom, join a breakout room with one or two people. Work on this lab together with your group, making sure all of you understand each component of the lab.

### Part 0: Your Group!

Edit the next Python cell to add information about who you're working with in your lab section:

In [None]:
# First, meet your CAs and TA if you haven't already!
# ...first name is enough, we'll know who they are! :)
ta_name = "Heman"
ca1_name = "Michelle"
ca2_name = "Tamun"


# Also, make sure to meet your team for this lab! Find out their name, what major they're in,
# and learn something new about them that you never knew before!
partner1_name = "Dev"
partner1_netid = "devhp2"
partner1_major = "Statistics"

partner2_name = "Collin"
partner2_netid = "collinmc3"
partner2_major = "Statistics"

partner3_name = "Sam"
partner3_netid = "sstef3"
partner3_major = "Statistics"



### Getting Help

Remember, there are a lot of ways to get help if you find yourself stuck:

1. In lab section, your TA, CAs, and your peers are here for you!

2. On the course Discord!

3. Office Hours:

  - Open office hours in person **every Monday, Tuesday, Thursday, and Friday** from 4:00pm - 6:00pm in 0060 Siebel Center for Design (SCD)

  - Zoom office hours **every Wednesday** from 5:00pm - 6:00pm and **every Thursday** from 9:00pm - 11:00pm

## Part 1: The Setup

Follow the usual steps for setting up labs. You will need to:

1. Make sure the dataset(s) you need are in your **working directory**.  
*We will need `DiscoveryPizza.csv` and `PythonPizza.csv` for the first part of the lab. We will need `hello.csv` for the second part of the lab!*
2. Import any libraries you will need and give them an abbreviation so they are easy to refer to.   
*We will just need `pandas`, typically abbreviated as `pd`.*
3. Read your datasets into dataframes. The dataframes should have useful names so you can easily identify which one is which.   
*We will use `df_pythonpizza`, `df_discoverypizza`, and `df_hello`.*

### Puzzle 1.1: Setup for the lab
Follow the steps above to get set up for today's lab.

In [2]:
# Import the pandas library as pd
import pandas as pd

# Read our two pizza datasets into dataframes
df_pythonpizza = pd.read_csv('PythonPizza.csv')
df_discoverypizza = pd.read_csv('DiscoveryPizza.csv')

In [3]:
## == TEST CASES for Puzzle 1.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.

assert(len(df_discoverypizza) == 40), "This is not the Discovery Pizza dataset you're looking for"
assert(len(df_pythonpizza) == 46), "This is not the Python Pizza dataset you're looking for"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### The Scenario

You and your friends are visiting a city and want to find the best pizza place to eat nearby. Neither of you know much about the food scene, but your friend (and maybe you!) are particular about making the best of your time and trying the best food possible. What to do but turn to the answer to all life's questions: *online reviews*.

You discover that there are only two pizza places in town and you have to decide which one is better! The two pizza places are called "Discovery Pizza" and "Python Pizza" (LOL!). The dataset `DiscoveryPizza.csv` contains data for all people who reviewed Discovery Pizza, specifically showing whether or not they recommend it and their gender. The dataset `PythonPizza.csv` contains the same information for Python Pizza.  

### Cleaning the Data

Sometimes, the data you get for analysis can be a bit messy. For example, run the code below to see how gender has been listed in the `DiscoveryPizza.csv` dataset. From `pandas`, we can first use `concat` to temporarily combine our two datasets together, then  use the `unique()` function to see all the different ways people wrote their gender.

*(***Quick Tip***: Remember, if you want to figure out something to do with pandas, the pandas cheat sheet can help!)*

In [4]:
print(pd.unique(
    pd.concat([df_discoverypizza["Gender"],
               df_pythonpizza["Gender"]])))

['Male' 'Man' 'male' 'Men' 'Female' 'Woman' 'female' 'woman' 'girl' 'man']


As we can see, the reviewers all identified as either male or female, but wrote their gender in different ways. We will need to change all the different ways of identifying as male or female into two gender categories: `Male` ane `Female`. 

### The List Type

So far in Python, we have encountered **string**, **number**, and **DataFrame** types. In order to list our multiple ways of identifying as male or female, we can use a fourth data type: **Lists**. These are native to Python, so we do not need to import a library to use them.  

### Puzzle 1.2: Lists of words for male and female

The list of words used for male has been put into a Python **List** below. Try the same with the words for female using the list printed in the previous code cell.

In [5]:
male = ["Male", "Man", "male", "Men", "man"]
female = ["Female", "Woman", "female", "girl", "woman"]

In [6]:
## == TEST CASES for Puzzle 1.2 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.

assert(set(female) == {'Female', 'Woman', 'female', 'woman', 'girl'}), "The list does not match the list of all words used for female"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Conditionals with Multiple Acceptable Values

Great! Now that we have our lists of words, we can use them in conditionals for our dataframes where multiple values are acceptable. We can use the `isin()` function and our list `male` to select all the Discovery Pizza data for males regardless of how the gender is written.

In [7]:
df_discoverypizza[df_discoverypizza["Gender"].isin(male)]

Unnamed: 0,Gender,Recommend
0,Male,Yes
1,Male,Yes
2,Male,Yes
3,Male,Yes
4,Male,Yes
5,Male,No
6,Man,No
7,Male,No
8,Male,No
9,male,No


### Puzzle 1.3 

Let's try it! Show all the Python Pizza recommendation data from females.

In [8]:
df_pythonpizza[df_pythonpizza["Gender"].isin(female)]

Unnamed: 0,Gender,Recommend
36,Female,Yes
37,woman,Yes
38,Female,Yes
39,Female,Yes
40,female,Yes
41,Female,Yes
42,female,Yes
43,Female,Yes
44,Woman,Yes
45,Female,No


## Part 2: Initial Analysis

Now that we've made sense of our slightly messy data, we can use it to figure out where to go for dinner. "That's easy," your friend says, "We'll just see which restaurant has a greater percent of recommendations."

### Puzzle 2.1: Comparing Overall Percentages

Try out your friend's suggestion. Calculate the percent of people who recommend Discovery Pizza and the percent of people who recommend Python Pizza. Your answer should be a percentage between 0% and 100%.

**Hint:** There are many ways to do this. One way is to use len(df) to give you the length of a dataframe. You can also find the length of a dataframe with certain conditionals. 

In [11]:
discovery_recpercent = (len(df_discoverypizza[(df_discoverypizza.Recommend == "Yes")])/len(df_discoverypizza))*100
python_recpercent = (len(df_pythonpizza[(df_pythonpizza.Recommend == "Yes")])/len(df_pythonpizza))*100
print("discovery_recpercent =", discovery_recpercent,
      "\npython_recpercent =", python_recpercent)

discovery_recpercent = 62.5 
python_recpercent = 58.69565217391305


In [12]:
## == TEST CASES for Puzzle 2.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.

assert(discovery_recpercent == 62.5), "The overall percentage of people who recommended Discovery Pizza does not appear to have been correctly calculated"
assert(python_recpercent == 100*(27/46)), "The overall percentage of people who recommended Python Pizza does not appear to have been correctly calculated"
## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 2.2: Reflection

❓ **Individual Reflection Question** ❓: Given what you've learned about experimental design, what are some reasons (specific to this dataset) you may not trust this result when making your decision? If you would trust it, explain why.

We do not know if the two groups are similar. It is possible that the people who rated discovery pizza are older and in some way may be more inclined to reccommend discovery pizza than to not. 

**Group Discussion**: Discuss the following in your groups!

Previously, we learned the importance of randomized controlled experiments. Discuss why randomized controlled experiments are ideal.  Our dataset, however, is from an observational study. What are some problems with observational studies?  Why do you think an observational study was done as opposed to a randomized controlled experiment in this case?  

Randomized controlled experiments are ideal because we can control confounding variables and make the two groups similar, but this is not possible with observational studies. It's possible multiple confounding variables are at work. In this case, an observational study was done because we want to see just the results and not make concludions about what factors may have caused them to recommend a certain pizza place or not. 

## Part 3:  Observational Studies and Stratification

Since this is an observational study, we have no guarantee that the two groups we are comparing are similar. But since we have information on the reviewers' genders, we can at least control for gender differences by comparing across people of the same gender.

### Puzzle 3.1: Calculating separate percentages

Use conditionals to calculate the percentage of men who recommended Discovery Pizza, the percentage of men who recommend Python Pizza, the percentage of women who recommend Discovery Pizza, and the percentage of women who recommend Python Pizza. There are multiple ways to do this, however, you can choose the way that makes most sense to you! Your answers should be a percentage between 0% and 100%. Make sure you include all people in the dataset in your calculations (think about all of the different ways to write gender that we looked at previously).

In [48]:
discovery_mpercent = len(df_discoverypizza[(df_discoverypizza["Gender"].isin(male)) & (df_discoverypizza.Recommend == "Yes")] )/len(df_discoverypizza[df_discoverypizza["Gender"].isin(male)]) * 100
python_mpercent = len(df_pythonpizza[(df_pythonpizza["Gender"].isin(male)) & (df_pythonpizza.Recommend == "Yes")] )/len(df_pythonpizza[df_pythonpizza["Gender"].isin(male)]) * 100
discovery_fpercent = len(df_discoverypizza[(df_discoverypizza["Gender"].isin(female)) & (df_discoverypizza.Recommend == "Yes")] )/len(df_discoverypizza[df_discoverypizza["Gender"].isin(female)]) * 100
python_fpercent = len(df_pythonpizza[(df_pythonpizza["Gender"].isin(female)) & (df_pythonpizza.Recommend == "Yes")] )/len(df_pythonpizza[df_pythonpizza["Gender"].isin(female)]) * 100

print("discovery_mpercent =", discovery_mpercent,
      "\npython_mpercent =", python_mpercent,
      "\ndiscovery_fpercent =", discovery_fpercent,
      "\npython_fpercent =", python_fpercent)

discovery_mpercent = 33.33333333333333 
python_mpercent = 50.0 
discovery_fpercent = 80.0 
python_fpercent = 90.0


In [22]:
## == TEST CASES for Puzzle 3.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.

assert(discovery_mpercent == 100*(1/3)), "The percentage of men who recommended Discovery Pizza does not appear to have been correctly calculated"
assert(python_mpercent == 100*(1/2)), "The percentage of men who recommended Python Pizza does not appear to have been correctly calculated"
assert(discovery_fpercent == 800/10), "The percentage of women who recommended Discovery Pizza does not appear to have been correctly calculated"
assert(python_fpercent == 100*(90/100)), "The percentage of women who recommended Python Pizza does not appear to have correctly calculated"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Observe the Results

Run the following cell to format all of your answers as a DataFrame:

In [23]:
pd.DataFrame( [
    {'Female': discovery_fpercent, 'Male': discovery_mpercent, 'Overall': discovery_recpercent },
    {'Female': python_fpercent, 'Male': python_mpercent, 'Overall': python_recpercent },
], index=['Discovery Pizza', 'Python Pizza'] )

Unnamed: 0,Female,Male,Overall
Discovery Pizza,80.0,33.333333,62.5
Python Pizza,90.0,50.0,58.695652


You should see the pattern reverse when you look at the overall recommendation percentage vs. the percentage stratified by gender. This is called **Simpson's Paradox**: a pattern within a population can appear, disappear, or reverse when you look at subpopulations.

### Puzzle 3.2: Reflection

❓ **Individual Reflection Question** ❓: Should you enjoy your dinner at Discovery Pizza OR Python Pizza? In other words, which comparison of the reviews should you trust and Why?

Trust the male female sub population percentages over the overall percentages because they were stratified by gender, which could have been a confounding variable. The overall percentages are probably affected by that confounding variable. As a result, I would go with pythonPizza.

**Group Discussion**: Discuss the following in your groups!

We happened to have gender data alongside our recommendation data, but if you could decide what data to gather, or you had data on more characteristics, how might you decide what to stratify by? Are there any possible issues that may be caused by stratification? Why is stratification a good technique to control for confounding variables?

Stratification allows us to control confounding variables. We would stratify by potential confounders or outside factors that could most influence the final data which in this case is % recommended. Another confounder could also be age as it's possible younger people might naturally prefer a specific pizza place. 

## Part 4: Shoe Size vs Shoes Owned

Let's look at some characteristics of another dataset. You recently had the opportunity to complete a "Hello Survey."  We compiled the results into a dataset and we are going to explore the following two questions from that survey: 

- What is your shoe size? which is stored in a variable called "Shoe Size."
- How many pairs of shoes do you have? which is stored in a variable called "Shoe Number."

### Puzzle 4.1: Load the Dataset

Read the hello dataset into a DataFrame. It is in the lab_simpsons_paradox folder and is called `hello.csv`.

In [24]:
df_hello = pd.read_csv("Hello.csv")
df_hello

Unnamed: 0,Name,Major,Year,Phone,Computer,Holes In Straw,Dogs,Cats,Fish,Chickens,...,Twitch,Have you ever programmed before?,What background in Python programming do you have?,Statistics Courses,Study Hours Per Week,Siblings,Sleep Hours,Shoe Number,How many different people (including each person in group chats!) did you text yesterday?,"Introvert, extrovert, ambivert?"
0,Devang Ghela,Information Sciences,Junior,iPhone,Windows-based computer,1.0,0,0,0,0,...,1,Yes,Some Python -- I have had one class or written...,2.0,2.0,1,7.0,8.0,33.0,Introvert
1,Michelle,information sciences,Senior,iPhone,Mac OS X-based computer,1.0,1,1,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,2.0,5,7.0,20.0,20.0,Ambivert
2,Ethan,Data science,Freshman,iPhone,Mac OS X-based computer,1.0,0,0,0,0,...,1,Yes,Some Python -- I have had one class or written...,2.0,10.0,0,8.0,7.0,10.0,Introvert
3,Krish,Stats + CS,Freshman,iPhone,Mac OS X-based computer,1.0,0,0,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,4.0,0,6.0,6.0,10.0,Introvert
4,Kelly,Information Science,Junior,Android,Mac OS X-based computer,1.0,0,1,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,1.0,1,6.0,3.0,6.0,Introvert
5,Carson,Biological Anthropology,Junior,iPhone,Windows-based computer,2.0,0,0,0,0,...,0,No,No programming background -- is Python Taylor ...,1.0,4.0,0,7.0,2.0,12.0,Extrovert
6,Aayush,Computer Science and Economics,Freshman,iPhone,Mac OS X-based computer,1.0,0,0,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,2.0,1,7.0,2.0,20.0,Ambivert
7,Shubh,Cs+Stats,Freshman,iPhone,Windows-based computer,1.0,0,0,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,2.0,1,8.0,2.0,10.0,Introvert
8,Kaylee Moore,Information Science,Sophomore,iPhone,Mac OS X-based computer,2.0,0,1,0,0,...,0,No,No programming background -- is Python Taylor ...,1.0,3.0,2,6.0,15.0,4.0,Ambivert
9,Andrew,Statistics & Computer Science,Freshman,iPhone,Mac OS X-based computer,2.0,1,0,0,0,...,0,Yes,Some Python -- I have had one class or written...,1.0,3.0,1,8.0,4.0,5.0,Introvert


In [35]:
## == TEST CASES for Puzzle 4.1 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(len(df_hello) == 267), "This is not the dataset you're looking for"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 4.2: Observation Subsets

In this situation, let's define shoes that are strictly greater than size 9 to be "large" and shoes that less than or equal to size 9 are defined to be "small".

Create two DataFrames that contain subsets of our data: one DataFrame that includes everyone who has a "large" shoe size and one DataFrame that includes everyone who has a "small" shoe size:

In [38]:
df_large = df_hello[df_hello["Shoe Size"] > 9.0]
df_small = df_hello[df_hello["Shoe Size"] <= 9.0]


In [39]:
## == TEST CASES for Puzzle 4.2 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(len(df_large) == 141), "This is not the dataframe subset you're looking for."
assert(len(df_small) == 126), "This is not the dataframe subset you're looking for."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 4.3: Average Number of Pairs of Shoes By Group 

Find the average number of pairs of shoes of each group (large shoes and small shoes):

In [46]:
large_avg_shoes = df_large["Shoe Number"].mean()
small_avg_shoes = df_small["Shoe Number"].mean()

print("large_avg_shoes =", large_avg_shoes,
      "\nsmall_avg_shoes =", small_avg_shoes)

large_avg_shoes = 5.928571428571429 
small_avg_shoes = 7.784


In [47]:
## == TEST CASES for Puzzle 4.3 ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.

assert(large_avg_shoes == 830/140),  "The average number of shoe pairs in among large shoe wearers does not appear correct. Make sure you are taking the mean of the correct variable."
assert(small_avg_shoes == 973 / 125), "The average number of shoe pairs in among small shoe wearers does not appear correct. Make sure you are taking the mean of the correct variable."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 4.4. Reflection

❓ **Individual Reflection Question** ❓: What is the relationship between shoe size and shoe number?  Can you think of a possible confounding variable in the observed relationship (or lack thereof) between shoe size and shoe number?

Size of one's feet is a possible confounding variable as larger feet could mean greater shoe size and greater shoe number. In this way, we cannot say certainly that there is a positive relationship between shoe size and shoe number. 

## Submit Your Work!

You're almost done -- congratulations!

You need to do two more things:

1.  Save your work. To do this, go to File -> Save All

2.  After you have saved, exit this notebook and follow the webpage instructions to commit this lab to your Git repository!