Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Simpson's Paradox - Kidney stone treatment A vs. B

**Name:** Amy Yang

**Email address associated with your DataCamp account:** yangy.ustc@gmail.com


**Project description**: 

Simpson's paradox is an interesting phenomenon in statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. In this project, we will examine a real-life  Simpson's paradox example from a medical study comparing the success rates of two treatments for kidney stones. We will identify the 'lurking' variable that causes the confusion and using multiple logistic regression to solve the puzzle!

You will apply the skills you learned in Multiple and Logistic Regression, along with content from Introduction to the Tidyverse and Data Manipulation in R with dplyr.

This project uses simplified dataset from the original study published [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1339981/).

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. A new look of an old research study

In the year of 1986, a group of urologists in London published a research paper in the British Medical Journal to compare two different methods in their effectiveness of removing kidney stones. Treatment A is open surgery (invasive) and treatment B is percutaneous nephrolithotomy (less invasive). When they looked at the results from the total of 700 patients, treatment B had a higher success rate. However, when they only look at the sub grouped data for patients with small kidney stones or large kidney stones, treatment A had better outcomes. What is going on here? In this notebook, we are going to solve the puzzle by using multiple regression and other statistical tools. Let's dive in now!
<img src="img/img1.jpg" height="500" width="500">

In [10]:
# load the tidyverse package
library(tidyverse)

# Read datasets kidney_stone_data.csv into hd_data
data <- read_csv("datasets/kidney_stone_data.csv")

# take a look at the first five rows of the dataset
head(data)

Parsed with column specification:
cols(
  treatment = col_character(),
  stone_size = col_character(),
  success = col_double()
)


treatment,stone_size,success
B,large,1
A,large,1
A,large,0
A,large,1
A,large,1
B,large,1


## 2. Recreate the Treatment X Success summary table

The data contains three columns; treatment (A or B), stone_size (large or small) and success (0 or 1).
To start, we want to know overall which treatment had higher success rate regardless of stone size. Let's create a table with number of success and frequency by each treatment using the tidyverse syntax.

In [16]:
# use the %>% to link three steps together
# 1. group by treatment and success 
# 2. calculate the total number of patients in each treatmentXsuccess combination using summarise function
# 3. create the frequency using mutate function

data %>% group_by(treatment, success) %>%summarise(N = n()) %>%mutate(Freq = N/sum(N))

treatment,success,N,Freq
A,0,77,0.22
A,1,273,0.78
B,0,61,0.1742857
B,1,289,0.8257143


## 3. Bringing stone size into the picture 

From the treatment and success rate descriptive table, we saw that treatment B performed better on average compared to treatment A (82% vs 78% success rate). Now, let's consider the stone size factor and see what we will find. In this task, you are going to further devide the data into small vs large stone subcategories and compute the same success count and rate by treatment like we did in the previous task. The final table will be treatment X stone size X success.

In [17]:
# we can achieve this goal by simply add stone size into the group_by variable list

data %>% group_by(treatment, stone_size, success) %>%summarise(N = n()) %>%mutate(Freq = N/sum(N))

treatment,stone_size,success,N,Freq
A,large,0,71,0.26996198
A,large,1,192,0.73003802
A,small,0,6,0.06896552
A,small,1,81,0.93103448
B,large,0,25,0.3125
B,large,1,55,0.6875
B,small,0,36,0.13333333
B,small,1,234,0.86666667


*Stop here! Only the three first tasks. :)*