In [None]:
%pip install pandas seaborn pingouin

In [2]:
import pandas as pd
import pingouin as pg
import seaborn as sns

## Using Data Science Notebooks to Report Analysis Results




Data science notebooks, like Jupyter Notebooks, have changed how researchers do and share their work. These notebooks let you mix code, data, and text in one place. This helps researchers explain their methods and show their results clearly. You can include code, text, equations, charts, and even videos. This makes it easier to share complex ideas with others, like peers, reviewers, and the public.

One big benefit of data science notebooks is that they help make research reproducible. Reproducibility means that others can repeat your work and get the same results. Notebooks save the whole process of your analysis, from data cleaning to final results. By sharing the notebook, you give others everything they need to repeat your study, including the exact code and outputs. This builds trust in your findings and helps others build on your work.

In this notebook, we'll try out the following Python packages, doing a few analyses and showing their results right next to the code:
  -  `pandas`: Makes it simple to reference variables in a study and show a table of the data, 
  - `seaborn`: Makes it simple to make plots from pandas tables,
  - `pingouin`: Makes nice statistical tables from pandas tables.

### Our Dataset: The Passengers on the Titanic

below, we load the data.  Every row is a passenger, every column is a variable about that passenger.  Please run the code and take a look at the dataset.  We'll use it in the next two sections.

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/titanic.csv')
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [22]:
df['adult_male'].loc[df['sex']=='male'].value_counts()

adult_male
True     537
False     40
Name: count, dtype: int64

In [None]:
# Correlation of survival rate with other variables

dv='survived'
vars=['pclass','alone', 'age', 'fare', 'adult_male']
df[[dv]+vars].corr()[dv][vars].sort_values(ascending=False)

fare          0.257307
age          -0.077221
alone        -0.203367
pclass       -0.338481
adult_male   -0.557080
Name: survived, dtype: float64

In [16]:
r=df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)['survived'][['pclass', 'alone', 'embark_town', 'sex']]
r

pclass         0.247845
alone         -0.203367
embark_town    0.101849
sex            0.543351
Name: survived, dtype: float64

## Reporting Statistics using the Pingouin Package and Visualizing Data with the Seaborn Package


|  Code  | Description |
| :-- | :-- |
| **`import pingouin as pg`**  | Imports the (already-installed) package.  Can be called starting with `pg.<function_name>` |
| **`pg.anova(data=df, dv='measurement_variable', between='group_variable', detailed=True)`** | Do a simple ANOVA between N groups of people |
| **`pg.pairwise_tukey(data=df, dv='measurement_variable', between='group_variable')`** | Do pairwise t-tests on all combinations of the grouping variable |
| **`import seaborn as sns`** | Load the (already-installed) `seaborn` package.  Its functions can be used by starting with `sns.<function_name>` |
| **`sns.barplot(data=df, x='Group Variable', y='Measurement Variable', hue='An Extra Measurement Variable')`** | Make a bar plot of the data. |


**Exercises**

**Example: Was there a significant difference in mean passenger age between each passenger class?**

ANOVA to check if a difference exists anywhere between the groups:

In [None]:
pg.anova(df, between='class', dv='age', )

T-Tests to compare values between all combinations between the groups (i.e. if I'm comparing one class against another with a t-test, will I see a significant difference?)

In [None]:
pg.pairwise_tukey(df, between='class', dv='age', )

Make a bar plot to show the mean value of each group:

In [None]:
class_order = ['Third', 'Second', 'First']
sns.barplot(df, x='class', y='age', order=class_order);

**Was there a significant difference in mean fare between each passenger class?**

ANOVA to check if a difference exists anywhere between the groups:

T-Tests to compare values between all combinations between the groups (i.e. if I'm comparing one class against another with a t-test, will I see a significant difference?)

Make a bar plot to show the mean value of each group:

**Was there a significant difference in survival rate between each passenger class?**

ANOVA to check if a difference exists anywhere between the groups:

T-Tests to compare values between all combinations between the groups (i.e. if I'm comparing one class against another with a t-test, will I see a significant difference?)

Make a bar plot to show the mean value of each group:

Extra bar plot, just for fun: What was the survival rate, broken down by both sex and class? (hint: `hue=`)