# Worksheet 26: Python for Data Science Part 1

We can use Python for data wrangling and visualizations.

In this notebook, we will:
- Introduce how to perform data wrangling in Python using `pandas` (equivalent to `dplyr`).
- Introduce how to create visualizations in Python using `seaborn` (equivalent to `ggplot2`).
- Compare Python syntax with R's.

### 1. Set up

To ensure we can see all outputs from a single code chunk, let's run the following:

In [2]:
# Run this first
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

We introduced a bunch of packages in the last worksheet:

In [3]:
import pandas as pd # data manipulation
import seaborn as sns # more data manipulation and visualizations
import matplotlib.pyplot as plt # more visualization options

Let's manipulate a dataset we are already familiar with:

In [4]:
# Load dataset from library
titanic = sns.load_dataset('titanic')

# Take a look
titanic.head() 

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


*Note: a `.` works a little bit like the pipe in `R`!*

### 2. Data wrangling with pandas

The `pandas` package contains functions for data wrangling that are similar to `dplyr` in R.

| dplyr     | pandas      |
|-----------|-------------|
| filter    | query       |
| select    | filter      |
| group_by  | groupby     |
| summarize | agg         |
| mutate    | assign      |
| arrange   | sort_values |

Here are some examples of using `pandas` functions:

In [5]:
# Filter passengers who survived
survived = titanic.query('survived == 1')
survived.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [6]:
# Select specific columns
selected_columns = titanic.filter(items=['who', 'age', 'class'])
selected_columns.head()

Unnamed: 0,who,age,class
0,man,22.0,Third
1,woman,38.0,First
2,woman,26.0,Third
3,woman,35.0,First
4,man,35.0,Third


In [7]:
# Find average age of passengers by class
titanic.groupby('who', observed=True).agg({'age': 'mean'})

Unnamed: 0_level_0,age
who,Unnamed: 1_level_1
child,6.369518
man,33.173123
woman,32.0


In [8]:
# Add other stats and format
titanic.groupby('who', observed=True).agg(
    age_mean = ('age', 'mean'), 
    count = ('sex', 'size')
)

Unnamed: 0_level_0,age_mean,count
who,Unnamed: 1_level_1,Unnamed: 2_level_1
child,6.369518,83
man,33.173123,537
woman,32.0,271


In [9]:
# Create a variable of age group: child or adult
titanic = titanic.assign(age_group=titanic['age'].apply(lambda x: 'child' if x < 18 else 'adult'))
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_group
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,adult
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,adult
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,adult
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,adult
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,adult


In [10]:
# Sort fare in descending order 
sorted_titanic = titanic.sort_values(by='fare', ascending=False)
sorted_titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,age_group
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True,adult
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True,adult
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False,adult
88,1,1,female,23.0,3,2,263.0,S,First,woman,False,C,Southampton,yes,False,adult
27,0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False,adult


Note that we can chain multiple functions, shown on different rows, by using parentheses `()` around all code and `.` as a pipe:

In [11]:
# Chain operations
(titanic[titanic['survived'] == 1] # keep passengers who survived
    .groupby('who', observed=True) # split by who
    .agg(total_fare=('fare', 'sum')) # find the total fare paid by the survivors
)

Unnamed: 0_level_0,total_fare
who,Unnamed: 1_level_1
child,1611.6751
man,3702.7251
woman,11236.8292


#### Try it! Knowing that 1 dollar in 1912 is equivalent to 32.54 dollars today, find the mean value paid by each class in today's money.

In [12]:
# Write code here

### 3. Data visualization with seaborn

Similar types of plots that we learn in R also exist in Python:

| ggplot2       | seaborn          |
|---------------|------------------|
| geom_bar      | barplot    |
| geom_histogram | histplot |
| geom_boxplot | boxplot   |
| geom_point    | scatterplot  |
| facet_wrap  | FacetGrid.map |

In [13]:
# For a categorical variable: first find counts
class_counts = titanic.groupby('class', observed=True).agg(counts=('class', 'size'))

# Then make a plot to represent counts
sns.barplot(data=class_counts, x='class', y= counts)
plt.title('Number of Passengers per Class')

NameError: name 'counts' is not defined

In [None]:
# Age distribution
sns.histplot(data=titanic, x='age')
plt.title('Age Distribution')

In [None]:
# Age distribution by who
sns.FacetGrid(titanic, col='who', col_wrap=3, height=4).map(sns.histplot, 'age')

In [None]:
# Boxplots are always easier to make comparisons...
sns.boxplot(data=titanic, x='who', y='age')

In [None]:
# Relationship between age and fare
sns.scatterplot(data=titanic, x='age', y='fare')
plt.title('Age vs Fare')

In [None]:
# Adding some color to the scatterplot
sns.scatterplot(data=titanic, x='age', y='fare', hue='who')
plt.title('Age vs Fare by Who')

#### **Try it! Make an appropriate graph and report the appropriate statistics to investigate if age is related to survival.**

In [None]:
# Write code here

Next, we will review some machine learning algorithms to make predictions!