# Exercise 1: Familiarize yourself with `pandas`

Skills: 
* `pandas` is one of the base Python packages for working with tabular data.
* Do some grouping and aggregation. Many ways to do this!
* Export to Google Cloud Storage
* Practice committing on GitHub

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/data-analysis-intro.html
* https://docs.calitp.org/data-infra/analytics_tools/saving_code.html
* https://docs.calitp.org/data-infra/analytics_examples/warehouse_tutorial.html

In [9]:
from siuba import *
from siuba.data import mtcars

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Groupby / Aggregation

* By cylinder categories, calculate the average mpg and find difference between max and min weight.
* Hint: for `pandas`: `groupby / agg`, `pivot_table`, `groupby / transform`
* Hint: for `siuba`: `groupby`, `summarize`

In [10]:
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


#### Siuba

In [19]:
#Average mpg

# For all cyl -print(f"The average mpg is {mtcars.mpg.mean()}.")

df1 = mtcars >> group_by(_.cyl) >> summarize(avg_mpg = _.mpg.mean()) >> ungroup()
df1.head()

Unnamed: 0,cyl,avg_mpg
0,4,26.663636
1,6,19.742857
2,8,15.1


#### Pandas

In [12]:
# dictionary - key to values

#df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
agg_by_cyl1 = mtcars.groupby('cyl').agg({'wt': 'min', 
                          'hp': 'max',
                          'mpg': 'sum'}).reset_index()
agg_by_cyl1.columns

Index(['cyl', 'wt', 'hp', 'mpg'], dtype='object')

In [13]:
# dictionary - key to values

#df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
agg_by_cyl2 = mtcars.groupby('cyl').agg({'wt': 'min', 
                          'hp': 'max',
                          'mpg': 'sum'})#.reset_index()

In [14]:
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [15]:
pd.pivot_table(mtcars, values=["wt"], columns = ['cyl'])

cyl,4,6,8
wt,2.285727,3.117143,3.999214


In [16]:
#pd.pivot_table equivalent to df.groupby().agg()

pd.pivot_table(
    mtcars,
    index=["cyl"], # groupby
    values = ["hp", "mpg"],
    aggfunc = "sum"
)



Unnamed: 0_level_0,hp,mpg
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1
4,909,293.3
6,856,138.2
8,2929,211.4


In [17]:
pd.pivot_table(mtcars, index = 'cyl',
               values=['hp', ],
                    columns=['vs'], aggfunc = 'sum')
# by cylinders, by vs groupings, calculate sum/max of hp. 
pd.pivot_table(mtcars, index = ['cyl', 'vs'], values = 'hp', aggfunc=['sum', 'max'])


Unnamed: 0_level_0,Unnamed: 1_level_0,sum,max
Unnamed: 0_level_1,Unnamed: 1_level_1,hp,hp
cyl,vs,Unnamed: 2_level_2,Unnamed: 3_level_2
4,0,91,91
4,1,818,113
6,0,395,175
6,1,461,123
8,0,2929,335


In [20]:
mtcars.groupby(['cyl', 'vs']).agg({'hp': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,hp
cyl,vs,Unnamed: 2_level_1
4,0,91
4,1,818
6,0,395
6,1,461
8,0,2929


## Functions

* Create a new column using a lambda function and tag 6 cylinder values with "six", and all other values as "other"
* Write a function that tags each cylinder value with the word (ex: 6 as "six", 8 as "eight")

In [27]:
# View type
# type(mtcars)

# Create new df
mtcars2 = mtcars

# Create new column with cylinders as a string
mtcars2["cylinder_flag1"] = mtcars.apply(lambda x: "six" if (x.cyl == 6)
                else "other", axis = 1)

mtcars2.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cylinder_flag1,cylinder_flag2
0,21.0,six,160.0,110,3.9,2.62,16.46,0,1,4,4,other,
1,21.0,six,160.0,110,3.9,2.875,17.02,0,1,4,4,other,
2,22.8,four,108.0,93,3.85,2.32,18.61,1,1,4,1,other,
3,21.4,six,258.0,110,3.08,3.215,19.44,1,0,3,1,other,
4,18.7,eight,360.0,175,3.15,3.44,17.02,0,0,3,2,other,


In [28]:
# Define function cylinder_word to convert numeric 'cyl' column to text
def cylinder_word(df):
    if (df.cyl == 4):
            return 'four'
    if (df.cyl == 6):
            return 'six'
    elif (df.cyl == 8):
            return 'eight'
        
function_digit_to_text = mtcars2.apply(cylinder_word, axis=1)

function_digit_to_text.head()

# Include column into mtcars3

# Debug

mtcars3 = mtcars2
mtcars3['cylinder_flag2'] = function_digit_to_text
mtcars3.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,cylinder_flag1,cylinder_flag2
0,21.0,six,160.0,110,3.9,2.62,16.46,0,1,4,4,other,
1,21.0,six,160.0,110,3.9,2.875,17.02,0,1,4,4,other,
2,22.8,four,108.0,93,3.85,2.32,18.61,1,1,4,1,other,
3,21.4,six,258.0,110,3.08,3.215,19.44,1,0,3,1,other,
4,18.7,eight,360.0,175,3.15,3.44,17.02,0,0,3,2,other,


## Export to Google Cloud Storage (GCS)

* Make sure credential works
* Use this path: "gs://calitp-analytics-data/data-analyses/FILENAME"
* Export using `df.to_parquet()` and `df.to_csv()`

In [None]:
# Removed files after running from GCS
mtcars2.to_parquet('gs://calitp-analytics-data/data-analyses/example_report/practice_exercise1_Julia.ipynb')
mtcars2.to_csv('gs://calitp-analytics-data/data-analyses/example_report/practice_exercise1_Julia.csv')

## Make a chart

* Read in the parquet file from GCS.
* Make a visualization using one of the charting packages.
* Name this notebook `YOURNAME_exercise1.ipynb`
    * If you need to rename because you already named it, do it within the terminal.
    * `git mv OLDNAME.ipynb NEWNAME.ipynb`. 
    * The `mv` stands for move, and renaming a file is basically "moving" its path. Doing it this way retains the git history associated with the notebook. If you rename directly with right click, rename, you destroy the git history.
* Use a descriptive commit message (ex: adding chart, etc). GitHub already tracks who makes the commit, the date, the timestamp of it, the files being affected, so your commit message should be more descriptive than the metadata already stored.

In [None]:
pd.read_parquet('gs://calitp-analytics-data/data-analyses/example_report/practice_exercise1_Julia.ipynb').head()

In [None]:
# Scatterplot of horsepower by cylinders
plt.scatter(mtcars2.cyl, mtcars2.hp)