# MSAS Tutorial Eight

**This tutorial will cover Pandas idioms and Goodhart's law**

**Written by: Ben Weber**

**Ideas drawn from Professor Chris Brooks**

**Pandas Idioms**

* We have a little bit about how there are many different ways write code that accomplishes the same task
* However, some are more appropriate than others, these are called idiomatic solutions

Can you think of any reasons why some solutions could be more appropriate than others?

**Functionality:**
* Code performs better 
* This means that runtimes are faster with certain solutions
* May not be important with smaller amounts of data, but could be significant in large datasets, files, or webpages

**Style/Readability:**
* Code could be made more succinct
* Code could be easier to read
* Code could be easier to understand conceptually 

In [None]:
# Bring in your usual libraries, with the addition of the timeit module (Use import keyword)

import pandas as pd
import numpy as np

import timeit

We will use timeit to test the functionaility of different coding solutions

In [None]:
# Load in our usual df:

pbp = pd.read_csv("UofM 2022 pbp")

**Method Chaining**

* We have talked a lot about methods now, method chaining is simply attatching a bunch of methods in a single script of code
* This works because a method returns the object it was applied to
* Therefore it follows that we should be able to continually attach methods without error, as they are all being applied to an object

To get a good idea of what is going on, we will look at an example from a previous tutorial:

Here is the **non**-idiomatic way of writing this code:

In [None]:
pbp = pbp.loc[:,['defense_team_name','play_pass_placement_displacement']]

pbp = pbp[pbp['defense_team_name'] == 'Ohio State Buckeyes']

pbp.sort_values(by='play_pass_placement_displacement',ascending=False,inplace=True)

pbp.set_index('defense_team_name',inplace=True)

In [None]:
pbp.head()

We can see that this works just fine and is probably useful for understanding exactly what is happening line by line, however, with method chaining, it could be much more succinct

Here is a more idiomatic way to right the code:

In [None]:
pbp = pd.read_csv('UofM 2022 pbp')

In [None]:
(pbp.filter(items=['defense_team_name','play_pass_placement_displacement'], axis=1)
    .sort_values(by='play_pass_placement_displacement',ascending=False)
    .set_index('defense_team_name')
    .filter(like='Ohio', axis=0)
    .head())

As you can see, method chaining allowed us to accomplsih the same task while writing less code

But which is faster? We can use timeit to check:

In [None]:
def approach_one():
    global pbp
    new_pbp = pbp.loc[:,['defense_team_name','play_pass_placement_displacement']]
    new_pbp = pbp[pbp['defense_team_name'] == 'Ohio State Buckeyes']
    new_pbp.sort_values(by='play_pass_placement_displacement',ascending=False,inplace=True)
    return new_pbp.set_index('defense_team_name',inplace=True)

pbp = pd.read_csv('UofM 2022 pbp')

In [None]:
timeit.timeit(approach_one, number = 25)

In [None]:
def approach_two():
    global pbp
    
    return (pbp.filter(items=['defense_team_name','play_pass_placement_displacement'], axis=1)
    .sort_values(by='play_pass_placement_displacement',ascending=False)
    .set_index('defense_team_name')
    .filter(like='Ohio', axis=0)
    .head())

pbp = pd.read_csv('UofM 2022 pbp')

In [None]:
timeit.timeit(approach_two, number=25)

In this case, method chaining was more functional than not, but this will not always be the case

**The big takeway is that idiomatic code could change based on the situation, and for larger projects you might have to test which methods are most idiomatic**

**Also important:**
* The code solution you decide is most idiomatic can change based on the goal of your code
* Think about which factors are most important in each specific situation, there is not a general rule

**Apply method:**

* This method is another pandas idiom, and since it came up in the last tutorial we will go over it again here
* Apply takes a function as a parameter, and if you attach to a vector, this is all you need to know
* If you are using apply on a dataframe, you will have to specify which axis you want to apply the function to

**Be careful when choosing an axis, becasue choosing columns will apply you function to each row, and vice versa**

In [None]:
pbp = pd.read_csv('UofM 2022 pbp')

def play_quality_func(EPA):
    if (EPA < -.5):
        return 'abysmal'
    elif ((EPA >= -.5) & (EPA < -.1)):
        return 'bad'
    elif ((EPA >= -.1) & (EPA < .05)):
        return 'decent'
    elif ((EPA >= .05) & (EPA < .5)):
        return 'good'
    else:
        return 'phenominal'

pbp['play_quality_epa'] = pbp['play_scrimmage_epa'].apply(lambda x: play_quality_func(x))
pbp['play_quality_epa']

**The apply method is a good example of vectorization being far more idiomatic than iteration**

**Goodhart's Law:**

Similary to the discussion around scales in tutorial seven, Goodhart's law is important to be aware of in data analysis

When we analyze data, we are basically making infrences and suggestions based on the data we observe, whether these conclusions are backed with statistical methods or not

**However, it is important to remember that these conclusions are just based on observations, not experiments**

Therefore, we find measures of correlation, not causation

* **Goodhart's law states that once a measure becomes a target, it ceases to be a good measure**

What does this mean?

* If we think about it in terms of correlation vs. causation, it would be like assuming casuation for two correlated variables
* This has led to a variety of statistical blunders throughout history, but can easily be avoided by understanding the difference

In sports analytics, it can lead to coaching errors due to poor strategy, for example, the Vikings and Randy Moss:

Back when Randy Moss was a Viking, somebody noticed the Vikings won a large percentage of their games when Randy Moss caught a certain number of passes. The coach, ignoring that this was just a measure of correlation, developed an offense designed to force feed Moss the ball. Unsurprisingly, the strategy did not work. The Vikings' offense was poor and they were losing games. This was a lessom for coaches around the league that the offenses operate best when a quarterback can go through his progressions naturally to find the open receiver.

**As you can see, the Vikings made the measure of winning more when Randy Moss had a good game a target, and the measure became obviously poor when they started losing games due to a stale offense**

That's all for this tutorial, if you want to learn more about Goodhart's law and its applications, visit this article/podcast:

https://dataskeptic.com/blog/episodes/2016/goodharts-law