# Missing and Extreme Values

## Discussion of Missing Data

[Textbook](https://ledatascifi.github.io/ledatascifi-2024/content/03/05c_missingdata.html)

**Missing data is prevalent and common feature of real world datasets.**

The most important question to ask is "Why is this variable blank for this observation?"
- Is that variable missing completely randomly, or systematically?
- If systematic explanation plausible, this will impact many steps of our analysis. [Gif: Truncation on y.](https://lh6.googleusercontent.com/h_laUnAP-yOolhyqjcHDzAElFHDJiaaO49SmCQaFfVRqZZ369V7KGKTcAozQYxHwVUPi2lGP7HUpMdfUPsngV0_0idyKW-DOvGN0mgydqOSLxeAymloswdQtcoDruoxreg=w1280)

Systematic reasons include censoring and truncation mechanisms:
- A central bank intervenes to stop an exchange rate falling below or
going above certain levels.
- Dividends paid by a company may remain zero until earnings reach
some threshold value.
- A government imposes price controls on some goods.
- A survey of only working women, ignoring non-working women.

## What **CAN** you do about missing data?

(Not "should".)

Option | Pro | Con 
:-- | :--- | :---
Find it | Basically "free" lunch on multiple dimensions | Data collection ain't free
**Leave blank.** For each test, use all observations with no missing values for the variables in that specific test. | Doesn't _add_ noise or bias | Less data = less power, non-missing sample might not be representative
Deduce value (my height is the same as last year) | When deduction is exact, great | Rarely possible
Interpolate (my height in a year is halfway between my recorded height the prior and subsequent year) | In some settings, this adds viable data | Can artificially smooth time trends and cross sectional differences, often not possible
Fill with other values (median, mean, "fancy" imputation) | Common in prediction problems because it can "allow" you to use more data | Lots (see below), but tl;dr: Don't do this for causal inference and non-prediction problems

## My recommendations 

1. Find it. The remaining options aren't super. 
1. If you can deduce the correct value, go ahead
1. Otherwise, tend towards leaving blank 
    - On a test-by-test basis, you'll delete or ignore observations where any variable in the test has missing values.
    - Most common choice in finance research, and if your question is about _causal inference_, the only choice.
1. Create a flag (binary variable) to indicate observations with missing values.
    - Then replace the NANs in the original variable with a consistent value, such as 0 or -999. 
    - In regressions, you must include both variables in the analysis. 
    - Allows you to keep the observation and not lose the rest of its information.
    - Easy. Works well for both prediction and causal inference. (I've done this in many projects.)
    

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]}) 

_df1 = df.copy()
_df1['firm'] = 1
_df1['date'] = _df1.index

_df2 = df.copy()
_df2['firm'] = 2
_df2['date'] = _df2.index

df2 = pd.concat([_df1, _df2])

## Warm up

Play around with each of the functions on `df`. Look at the possible parameters each can take, and try a few. 

(Don't change the underlying data yet, just use the functions without assigning the output to a new object.)

## Questions for df:

Fill all missing values with -1

Fill missing values for variable “B” with -1

Fill all values with the mean for the variable

Fill all values with the median for the variable

Fill values by taking the most recent non-missing prior value

## Questions for df2:

Carry missing values forward without carrying values from firm 1 to firm 2

Fill missing values with the average for firms on that date

## Outliers

[Let's look at Anscombes quartet again.](https://ledatascifi.github.io/ledatascifi-2024/content/03/05d_outliers.html#example-2-visual-intuition)

--> Outliers _can_ cause analysis to mistate relationships

## Outliers exercises (in class)

What can we do to find outliers?

## Dealing with outliers

To paraphrase @ChelseaParlatt: We don't remove or change outliers because they are extreme. Remove them because they aren't part of the data generating process (DGP) you want to study. ("How are X and Y related, **on average**?") 

After finding them, investigate outliers. (This requires domain knowledge and costs time + effort. But necessary for valid analysis!)
- If they are errors, fix or remove.
- If they are from another DGP, remove (censor). 
- If they are correct and from the DGP you are interested in, keep. 

## Doing analysis with extreme values

As our quartet showed, extreme values can influence your analysis. You want to know $E(Y|X)$ but outliers can weigh on some kinds of analysis so that your result reflects the outliers more than the central tendencies. 

[Play with this.](https://ledatascifi.github.io/ledatascifi-2024/content/03/05d_outliers.html#example-2-visual-intuition)

Practically, this is solved by applied researchers three ways:

1. Winsorize: Any variable above or below some extreme limit is changed to that limit
    - Q1: What cutoff? p0.1, p1, or p5?
    - Q2: Is the cutoff set using all observations, or within subgroups? 
        - It matters if the distribution shifts over time or across groups. 
        - Viz ex: Shifting ridgeline + fixed limit. 
        - Word ex: If you winsorize "extremely low" GDP per capita at $250 USD, you'll change no countries today, but many countries in prior years.
1. Transform
    - log(), a la Assignment 2 - **by far most common**
        - Many variables are ~lognormal (income, etc)
        - Interpretation often easier (proportions and elasticities, covered later)
    - Others: z-score, Inverse hyperbolic sine, Box-Cox, square root, exponential, 
    - Normalization on the (0, 1) interval: max/min, sigmoidal, hyperbolic tangent.
1. Use an estimator that is robust to outliers
    - [Hot, recent research](https://www.sciencedirect.com/science/article/abs/pii/S0304405X2200174X) on issues with the log transformation and an estimator that solves the issue without the transformation

## Outliers exercises (after class)

[Load CCM](https://ledatascifi.github.io/ledatascifi-2024/content/03/05d_outliers.html#finding-outliers).

1. Get the correlation between `mb` and `prodmktfluid`.
1. Create a variable called `mb_win1` and `prodmktfluid_win1` using the default values of `winsorizer_with_missing()`.
1. Get the correlation between `mb_win1` and `prodmktfluid_win1`.
1. Let me know what you found!

Load the 2020 slice of Compustat that we used on Assignment 2. 
1. Plot the kde of the raw assets variable.
1. Plot the kde of the log assets variable.

## A short list of problem when filling missing values

(I'm including this because it was promised above.)

1. Which "other observations" to fill based on? All? A subset?
    - Imagine you're filling the "height" variable. 
    - Same gender? Same age? Same gender and age?
1. Still, is your subset of comparable obs "enough". Birth year? Birth country? Ethnicity?
1. If you do this for many variables one at a time, separately, that means you're not using the covariance between variables... 
1. What if missing values are missing for systematic reasons?

This is not an exhaustive list. 

## Acknowledgments 

Material remixed from many sources, but punched up this year with stuff from:
- Rauli Susmel ([censoring](https://www.bauer.uh.edu/rsusmel/phd/ec1-23.pdf))
- @ChelseaParlett (DGP quote)
- [Nick Hagerty](https://github.com/msu-econ-data-analytics/course-materials)