In [None]:
import pandas as pd
import numpy as np
import os
pd.options.plotting.backend = 'plotly'

# Lecture 8 – Unfaithful Data, Hypothesis Testing

## DSC 80, Spring 2023

### Agenda

- Messy data.
- Unfaithful data.
- Hypothesis testing.

## Messy data

### More data type ambiguities

- 1649043031 looks like a number, but is probably a date.
    - As we saw earlier, [Unix timestamps](https://www.unixtimestamp.com) count the number of seconds since January 1st, 1970.

- "USD 1,000,000" looks like a string, but is actually a number **and** a unit.
    
- 92093 looks like a number, but is really a zip code (and isn't equal to 92,093).
    
- Sometimes, `False` appears in a column of country codes. Why might this be? 
🤔

### Example: The Norway problem 🇳🇴

In [None]:
import yaml

player = '''
name: Magnus Carlsen
age: 32
country: NO
'''

In [None]:
yaml.safe_load(player)

## Unfaithful data

### Is the data "faithful" to the DGP?

- In other words, how well does the data represent reality?

- Does the data contain unrealistic or "incorrect" values?
    - Dates in the future for events in the past.
    - Locations that don't exist.
    - Negative counts.
    - Misspellings of names.
    - Large outliers.

### Is the data "faithful" to the DGP?
    
- Does the data violate obvious dependencies?
    - Age and birthday don't match. 
- Was the data entered by hand?
     - Spelling errors.
     - Fields shifted.
     - Did the form require fields or provide default values?  
- Are there obvious signs of data falsification (also known as "curbstoning")?
    - Repeated names.
    - Fake looking email addresses.
    - Repeated use of uncommon names or fields.

<center><img src='imgs/data-sd.png' width=70%></center>

### Example: Police vehicle stops 🚔

The dataset we're working with contains all of the vehicle stops that the San Diego Police Department made in 2016.

In [None]:
stops = pd.read_csv(os.path.join('data', 'vehicle_stops_2016_datasd.csv'))
stops.head()

In [None]:
stops.shape

### Data types

Are the data types correct? If not, are they easily fixable?

In [None]:
stops.head(1)

In [None]:
stops.info()

### Unfaithfulness
* Are there suspicious values?
* If a value is suspicious, can we trust the observation?
* For example, consider `'subject_age'` – some are too high to be true, some are too low to be true.

In [None]:
stops['subject_age'].unique()

In [None]:
ages = pd.to_numeric(stops['subject_age'], errors='coerce')
ages.describe()

Ages range all over the place, from 0 to 220. Was a 220 year old really pulled over?

In [None]:
stops[ages > 100]

What about all of the stops that involved people under the legal driving age?

In [None]:
ages[ages < 16].value_counts()

In [None]:
stops[ages < 16]

### Unfaithful `'subject_age'`

* Ages of `'No Age'` and `0` are likely explicit null values.
* What do we do about the exceptionally small and large ages?
    - Do we throw the entire row away, even if the rest of row is well-formed?
* What about the 14 and 15 year olds?
    - Each has more than one occurrence – these could be real entries!

### Human-entered data
* Which fields were likely entered by a human?
* Which fields were likely generated by code?
    - What was the original source?

Let's look at all unique stop causes. Notice that there are three different causes related to bicycles, which should probably all fall under the same cause.

In [None]:
stops['stop_cause'].value_counts()

Let's plot the distribution of ages, within a reasonable range (15 to 85). What do you notice? How could we address this?

In [None]:
ages[(ages > 15) & (ages <= 85)].plot(kind='hist')

Now let's look at the first few and last few rows of `stops`.

In [None]:
stops[['timestamp', 'stop_date', 'stop_time']].head()

In [None]:
stops[['timestamp', 'stop_date', 'stop_time']].tail(10)

Do you think `'-0:81'` is a time that a computer would record?

### Unfaithful data vs. outliers

* Unfaithful data are data that don't accurately represent the data generating process.
* Outliers are "unusual" observations, unlike the rest of the data. They may be real, or they may be unfaithful.
    - For instance, it's possible that a 102-year old was pulled over for speeding.
* The two are hard to tell apart; doing so often requires research and domain knowledge.

### Watch out for...

* **Consistently "incorrect" values**.
    - Example: Recorded ages of -1 or 99.
    - These are often "default" values, often used when a value is missing.
    - Solution: Change the value to the correct one if it is known!
    
* **Abnormal artifacts from the data collection process**.
    - Example: Spikes in recorded ages at round numbers (25, 30, 35, 40), or spikes in recorded COVID cases on Mondays.
    - Solution: Try "smoothing", e.g. binning the ages.
        
* **Unreasonable outliers**.
    - Example: Age of 200.
    - Solution: Not sure. Could remove the row. Could be indicative of a bug in the data collection process. Could be real!

## Missing values

### Where'd you go?

* Missing, or null, values in a dataset can occur from:
    - Intentional logic, where a value doesn't make sense.
    - A non-response in the measurement process.
    - Mistakes in the data recording process.
    
* Missing values are most often encoded with `NULL`, `None`, `NaN`, `''`, `0`, etc.

- **Important**: While you may have "dealt" with null values in Project 1, the extent to which you handled them was calling `.fillna(0)`. As we'll see over the next few weeks, we must take more care in treating null values!

### Common representations of "null"

- All forms of `0` (e.g. `0`, `'0'`, `'zero'`) are common substitutes for null, **but as we'll see, simply filling nulls with 0 is not always useful statistically**.
- -1 is common if a column must be non-negative.
- 1900 and 1970 are common if a non-null date is required.
    - Remember, Unix time starts counting from January 1, 1970.

### Common representations of "null"

- Some common representations for "null" are also real values themselves!
- For instance, the point 0°00'00.0"N+0°00'00.0"E in the South Atlantic Ocean is called "Null Island."

<center><img src='imgs/null.png' width=60%></center>

- [This person's name is Mr. Null!](https://www.wired.com/2015/11/null/)
- Watch: [Null Island: The Busiest Place That Doesn't Exist](https://www.youtube.com/watch?v=bjvIpI-1w84).

### Missing values in the stops dataset

What are the non-`NaN` null values in the stops dataset?
- `'service_area'`: `'Unknown'`.
- `'subject_age'`: `0`, `'No Age'`.
- Others?

In [None]:
stops

### Finding null values in `pandas`

* Null values are encoded using NumPy's `NaN` value, which is of type `float`.
* The `isna` method for DataFrame/Series detects missing values.
    - It returns a Boolean DataFrame/Series.
    - `isnull` is equivalent to `isna`.

In [None]:
type(np.NaN)

In [None]:
# All of the rows where the subject age is missing.
stops[stops['subject_age'].isna()]

In [None]:
# Proportion of values missing in the subject_age column.
stops['subject_age'].isna().mean()

In [None]:
# Proportion of missing values in all columns.
stops.isna().mean()

### Dropping observations with null values
- The `dropna` method,
    - when used on a Series, returns a new Series with all null entries removed.
    - when used on a DataFrame, returns a new DataFrame where all rows with at least one null value are removed.
- Don't drop rows unless absolutely necessary!
    - Usually, there is still useful information in the other columns.

In [None]:
stops.dropna().head()

In [None]:
stops.dropna().shape

In [None]:
stops.shape

### Dropping observations with null values

When used on a DataFrame:

* `.dropna()` drops **rows** containing **at least one** null value.
* `.dropna(how='all')` drops **rows** containing **only** null values.
* `.dropna(axis=1)` drops **columns** containing at least one null value.
* Other keyword arguments: `thresh`, `subset`.

In [None]:
nans = pd.DataFrame([['dog', 0, 1, np.NaN], ['cat', np.NaN, np.NaN, np.NaN], ['dog', 1, 2, 3]], columns='A B C D'.split())
nans

In [None]:
nans.dropna(how='any')

In [None]:
nans.dropna(how='all')

In [None]:
nans.dropna(subset=['B', 'C'])

In [None]:
nans.dropna(axis=1)

### Filling null values

As you've seen, the `fillna` method replaces all null values. Specifically:

* `.fillna(val)` fills null entries with the value `val`.
* `.fillna(dict)` fills null entries using a dictionary `dict` of column/row values.
* `.fillna(method='bfill')` and `.fillna(method='ffill')` fill null entries using neighboring non-null entries.

In [None]:
nans

In [None]:
# Filling all NaNs with the same value.
nans.fillna('zebra')

In [None]:
# Filling NaNs differently for each column.
nans.fillna({'B': 'f0', 'C': 'f1', 'D': 'f2'})

In [None]:
# Dictionary of column means.
# Note that most numerical methods ignore null values.
means = {c: nans[c].mean() for c in nans.columns[1:]}
means

In [None]:
# Filling NaNs with column means
nans.fillna(means)

In [None]:
# Another way of doing the same thing.
nans.iloc[:, 1:].apply(lambda x: x.fillna(x.mean()), axis=0)

In [None]:
# bfill stands for "backfill".
nans.fillna(method='bfill')

In [None]:
# ffill stands for "forward fill".
nans.fillna(method='ffill')

### Filling null values, so far

- Both in Project 1, and in our brief exploration today, we filled null values either with a constant (like 0), or by using other information in the same column as the null value.

- What if we could use values in **other** columns to inform us on how to fill null values in a particular column? For instance, what if we could fill missing values in the `'D'` column differently for rows where:
    - The `'A'` column contains `'dog'`.
    - The `'A'` column contains `'cat'`.

- To do so, we'd need to know whether the missingness of a particular column (like `'D'`) looks like it **depends** on the value of `'A'`.

- To establish this relationship, we'd need to perform a **permutation test**, which is a form of **hypothesis test**.

## Hypothesis testing

### Hypothesis testing

- There are many competing ways of viewing how the world works.
    - That is, competing **models** for how the data were **generated**.
- How do we decide which models are unlikely to be true, and which are reasonable?

- We'll start with an intuitive exploration, and cover more details next class.

### Example: Coin flipping

- Suppose we find a coin on the ground. We flip it 100 times, and see 59 heads and 41 tails.

- We want to answer the question "**does this observation (59 heads, 41 tails) look like it came from a fair coin?**"

- More precisely, we want to decide between two possible **hypotheses**.
    - **Null Hypothesis**: The coin is fair.
    - **Alternative Hypothesis**: The coin is biased in favor of heads.

- To decide, we need to know how rare it is to see 59 heads and 41 tails, or a result that's even more biased in favor of heads, when flipping a fair coin 100 times.

### Test statistics

> To decide, we need to know how rare it is to see **59 heads and 41 tails, or a result that's even more biased in favor of heads**, when flipping a fair coin 100 times.

- To make this notion precise, we need to summarize our observation (59 heads and 41 tails) into a single number, called a **test statistic**, which will **help us decide between the two hypotheses**.

- Then, we can find the **distribution of the test statistic, under the null hypothesis**, which will give us a range of typical values for the test statistic under the assumption that the null hypothesis is true.

- **Question**: What are some possible test statistics in this scenario?

For the alternative hypothesis "the coin was biased towards heads", we could use:
* $N_H$ (number of heads).
* $\frac{N_H}{100}$ (proportion of heads).
* $N_H - 50$ (difference from expected number of heads).

For simplicity, we'll start with $N_H$.

- **If a value of this test statistic is large, it means there are many heads in 100 flips.** This would suggest that the coin flipped is biased in favor of heads (alternative hypothesis).

- **If a value of this test statistic is small, it means there are few heads in 100 flips.** This would suggest that the coin flipped is not biased in favor of heads (null hypothesis).

### Generating the null distribution

- Now that we've chosen a test statistic, we need to generate the distribution of the test statistic under the assumption the null hypothesis is true, i.e. the **null distribution**.

- This distribution will give us, for instance:
    - The probability of seeing 4 heads in 100 flips of a fair coin.
    - The probability of seeing at most 46 heads in 100 flips of a fair coin.
    - **The probability of seeing at least 59 heads in 100 flips of a fair coin.**

### Generating the null distribution, using math

The number of heads in 100 flips of a fair coin follows the $\text{Binomial(100, 0.5)}$ distribution, in which

$$P(\text{# heads} = k) = {100 \choose k} (0.5)^k{(1-0.5)^{100-k}} = {100 \choose k} 0.5^{100}$$

In [None]:
from scipy.special import comb

def p_k_heads(k):
    return comb(100, k) * (0.5) ** 100

The probability that we see at least 59 heads is then:

In [None]:
sum([p_k_heads(k) for k in range(59, 101)])

Let's look at this distribution visually.

In [None]:
plot_df = pd.DataFrame().assign(k = range(101))
plot_df['p_k'] = p_k_heads(plot_df['k'])
plot_df['color'] = plot_df['k'].apply(lambda k: 'orange' if k >= 59 else 'blue')

fig = plot_df.plot(kind='bar', x='k', y='p_k', color='color', width=1000)
fig.add_annotation(text='This red area is called the p-value!', x=77, y=0.008, showarrow=False)

### Making a decision

We saw that, in 100 flips of a fair coin, $P(\text{# heads} \geq 59)$  is only ~4.4%.

- This is quite low – it suggests that our observed result is quite unlikely **under** the null.

- As such, we will **reject** the null hypothesis – our observation is **not consistent** with the hypothesis that the coin is fair.

- The null still may be true – it's possible that the coin we flipped was fair, and we just happened to see a rare result. For the same reason, we also **cannot "accept"** the alternative.

- This probability – **the probability of seeing a result at least as extreme as the observed, under the null hypothesis** – is called the p-value.
    - If the p-value is below a pre-defined cutoff (often 5%), we reject the null.
    - Otherwise, we fail to reject the null.

### Fun fact

- One researcher found that coin flips aren't 50/50, but rather are closer to 51/49, biased towards whichever side started facing up.
- [Read this](https://www.smithsonianmag.com/science-nature/gamblers-take-note-the-odds-in-a-coin-flip-arent-quite-5050-145465423) for more details.

## Summary, next time

### Summary

- Data cleaning is the process of transforming data so that it is an accurate representation of the data generating process.
- Unfaithful data is data that is not representative of the data generating process. When working with messy data, we must look for:
    - Missing values (i.e. "null" values).
    - Incorrect values.
- Useful methods to be aware of: `fillna`, `isna`/`isnull`, `dropna`.
- Hypothesis testing allows us to make confident conclusions regarding a data generating process, given some observed data.

### Next time

- Hypothesis testing via simulation, as you saw in DSC 10. 
- More examples and test statistics (e.g. TVD).