# Lecture 12 (4/22/2022)

**Announcements**



*Last time we covered:*
- Tidy data (wide / long format)

**Today's agenda:**
- Data transformations
> Part 1: logarithmic transformations
>
> Part 2: z-scoring
>


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Part 1: log transformations

## Why we (sometimes) need log transformations

To get a sense of when we may need to transform our data, let's take a look at our `gapminder` dataset once again:

In [None]:
gap = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/gapminder.csv")

gap

Below we plot the distribution of *population* across the countries in this dataset for the most recent year available.

What does it look like?

In [None]:
g = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = 20 # even with more bins the distribution is similar
)
g.set_xlabel("Population (billions)")
g.set_title("Population by country, 2007")

In [None]:
# How skewed is the data above?

gap["pop"][gap["year"] == 2007].mean() # ~44M
gap["pop"][gap["year"] == 2007].median() # ~10.5M


# ...big difference

**Why this is a problem?**

Most standard statistics assumes that the variables being predicted or serving as predictors are (roughly) normally distributed. Our population data above clearly isn't!

**How common is this?**

The pattern above isn't unique to population. Many other common variables tend to have similarly shaped distributions. 

*Can you think of some others?* (hint: remember Zipf's Law back in pset 1?)

## What to do with data like the above: log transform!

**What do we mean by log transforming?**

We want to convert our population data to a logarithmic scale rather than a linear scale.

We'll illustrate this difference below.

In [None]:
# Our original graph used bins scaled at a linear interval. We're printing them below. 

_ = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = 20
)

# Here's our original histogram bins: note how they increase by a fixed amount
print("Histogram bins: {}".format([str(elem) for elem in plt.xticks()[0]]))

# Can print each element minus the previous elemnt to show this
# [plt.xticks()[0][i+1] - plt.xticks()[0][i] for i in range(len(plt.xticks()[0])-1)]


In [None]:
# Step 1: let's generate logarithmic scaled bins instead of the above.


# Instead of increasing by a fixed amount (200k), they will increase by a fixed *multiple* (10) from 100k to 1B
# Use `np.logspace` for this
log_bins = np.logspace(base = 10, # specify the base (10 is default)
                       start = 5, # specify the start, which is base**start (in this case 10e5)
                       stop = 9, # specify the end, which is base**end (in this case 10e9)
                       num = 10) # specify the number of bins

[str(elem) for elem in log_bins]

# These don't increase by a fixed amount
# Can print each element minus the previous element as we did above to show this

# [log_bins[i+1] - log_bins[i] for i in range(len(log_bins)-1)]

# Instead, they increase by a fixed *multiple*
# Can print each element in log_bins *divided by* the previous element to show this

# [log_bins[i+1] / log_bins[i] for i in range(len(log_bins)-1)]

In [None]:
# Now, let's use these logarithmic intervals as the basis for our histogram bins

g = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = log_bins # This is the key change
)

# NOTE: we need to also specify that the x axis should be drawn on a logarithmic scale 
# (try graphing without this to see what happens!)
g.set_xscale('log')


Our data looks normally distributed when we plot it on a log scale. Woo hoo!

But we haven't changed the underlying data. 

Let's log transform *the data itself* so its (log) values are normally distributed.


In [None]:
# To do this, use np.log10 (np.log uses the *natural log*)
gap['log_pop'] = np.log10(gap['pop'])

gap

Now what? Let's take a look at our *log transformed* population variable.

Is it normally distributed?

In [None]:
g = sns.histplot(data = gap[gap['year'] == 2007], x = 'log_pop')
g.set_xlabel("Log transformed population")

## Log transformations: Summary

- Statistics and modeling solutions often assume that the underlying variables are *normally distributed*

- You can count on many variables in the world being roughly normally distributed (especially with large samples!) 

- But certain types of data are **reliably not** normally distributed (ex. income, wealth, population, number of Twitter followers, number of Spotify streams, ...)

- When your data looks like the examples above (rule of thumb: roughly exponentially distributed, or has very large right skew), it's often the case that the *logarithm* of the data *is normally distributed*.

- You can check whether this is true by plotting it on a log scale as we did above. If so, consider *log transforming* your data.

Note: using the log transformed values for a variable in statistics or regression changes how you interpret your results (for example, regression coefficients on a log-transformed variable X will reflect the impact of *multiplicative* changes to X on the output variable Y). 

## [OPTIONAL] Why log transformations work for exponential data

***

# Part 2: z-scoring