# SW 282: Lab 7 - Hypotheses

---

### Professor Erin Kerrison

In this notebook, you will use the matplotlib skills you've worked with thus far, to plot histograms and to generate Z-scores for randomly selected data points.

In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [9,6]

To demonstrate the concepts in this notebook, we will use the [USEIA State Energy Data Systems ("SEDS")](https://www.eia.gov/state/seds/) dataset, which includes national data collected on energy use and pricing. The SEDS dataset enables us to create a historical time series of energy production, consumption, prices, and expenditures by state, all of which  are defined as consistently as possible over time and across sectors for analysis and forecasting purposes.

The specific data points we will be working with describe the median total energy cost by state and year in current dollars (per million).

In [None]:
# Let's begin as we normally do, by glancing at an abbreviated table illustrating how the data are organized.

seds = Table.read_table("seds-data.csv")
seds

## Histograms

Now that we've pulled the dataset into Python and can visually see how the variables are organized in table form, let's turn to a second visual, which captures the distributions of the energy prices across three decades (in the years 1980, 1990, and 2000). We will also compute the standard deviation for the data distributon in each of those three years.

In [None]:
seds_1980 = seds.where("Year", are.equal_to(1980))

seds_1980.hist("Energy Price", bins=20)
plt.title("1980 Energy Price");

sd_1980 = np.std(seds_1980.column("Energy Price"))
print("The standard deviation for 1980 is: {:.5f}".format(sd_1980))

In [None]:
seds_1990 = seds.where("Year", are.equal_to(1990))

seds_1990.hist("Energy Price", bins=20)
plt.title("1990 Energy Price");

sd_1990 = np.std(seds_1990.column("Energy Price"))
print("The standard deviation for 1990 is: {:.5f}".format(sd_1990))

In [None]:
seds_2000 = seds.where("Year", are.equal_to(2000))

seds_2000.hist("Energy Price", bins=20)
plt.title("2000 Energy Price");

sd_2000 = np.std(seds_2000.column("Energy Price"))
print("The standard deviation for 2000 is: {:.5f}".format(sd_2000))

The histograms all look very similar, but that could be because they're all being plotted on axes of different scales. Recall that we don't believe in junk science or distorted visuals!!

So, let's plot all of the histograms on the same pair of axes to see how the distributions are changing when compared in a standardized visual format.

In [None]:
for year in np.arange(1980, 2001, 10):
    plt.hist(x="Energy Price", data=seds.where("Year", year), bins=20, alpha=.5)

plt.xlabel("Median Energy Costs in 2019 $, per million BTUs")
plt.ylabel("Percent per Unit");

plt.legend(np.arange(1980, 2001, 10));

The plot that you just created above shows that although the _shape_ of the distribution hasn't changed much over time, its _scale_ has increased.  You can see that the data sampled at 1980, 1990, and 2000 move towards a greater spread and are centered at a higher value as time passes.

## Z-Scores

In this section, we want to calculate the Z-score of a randomly selected data point from one of years shown above. To calculate the Z-score of a data point, we can use the function `scipy.stats.zscore`, which takes in an array of values and returns an array of Z-scores. 

To randomly select the value, we'll use `np.random.choice` to choose from an array of _index values_ (generated using `np.arange` and the length of the array) which will correspond both to the values of the array of data points _and_ to the value in the array returned by `stats.zscore`. Remember that in Python, the index values for an array of length $n$ go from $0$ to $n-1$ and that we can access the element at index $i$ using square brackets: `arr[i]`.

In [None]:
from scipy import stats

# extract the array of data points from the table created above
prices_1980 = seds_1980.column("Energy Price")

# create an array of indices using the length of the array
num_1980 = len(prices_1980)
indices_1980 = np.arange(num_1980)   # returns an array 0, 1, 2, ..., num_1980 - 1

# randomly select the index
idx_1980 = np.random.choice(indices_1980)

# calculate the Z-scores
zs_1980 = stats.zscore(prices_1980)

print("1980: The Z-score for {} is {:.5f}".format(prices_1980[idx_1980], zs_1980[idx_1980]))

Try running that last cell a few more times and take note of how the z-score changes every time a differnet random state's median energy price data point is selected.

<div class="alert alert-info">

**QUESTION:** Relying on your understanding of the code you just ran for the 1980 data sample, in the cells below, please calculate the Z-scores for randomly selected values from 1990 and 2000.

</div>

In [None]:
# extract the array of data points from the table created above
prices_1990 = seds_1990.column("Energy Price")

# create an array of indices using the length of the array
num_1990 = len(prices_1990)
indices_1990 = np.arange(num_1990)   # returns an array 0, 1, 2, ..., num_1990 - 1

# randomly select the index
idx_1990 = np.random.choice(indices_1990)

# calculate the Z-scores
zs_1990 = stats.zscore(prices_1990)

print("1990: The Z-score for {} is {:.5f}".format(prices_1990[idx_1990], zs_1990[idx_1990]))

<div class="alert alert-info">

**QUESTION:** I know it's fun to generate randomly selcted median energy cost values from 1990, but please settle on one in order to answer the following question? How do you interpret that output? In other words, what does the randomly selected value tell you AND what does its z-score suggest? 

</div>

_**Type your answer here, replacing this text.**_

In [None]:
# extract the array of data points from the table created above
prices_2000 = ...

# create an array of indices using the length of the array
num_2000 = ...
indices_2000 = ...

# randomly select the index
idx_2000 = ...

# calculate the Z-scores
zs_2000 = ...

print("2000: The Z-score for {} is {:.5f}".format(prices_2000[idx_2000], zs_2000[idx_2000]))

<div class="alert alert-info">

**QUESTION:** Again, please choose one randomly selected energy costs value from 2000 to answer the following: How do you interpret that output? What does the randomly selected value tell you AND what does its z-score suggest? 

</div>

_**Type your answer here, replacing this text.**_

---

### References

The data for this notebook is from https://www.eia.gov/state/seds/seds-data-complete.php?sid=US#CompleteDataFile.