# SW 282: Lab 7 - Hypotheses

---

### Proessor Erin Kerrison

In this notebook, you will use the matplotlib skills you've learned to plot histograms and to generate Z-scores for randomly selected data points.

In [None]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = [9,6]

To demonstrate the concepts in this notebook, we will use data from [USEIA State Energy Data Systems](https://www.eia.gov/state/seds/) that has data about energy use and pricing. The data set we will be working with describes the median total energy cost by state and year in current dollars per million [BTUs](https://en.wikipedia.org/wiki/British_thermal_unit).

In [None]:
seds = Table.read_table("seds-data.csv")
seds

## Histograms

In this section, we'll look at the distributions of the energy prices across three decades (in the years 1980, 1990, and 2000). We also look at the standard deviation for the data for that year.

In [None]:
seds_1980 = seds.where("Year", are.equal_to(1980))

seds_1980.hist("Energy Price", bins=20)
plt.title("1980 Energy Price");

sd_1980 = np.std(seds_1980.column("Energy Price"))
print("The standard deviation for 1980 is: {:.5f}".format(sd_1980))

In [None]:
seds_1990 = seds.where("Year", are.equal_to(1990))

seds_1990.hist("Energy Price", bins=20)
plt.title("1990 Energy Price");

sd_1990 = np.std(seds_1990.column("Energy Price"))
print("The standard deviation for 1990 is: {:.5f}".format(sd_1990))

In [None]:
seds_2000 = seds.where("Year", are.equal_to(2000))

seds_2000.hist("Energy Price", bins=20)
plt.title("2000 Energy Price");

sd_2000 = np.std(seds_2000.column("Energy Price"))
print("The standard deviation for 2000 is: {:.5f}".format(sd_2000))

The histograms all look very similar, but that could be because they're all being plotted on axes of different scale. Let's plot all of the histograms on the same pair of axes to see how the distributions are changing.

In [None]:
for year in np.arange(1980, 2001, 10):
    plt.hist(x="Energy Price", data=seds.where("Year", year), bins=20, alpha=.5)

plt.legend(np.arange(1980, 2001, 10));

The plot above shows that although the _shape_ of the distribution hasn't changed much over time, its _scale_ has increased (meaning that the data have greater spread and are centered at a higher value).

## Z-Scores

In this section, we want to calculate the Z-score of a randomly selected data point from one of years shown above. To calculate the Z-score of a data point, we can use the function `scipy.stats.zscore`, which takes in an array of values and returns an array of Z-scores. To randomly select the value, we'll use `np.random.choice` to choose from an array of _index values_ which will correspond both to the values of the array of data points _and_ to the value in the array returned by `stats.zscore`.

In [None]:
from scipy import stats

# extract the array of data points from the table created above
prices_1980 = seds_1980.column("Energy Price")

# create an array of indices using the length of the array
num_1980 = len(prices_1980)
indices_1980 = np.arange(num_1980)

# randomly select the index
idx_1980 = np.random.choice(indices_1980)

# calculate the Z-scores
zs_1980 = stats.zscore(prices_1980)

print("1980: The Z-score for {} is {:.5f}".format(prices_1980[idx_1980], zs_1980[idx_1980]))

**Question:** In the cells below, calculate the Z-scores for randomly selected values from 1990 and 2000.

In [None]:
# extract the array of data points from the table created above
prices_1990 = ...

# create an array of indices using the length of the array
num_1990 = ...
indices_1990 = ...

# randomly select the index
idx_1990 = ...

# calculate the Z-scores
zs_1990 = ...

print("1990: The Z-score for {} is {:.5f}".format(prices_1990[idx_1990], zs_1990[idx_1990]))

In [None]:
# extract the array of data points from the table created above
prices_2000 = ...

# create an array of indices using the length of the array
num_2000 = ...
indices_2000 = ...

# randomly select the index
idx_2000 = ...

# calculate the Z-scores
zs_2000 = ...

print("2000: The Z-score for {} is {:.5f}".format(prices_2000[idx_2000], zs_2000[idx_2000]))

---

### References

The data for this notebook is from https://www.eia.gov/state/seds/seds-data-complete.php?sid=US#CompleteDataFile.