# Data Science Methods for Clean Energy Research 
## *Descriptive Statistics*

## Outline
### 1. Computing the mean median and variance of a dataset population
### 2. Drawing samples and Distribution of sample mean & variance 
### 3. Bootstrapping


---

## Import libraries and dataset

We import the libraries we will need to use

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from IPython.display import clear_output, display
%matplotlib inline
matplotlib.rcParams.update({'font.size': 22})

You can find many datasets on kaggle, ranging from search for exoplanets to world happiness!

https://www.kaggle.com/unsdsn/world-happiness

In [None]:
df = pd.read_csv('datasets/2015.csv')
df.describe()

We will assume that this dataset is complete and that somehow it is the full **population**. In reality it would still be a sample. Let's look at a histogram of the 'Happiness Score', the histogram is a representation of the **distribution** of the population of Happiness Scores.

In [None]:
df['Happiness Score'].hist(bins=18, color='royalblue', alpha=0.8)
plt.xlabel('Happiness Score')
plt.ylabel('Distribution of \n Happiness Score population')
plt.show()

We can see that the population isn't normally distributed - **does not follow** a perfect Gaussian or Normal distribution:

$$p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} }$$

In [None]:
df['Happiness Score'].hist(bins=18, density=True, color='royalblue',alpha=0.8)
plt.xlabel('Happiness Score')
plt.ylabel('Probability distribution of \n Happiness Score population')

# Let's see what the best gaussian would look like
sigma = 1.145010
mu = 5.375734
bins = np.linspace(2.839000,7.587000,30)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),linewidth=4, color='red')
plt.show()

What if we estimated the probability distribution using a [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)? .. we will see it is not a Gaussian distribution yet again.

In [None]:
df['Happiness Score'].plot.kde(linewidth=4,color="crimson")
plt.xlabel('Happiness Score')
plt.ylabel('Probability distribution of \n Happiness Score population')

## 1) Computing the mean, median, standard deviation and variance

Wait how did I get the **mean** and **standard** deviation? Well we assumed we had the entire population so we computed it simply as 

$${\mu}=\frac{\sum_{i=1}^N X_i}{N}$$ 

$$\sigma=\sqrt{\sum_{i=1}^N\frac{(X_i-\mu)^2}{N}}$$


In [None]:
N = len(df['Happiness Score'].values )
print(N)
mean = np.sum( df['Happiness Score'].values ) / N 
print(mean)
mean = 0.0
for i in range(N):
    mean = mean + df['Happiness Score'].values[i]
mean = mean /N
print(mean)

In [None]:
df['Happiness Score'].describe()

### Exercise 1: Compute the standard deviation and variance 

Slightly different standard deviation? Because describe computes std as the corrected sample standard deviation which is unbiased

$$\sigma=\sqrt{\sum_{i=1}^{N}\frac{(X_i-\mu)^2}{N-1}}$$

### Exercise 2: Compute the median 

Where do these values end up on the plot?

In [None]:
df['Happiness Score'].hist(bins=18,density=True,color='royalblue',alpha=0.8)
df['Happiness Score'].plot.kde(linewidth=4,color="crimson")
plt.plot(median*np.ones(10),np.linspace(0,0.5,10),lw=3,color='black',label='median')
plt.plot(mean*np.ones(10),np.linspace(0,0.5,10),lw=3,color='limegreen',label='mean')
plt.xlabel('Happiness Score')
plt.ylabel('Probability distribution of \n Happiness Score population')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize='small')
plt.xlim([2.6,7.6])

## 2) Sampling from a population, sample distributions & the central limit theorem

Let's go back to our Clean Energy Project database and load the data. Note: make sure you have the correct path to that file.

In [None]:
data = pd.read_csv('http://faculty.washington.edu/dacb/HCEPDB_moldata.zip')

In [None]:
data.describe()

This dataset is huge! We will take a random slice and assume that our slice represents a full population (again this is actually a large sample) and look at the mass values. We are making this approximation to accelerate the calculations in the notebook.

In [None]:
population = data.sample(frac = 0.05) 
population.describe()

In [None]:
true_mean = population['mass'].mean()
true_stdev = population['mass'].std()
print("pop mean", true_mean, "pop stdev", true_stdev)

In [None]:
population['mass'].plot.hist(bins=50, color='violet')

In [None]:
population['mass'].plot.kde(color='purple',lw=4)

Our goal is to **sample** from the mass values to attempt to compute the true population mean and standard deviation. Let's write a function that samples the  dataframe with an argument *n* that is the number of samples to take. We want to sample without replacement.

In [None]:
def draw_sample(df, column, n):
    subset_indices = np.random.choice(np.array(list(df[column].index)), size=n, replace=True)
    sample = pd.DataFrame(data=df[column][subset_indices].values, columns=['sample'])
    return sample

In [None]:
sample = draw_sample(population,'mass', 20)

In [None]:
sample

Now we want to draw *M* samples of size *n* from the population and see what the mean and standard deviation are for these samples.

### Exercise 3: Breakout room

Create a function which calls <code>draw_sample</code> *M* times and returns the mean and standard deviation of each sample. 

Input arguments should include

* a variable called .. let's say <code>sample_funct</code> this variable will be used to refer to the <code>draw_sample</code> function
* a variable for the dataframe
* a variable for the column of interest
* a variable for the number of calls *M*
* a variable for the number of data points per samples *n*

The output should include

* a list which contains the means
* a list which contains the standard deviations

Hint: your function might look like:

<code>def repeat_samples_stats(sample_funct, df, column, M, n):  
   means = []  
   stdevs = []  
   ...  
   return (means, stdevs)
</code>

then use the append method to append each mean and sd value to the end of each respective list.

Let's use our function to make 20 samples and compute 20 means and 20 standard deviations. We will extract 50 points per sample.

In [None]:
means_1, stdevs_1 = repeat_samples_stats(draw_sample, population, 'mass', 500, 50)

In [None]:
means_1

What does the distribution of the means look like - i.e. let's consider all these means as data points part of a sample or population. We will first use matplotlib

In [None]:
plt.hist(means_1,bins=50,color='mediumaquamarine')
plt.show()

### Exercise 4 (in class) - play with the number of samples and points per sample

In [None]:
means_2, stdevs_2 = repeat_samples_stats(draw_sample, population, 'mass', 500, 1000)

In [None]:
plt.hist(means_2, bins=50,color='mediumaquamarine')
plt.show()

Now let's make a function with five arguments `sample_stats_funct`, `df`, `column`, `M` and `n` that takes the return values from the last function and
* converts the lists to a single dataframe
* plots two histograms of the columns (mean, sd)
* prints out the mean and sd of the columns

In [None]:
def describe_sample(sample_stats_funct, df, column, M, n):
    
    means, sds = sample_stats_funct(draw_sample, df, column, M, n)
    df = pd.DataFrame(data={'means': means, 'sds': sds})
    
    df.hist(bins=100,color='mediumorchid')
    print('Mean: {}'.format(np.round(df['means'].mean(), 3)))
    print('Std Dev: {}'.format(np.round(df['sds'].mean(), 3)))
    
    return df

In [None]:
df = describe_sample(repeat_samples_stats, population, 'mass', 1000, 100)

In [None]:
print(true_mean, true_stdev)

In [None]:
df = describe_sample(repeat_samples_stats, population, 'mass', 1000, 50)

## Next time - Bootstrapping