# [Computational Social Science] 
## 1-4 Statistics and Computation Refresher - Student Version

This notebook will review some basic statistical and computational concepts. We assume knowledge of Python up to the level of D-Lab's [Python Fundamentals](https://github.com/dlab-berkeley/python-fundamentals) workshop. If the materials here are challenging, be sure to review them and the Fundamentals materials, and ask for help from the instructors early and often!

In [None]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1234)

## Load Data

Throughout this course, we will make extensive use of [pandas dataframes](https://pandas.pydata.org/). Getting comfortable with pandas will be important as this will be the primary tool you use to load, manipulate, and combine datasets. For this lab, we will use a dataset builtin to the [statisticalmodels](https://www.statsmodels.org/stable/index.html) library. Run the following code to load the dataset.

In [None]:
anes96 = sm.datasets.anes96 
dataset_anes96 = anes96.load_pandas()
df_anes96 = dataset_anes96.data

## Basic Pandas Operations

Let's run through some basic pandas operations. These methods are not an exhaustive treatment of everything pandas can do, but should provide a good refresher on some of the basics. First, try to get the first 5 rows of a pandas frame and display them in the notebook.

In [None]:
df_anes96.head()

Next, return the first 10 rows, and then return the last 10 rows.

In [None]:
# First 10 rows
df_anes96.head(10)

In [None]:
# Last 10 rows
df_anes96.tail()

We can see the total number of rows and columns by using a dataframe's "shape" attribute:

In [None]:
df_anes96.shape

Next, check out the data types across all of the columns.

In [None]:
df_anes96.dtypes

Now try using the [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method to see some summary statistics for each column in the dataframe.

In [None]:
df_anes96.describe()

**Question**: What can you gather from these explorations? What are the data types for all of the columns? Do these data types really make sense?

**Answer**: 

## Renaming, Indexing, and Slicing

Now let's practice with manipulating dataframes. Renaming columns and pulling particular rows and columns are useful methods for working with dataframes.

**Challenge**: Use the [`.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method to change a column name. For example, try renaming "educ" to "education."

In [None]:
df_anes96 = df_anes96.rename(columns = {"educ": "education"})
df_anes96.head()

The `.rename()` method allows you to modify index labels and/or column names. As you can see, we passed a dict to the columns parameter, with the original name as the key and the new name as the value. Importantly, we also set the inplace parameter to True, which modifies the actual DataFrame, not a copy of it.

Next, let's take a look at slicing dataframes. Before we used the .head() and .tail() method to get the first n or last n rows of a dataframe. Instead, use the [] operator to return the first 5 rows.

In [None]:
df_anes96[:5]

There are a few other methods that we can use to index data too. In particular, let's use the [.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) method. First, let's make a sample dataframe (credit to [Chris Fonnesbeck's tutorial](https://github.com/fonnesbeck/scipy2015_tutorial) for this example).

In [None]:
bacteria = pd.DataFrame({'bacteria_counts': [632, 1638, 569, 115],
                        'other_feature': [438, 833, 234, 298]},
                       index = ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

Note that to create the dataframe we first passed in a dictionary to create the columns and values, and then separately passed in a list for the index that corresponds to the taxon for each bacterium. Let's take a look at what the dataset looks like.

In [None]:
bacteria

**Challenge**: Now, use the loc method to look at the row associated with "Actinobacteria".

In [None]:
bacteria.loc["Actinobacteria"]

**Challenge**: Next, let's look at the [.iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) method. Using our original df_anes96 dataframe, try using .iloc to get the 2nd, 6th, 7th, and 10th rows. **Hint**: Remember, what number does Python start its indexes with?

In [None]:
df_anes96.iloc[[1,5,6,9]]

**Challenge**: Now try to use `.iloc` to select every 5 rows between the 25th index and 50th index. **Hint**: Try looking at how to [slice and stride](https://towardsdatascience.com/indexing-best-practices-in-pandas-series-e455c7d2417) in Python.

In [None]:
df_anes96.iloc[25:50:5,:]

## Calculations

Next, let's look at some common calculations you might make with real-life datasets. First, try to use the [`.unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) to find the unique values in TVnews. What do you find?

In [None]:
df_anes96['TVnews'].unique()

**Answer**: ...

How would we get the number of unique values? Try using [`.nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) to find the number of unique values in TVnews!

In [None]:
...

**Answer**: 

Next, try to find the [`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) of age and [`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) age in the df_anes96 dataset, wrapping those calculations in the `print()` function.

In [None]:
print('sum of age is', ...)
print('mean of age is', ...)

**Challenge**: Sometimes we want to explore certain relationships between two variables in our dataset. 

Try to use the [`.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html), [`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) and [`.count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html) methods to group the observations by education level, and calculate the proportion of the vote that went to Bob Dole by education level. 

**Hint**: In the vote feature, a "0" denotes a vote for Clinton and a "1" denotes a vote for Dole. Divide the `sum` of the vote by the `count` of the vote!

In [None]:
...

## Visualization

Another key part of data science is using visualizations to explore your data and present results. Python provides several powerful tools for creating visualizations. In this course, we will mainly use [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/introduction.html#:~:text=Seaborn%20is%20a%20library%20for,examining%20relationships%20between%20multiple%20variables). Matplotlib is a popular visualization library, and seaborn is built on top of it and includes some integration with pandas. There are other options as well. For those of you coming from R, you might want to explore [ggplot](http://ggplot.yhathq.com/), [Bokeh](https://docs.bokeh.org/en/latest/), and [plotnine](https://plotnine.readthedocs.io/en/stable/), which are all built on top of the "grammar of graphics" that you might be familiar with.

Let's start with a simple histogram. Use the [`.hist()`](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.hist.html) method to plot a simple histogram for 'age' on top of the df_anes96 dataframe.

In [None]:
df_anes96.hist('age');

**Question**: Notice that the `.hist()` method has some additional arguments that you can supply beyond just the variable that is being plotted. Try using the bins argument to adjust the binwidths. What happens if you use 10? What about 1?

In [None]:
df_anes96.hist('age', bins = 10);

In [None]:
df_anes96.hist('age', bins = 100)

**Answer**: ...

What if we want to add some info to the plot? Instead of calling `.hist()` directly on the pandas dataframe, try using plt.hist().

In [None]:
plt.hist(df_anes96['age'])

We can call functions in the plt module multiple times within a single cell and those functions will all work on, and modify, the current figure associated with the current cell. This is because pyplot (or plt) keeps an internal variable for the current figure which is unique to each cell plt is used in. Try adding a [`title`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.title.html), [`xlabel`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xlabel.html), and [`ylabel`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.ylabel.html) to the histogram.

In [None]:
# adding labels and titles 
plt.clf() # clear the figure because sometimes it can get messy if you don't
plt.title('...')
plt.xlabel('...')
plt.ylabel('...')
plt.hist(df_anes96['age'], bins=...) # specify number of bins at 20
plt.show();

### Barplot

Now try it yourself! Instead of a histogram, let's make a bar plot using matplotlib's [`.bar()`](https://matplotlib.org/3.3.1/api/_as_gen/matplotlib.pyplot.bar.html) method.

**Question**: What kind of data is a bar plot good for visualizing? How is this different from a histogram?

**Answer**: ...

Make a bar plot that visualizes the votes that Bob Dole received in this sample, broken down by education level. You will need to manipulate the dataframe to get the vote counts by education level, then plot using the `.bar()` method. Also be sure to recode the numerical values in "education" to their corresponding text values. Consult the [dataset documentation](https://www.statsmodels.org/stable/datasets/generated/anes96.html). **Hint**: Consider using `groupby()`, `sum()`, and `replace()` to get the data into the correct shape before plotting.

In [None]:
# fromatting the data for plotting
educ_vote_Dole = df_anes96.groupby('education', as_index=False)[...].sum()

# fromatting the data for plotting
educ_vote_Dole['education_labels'] = educ_vote_Dole['education'].replace([...],
                                                                         [...])

In [None]:
educ_vote_Dole

Next, use your new dataset to make a barplot. Be sure to add [`.xtick()`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xticks.html) labels to label the categories.

In [None]:
## Plot
plt.clf() # clear the figure because sometimes it can get messy if you don't

x = educ_vote_Dole[...]
y = educ_vote_Dole[...]


plt.bar(x, y)

plt.title('Number of Votes for Dole by Education Level')
plt.xticks(...)
plt.show()

## Plotting with Seaborn

"`Seaborn` is a Python visualization library based on `Matplotlib`. It provides a high-level interface for drawing attractive statistical graphics."

Let's import it and give it the alias `sns`, which is done by convention.

In [None]:
import seaborn as sns
sns.set(rc={'axes.facecolor' : '#EEEEEE'})

The `sns.set()` function allows us to change some of the `rcParams`. Here, we're changing the plot's face color.

`seaborn` has the capacity to create a large number of informative, beautiful plots very easily. Here we'll review several types, but please visit their [gallery](https://seaborn.pydata.org/examples/index.html) for a more complete picture of all that you can do with `seaborn`.

Let's use the [U.S. Macroeconomics](https://www.statsmodels.org/dev/datasets/generated/macrodata.html) dataset, also from the `statsmodels` library. Load the data and explore it.

In [None]:
macro = sm.datasets.macrodata
dataset_macro = macro.load_pandas()
df_macro = dataset_macro.data

In [None]:
df_macro.head()

Now, use the `sns` library's [`.regplot()`](https://seaborn.pydata.org/generated/seaborn.regplot.html) method to visualize a regression of Real Gross Domestic Product (GDP) on Consumer Price Index (CPI). 

In [None]:
sns.regplot(...)

plt.title(...)
plt.xlabel(...)
plt.ylabel(...);

**Question**: How well does the regression fit the data? What can you conclude from this plot?

**Answer**: 

### Kernel Density Plots

Earlier, we used histograms to visualize continuous data. As we saw earlier, the choice of bin width is consequential for the shape of the histogram. Another option for visualizing the same data is [kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation). KDE is a method for estimating the probability density distribution (pdf) of a random variable. Try using seaborn's [`.kdeplot()`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to plot real GDP.

In [None]:
sns.kdeplot(...)

plt.title(...)
plt.xlabel(...)
plt.ylabel(...);

**Question**: What can you say about the distribution from this kernel density plot?

**Answer**: ...

You can also overlay a kernel density plot on a histogram using [`.displot()`](https://seaborn.pydata.org/generated/seaborn.displot.html). Try using displot yourself here.

In [None]:
sns.displot(...)

plt.title(...)
plt.xlabel(...)
plt.ylabel(...);

**Question**: What can you say about the KDE plot overlayed on the histogram?

**Answer**: 

### Joint Distribution

The last visualization technique we will look at is plotting a joint distribution. Use the [`.jointplot()`](https://seaborn.pydata.org/generated/seaborn.jointplot.html) method to plot the joint distribution of unemployment and inflation. Note that `.jointplot()` returns a different type of object than the other plots we have worked with so titling it might be hard. Check out this answer on [stackoverflow](https://stackoverflow.com/questions/49065837/customize-the-axis-label-in-seaborn-jointplot) to see if you can figure it out!

In [None]:
j_plot = sns.jointplot(x = ..., y = ..., data = ..., alpha = .5)

j_plot.set_axis_labels(...)

j_plot.ax_joint.set_xlabel(...)

j_plot.ax_marg_y.grid('on') 

plt.tight_layout();

**Question**: What can you say about the relationship between inflation and unemployment? Why are joint distributions interesting in general?

**Answer**: 

## Simulations

One of the advantages of computational social science is that computing gives us the tools to create simulations. Traditional pedagogy in statistics emphasizes solving problems analytically, but oftentimes we can solve the same problems computationally. For now, let's explore the [bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#:~:text=Bootstrapping%20is%20any%20test%20or,etc.\)%20to%20sample%20estimates.) as a way to use simulations. A common problem in statistics is that we usually do not know the true parameters (mean, variance, etc.) of a population. Using the bootstrap, we can estimate these values. The basic procedure for the bootstrap is to do the following:

1. For a dataset of size n, take a resample **with replacement** of size n.
2. Calculate the quantity of interest (i.e. mean, median, etc.)
3. Repeat this procedure a large number of times (for example, 10000)
4. Visualize/analyze the distribution of the resampled quantity

Let's try it ourselves. First, let's see how many observations we have in our df_macro dataset. Use the `.shape` property to find this information.

In [None]:
...

Next, find the median for the real GDP quantity.

In [None]:
df_macro['realgdp']. ...

Now, resample the dataframe with replacement and find the median of real GDP. Be sure to set the random seed.

In [None]:
# specify a sample of 203, with replacement and set the random seed
resample = df_macro.sample(n= ..., replace= ..., random_state = 1) 

In [None]:
# if you want to view the number of duplicates in the dataset
resample.duplicated().value_counts()

In [None]:
resample['realgdp'].quantile(q= ... ) # another way to obtain median

**Question**: What is the resampled median? Does the answer you got intuitively make sense?

**Answer**: 

Next, we write a function that takes a dataframe, column, and the number of replications as arguments and returns a number of resampled medians equal to the number of replications.

In [None]:
def bootstrap_median(original_sample, label, replications):
    """Returns an array of bootstrapped sample medians:
    original_sample: table containing the original sample
    label: label of column containing the variable
    replications: number of bootstrap samples
    """
    just_one_column = original_sample.loc[:, label] # Hint: slice this to include all rows and just the 'label' column
    medians = []
    
    for i in np.arange(replications):
        bootstrap_sample = just_one_column.sample(n= ..., replace= ...) 
        resampled_median = bootstrap_sample.quantile(...)
        medians.append(...)

    return medians

Plot the medians from the realgdp column of our df_macro with a histogram, and add a line with the 95% confidence interval.

In [None]:
medians = bootstrap_median(..., 
                           ..., 
                           ...)
resampled_medians = pd.DataFrame(data={'Bootstrap Sample Median': medians})
resampled_medians.hist()


# Have a look and see if you can see what's going on!
plt.plot(list([pd.Series(medians).quantile(q=.025), 
               pd.Series(medians).quantile(q=.975)]), 
         np.array([500, 500]), 
         color='yellow', 
         linewidth=10, zorder=1);

Now, we repeat this whole process 100 times to plot 100 confidence intervals. Similar to the slice and stride we did to a pandas data frame above, the [`.arange()`](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) method in numpy creates an array based on a range. We use `.arange()` below to specify how many simulations we want to do. *Note: we could use Python's `range` for this as well.* This step will take a minute to run. 

In [None]:
left_ends  = []
right_ends = []

for i in np.arange(100):
    first_sample = df_macro.sample(n=203, replace=True)
    medians = bootstrap_median(..., ..., ...)
    left_ends.append(pd.Series(medians).quantile(q= ...))        
    right_ends.append(pd.Series(medians).quantile(q= ...))       

intervals = pd.DataFrame(data={"Left": left_ends, "Right": right_ends})
intervals

In [None]:
plt.figure(figsize=(8,8))
for i in np.arange(100):
    ends = intervals.iloc[i, :]
    plt.plot(ends, np.array([i + 1, i + 1]), color='gold')
plt.xlabel('Median')
plt.ylabel('Replication')
plt.title('Population Median and Intervals of Estimates');

**Question**: What can you say about the distributions of the resampled medians? Why is this method useful? Did your code take a while to run, and if so what does this suggest?

**Answer**: 

---
Notebook written by Aniket Kesari. Materials borrwed from D-Lab's [pandas](https://github.com/dlab-berkeley/introduction-to-pandas) and [data visualization](https://github.com/dlab-berkeley/visualization-with-python), and [Legal Studies 123: Data, Prediction, and Law](https://github.com/Akesari12/LS123_Data_Prediction_Law_Spring-2019/blob/master/labs/Probability%20Distributions%2C%20Bootstrap%2C%20and%20Confidence%20Intervals/Probability%2C%20Bootstrap%20and%20Confidence%20Intervals%20Solutions.ipynb).