<center>
<h1>STAT 654: Statistical Computing with R and Python</h1>
<h2>Intro Python for Data Analysis</h2>
<strong>
Daniel Drennan<br>
Dr. Sharmistha Guha<br><br>
Department of Statistics<br>
Texas A&M University<br>
College Station, TX, USA<br><br>
Spring 2022<br>
</strong>
</center>

In [None]:
# typical python imports (not used here)
import os
import functools

# standard scipy imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
import scipy.stats
import seaborn as sns
import statsmodels.api as sm

# Datasets

We are going to use two data sources, one for regression and one for classification. Respectively, they are listed below.

1. Auto MPG. (1993). UCI Machine Learning Repository. <https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/>

2. Fisher, R.A.. (1988). Iris. UCI Machine Learning Repository. <https://archive.ics.uci.edu/ml/machine-learning-databases/iris/>

We can in principle read these data directly from the web without downloading them, which I'll demonstrate.
However, it makes more sense to download the data directly to your machine to reduce the number of times the main page is queried (it is often the case that you'll rerun cells where you load a dataset).
A code to run from your terminal. 

```bash
# locally download the auto-mpg data
curl -o data/auto-mpg.data archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
curl -o data/auto-mpg.names archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names
# locally download the iris data
curl -o data/iris.data archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
curl -o data/iris.data archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names
```

The main Python library for working with data is Pandas.
Our first step will be to load the two datasets and then we'll look at how to create new features using Pandas.

In [None]:
# Automobile MPG data for a regression problem
# The data does not include headers, so we must read and add them manually
# See archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/ for details
auto_url = "data/auto-mpg.data"
auto_names = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "model_year",
    "origin",
    "car_name"
]

# auto = pd.read_table(auto_url, na_values="?", names=auto_names)
auto = pd.read_fwf(auto_url, na_values = "?", names=auto_names)
auto.head()

In [None]:
auto.describe()

In [None]:
# Iris data available from archive.ics.uci.edu/ml/machine-learning-databases/iris/
# to perform a classification problem
iris_path = "data/iris.data"

# The column names are not stored in the data, so we must provide them
iris_names = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "species"
]

# Now load the data and preview it
iris = pd.read_csv(iris_path, names = iris_names)
iris.head()

In [None]:
# Get summary statistics for a dataframe
iris.describe()

## Modifying Data

It is very typical to transform imported data in some way before actually working with it.
From creating new variables to rescaling existing ones, we can do a lot of stuff compactly using pandas.
A few of these ideas are templated below.

In [None]:
# Transform a categorical variable to a 0-1 indicator (dummy) variable
pd.get_dummies(iris)

In [None]:
# The "Iris-" part of each species string is redundant, so let's strip it off every description in one clean pass
# There are multiple ways to do this, but we'll strive for one that is pretty elegant
iris['species'] = iris.species.apply(lambda x: x.split("-")[1])

iris.head()

In [None]:
# Suppose we wanted to transform mpg to kpg (kilometers per gallon instead of miles per gallon)
# We can create a new variable as follows
auto["kml"] = auto.mpg / 2.35214583

auto.head()

## Data Visualization

In [None]:
# Making scatterplot matrices with a dataframe (does not scale well for very large datasets)
plot_vars = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year"]
sns.pairplot(auto, vars=plot_vars, hue="origin", palette="viridis")
plt.plot()

In [None]:
pd.plotting.boxplot_frame_groupby(iris.groupby("species"), figsize=(12,12))
plt.plot()

# Hypothesis Testing

Hypothesis testing in Python can be underwhelming when compared with R.
The outputs in Python are not nearly as informative as those in R.
Based on the last graph we made, we can form a hypothesis about the difference in sepal widths
between setosa and versicolor plants and conduct a two sample t test on the result.
The result is shown below.

In [None]:
versicolor_sepal_width = iris[iris.species == "versicolor"].sepal_width.to_numpy()
setosa_sepal_width = iris[iris.species == "setosa"].sepal_width.to_numpy()

t_test = sp.stats.ttest_ind(versicolor_sepal_width, setosa_sepal_width, equal_var=False)

test_stat = t_test[0]
p_value = t_test[1]

print("H0: Average Sepal Widths of Versicolor and Setosa flower strains are equivalent")
print(f"T = {test_stat:.4f} with p-value {p_value}")

Another example which is more concrete is shown below

In [None]:
# I want this to be deterministically repeatable, so I'm setting a random seed now.
np.random.seed(1)

# For this demonstration, I want to construct two datasets which are similar and 
# compare their means using the t test shown before
y = np.random.normal(loc=100, scale=30, size=(1000, 2))

# Conduct a two-sample test with the data just simulated
ttest = sp.stats.ttest_ind(y[0], y[1], equal_var=True)

print("Two sample t test\nMean1: {:.4f}\tMean2: {:.4f}\tT-stat: {:.4f}\tp-value: {:.6f}".format(*y.mean(axis=0), *ttest))

# One sample test
onesample = sp.stats.ttest_1samp(y[1], 60)

print("One sample t test\nMean: {:.4f}\tT-stat: {:.4f}\t\tp-value: {:.6f}".format(y[1].mean(), *onesample))

Let's plot the last two datasets to compare them.

In [None]:
# Generate a figure and specify its size
plt.figure(figsize=(12,8))

# Always label your plots
plt.title("Two Samples $Y_1, Y_2 ~ N(100, 30)$")
plt.xlabel("Sampled Value")
plt.ylabel("Frequency")

# Plot histograms
for _ in range(2):
    plt.hist(y[:, _], bins=20, color=f"C{_}", alpha=0.75, label=f"$Y_{_+1}$")

plt.legend()
plt.plot()

The two sample t-tests shown before neglect most of the information in our datasets.
Moreover, they answer relatively uninteresting questions about differences in features.
We can do better by building classifiers and regression models of the data.


In [None]:
# Making scatterplot matrices with a dataframe (does not scale well for very large datasets)
plot_vars = ["mpg", "cylinders", "weight", "acceleration", "model_year"]
sns.pairplot(auto, vars=plot_vars, hue="origin", palette="viridis")
plt.plot()

In [None]:
auto_model = sm.OLS.from_formula(
    "mpg ~ weight + acceleration + model_year + origin", 
    data=auto).fit()
auto_model.summary()

In [None]:
# The same model, but with interactions
# The string splitting method can be handy for converting a string into a list
vars = "weight + acceleration + model_year + origin".split(" + ")
# First we need to convolve the vars list just constructed to build the interactions
interactions = []
for j in range(3):
    for k in range(4):
        interactions.append(f"{vars[j]}*{vars[k]}")

# Now we can recombine the interactions as the RHS of the formula
formula = "mpg ~ " + " + ".join(interactions)

auto_model2 = sm.OLS.from_formula(formula, data = auto).fit()
auto_model2.summary()

In [None]:
plt.figure(figsize=(12, 4))
plt.plot(auto_model2.pvalues, "o", mfc="none", markersize=10)
plt.xticks(rotation=30)

# Labels / formatting
plt.title("Auto-Mpg model with first order interactions", fontsize=24)
plt.xlabel("Model Term", fontsize=16)
plt.ylabel("p-value", fontsize=16)

plt.grid(alpha=0.3)
plt.hlines(0.05, 0, len(interactions)-2, color="steelblue", linestyles="-.", linewidth=2)