# Linear models

In this notebook, we're going to give a brief introduction to prediction using linear models, as well as some related concepts. If you enjoy this material, we highly recommend further data science and statistics classes!

In [None]:
%matplotlib inline
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

## Read in the data

We'll be using the same data from Monday's lecture. As a reminder:

_The data for this demo is a lexical decision dataset. It was elicited from 21 subjects for 79 English concrete nouns. RT stands for "response time" (in what I assume is a normalized scale, rather than seconds). Frequency indicates a calculated frequency for the noun within a given corpus. In this lexical decision task, subjects were shown a series of words (e.g., CAT, BAT, CART) and the occasional nonword (e.g., CAZ, BRIT, CHOG) and asked to simply identify whether the word was real or not. In addition to measuring accuracy, recording the response time can tell us a bit about how our brains process words, as faster (lower) response times result from faster, easier mental processing._

Let's first read in the data as a Table.

In [None]:
lex = Table.read_table('wk4-lexicaldecision.csv')
lex.show(10)

## Exploratory data analysis

Once you learn how to fit models, it can be very tempting to do this as soon as you get your hands on some data. But we should always explore the data first. We need to understand the data and what it means before we can safely apply any modeling.

What does "Exploratory data analysis (EDA)" look like? It's often an amorphous activity guided by human intuition and questions. I like to structure it as a series of questions I have about the data and then use EDA to answer the questions.

In [None]:
# How big is the data?
lex.num_rows

In [None]:
lex.num_columns

In [None]:
# What do the columns mean?
lex.show(5)

In [None]:
# What are typical values for some important numerical columns?
lex.select(["RT", "Frequency", "Length"]).stats()

In [None]:
# What's the distribution of values in the Sex column?
lex.group("Sex")

In [None]:
# What's the distribution of values in the NativeLanguage column?
lex.group("NativeLanguage")

In [None]:
# What's the distribution of values in the Class column?
lex.group("Class")

A large part of EDA is making visualizations. Here's where you can be creative!

In [None]:
# What is the distribution of repsonse time?
lex.hist(["RT"])

In [None]:
# Are the response times different across genders?
lex.where('Sex','M').select("RT").boxplot()

In [None]:
lex.where('Sex','F').select("RT").boxplot()

In [None]:
# Let's plot them side-by-side
time_male = lex.where('Sex','M').column('RT')
time_female = lex.where('Sex','F').column('RT')
plt.subplot(1,2,1)
plt.boxplot(time_male)
plt.title("Male")
plt.subplot(1,2,2)
plt.boxplot(time_female)
plt.title("Female");

In [None]:
# How are word length and frequency related?
lex.select(["Length", "Frequency"]).scatter("Length")

## Linear models

Recall that the Pearson $r$ correlation cofficient is a statistic (i.e. a way to calculate a number) that tells us about the association between two variables.

<img src="correlations.png" />

If we know there is an association between two variables, we can use one to predict the other. Linear models help us to do this.

In [None]:
# Don't worry about understanding what this is, we'll cover the core ideas next week when we talk about functions
def linear_model(X, y):
    X = sm.add_constant(X.values)
    model = sm.OLS(y, X).fit()
    print(model.summary())

In [None]:
# Fit a linear model to predict Frequency from Length
X = lex.select(["Length"])
y = lex.column("Frequency")
linear_model(X, y)

The linear model demonstrated a significant effect of Length on Frequency, $p<0.000$.

In [None]:
# which is equivalent to plotting this line here
lex.select(["Length", "Frequency"]).scatter("Length", fit_line=True)