# CSS Lab: Drawing Conclusions
This lab covers techniques for evaluating the validity of conclusions made from data. These conclusions are not always clear, and can have ethical implications. After completing this lab, you will understand some of the factors and methods used to evaluate the conclusions reached from data.

## Section 1: Background
This lab uses data from the University of Michigan Learning Analytics Data Architecture (LARC) project. The LARC project tracks the performance of all undergraduate students at the University of Michigan. To protect the privacy of students, the true data is used to generate a synthetic data set: fake data that preserves the statistical properties of the original data. This data can be used to investigate questions about the performance of students over the course of their undergraduate career, taking factors such as major, gender, and year into account.

## Section 2: Setup
### 2.1 Load Python Libraries

In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from scipy import stats as spstats
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, f1_score
%matplotlib inline

### 2.2 Load Data
We begin by loading data about each (synthetic) student, and using the anonymous ID column (ANONID) as an index for the data frame. Some of the columns in this data frame are:

* MAJOR1_DESCR - Student's first major
* HSGPA - Student's high school GPA
* LAST_ACT_COMP_SCORE - Student's comprehensive ACT score
* LAST_SATI_TOTAL_SCORE - Student's total SAT score
* SEX - Student's sex

In [None]:
df_student = pd.read_csv("data/student.record.csv.gz").set_index("ANONID")
df_student.head()

#### Loading course data
Now we load a separate data frame containing one row for each student/course combination. Some of the columns in this data frame are:

* TERM - Academic term id
* ANONID - Student id
* SUBJECT - Course academic subject
* CATALOG_NBR - Course catalog number (not unique across subjects)
* GRADE - Student's grade in course
* GPAO - Student's GPA up to and including current semester, excluding this course.
* Season - Academic term season
* Year - Acadeic term year

In [None]:
df_course = pd.read_csv("data/course-modified.csv.gz")
df_course.head()

#### Examining Data
Let's plot the high school GPAs of all students in the data set.

In [None]:
plt.figure(figsize=(8,4))
df_student.HSGPA.hist(bins=np.arange(0.01,37.01))
plt.xlabel('High School GPA'); plt.ylabel('Count')
plt.yscale('log', nonposy='clip')

#### Short Answer 1
The highest possible GPA is 4.0, but the above plot shows values above 30.
These values must be errors. Give at least **two** possible explanations for how a value of 31 could end up in a student's GPA record.

🤔 Your answer here:

### Clean Data
We can do some data cleaning to remove values that are obvious errors.

In [None]:
# Remove outliers
outliers = df_student.HSGPA > 4
df_student.loc[outliers] = None
# Remove missing data coded as 0
missing = df_student.HSGPA == 0
df_student.loc[missing] = None

#### Plot Cleaned Data

In [None]:
plt.figure(figsize=(8,4))
df_student.HSGPA.hist(bins=25)
plt.xlabel('High School GPA'); plt.ylabel('Count')
plt.yscale('log', nonposy='clip')

In [None]:
# More data cleaning

outliers = df_course.GRADE > 4
df_course.loc[outliers] = None
missing = np.isnan(df_course.GRADE)
df_course = df_course.dropna()

## 2.3 Combining Data
We can join the student and course data into a single data frame.

In [None]:
df_combined = df_course.join(df_student, on="ANONID")
df_combined.head()

## Section 3: Sampling
Often, data is collected from a small group in order to try to understand a larger group. For example, asking 100 random students to complete a survey in order to understand the entire student body. This process is called _sampling_. The data collected is called the ___sample___ and the larger group is the ___target population___.

In [None]:
# Helper functions

def get_bins(values):
    values = sorted(set(values))
    midpoints = [np.mean((values[i],values[i+1])) for i, g in enumerate(values[:-1])]
    left = values[0] - (midpoints[0] - values[0])
    right = values[-1] + (values[-1] - midpoints[-1])
    bins = [left] + midpoints + [right]
    return bins

grade_hist_bins = get_bins(df_course.GRADE)
all_grades = sorted(set(df_course.GRADE))

def get_sample_mean(df, num_samples=10):
    samples = df.GRADE.dropna().sample(num_samples, replace=True)
    return samples.mean()

def grade_hist(grades, color=(0,0,1,0.5)):
    n, bins, patches = plt.hist(grades, align="mid", bins=grade_hist_bins, width=0.1, color=color)
    plt.xticks(range(5), ['E', 'D', 'C', 'B', 'A'])
    plt.ylabel("Count")
    plt.xlabel("Grade")
    
def grade_stem(grades, color=(0,0,1,0.5)):
    bin_values, bins = np.histogram(grades, grade_hist_bins)
    plt.stem(all_grades, bin_values, basefmt="none", use_line_collection = True)
    plt.xticks(range(5), ['E', 'D', 'C', 'B', 'A'])
    ylim = plt.ylim()
    plt.ylim([0, ylim[1]])
    plt.ylabel("Count")
    plt.xlabel("Grade")

### 3.1 Sample Size

#### Population Mean
In this section, we will use the example of finding the mean course grade of students in the LARC data set. Since we have data on all students rather than just a sample, we can calculate the exact mean of all students, called the ___population mean___, but usually that's not possible. The cell below calculates the exact mean, which will be helpful to keep in mind throught this section.

In [None]:
pop_mean = df_course.GRADE.mean()
pop_mean

#### Sampling
Now let's pretend we do not have access to everyone's grades. This is a really common situation, and it is why people conduct surveys. It is often much easier to ask a smaller group of people (called a sample) for information like grades than to get data on every single person (called the population). We can then use the group's answers to estimate everyone's answers. That is, we estimate the population data using the sample data. 

#### Short Answer 2
In the cells below, you will be exploring how sample size influences sample mean. In the cell below, change the sample size to answer the following questions:
1. What sample sizes did you try?
2. How does the range of sample means change as sample size changes?
3. Which letter grade or grades do these ranges correspond to?

In [None]:
# change the value of sample size
sample_size = 3

The below cell simulates an experiment that samples random course grades from the number of students indicated by ```sample_size``` and calculates their mean, called the sample mean. This is like asking this number of random students their grades. The figure shows the grades sampled and is labeled with the sample mean. You can repeat this cell to see different random samples.

In [None]:
samples = df_course.GRADE.sample(sample_size)
plt.figure(figsize=(8,4))
grade_stem(samples)
plt.title("Sample mean: {:0.2f}".format(np.mean(samples)))

#### Sample Mean
The cell below repeats that process (take a sample of three random grades and take the mean) 10 times. Then it shows the range of sample means as well as the true population mean so that you can see how close (or far) the sample means are from the population mean.

In [None]:
means = []
for i in range(10):
    m = df_course.GRADE.sample(sample_size).mean()
    means.append(m)
    print("Sample {} mean: {:0.2f}".format(i ,m))
print("Sample mean range: {:0.2f} — {:0.2f}".format(min(means), max(means)))
print("Population mean: {:0.2f}".format(pop_mean))

🤔Your answer here:

### 3.2 Sample Bias
In the previous section, we saw that estimates based on samples depend on the size of the sample. In this section, we explore what happens if a sample is drawn from a subset the population, rather than the entire population. For comparison, the cell below shows the true population mean of all course grades.

In [None]:
pop_mean

#### Sampling Frame
The ___sampling frame___ is the set of people samples come from. Ideally, the sampling frame should be the entire population, but that may not always be possible.

As an example, let's assume that we're conducting a survey of student grades and are sampling student names from SAT records. Not all students took the SAT. So the set of students who might be sampled (the sampling frame) is smaller than the set of all students (the population).

#### Short Answer 3
The cells below select students who have taken the SAT and then takes 10 samples of students of size ```sample_size``` and reports the means for each sample. Use the cells below to answer the following:
1. How does the range of sample means compare to the case where samples were drawn from all students? Consider both the size of the range and its center.
2. Does increasing the sample size improve the estimate of the population mean grade when the samples are only taken from students who completed the SAT?

In [None]:
# change the value of sample size
sample_size = 200

In [None]:
df_sat = df_combined[df_combined.LAST_SATI_VERB_SCORE.notnull()]
means = []
for i in range(10):
    m = df_sat.GRADE.sample(sample_size).mean()
    means.append(m)
    print("Sample {} mean: {:0.2f}".format(i, m))
print("Sample mean range: {:0.2f} — {:0.2f}".format(min(means), max(means)))
print("Population mean: {:0.2f}".format(pop_mean))

🤔Your answer here:

#### Short Answer 4
Compared to students who did not take the SAT, why might students who have taken the SAT have a higher grade point average?

🤔Your answer here:

### 3.3 Correlations
One of the most common tasks in quantitative analysis is to determine whether a relationship exists between two variables. When it does, those variables are said to be ___correlated___.

One way to examine correlation is to plot the two variables against each other and look at the slope of the best fit line. A upward slope is called a positive correlation and means an increase in one variable usually corresponds to an increase in the other.

A downward slope is called a negative correlation and means an increase in one variable usually corresponds to a decrease in the other.

The cell below shows examples of various kinds of correlations.

In [None]:
def rot(theta):
    theta = np.deg2rad(theta)
    return np.array([
        [np.cos(theta), -np.sin(theta)],
        [np.sin(theta), np.cos(theta)]
    ])

def getcov(corr=1, slope=1, theta=0):
    cov = np.array([
        [1/slope, corr],
        [corr, slope]
    ])

    r = rot(theta)
    return r @ cov @ r.T

def generate_data(x=0, y=0, corr=1, slope=1, theta=0):
    # get the covariance matrix with the appropriate transforms
    cov = getcov(corr=corr, slope=slope, theta=theta)
    X, Y = np.random.multivariate_normal([x, y], cov, 2000).T   
    return X, Y 

In [None]:
plt.figure(figsize=(20, 7.5))
parameters = [
    (0.9, 'Strong Positive'),
    (0.5, 'Weak Positive'),
    (0.0, 'No Correlation'),
    (-0.5, 'Weak Negative'),
    (-0.9, 'Strong Negative')
]

for i, (r, title) in enumerate(parameters):
    x, y = generate_data(corr=r,slope=1)
    plt.subplot(2, 5, i + 1)
    plt.plot(x, y, '.', markersize=4)
    m, b = np.polyfit(x, y, 1)
    plt.title(title + " ("+str(r)+")")
    plt.xlim([-2, 2])
    plt.ylim([-2, 2])

slopes = [0.5, 1, 1.5, -0.5 , -1]
fixed_corr = 1

for i, slope in enumerate(slopes):
    x, y = generate_data(corr=fixed_corr,slope=slope)
    plt.subplot(2, 5, i + 6)
    plt.plot(x, y, '.', markersize=4)
    m, b = np.polyfit(x, y, 1)
    plt.title("Correlation = " + str(np.sign(slope)))
    plt.xlim([-2, 2])
    plt.ylim([-2, 2])
    plt.tight_layout()

### 3.4 Spurious correlations
Correlations are a good indication that two variables are related, but are not always conclusive. If you measure a large number of variables and compare each one to the others, you will find some that correlate purely by chance. These are called ___spurious correlations___.

This section uses examples from Tyler Vigen's online directory of spurious correlations http://tylervigen.com/spurious-correlations. Each variable is a time series of 10 annual observations, such as the number of movies Nicolas Cage appeared in, or the number of Sociology PhDs awarded in the US.

In [None]:
# Helper functions

def show_correlation(pair):
    plt.figure(figsize=(4,4))
    p, r, a, b = pair
    df = a.join(b)
    x, y = [df[c] for c in df.columns]
    plt.figure(figsize=(4,4))
    plt.plot(x, y, '.', markersize=10)
    plt.xlabel(a.columns[0])
    plt.ylabel(b.columns[0])
    
def show_time_correlation(pair):
    fig, ax2 = plt.subplots(1,1, figsize=(8,4))
    p, r, a, b = pair
    df = a.join(b)
    x, y = [df[c] for c in df.columns]
    ax2.set_xlabel('Year')
    ax2.set_ylabel(a.columns[0])
    lns2 = ax2.plot(df.index, x, 'or-', label=df.columns[0])
    ax3 = ax2.twinx()
    ax3.set_ylabel(b.columns[0])
    lns3 = ax3.plot(df.index, y, 'sb-', label=df.columns[1])
    lns = lns2 + lns3
    labs = [l.get_label() for l in lns]
    ax2.legend(lns, labs, loc=0)
    plt.tight_layout()
    
def plot_data(a):
    a = a.set_index('Year')
    fig, ax2 = plt.subplots(1,1, figsize=(8,4))
    ax2.set_xlabel('Year')
    ax2.set_ylabel('Count')
    x = a[a.columns[0]]
    lns2 = ax2.plot(a.index, x, 'sb-', label=a.columns[0])
    lns = lns2
    labs = [l.get_label() for l in lns]
    ax2.legend(lns, labs, loc=0)
    plt.tight_layout()

def find_correlations(df):
    correlations = []
    for i in range(len(df)):
        for j in range(i + 1, len(df)):
            a = df[i].set_index('Year')
            b = df[j].set_index('Year')
            df_both = a.join(b)
            x, y = [df_both[c] for c in df_both.columns]
            r, p = spstats.pearsonr(x, y)
            if p < 0.05:
                correlations.append( (p, r, a, b))
    return sorted(correlations, key=lambda x: x[1], reverse=True)

In [None]:
data = [
    "data/cage.csv",
    "data/fall-pool.csv",
    "data/steam.csv",
    "data/bedsheets.csv",
    "data/sociology.csv",
    "data/cs.csv",
    "data/economics.csv",
    "data/anthropology.csv"]
df = [pd.read_csv(d) for d in data]

#### Visualize data
Now we can plot the various data sets.

In [None]:
plot_data(df[4])

#### Finding Correlations
The following cell compares each of the 8 data sets with all of the others and lists the pairs that are significantly correlated with each other.

In [None]:
pairs = find_correlations(df)
for i, (p, r, a, b) in enumerate(pairs):
    print(i, ':', a.columns[0], '—', b.columns[0])

#### Short Answer 5
There code above compares all 28 possible combinations of the 8 data sets and lists just the ones with significant correlations. What fraction of these combinations are correlated?

🤔Your answer here:

#### Visualizing Correlation
The cells below visualize the correlation between variables in two different ways. The first plots one variable against the other. The next plots both variables over time.

In [None]:
show_correlation(pairs[0])

In [None]:
show_time_correlation(pairs[0])

## Section 4: Prediction
One common task in computational social science is to predict some feature of future data from data that has already been seen. This section will use the example of academic majors. Imagine you have access to the courses and grades of a group of students and want to determine what their current (or future) academic major is. This section will walk you through the process.

#### Subject grades
We will be predicting academic majors based on Grade Points Earned, the sum of all course grades a student has achieved (also called Michigan Honor Points at the University of Michigan).

Specifically, we will compare grade points earned within one subject (psychology) to grade points earned in other subjects. The cells below prepare the data by sorting it into psychology and non-psychology majors and producing a tuple of psychology and non-psychology grade points earned for each student.

In [None]:
#settings
df_combined.rename(columns={'LAST_ACT_ENGL_SCORE': 'ACT_ENGLISH', 
                            'LAST_ACT_MATH_SCORE': 'ACT_MATH',
                            'LAST_ACT_READ_SCORE': 'ACT_READING', 
                            'LAST_ACT_SCIRE_SCORE': 'ACT_WRITING',
                            'LAST_ACT_COMP_SCORE': 'ACT_SCORE',
                            'LAST_SATI_VERB_SCORE': 'SAT_VERBAL',
                            'LAST_SATI_MATH_SCORE': 'SAT_MATH',
                            'LAST_SATI_TOTAL_SCORE': 'SAT_TOTAL'}, inplace=True)

major = 'Psychology BA'
subject = 'PSYCH'
n = 5000
variables = ['ANONID', 'major', 'HSGPA', 'SEX',
             'ACT_ENGLISH', 'ACT_MATH', 'ACT_READING', 'ACT_WRITING',
             'ACT_SCORE', 'SAT_VERBAL', 'SAT_MATH', 'SAT_TOTAL']

# make sex numeric (1 == female)
df_combined.SEX.replace({'F':1, 'M':0}, inplace=True)

# categorize students and classes
df_combined['major'] = (df_combined.MAJOR1_DESCR == major).astype(int)
df_combined.major.replace({1:major, 0:'Other'}, inplace=True)
df_combined['on_topic'] = (df_combined.SUBJECT == subject).astype(int)

#Get total grades in the subject
topical = df_combined[df_combined.on_topic == 1][variables + ['GRADE']]
topical = topical.groupby(variables).sum().reset_index()
topical.rename(columns={'GRADE':'subject_grade'}, inplace=True)

#get total grades in other subjects
ot = df_combined[df_combined.on_topic == 0][['ANONID', 'GRADE']]
ot = ot.groupby(['ANONID']).sum().reset_index()
ot.rename(columns={'GRADE':'other_grade'}, inplace=True)

#merge grade info together
together = topical.merge(ot, on='ANONID')

#sample for balanced classes
majors = together[together.major == major].sample(n, replace=True)
others = together[together.major != major].sample(n, replace=True)
together = pd.concat([majors, others], sort=False)

### 4.1 Visualize
The cell below visualizes these data.

In [None]:
plt.figure(figsize=(8,8))
for m in together.major.unique():
    tmp = together[together.major == m]
    plt.scatter(tmp.subject_grade, tmp.other_grade, s=2, label=m)
plt.legend()
plt.xlabel("Grade Points Earned in "+subject)
plt.ylabel("Grade Points Earned in Other Subjects")

#### Short Answer 6
Looking at the above figure, how would you predict the major of student from their grade points earned?

🤔Your answer here:

### 4.2 Features
The cell below combines the grade points for all students into one list, and the corresponding academic major labels for students into another. These are the lists that will be used to predict academic majors and test our predictions.

In [None]:
X = together[['subject_grade', 'other_grade']]
y = together['major']
print("Features: ", X.values[0])
print("Label:", y.values[0])

### 4.3 Classifiers
Now we will create a ___classifier___ to make predictions based on the features we have created. We first give the classifier a set of features with known labels, called training data. We're asking the classifier to figure out some pattern in the input data, grade points in different subjects, that is related to the thing we want to know, a student's major. Once it learns this pattern (for now, don't worry about how it learns), it can tell us things like, "given these grade points, I think the student is a psychology major."

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95)
classifier = SVC(kernel="linear", probability=True)
classifier.fit(X_train, y_train)

#### Testing The Classifier
The below cell picks a random student and predicts their major using the classifier. It also prints out the correct academic major so we can see whether the classifier is correct.

In [None]:
i = random.randint(0, len(X_test) - 1)
sample = X_test.values[i]
prediction = classifier.predict([sample])
true = y_test.values[i]
print(subject+" grade points:", sample[0])
print("Other grade points:", sample[1])
print("Predicted:\t", prediction[0])
print("True:\t\t", true)

#### Short Answer 7
Run the above cell several times. When the classifier makes a mistake, is it usually predicting a student is a psychology major when they aren't? Or predicting psychology majors are other majors? Or is it about 50/50?

🤔Your answer here:

#### Calculating Errors
The cell below calculates the number of Psychology majors correctly classified (True Positive), the number of Other majors correctly classified (True Negative), Psychology majors classified as Other (False Negative), and Other majors classified as Psychology majors (False Positive).

In [None]:
prediction = classifier.predict(X_test)

true_positive = sum((prediction == 'Psychology BA') & (y_test == 'Psychology BA'))
false_positive = sum((prediction == 'Psychology BA') & (y_test == 'Other'))
true_negative = sum((prediction == 'Other') & (y_test == 'Other'))
false_negative = sum((prediction == 'Other') & (y_test == 'Psychology BA'))

print('Number of Psychology BAs predicted to be Psychology BAs: ', true_positive)
print('Number of Psychology BAs predicted to be Other: ', false_negative)
print('Number of Other majors predicted to be Other: ', true_negative)
print('Number of Other majors predicted to be Psychology BAs: ', false_positive)

### 4.4 Precision and Recall

There are several ways to measure the quality of a classifier. We will talk about ___precision___ and ___recall___. Using psychology classes as an example, 
- With a high ___precision___ classifier, we can be confident that anyone it says is a psychology major really is one. 

$$ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} $$

- With a high ___recall___ classifier, we can be confident that none of the psychology majors were labeled as another major.

$$ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} $$

- An ideal classifier would have high precision and high recall, but a bad one might have neither. 

#### Short Answer 8
Using the values above, calculate the precision and recall of the classifier.

🤔Your answer here:

Often, there is a tradeoff between precision and recall. 
- If a model says that every student is a psychology major, for example, then it would have perfect recall (all psych majors correctly guessed as psych majors), but terrible precision (all non-psych majors are also guessed as psych majors).
- If it labels only the most likely psychology major as a psychology major, then it probably has good precision (one out of one, or 100% of guesses correct), but it will have terrible recall (most psychology majors listed as other majors).
- We can look at this tradeoff by changing the "threshold" of the classifier, shown in the cell below.

In [None]:
# Helper functions

def get_proba(X, y, test_size=0.5):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
    classifier = SVC(kernel="linear", probability=True)
    classifier.fit(X_train, y_train)
    y_pred_proba = classifier.predict_proba(X_test)
    return y_test, y_pred_proba[:,1]

def get_precision_recall(y_true, y_proba, threshold):
    y_pred = [major if x > threshold else 'Other' for x in y_proba]
    precision = precision_score(y_true, y_pred, pos_label=major)
    recall = recall_score(y_true, y_pred, pos_label=major)
    return precision, recall

In [None]:
precision = []
recall = []
thresholds = np.arange(0,1,.05)

y_true, y_proba = get_proba(X, y, test_size=0.95)

for threshold in tqdm(thresholds):
    p, r = get_precision_recall(y_true, y_proba, threshold=threshold)
    precision.append(p)
    recall.append(r)
plt.figure(figsize=(4,4))
plt.plot(thresholds, precision, label="Precision")
plt.plot(thresholds, recall, label="Recall")
plt.xlabel("Threshold");
plt.legend()

#### Short Answer 9
Depending on the application, sometimes precision is more desirable and sometimes recall is more desirable. For example, if you were using a classifier to find likely locations of a rare lost treasure, you would want a high recall, to make sure you didn't miss possible locations.

Can you think of **one** example where recall is more important and **one** where precision is more important? Explain your reasoning.

🤔Your answer here:

#### F1 Score
It can be confusing to consider two separate measures of quality, so they are sometimes combined into a single measure called the ___F1 score___. The cell below shows how F1 score changes along with precision and recall.

$$ F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

In [None]:
print("Weight\tPrec.\tRecall\tF1")
f1 = [2 * precision[i] * recall[i] / (precision[i] + recall[i]) for i in range(len(thresholds))]
plt.figure(figsize=(4,4))
plt.plot(thresholds, precision, label="Precision")
plt.plot(thresholds, recall, label="Recall")
plt.plot(thresholds, f1, label="F1")
plt.legend()
plt.xlabel('Threshod')

### 4.5 Validation

In [None]:
# Test with training data
n = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=n, 
                                                    test_size=X.shape[0]-n, 
                                                    stratify=y)
classifier = SVC(kernel="linear")
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_train)
print("F-1 score for the training data is: ", f1_score(y_train, y_pred, pos_label=major))

In [None]:
# Test with out-of-sample data
y_pred = classifier.predict(X_test)
print("F-1 score for the testing data is: ", f1_score(y_test, y_pred, pos_label=major))

## Reflect and Try it Yourself

In the previous section, we built a classifier that used grade points earned in psychology courses and grade points earned in other courses to predict whether a student is a Psychology BA or Other major. Now it's your turn to build a classifier. We can use the other variables in our data to try and improve our classifier. Your goal will be to build a classifier that outperforms the previous classifier. This is to say that you will evaluate your classifier's precision, recall, and F1 scores and argue why your classifier is better.

In this section, you should choose 1 or more variables to use as predicting variables instead of (or alongside of) grade points.

Here are the variables you can choose from:

``subject_grade`` <br>
``other_grade`` <br>
``HSGPA`` <br>
``SEX`` <br>
``ACT_ENGLISH`` <br>
``ACT_MATH`` <br>
``ACT_READING`` <br>
``ACT_WRITING`` <br>
``ACT_SCORE`` <br>
``SAT_VERBAL`` <br>
``SAT_MATH`` <br>
``SAT_TOTAL`` <br>

In the following cell, write the name of each variable you would like to use in a list. Each item should be in quotations and separated by a comma.

In [None]:
# list the variables you would like to include in your classifier
# x_vars = ['subject_grade', 'other_grade'] these are the variables we used earlier

# enter your choice of variables in quotes, separated by a comma
x_vars = []

In [None]:
# training the classifier
X = together[x_vars]
y = together['major']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95)
classifier = SVC(kernel="linear", probability=True)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_train)
print("Precision for the training data is: ", precision_score(y_train, y_pred, pos_label=major))
print("Recall for the training data is: ", recall_score(y_train, y_pred, pos_label=major))
print("F-1 score for the training data is: ", f1_score(y_train, y_pred, pos_label=major))

# Test with out-of-sample data
y_pred = classifier.predict(X_test)
print("Precision for the testing data is: ", precision_score(y_test, y_pred, pos_label=major))
print("Recall for the testing data is: ", recall_score(y_test, y_pred, pos_label=major))
print("F-1 score for the testing data is: ", f1_score(y_test, y_pred, pos_label=major))

#### Reflection Question 1

1. In the Try it Yourself section, what variables did you use to build your classifier?
2. What are the precision, recall, and F1 scores for your classifier for the training data?
3. What are the precision, recall, and F1 scores for your classifier for the testing data?
4. The goal was to build a classifier that performs better than our earlier classifier. Why is your classifier better? (Hint: Which measure(s) do you consider to be most important for this classification task?)

🤔Your answer here:

#### Reflection Question 2

1. Consider a course where every student completing the course got an A. From only this information would you conclude that it would be easy to get an A in this class?
2. The student performance data doesn't contain grades for students who withdrew from a course without completing it. Now, you learn that 75% of the students dropped the class after failing the midterm. Does that change your answer to the above question?
3. Assuming that students only withdraw when they are performing poorly, how does the exclusion of students who have withdrawn from courses change the apparent student performance in a class?

🤔Your answer here:

#### Reflection Question 3
In the first section of this lab, we saw that increasing the sample size decreased the range, or ___variance___, of estimates. We also saw that systematically excluding a group of indidividuals from the sample can raise or lower the entire range, called ___bias___.

Over the past decades, ___grade inflation___ has caused average course grades to steadily increase at many universites. Imagine you are using historical data to predict a current student's performance.

1. How, if at all, could grade inflation increase the variance of the prediction?
2. How, if at all, could grade inflation increase the bias of the prediction?

🤔Your answer here:

#### Reflection Question 4
Usually, the more data you use to train a classifier, the more accurate it will be. However, the less data you will have available to test on.

1. What is the advantage of using a large amount of training data?
2. What is the advantage of using a large amount of test data?

🤔Your answer here:

## References

Matz, R. L., Koester, B. P., Fiorini, S., Grom, G., Shepard, L., Stangor, C. G., ... & McKay, T. A. (2017). Patterns of Gendered Performance Differences in Large Introductory Courses at Five Research Universities. AERA Open, 3(4), 2332858417743754.

Wright, M. C., McKay, T., Hershock, C., Miller, K., & Tritz, J. (2014). Better than expected: Using learning analytics to promote student success in gateway science. Change: The Magazine of Higher Learning, 46(1), 28-34.