# Engineering Data Analysis

> **Mohamad M. Hallal, PhD** <br> Teaching Professor, UC Berkeley

[![License](https://img.shields.io/badge/license-CC%20BY--NC--ND%204.0-blue)](https://creativecommons.org/licenses/by-nc-nd/4.0/)
***

# Linear Combination of Random Variables

In this notebook, we will explore how to compute numerical summaries of a linear combination of random variables. We will examine a dataset of exam scores and then generate random test scores and examine how combining them affects the overall score. 

Let's get started!

# Dataset

Let's load the provided data set `grades.csv`. These are all the features:

| Feature  | Units | Description                            |
| :-       | :-    | :-                                     |
| Test 1   | %     | Test 1 score of a student              |
| Test 2   | %     | Test 2 score of a student              |

Run the cell below, which reads the data and saves it as a variable named `data`.

In [None]:
import pandas as pd

# Load dataset
data = pd.read_csv('resources/grades.csv')

# Extract Test 1 and Test 2 scores
test1 = data['Test1(%)']
test2 = data['Test2(%)']

# Display the first few rows of the data using the head() method
data.head()

Let's visually examine if there is any relationship between Test 1 and Test 2 grades. 

Run the cell below to create a scatter plot of Test 2 (y) versus Test 1 (x) grades.

In [None]:
import matplotlib.pyplot as plt

# plot the scatter plot
plt.scatter(test1, test2)

# label the axes
plt.xlabel('Test 1 (%)')
plt.ylabel('Test 2 (%)')

# control limits
plt.xlim(0,105)
plt.ylim(0,105)

plt.show()

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>What does the scatter plot tell you about the relationship between Test 1 and Test 2 scores?</b></div> 

# Total Score

The total score on tests can be defined as the sum of the Test 1 and Test 2 scores:

$$\text{Total Score} = \text{Test 1} + \text{Test 2}$$

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>What can you say about the above equation? Is it a linear combination or not?</b></div> 

Run the code below to compute the total test score for each student.

In [None]:
total = test1 + test2

# Display the first few rows of the data using the head() method
total.head()

# Expected Value

We want to determine the mean total score.

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Compute the mean of the Test 1, Test 2, and Total scores.</div> 

In [None]:
test1_mean = ...
test2_mean = ...
total_mean = ...

print(f'Test 1 Mean:  {test1_mean:.0f}%')
print(f'Test 2 Mean:  {test2_mean:.0f}%')
print(f'Total Mean:  {total_mean:.0f}%')

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>Imagine we don't have access to the individual scores for Test 1 and Test 2. Instead, we're only told that the mean score on Test 1 is 79% and the mean score on Test 2 is 78%. Can we still calculate the mean total score? If so, how? Feel free to test your logic in the cell below.</b></div> 

# Variance

We want to determine the variance of the total score.

<div class="alert alert-block alert-info"> <b>YOUR TURN!</b> Compute the variance of the Test 1, Test 2, and Total scores.</div> 

In [None]:
test1_var = ...
test2_var = ...
total_var = ...

print(f'Test 1 Variance: {test1_var:.0f}%^2')
print(f'Test 2 Variance: {test2_var:.0f}%^2')
print(f'Total Variance:  {total_var:.0f}%^2')

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>Imagine we don't have access to the individual scores for Test 1 and Test 2. Instead, we're only told that the variance on Test 1 is 259%^2 and the variance on Test 2 is 309%^2. Can we still calculate the variance of the total score? If so, how? Feel free to test your logic in the cell below.</b></div> 

Let's simulate a random dataset for Test 1 and Test 2 scores. Run the code below to generate random values for Test 1 and Test 2 and then create a scatter plot of the randomized Test 2 (y) versus Test 1 (x) grades.

In [None]:
import numpy as np

# set seed number
np.random.seed(14)

# create two random datasets
test1_rand = np.random.randint(30, 100, 100)
test2_rand = np.random.randint(30, 100, 100)

# plot the scatter plot
plt.scatter(test1_rand, test2_rand)

# label the axes
plt.xlabel('Random Test 1 (%)')
plt.ylabel('Random Test 2 (%)')

# control limits
plt.xlim(0,105)
plt.ylim(0,105)

plt.show()

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>What does the scatter plot tell you about the relationship between the randomized Test 1 and Test 2 scores?</b></div>

Next, run the cell below to get the total score and then the variances.

In [None]:
total_rand = test1_rand + test2_rand

test1_rand_var = test1_rand.var()
test2_rand_var = test2_rand.var()
total_rand_var = total_rand.var()

print(f'Test 1 Variance: {test1_rand_var:.0f}%^2')
print(f'Test 2 Variance: {test2_rand_var:.0f}%^2')
print(f'Total Variance:  {total_rand_var:.0f}%^2')

<div class="alert alert-block alert-warning"> <b>DISCUSS!</b> <b>Again, imagine we don't have access to the individual scores for Test 1 and Test 2. Instead, we're only told that the variance on Test 1 is 413%^2 and the variance on Test 2 is 421%^2. Can we still calculate the variance of the total score? If so, how?</b></div> 