# Homework 01: Data Cleaning and Exploratory Data Analysis 
***

**Name**: 

***

This assignment is due on Canvas by **6:00PM on Friday September 10**. Your solutions to theoretical questions should be done in Markdown directly below the associated question.  Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available in the Homework 01 assignment write-up on Canvas. To make life easier on the grader if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Remember that there is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do Kernel $\rightarrow$ Restart & Run All as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 

---

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

### Problem 1  (10 Points)
***

<img style="float: right; width: 200px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-16%20at%2011.50.04%20AM.png" alt="Drawing"/> 

**Part A (5 of 10 points):** There is a new co-ed engineering dorm on campus. It is called "Casa Matt & Trey", or CMT dorm. CMT dorm has 4 floors. Each floor has 24 students; 12 men and 12 women per floor. Furthermore the first floor is reserved for 18-year olds, the second floor for 19-year olds, the third floor for 20-year olds, and the 4th floor for 21-year olds. All 96 students are listed on a dorm roster.

The administration is interested in the student opinion of the new dorm. A survey is sent to 2 randomly chosen 18-year old women, 2 randomly chosen 19-year old women, 2 randomly chosen 20-year women and 2 randomly chosen 21-year old women. Then the same process is done for the various ages of men in the dorm all of whose names were listed on the roster.

$$ \quad $$
    
Identify the following: 

- the population 
- the sample frame 
- the sample 
- the type of sample 
- the variable of interest

$\color{red}{\text{Solution to Part A in this cell.}}$

**Part B (5 of 10 points):** After analyzing the results the administration decides it is wiser to perform a census sample of CMT dorm to determine the opinion of the new facility.

Answer the following:

- What is the population 
- What is the sample frame 
- What is the sample 
- How many people were surveyed this time around
- If everyone honestly answered the survey, then is this data biased towards a particular age group or gender?

$\color{red}{\text{Solution to Part B in this cell.}}$

## Problem 2 (30 points)
***

Let $x_1, x_2, \ldots, x_n$ be $n$ observations of a variable of interest.  Recall that the sample mean $\bar{x}$ and sample variance $s^2$ are given by 

$$
\bar{x} = \frac{1}{n}\sum_{k=1}^n x_k \quad \textrm{and} \quad s^2 = \frac{1}{n-1}\sum_{k=1}^n \left( x_k - \bar{x}\right)^2 
$$

Recall that the standard deviation is typically found by taking the square root of the variance.

Notice that the above computation of the variance requires two passes over the data: one to compute the mean, and a second to subtract the mean from each observation and compute the sum of squares. Additionally, this manner of computing sample variance can lead to numerical problems and inefficiencies. For example, if the $x$'s are "large" and the differences between them "small", computing the sample variance using the formula above requires computing a small number as the difference of two small numbers. Not good! Check out this optional wikipedia article about [Loss of Significance](https://en.wikipedia.org/wiki/Loss_of_significance) to learn more.

It is often useful to be able to compute the variance and/or standard deviation in a single pass, inspecting each value $x_k$ only once; for example, when the data are being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation. In this problem (part B), we will compute the variance as the data arrives one at a time; we will thus not need to save the data for a second pass.  

**Part A (5 points)**: It is common to manipulate expressions algebraically to rewrite them in a more convenient way. We will practice this idea on the following identity. Please note that either side of the identity does not solve the numerical issues noted above. This is simply a mathematical exercise to see how to manipulate an expression and change its form.

Prove the following identity by algbraically manipulating one side to derive the other side.

$$
\displaystyle s^2 = \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} = \frac{1}{n(n-1)} \left(n \cdot \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i \right)^2 \right)
$$

$\color{red}{\text{Solution to Part A in this cell.}}$

**Part B (10 points):** Use the following algorithm to complete the function `running_variance` below that returns a running estimate of the variance. This function takes a list of integers as input and returns the variance of the list once we have iterated through the entire data set.

This algorithm provides a recursive way to compute the standard deviation and the mean as we make one pass through the data set. 

Consider an arbitrary data set of $n$ elements: $[x_1, x_2 ... x_i, ... x_{n-1}, x_n]$

We will let $\bar{x}_k$ represent the mean of the first $k$ elements of this dataset. We will let $s_k$ represent the standard deviation of the first $k$ elements of this dataset. 

- Initialize $\bar{x}_1 = x_1$ and $s_1 = 0$.

- Use the recursive formulas below to update the estimate of the mean and standard deviation as we iterate though our data set.
\begin{align*}
\bar{x}_k &= \bar{x}_{k-1} + \frac{x_k - \bar{x}_{k-1}}{k} \\
s_k &= s_{k-1} + (x_k - \bar{x}_{k-1})\cdot(x_k-\bar{x}_k) \\
\end{align*}
- For $2 \leq k \leq n, \ \ s^2 = \frac{s_k}{k-1}$

Note: Don't forget as you are coding that python is 0-indexed. 

$\color{red}{\text{Solution to Part B below in code cell.}}$

In [None]:
def running_variance(x):
#type your code in here        
    return variance

    

**Part C (3 points)** Read in the 'Boxing.csv' datafile.
Use the `running_variance` function from Part B to compute the variance of the 'Latino' column. 

The 'Boxing.csv' dataset was obtained from a [Harvard Dataverse site](https://dataverse.harvard.edu/dataset.xhtml;jsessionid=1439a44a49e91658631f515f0fe6?persistentId=doi%3A10.7910%2FDVN%2FRYGQM6&version=&q=&fileTypeGroupFacet=&fileAccess=&fileTag=Data&fileSortField=&fileSortOrder=). You can read the website's description of the data set below:
"All 704 male boxing champions among 8 weight divisions (Heavyweight, Light Heavyweight, Middleweight, Welterweight, Lightweight, Featherweight, Bantamweight, and Flyweight) from 1924-2011 were categorized according to his predominant or identified race; American Indian, Asian, Black, Latino, or White."

$\color{red}{\text{Solution to Part C in code cell below.}}$

In [None]:
#Type Code Here

**Part D (8 points):** Use the more "traditional" formula to compute variance. You may not use np.mean, np.std, or np.var here (or any other built in functions for these quantities.) You must complete the function `traditional_variance` below to compute variance. 

Use the formula below.

$$ s^2 = \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1} $$

$\color{red}{\text{Solution to Part D in code cell below.}}$

In [None]:
def traditional_variance(x):
#Put Code Here
    
    return variance_t
        

**Part E (3 points)** Use the `traditional_variance` function from Part D to again compute the variance of the 'Latino' column from the 'Boxing.csv' dataset.

$\color{red}{\text{Solution to Part E in code cell below.}}$

In [None]:
#Put Code Here

**Part F (1 point):** Run the following cells to analyze each of your functions from **Part B** and **Part D**. You can read about `time` [here](https://docs.python.org/3/library/time.html#time.time). You can read about `tracemalloc` and monitoring memory usage in Python [here](https://medium.com/survata-engineering-blog/monitoring-memory-usage-of-a-running-python-program-49f027e3d1ba).

Observe the results.

In [None]:
# When this cell is executed, we time the running_variance function

from time import time
import numpy as np

xx = np.arange(1000000)

# timing for the running_variance function
tbeg = time()
running_variance(xx)
tend = time()

print('took {} seconds to calculate a running estimate of variance'.format(tend-tbeg))

In [None]:
# When this cell is executed, we time the traditional_variance function

tbeg = time()
traditional_variance(xx)
tend = time()

print('took {} seconds to calculate a direct computation of variance'.format(tend-tbeg))

In [None]:
# When this cell is executed, we investigate memory usage of the running_variance function.

import tracemalloc

tracemalloc.start()
running_variance(xx)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

In [None]:
# When this cell is executed, we investigate memory usage of the traditional_variance function.

tracemalloc.start()
traditional_variance(xx)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()

$\color{red}{\text{Solution to Part F in this cell.}}$
What did you observe in the results?


### Problem 3 (30 points)
***

In this bit of data analysis we are going to look at the world of Boxing.
Boxing is a sport that holds no bias towards competitors whether they are rich or poor, blue collar or white collar, big or small, professor or student. In amateur boxing every boxer is matched by gender, age, weight, and ability; everyone is equal in the ring.

<img style="float: right; width: 200px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-22%20at%209.59.36%20AM.png" alt="Drawing"/> 

We will now analyze some professional boxing data.
This dataset contains the race of every one of eight weight-category champions in mens professional boxing between 1924 and 2011.

Each value in the dataset represents the number of champions that were a member of the particular race listed at the top of each respective column. Your job is to figure out what story the data are telling.

As usual make sure you have imported the proper libraries.

**Part A (2 points):** Read in the "Boxing.csv" datafile as a pandas dataframe. Take a look at the dataframe. You should see the number of male boxers from 8 weight categories and their race starting in 1924.

$\color{red}{\text{First solution to Part A in the code cell below.}}$

In [None]:
#Put Code Here

Now take a look at the more modern data (use the dot function $\color{blue}{\text{.tail()}}$). These should be the years ending in 2011.

$\color{red}{\text{Second solution to Part A in the code cell below.}}$

In [None]:
#Put Code Here

**Part B (5 points):** Suppose you think you see a different distribution of races in 1924 than you do in 2011. To help verify these thoughts, lets visualize the races of the eight various weight class champions as the years progress.

We begin exploring the data by creating smaller data sets. Specifically, create new dataframes in 11-year increments beginning in 1924. So for example, the first smaller data set will be the dataframe containing data from 1924 up to and including 1934. The second such data set will contain data from 1935 up to an including 1945, etc. Once you have finished generating these new smaller data sets, print "dfBoxing68_78" to the screen so that you (and we the graders) can verify your solution.

Note: Please follow the given naming convention for each new dataframe that you create:

- dfBoxing24_34
- dfBoxing35_45
- dfBoxing46_56
- dfBoxing57_67
- dfBoxing68_78
- dfBoxing79_89
- dfBoxing90_00
- dfBoxing01_11

$\color{red}{\text{Solution to Part B in the code cell below.}}$

In [None]:
#Put Code Here


**Part C (8 points):** Now lets create side-by-side blox plots for every one of these dataframes for each race.

In other words, each race will have 8 side-by-side box plots (one for each 11-year chunk of time) in an attempt to see the distribution of champions for that race as the years progress.

import $\color{blue}{\text{matplotlib.pyplot}}$ in order to start creating some visualizations.

Plotting requirements: 
- Each group of boxplots must have a title that indicates which race's data is being displayed.
- Customize the x-axis tick labels to show each time period. (i.e. 1924-1934, 1935-1945, etc)
- Standardize the y axis limits to go from -1 to 9 for each figure.
- Self-check: You should have a total of 5 figures, each with 8 boxplots.


$\color{red}{\text{Solution to Part C in the code cell below.}}$

In [None]:
#Put Code Here   

**Part D (3 points):** Now we can interpret the graphics. Look at $Q_2$ in the progressing box-and-whisker plots for each race. Some rise, some fall, some are flat, and some oscillate.

In the context of this data, what does a fall in $Q_2$ indicate?
In the context of this data, what does a flat $Q_2$ indicate?
How would you categorize the 11-year span between 1968 and 1978 for Latino boxers?

During which 11-year span do the BNW plots show the largest IQR for Black boxers?
In the context of this data, what does a large IQR indicate?

$\color{red}{\text{Solution to Part D in this cell.}}$


**Part E (10 points):** For each year between 1924 and 2011, how many times did each race have 8 champions, or 7 champions, or 6 champions,..., all the way down to how many times did each race have 0 champions? Create histograms to visualize an answer to this question. Note, you should have 5 seperate histograms.

$\color{red}{\text{Solution to Part E in the cell below.}}$

In [None]:
# Put Code Here


**Part F (2 points):** Now we want to interpret the graphics. All of the histograms, except for one, are essentially right skewed, and the remaining histogram is nearly uniform.

Relative to this data, what does a right-skewed histogram indicate?

Relative to this data, what does a uniform histogram indicate?

$\color{red}{\text{Solution to Part F in this cell.}}$

### Problem 4 (16 points)
***

Consider the following 3 data sets:

`A=[8, 6, 7, 5, 3, 0, 9, 8, 6, 7, 5, 3, 0, 9]`

`B=[3, 14, 15, 9, 26, 5, 35, 8, 9, 7, 9]`

`C` is the random data set generated by using `np.random.randint(0,1000, size=200)`

For each data set, perform the following computations:

**Part A (4 points):** Compute and print the mean and standard deviation of the data set. Use built-in NumPy functions for calculation.

$\color{red}{\text{Solution to Part A in the code cell below.}}$

In [None]:
#Put Code Here

**Part B (4 points):** Compute and print the mean and standard deviation of the new data set formed by subtracting the original mean from each observation. Use built-in NumPy functions.

$\color{red}{\text{Solution to Part B in the code cell below.}}$

In [None]:
#Put Code Here

**Part C (4 points):** Compute and print  the mean and standard deviation of the new data set formed by subtracting the original mean from each observation and then dividing by the original standard deviation. Use built-in NumPy functions.

$\color{red}{\text{Solution to Part C in the code cell below.}}$

In [None]:
#Put Code Here

**Part D (4 points):** What do you notice about the means? What do you notice about the standard deviations?

$\color{red}{\text{Solution to Part D in this cell.}}$


### Problem 5 (14 points)
***

Football teams are being matched for a competition. In an attempt to be efficient it was decided to match the teams based on a single descriptor of their weight using a measure of central tendancy.

The first team, team Red, sent in their weights:
`R=[58, 63, 55, 58, 60, 62, 73, 51, 62, 72, 68]`

<img style="float: middle; width: 250px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-16%20at%206.19.08%20PM.png" alt="Drawing"/> 

Team Blue-and-Yellow sent in their weights:
`BY=[50, 165, 62, 151, 52, 160, 51, 163, 52, 159, 57]`
<img style="float: middle; width: 250px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-16%20at%204.31.46%20PM.png" alt="Drawing"/> 


Team Shaq sent in their weights:
Team Shaq: `Q = [55, 62, 62 ,65, 54, 64, 59, 51, 62, 51, 427]`
<img style="float: middle; width: 150px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-16%20at%204.38.43%20PM.png" alt="Drawing"/> 

The teams that sent in their weights were not very accurate though. For instance, the coach for team shaq accidentilly listed his own weight with the players weights and Team Blue-Yellow thought the parents would play with the children, so they listed their own weights along with their childrens weights.


**Part A1 (4 points)** Teams must be no more than 11 pounds different in average weight in order to compete against each other. Compute and print the mean weight and the median weight of each team. You may use built-in mean and median functions to perform this computation.

$\color{red}{\text{Solution to Part A1 in the code cell below.}}$

In [None]:
#Put Code Here

**Part A2 (2 points)**
According to the means, which teams can play each other?
According to the medians, which teams can play each other?

$\color{red}{\text{Solution to Part A2 in this cell.}}$

**Part B (2 points)** List a pro and a con for computing the mean and a pro and a con for computing the median.

$\color{red}{\text{Solution to Part B in this cell.}}$

**PART C (3 points)** In a boxing competition, the team weights are also sent in for matching purposes.
Teams are matched if their average weights are close together.

<img style="float: right; width: 250px; padding: 3mm;" src="attachment:Screen%20Shot%202021-08-16%20at%206.48.12%20PM.png" alt="Drawing"/>

Two teams (CBC and DPD) each sent 5 boxers weights.
Look closely at the individual weights.

Find the mean, median, and variance for each team. Again, use the built-in NumPy functions.

Team CBC weights = `[50, 54, 155, 200, 243]`

Team DPD wieghts = `[147, 150, 155, 156, 167]`

$\color{red}{\text{Solution to Part C in the code cell below.}}$

In [None]:
#Put Code Here

**Part D (3 points):** What does the mean indicate about matching the teams?

What does the median indicate about matching the teams?

What does the variance say about matching the teams?

$\color{red}{\text{Solution to Part D in this cell.}}$