# Homework 3 Supplemental Notebook

## History of Data Science, Winter 2022

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Question 3

In this question, we'll revisit the data Tycho Brahe recorded when trying to measure the right ascension of $
\alpha$ Arietis. The table from lecture is shown below.

<a name='chart'></a>
<img src='data/brahe.png' width=250>

Recall, we verified that Brahe took the mean of each pair of observations. What's not clear is **how** he chose which observations to pair up (note that they're not paired in chronological order). The bottom 12 values in the right column (i.e. the "paired means") seem to be much less spread out than the values in the middle column – but maybe Brahe paired values in a way that would keep the spread of the paired means low.

### Question 3.1

First, it's important to establish that the fact that **no matter how he chose to pair the 24 values together, the mean of the 12 paired means would be equal to the original mean of the 24 observations**. That is, the mean of the 24 values in the middle column is equal to the mean of the bottom 12 values in the right column, and that would be true no matter what the pairings were.

<b style='color:red'>Your Job:</b> To illustrate this fact on a more tangible scale, let's try an example. Consider the numbers 1, 1, 4, 5, 8, 9.

(i) Compute their mean. Show your work.

(ii) Compute the pairwise means of (1, 1), (4, 8), and (5, 9). Now take the mean of these three pairwise means. Show your work.

(iii) Compute the pairwise means of (1, 9), (4, 5), and (1, 8). Now take the mean of these three pairwise means. Show your work. Do you get the same result as in (i) and (ii)?

(iv) (Optional) Prove that no matter how many numbers you have, no matter how you pair them up, if you take the mean of the pairs you will get the mean of the original numbers. It is a short proof.

### Question 3.2

Run the cell below to load in a DataFrame repesentation of the 24 observations in the middle column from the picture above, i.e. just the observations that Brahe paired off.

In [None]:
brahe = pd.read_csv('data/tycho-brahe-pairs.csv')
brahe

As we mentioned at the start of the question, it appears to be the case that Brahe paired the data in a way that kept the spread of the paired means low, even though this would have no impact on the mean of the paired means.

Let's try to compute the **standard deviation** of Brahe's paired means. First, we'll start by copying the two functions we wrote during lecture, along with a third helper function.

In [None]:
def time_minutes(deg, arcmin, arcsec):
    total_deg = deg + arcmin / 60 + arcsec / (60 ** 2)
    
    # Each degree is 4 minutes
    return total_deg * 4

In [None]:
def arc_degree(time_minutes):
    # 1 degree = 4 minutes, let's think in terms of degrees
    total_deg = time_minutes / 4
    
    # First, round DOWN
    deg = np.floor(total_deg)
    
    # Compute the part left over – we need to describe this in arcminutes and arcseconds
    leftover = total_deg - deg
    
    # Each arcminute is 1/60th of a degree, so figure out the number of times 1/60 goes into the leftover
    arcmin = leftover // (1 / 60)
    
    # Compute the part left over - we need to describe this in arcseconds
    leftover = leftover - arcmin * (1 / 60)
    
    # Each arcsecond is equal to 1/3600th of a degree, so figure out the number of times 1/3600 goes into the leftover
    arcsec = leftover // (1 / 3600)
    
    return (deg, arcmin, arcsec)

In [None]:
def string_to_tuple(s):
    '''Takes in a string formatted as "a, b, c"
       and returns a tuple (a, b, c)
    '''
    post_split = s.split(', ')
    return tuple([int(i) for i in post_split])

# Example behavior
string_to_tuple('26, 4, 16')

In [None]:
(time_minutes(26, 4, 16) + time_minutes(25, 56, 23)) / 2

<b style='color:red'>Your Job:</b> Complete the implementation of the function `paired_means`, which takes in a DataFrame with a `'Reading'` column and returns an array containing the mean of the first two readings, the mean of the second two readings, the mean of the third two readings, and so on (all measured in regular minutes).

For instance, `paired_means(brahe)` should return an array with 12 numbers in it, the first of which is `104.02166666666668`, the mean of 26º 4' 16" and 25º 56' 23" in minutes.

In your PDF writeup, provide a screenshot of the code you write.

In [None]:
def paired_means(df):
    output_means = ...
    # For each pair
    for i in ...:
        # Get the readings of both dates in the pair
        reading1 = string_to_tuple(df.get('Reading').iloc[i])
        reading2 = string_to_tuple(df.get('Reading').iloc[i+1])
        
        # Convert both readings to regular minutes
        reading1_min = ...
        reading2_min = ...
        
        # Compute their mean
        mean = ...
        
        # Store the mean (in regular minutes) in the output array
        output_means = ...
        
    return output_means

After finishing your function, run the two cells below.

In [None]:
original_paired_means = paired_means(brahe)
original_paired_means

In [None]:
observed_std = np.std(original_paired_means)
observed_std

If you implemented `paired_means` correctly, you'll see that `observed_std` is `0.012332120255100665`. This is the **standard deviation** of the 12 paired means that Brahe computed (i.e. the standard deviation of the bottom 12 numbers in the right column of [this image](#chart)).

Below, we define a function that takes in a DataFrame, calls your `paired_means` function on the DataFrame to get an array of paired means, and returns the standard deviation of those paired means.

In [None]:
def spread_paired_means(df):
    return np.std(paired_means(df))

### Question 3.3

Let's run a simulation to test the following hypotheses:

- **Null Hypothesis:** Tycho Brahe paired the 24 observations into 12 groups at random, and any deviations from this are due to chance alone.

- **Alternative Hypothesis:** Tycho Brahe paired the 24 observations into 12 groups in a way that kept the spread of the paired means small.

To do this, we'll repeatedly
- shuffle the `brahe` DataFrame,
- call `spread_paired_means` on the shuffled DataFrame, and
- keep track of the resulting standard deviations.

By shuffling the `brahe` DataFrame and calling `spread_paired_means`, we are **randomly** grouping the 24 points into 12 pairs, computing their pairwise means (giving us 12 values), and computing the standard deviation of the pairwise means (i.e. a **test statistic**). By following this process, we will end up with an array of many values, each corresponding to the standard deviation of one set of 12 paired means. We can then see where `observed_std` lies in this distribution.

<b style='color:red'>Your Job:</b> Complete the code below to conduct the simulation. By the end of this cell, `stds` should be an array of **10000** standard deviations, as described above. In your PDF writeup, include a screenshot of both the code you write and the plot that appears at the bottom of your notebook.

**_Hint:_** This is a standard DSC 10 question; refer to the textbook if it feels unfamiliar. To shuffle a DataFrame `df`, use `df.sample(n)`, where `n` is the length of the DataFrame.

In [None]:
stds = ...
stds

After you've completed the cell above, run the following cell to see a histogram of your simulated standard deviations, along with a red line at `observed_std`, the standard deviation of the means of Brahe's choice of pairs. Make sure to include this in your PDF writeup.

In [None]:
plt.figure(figsize=(10, 5))
plt.hist(stds, density=True, ec='w', bins=np.arange(0, 0.35, 0.01))
plt.axvline(x=observed_std, color='red');

If you completed all steps correctly, you'll notice that the null hypothesis – that Brahe selected his pairs at random – does not seem consistent with our simulated standard deviations, which were computed by selecting pairs at random. 

It seems like there must have been some reason he selected the pairs he did!