# Summary Statistics

1. Greg was 14, Marcia was 12, Peter was 11, Jan was 10, Bobby was 8, and Cindy was 6 when they started playing the Brady kids on The Brady Bunch. Cousin Oliver was 8 years old when he joined the show. What are the mean, median, and mode of the kids' ages when they first appeared on the show? What are the variance, standard deviation, and standard error?

  Let $x_i$ be each child's age when they first appeared on the show and $N$ the number of children. 

1. The mean is $\frac{\displaystyle{\sum_{i=1}^{N}} x_i}{N} = \frac{14+12+11+10+8+6+8}{7} = 9.86$ years. 
2. From young to old, the 7 children appeared on the show at age 6, 8, 8, 10, 11, 12, 14. The median is 10 years.
3. Since 8 appeared twice and all the other ages appeared only twice, the mode is 8 years.

We can check the above results in Python.

In [4]:
# Load libraries
import numpy as np
import pandas as pd

In [57]:
# Create a data frame with children's names and ages
df = pd.DataFrame()
df['names'] = ['Greg', 'Marcia', 'Peter', 'Jan', 'Bobby', 'Cindy', 'Oliver']
df['age'] = [14, 12, 11, 10, 8, 6, 8]

# Mean
print(np.mean(df['age']))
# Median
print(np.median(df['age']))
# Mode 
# Write a custom function to handle multiple modes
def modes(data):
    # count the number of each unique age
    frequency = {}
    for d in data:
        if d in frequency:
            frequency[d] += 1
        else:
            frequency[d] = 1
    return [key for key in frequency if frequency[key] == max(frequency.values()) and frequency[key] > min(frequency.values())]

print(modes(df['age']))

9.857142857142858
10.0
[8]


All results are confirmed.

2. Using these estimates, if you had to choose only one estimate of central tendency and one estimate of variance to describe the data, which would you pick and why?

I will use the mean to describe the central tendency because it has richer [mathematical properties](https://stats.stackexchange.com/questions/7307/mean-and-median-properties) than both the median and the mode. Morever, there are no extreme values in our data to distort the mean so it's safe to use it. To describe the variance of the age data, I would choose the standard deviation because the population is already known so we can measure its variance ($\frac{\sum(x_i - \mu)^2}{N}$, $\mu$ is the mean) directly.     

3. Next, Cindy has a birthday. Update your estimates- what changed, and what didn't?

After Cindy turned 7, the mean changes to $\frac{14+12+11+10+8+7+8}{7} = 10$ years. However, the median stays 10 years and the mode is still 8 years.

Let's again confirm the results using Python.

In [56]:
# Update Cindy's age to 7
df.loc[5,'age'] = 7

# Mean
print(np.mean(df['age']))
# Median
print(np.median(df['age']))
# Mode
print(modes(df['age']))

10.0
10.0
[8]


Again, all results are confirmed. 

4. Nobody likes Cousin Oliver. Maybe the network should have used an even younger actor. Replace Cousin Oliver with 1-year-old Jessica, then recalculate again.  Does this change your choice of central tendency or variance estimation methods?

After replacing Oliver with Jessica, the new results are:

1. The mean is $\frac{\displaystyle{\sum_{i=1}^{N}} x_i}{N} = \frac{14+12+11+10+8+6+1}{7} = 8.86$ years. 
2. From young to old, the 7 children appeared on the show at age 1, 8, 8, 10, 11, 12, 14. The median is still 10 years.
3. Since 8 appeared twice and all the other ages appeared only twice, the mode is still 8 years.

I still use the standard deviation to describe the variance. However, it's now more appropriate to use the median to describe the central tendency since the mean is strongly influenced by Jessica's young age. Let's confirm the results using Python.

In [58]:
# Replace Oliver with Jessica
df.loc[6:6,'names':'age']='Jessica',1

# Mean
print(np.mean(df['age']))
# Median
print(np.median(df['age']))
# Mode
print(modes(df['age']))

8.857142857142858
10.0
[]


Results are confirmed.

5. On the 50th anniversary of The Brady Bunch, four different magazines asked their readers whether they were fans of the show.  The answers were:
    TV Guide            20% fans
    Entertainment Weekly    23% fans
    Pop Culture Today       17% fans
    SciPhi Phanatic     5% fans

  Based on these numbers, what percentage of adult Americans would you estimate were Brady Bunch fans on the 50th anniversary of the show?

My best guess is 16.25\%, which is the mean of the four polls ($\frac{20\% + 23\% + 17\% + 5\%}{4}$). 

In [63]:
# Create a data frame with names of polls and the results
fans = pd.DataFrame()
fans['poll'] = ['TV Guide', 'Entertainment Weekly', 'Pop Culture Today', 'SciPhi Phanatic']
fans['result'] = [20, 23, 17, 5]

# Mean
print('{}% of adult Americans were Brandy Bunch fans'.format(np.mean(fans['result'])))

16.25% of adult Americans were Brandy Bunch fans


The guess was confirmed.