# Describing Data

## QUESTION 1

Greg was 14, Marcia was 12, Peter was 11, Jan was 10, Bobby was 8, and Cindy was 6 when they started playing the Brady kids on The Brady Bunch. Cousin Oliver was 8 years old when he joined the show. What are the mean, median, and mode of the kids' ages when they first appeared on the show? What are the variance, standard deviation, and standard error?


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# This list will become the row names in our data frame:
names = ['Greg', 'Marcia', 'Peter', 'Jan', 'Bobby', 'Cindy', 'Oliver']

# Create an empty data from with named rows
df = pd.DataFrame(index=names)

# Add our 'ages' column to the data frame:
df['ages'] = [14, 12, 11, 10, 8, 6, 8]

df

Unnamed: 0,ages
Greg,14
Marcia,12
Peter,11
Jan,10
Bobby,8
Cindy,6
Oliver,8


## Calculate the Mean

In [3]:
# CALCULATE THE MEAN - The average value of a list
# Using built-in Python functionality:
print(sum(df['ages']) / len(df['ages']))

# Using NumPy:
np.mean(df['ages'])

9.857142857142858


9.8571428571428577

## Calculate the Median

In [4]:
# CALCULATE THE MEDIAN - The middle value in an ordered list
# Using vanilla Ptyhon and the built-in statistics module:
import statistics
print(statistics.median(df['ages']))

# Using NumPy:
np.median(df['ages'])

10


10.0

## Calculate the Mode

In [5]:
# CALCULATE THE MODE - The most frequent value in a list
# Using vanilla Python's statistics module
import statistics
statistics.mode(df['ages'])

8

The above code will result in a statistics error if there is more than one mode in the data frame. See `Appendix A` below for what you can do if there are multiple modes in your data frame.

## Calculate the Variance

Variance `v` is measured as the sum of the squared difference of each individual datapoint `x` from the `mean`, divided by the number of datapoints `n` minus `1`.

```v = sum((x - mean) ** 2) / (n - 1)```

In [6]:
# Calculate the Variance using numpy.var() 
df['ages'].var()

7.4761904761904754

## Calculate the Standard Deviation
The most common estimate of variability used by statisticians is the square root of the variance, called the <em>standard deviation</em>.
```s = v ** 0.5```

NumPy gives us the useful <a href="https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.std.html">np.std()</a> function for working with standard deviations. A tricky default in numpy is to calculate the <em>population</em> standard deviation, dividing by `n`, rather than the <em>sample</em> standard deviation, dividing by `n - 1`. To calculate the sample instead of the population standard deviation we need to manually set the "delta degrees of freedom" with the ddof named parameter:

```np.std(df['age'], ddof=1)```

In [7]:
# Calculate the Standard Deviation of the sample using numpy.std()
# and ddof=1
np.std(df['ages'], ddof=1)

2.7342623276105891

## Calculate the Standard Error
Another useful estimate of variance is the <em>standard error</em>, which quantifies uncertainty in the estimate of the sample mean. While the standard deviation tells us about variance in the population, the `standard error` tells us about the `precision` of our `sample mean estimate`.

The formula for the standard error `se` of the mean is the standard deviation of the sample `s` divided by the square root of the sample size `n`.

```se = s / (n ** 0.5)```

In Python, this is:

```np.std(df['age'] ,ddof=1) / np.sqrt(len(df['age']))```

In [8]:
# Calculate the Standard Error of the data frame using np.std() & np.sqrt()
# The standard error will be very close to 1 because we are not really
# using a sample, we are using the entire data frame! 
np.std(df['ages'], ddof=1) / np.sqrt(len(df['ages']))

1.0334540197243192

# QUESTION 2
Using these estimates, if you had to choose only one estimate of central tendency and one estimate of variance to describe the data, which would you pick and why?

## My Answer:
I would use the mean and standard deviation.

The mean because it provides the average of the data set. 

The standard deviation because it shows us the distribution of values from the mean, and also provides insight into extreme data points (if there are any). 

# QUESTION 3
Next, Cindy has a birthday. Update your estimates- what changed, and what didn't?

In [9]:
df.at['Cindy', 'ages'] = df.loc['Cindy', 'ages'] + 1 
df

Unnamed: 0,ages
Greg,14
Marcia,12
Peter,11
Jan,10
Bobby,8
Cindy,7
Oliver,8


## <font color="green">This is how you add 1 to Cindy's age using a formula:</font>
​
`df.at['Cindy', 'ages'] = df.loc['Cindy', 'ages'] + 1 `

## Updated Mean
Previous mean was 9.85

In [10]:
np.mean(df['ages'])

10.0

This makes sense because the average of the data set should be higher now that Cindy's age went up by 1 year.

## Updated Median
Previous median was 10

In [11]:
np.median(df['ages'])

10.0

This makes sense because there are still 7 data points and Jan, age 10 is still in the middle of the data set.

## Updated Mode
Previous mode was 8.

In [12]:
statistics.mode(df['ages'])

8

This makes sense because the two most frequent data points are still 8

## Updated Variance
Previous variance was 7.4761904761904754

In [13]:
df['ages'].var()

6.333333333333333

This makes sense because there is less variability in the data now that Cindy's age is 7 and not 6 (the data frame ranges from 7 to 14 instead of 6 to 14). 

## Updated Standard Deviation
Previous std. dev. was 2.7342623276105891

In [14]:
np.std(df['ages'], ddof=1)

2.5166114784235831

Again, this makes sense because there is less variability, therefore the standard deviation is closer to the mean.

## Updated Standard Error
Previous standard error was 1.0334540197243192

In [15]:
np.std(df['ages'], ddof=1) / np.sqrt(len(df['ages']))

0.95118973121134176

## <font color="green">The standard error did something I wouldnt have expected. I thought that because we are using the entire data frame that the standard error should remain close to 1?</font>

# QUESTION 4
Nobody likes Cousin Oliver. Maybe the network should have used an even younger actor. Replace Cousin Oliver with 1-year-old Jessica, then recalculate again. Does this change your choice of central tendency or variance estimation methods?

In [16]:
df_noOliver = df.drop(['Oliver'])
df_noOliver.loc['Jessica'] = [1]
df_noOliver

Unnamed: 0,ages
Greg,14
Marcia,12
Peter,11
Jan,10
Bobby,8
Cindy,7
Jessica,1


# df_noOliver Mean
Previous mean was 10. 

In [17]:
np.mean(df_noOliver['ages'])

9.0

This makes sense because the average would go down now that we've removed Oliver who is 8 and added Jessica who is 1.

# df_noOliver Median

In [18]:
np.median(df_noOliver['ages'])

10.0

This makes sense because there are still 7 data points and Jan, age 10 is still in the middle of the sorted data set.

# df_noOliver Mode
### <font color="green">Question for my mentor: using `statistics.mode(df['ages'])` returns an error because there isn't a mode any more. What are we supposed to do, to show that there isn't a mode, without showing an error?</font> 

# df_noOliver Variance
Previous variance was 6.333333333333333

In [19]:
df_noOliver['ages'].var()

18.0

This makes sense because now the data set ranges from 1 to 14 versus 7 to 14 in the previous example. 

# df_noOliver Std. Dev.
Previous Std. Dev. was 2.5166114784235831

In [20]:
np.std(df_noOliver['ages'], ddof=1)

4.2426406871192848

This makes sense because there is much more variance in the data set.

# df_noOliver Std. Err.
Previous standard error was 0.95118973121134176

In [21]:
np.std(df_noOliver['ages'], ddof=1) / np.sqrt(len(df_noOliver['ages']))

1.6035674514745462

### <font color="green">Standard Error doesn't make sense to me. I thought standard error shows the difference between the population and the sample, but we are not looking at a sample here, we are looking at the entire data set. I don't understand standard error. I'll need my mentor to explain this to me</font>

# Does this change your choice of central tendency and variance estimation?

No, I dont' think it does. I think the mean and standard deviation are still the best ways to describe the data set... 

# QUESTION 5    
      
On the 50th anniversary of The Brady Bunch, four different magazines asked their readers whether they were fans of the show. The answers were: TV Guide 20% fans Entertainment Weekly 23% fans Pop Culture Today 17% fans SciPhi Phanatic 5% fans</li>


Based on these numbers, what percentage of adult Americans would you estimate were Brady Bunch fans on the 50th anniversary of the show?

In [22]:
# This list will become the row mags in our data frame:
mags = ['TV_Guide', 'Ent_Wkly', 'Pop_Cult', 'SciPhi']

# Create an empty data from with mags as the index
df = pd.DataFrame(index=mags)

# Add popularity column to the data frame:
df['popularity'] = [0.2, 0.23, 0.17, 0.05]

df

Unnamed: 0,popularity
TV_Guide,0.2
Ent_Wkly,0.23
Pop_Cult,0.17
SciPhi,0.05


In [23]:
np.mean(df['popularity'])

0.16250000000000003

In [24]:
np.std(df['popularity'], ddof=1)

0.078898669190297505

## My Answer to Question 5

With a mean of 16.25% and a standard deviation of 0.08, I would say that somewhere between 16.17% and 16.33% of Americans were Brady Bunch fans on the 50th anniversary of the show. 

# APPENDIX A
If you have more than one mode, you can use the code in below to prevent the statistics error from occurring. But the code below will only give you back the first mode, not all of the modes. 

## I added a second 6 to the `ages` data frame to see what would happen, and it returned 6 as the first mode.  

In [25]:
# Make a blank data frame.
df = pd.DataFrame()

# Populate it with the Brady kids and Cousin Oliver's ages.
df['ages2'] = [14, 12, 11, 10, 8, 6, 8, 6]

# Generate a list of unique elements along with how often they occur:
(values, counts) = np.unique(df['ages2'], return_counts=True)

# The location in the values list of the most-frequently-occuring element.
index = np.argmax(counts)
print(index) # Prints the location of the max item in the list

# The most frequent element
values[index]

0


6

Now that you've written the above code, when you run the `statistics.mode(df['ages2'])` code you shouldn't get a `statistics error`... but you do!

## <font color="green">Note for my mentor: This isn't working as I expected. What am I misunderstanding? </font>

In [26]:
statistics.mode(df['ages2'])

StatisticsError: no unique mode; found 2 equally common values