# Drill - Describing Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## 1. "Greg was 14, Marcia was 12, Peter was 11, Jan was 10, Bobby was 8, and Cindy was 6 when they started playing the Brady kids on The Brady Bunch. Cousin Oliver was 8 years old when he joined the show. 

   - ## What are the mean, median, and mode of the kids' ages when they first appeared on the show? 

   - ## What are the variance, standard deviation, and standard error?"

## Measures of Central Tendency - Mean, Median, and Mode

In [2]:
brady_bunch = pd.DataFrame()
brady_bunch['name'] = ['Greg', 'Marcia', 'Peter', 'Jan', 'Bobby', 'Cindy', 'Oliver']
brady_bunch['age'] = [14, 12, 11, 10, 8, 6, 8]

### Mean

In [3]:
def df_mean(df, column):
    mean = df[column].mean()
    return mean

In [4]:
df_mean(brady_bunch, 'age')

9.857142857142858

### Median

In [5]:
def df_median(df, column):
    median = df[column].median()
    return median

In [6]:
df_median(brady_bunch, 'age')

10.0

### Mode(s)

In [7]:
def df_modes(df, column):
    (values, counts) = np.unique(df[column], return_counts=True)
    indices = [x[0] for x in list(enumerate(counts)) if x[1] == counts[np.argmax(counts)]]
    modes = [values[x] for x in indices]
    if len(modes) == len(set(df[column])):
        modes = None
    return modes

In [8]:
df_modes(brady_bunch, 'age')

[8]

## Measures of Variance - Variance, Standard Deviation, and Standard Error

### Variance

In [9]:
def df_var(df, column):
    var = df[column].var(ddof=False)
    return var

In [10]:
df_var(brady_bunch, 'age')

6.408163265306122

### Standard Deviation

In [11]:
def df_std(df, column):
    std = df[column].std(ddof=False)
    return std

In [12]:
brady_bunch_age_std = df_std(brady_bunch, 'age')
brady_bunch_age_std

2.531435020952764

### Standard Error

In [13]:
def df_sterr(df, column):
    sterr = df[column].std(ddof=False) / np.sqrt(len(df[column])-1)
    return sterr

In [14]:
df_sterr(brady_bunch, 'age')

1.0334540197243194

## DataFrame.describe()

In [15]:
brady_bunch.describe()

Unnamed: 0,age
count,7.0
mean,9.857143
std,2.734262
min,6.0
25%,8.0
50%,10.0
75%,11.5
max,14.0


-------------------------------------------------------------------------------------------------------------------------

## 2. Using these estimates, if you had to choose only one estimate of central tendency and one estimate of variance to describe the data, which would you pick and why?

### Descibing the data with:
   - ### Measures of Central Tendency - _mean_
   - ### Measures of Variance - _standard deviation_
   
### I would use the _mean_ and the _standard deviation_ to describe the data, because together, they provide a quick overview of how the datapoints are distributed.  

### For this sample, the average (mean) age was almost 10 years old, and all ages in the dataset fall within two (2) standard deviations from the mean.
-------------------------------------------------------------------------------------------------------------------------

## 3. Next, Cindy has a birthday. Update your estimates- what changed, and what didn't?

In [16]:
brady_bunch.loc[brady_bunch['name'] == 'Cindy', ['age']] = 7
brady_bunch

Unnamed: 0,name,age
0,Greg,14
1,Marcia,12
2,Peter,11
3,Jan,10
4,Bobby,8
5,Cindy,7
6,Oliver,8


In [17]:
brady_bunch.describe()

Unnamed: 0,age
count,7.0
mean,10.0
std,2.516611
min,7.0
25%,8.0
50%,10.0
75%,11.5
max,14.0


In [18]:
print('New mean: ', df_mean(brady_bunch, 'age'))
print('New median: ', df_median(brady_bunch, 'age'))
print('New mode: ', df_modes(brady_bunch, 'age'))
print('New variance: ', df_var(brady_bunch, 'age'))
print('New standard deviation: ', df_std(brady_bunch, 'age'))
print('New standard error: ', df_sterr(brady_bunch, 'age'))

New mean:  10.0
New median:  10.0
New mode:  [8]
New variance:  5.428571428571429
New standard deviation:  2.32992949004287
New standard error:  0.951189731211342


### The mean, variance, standard deviation, and standard error changed when the dataframe was updated with Cindy's new age.  The median and mode remained the same.
-------------------------------------------------------------------------------------------------------------------------

## 4. Nobody likes Cousin Oliver. Maybe the network should have used an even younger actor. Replace Cousin Oliver with 1-year-old Jessica, then recalculate again. Does this change your choice of central tendency or variance estimation methods?

In [19]:
# DataFrame after updating Cindy's age
brady_bunch

Unnamed: 0,name,age
0,Greg,14
1,Marcia,12
2,Peter,11
3,Jan,10
4,Bobby,8
5,Cindy,7
6,Oliver,8


In [20]:
# Replace Oliver with Jessica
brady_bunch[brady_bunch['name'] == 'Oliver'] = ('Jessica', 1)

In [21]:
# DataFrame afer replacing Oliver with Jessica
brady_bunch

Unnamed: 0,name,age
0,Greg,14
1,Marcia,12
2,Peter,11
3,Jan,10
4,Bobby,8
5,Cindy,7
6,Jessica,1


In [22]:
print('New mean: ', df_mean(brady_bunch, 'age'))
print('New median: ', df_median(brady_bunch, 'age'))
print('New mode: ', df_modes(brady_bunch, 'age'))
print('New variance: ', df_var(brady_bunch, 'age'))
print('New standard deviation: ', df_std(brady_bunch, 'age'))
print('New standard error: ', df_sterr(brady_bunch, 'age'))

New mean:  9.0
New median:  10.0
New mode:  None
New variance:  15.428571428571429
New standard deviation:  3.927922024247863
New standard error:  1.6035674514745466


### By changing one of the show's characters (and including their age in calculations), the mean was significantly affected and is no longer as close to the median.  I would use the _median_ and _standard deviation_ to describe the data.
----------------------------------------------------------------------------------------------------------------------

## 5. On the 50th anniversary of The Brady Bunch, four different magazines asked their readers whether they were fans of the show. The answers were: TV Guide 20% fans Entertainment Weekly 23% fans Pop Culture Today 17% fans SciPhi Phanatic 5% fans

## Based on these numbers, what percentage of adult Americans would you estimate were Brady Bunch fans on the 50th anniversary of the show?


In [23]:
# The statistic for "SciPhi Phanatic" does not seem to be representative 
# of the population at large as that magazine's sampling would seem to favor fans 
# of only one specific genre of entertainment.  Including this data in statistical 
# calculations for the popularity of the TV show would likely insert some degree of bias.
bb_fans = pd.DataFrame()
bb_fans['magazine'] = ['TV Guide', 'Entertainment Weekly', 'Pop Culture Today']
bb_fans['percentage'] = [20, 23, 17]

In [24]:
bb_fans.mean()

percentage    20.0
dtype: float64

### I would estimate that 20% of adult Americans were fans of The Brady Bunch TV show.