<h1><b>1. Exploratory Data Analysis</b></h1>
<p>Let's first import the required libraries for this chapter.</p>

In [1]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

In [2]:
try:
    import common
    DATA = common.dataDirectory()
except ImportError:
    DATA = Path().resolve() / 'data'

<p>We are going to use many datasets. Since we've already stored them in the same directory as our code, we can use the same path names.</p>

In [3]:
AIRLINE_STATS_CSV = DATA / 'airline_stats.csv'
KC_TAX_CSV = DATA / 'kc_tax.csv.gz'
LC_LOANS_CSV = DATA / 'lc_loans.csv'
AIRPORT_DELAYS_CSV = DATA / 'dfw_airline.csv'
SP500_DATA_CSV = DATA / 'sp500_data.csv.gz'
SP500_SECTORS_CSV = DATA / 'sp500_sectors.csv'
STATE_CSV = DATA / 'state.csv'

<h2>1.1 Estimates of Location</h2>
<h3>Example - Location Estimates of Population and Murder Rates</h3>
<p><strong>Table 1-2.</strong> 2010 Census containing population and murder rates (units of murders per 100,000 people per year) for each state.

In [4]:
df_state = pd.read_csv(STATE_CSV)
print(df_state.head(10))

         State  Population  Murder.Rate Abbreviation
0      Alabama     4779736          5.7           AL
1       Alaska      710231          5.6           AK
2      Arizona     6392017          4.7           AZ
3     Arkansas     2915918          5.6           AR
4   California    37253956          4.4           CA
5     Colorado     5029196          2.8           CO
6  Connecticut     3574097          2.4           CT
7     Delaware      897934          5.8           DE
8      Florida    18801310          5.8           FL
9      Georgia     9687653          5.7           GA


<p> Let's compute the mean.</p>

In [5]:
df_state['Population'].mean()

6162876.3

<p>To compute the trimmed mean we can use <code>trim_mean</code> from <code>scipy.stats</code>:</p>

In [9]:
print(trim_mean(df_state['Population'], 0.1))

4783697.125


<p>Ok! Now, how about computing the median?</p>

In [10]:
print(df_state['Population'].median())

4436369.5


<p>Since <code>trim_mean</code> with a 10% (0.1) drop, which consists of removing 5 states from each end of the dataset population-wise, excludes the extremes, we get a smaller value when comparing with the mean.</p>
<p>To compute the average murder rate for the country, we need to use a <strong>weighted average</strong> to take into account the different demographics of each state.</p>

In [12]:
print(df_state['Murder.Rate'].mean())

4.066


In [13]:
# Use weighted mean with np.average
print(np.average(df_state['Murder.Rate'], weights=df_state['Population']))

4.445833981123393


To get the <strong>weighted median</strong>, use the <code>wquantiles</code> package.</p>

In [14]:
print(wquantiles.median(df_state['Murder.Rate'], weights=df_state['Population']))

4.4


<div style="background: lightblack; 
            font-size: 16px; 
            padding: 10px; 
            border: 1px solid lightgray; 
            margin: 10px;">
  <h4><strong>Takeaways:</strong></h4>
<ul>
<li>Although the mean is the basic metric for location, it can be sensitive to extreme values.</li>
<li>Other metrics such as the median and the trimmed mean are more robust since they are not as affected by outliers.</li>
</ul>
</p>
</div>

<h2>1.2 Estimates of Variability</h2>
<h3>1.2.1 Standard Deviation and Related Estimates</h3>

SyntaxError: invalid syntax (1638611857.py, line 1)