# Exploratory Data Analysis in Python

## Introduction

This course is written and presented by Allen Downey, Staff Scientist at DrivenData and Professor Emeritus at Olin College.

This course covers much of the material covered by the Statistical Thinking in Python (Part 1) course.

Prerequisite:
- Python Data Science Toolbox (Part 2)

This course is part of these tracks:
- Data Analyst with Python (career track)
- Data Scientist with Python (career track)

## Setup

Use the directions in the README.md file in the datacamp directory to set up the virtual environment for this course.

## Imports

Imports are collected here for convenience and clarity.

See the course "Introduction to Importing Data in Python", which uses the h5py module to load a .hdf5 file correctly.

For this course, pytables (tables) is used as an alternative tool. See https://www.pytables.org/usersguide/introduction.html.

In [None]:
import empiricaldist
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats
import seaborn as sns

plt.style.use("dark_background")

## Datasets

Name | File
| :--- | :--- |
| National Survey of Family Growth (NSFG) | nfsg.hdf5 |
| General Social Survey (GSS) | gss.hdf5 |
| Behavioral Risk Factor Surveillance System (BRFSS) | brfss.hdf5 |

### National Survey of Family Growth (NSFG)

The data comes from the National Survey of Family Growth (NSFG) (https://www.cdc.gov/nchs/nsfg/index.htm) for 2013-2015. The survey is "nationally representative of women 15-44 years of age in the ... United States." The data includes "information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and general and reproductive health."

It is necessary to read the codebook to understand the data fields. See, for example, https://www.cdc.gov/nchs/nsfg/nsfg_2013_2015_codebooks.htm.

`birthwgt_lb1` stands for birth weight pounds. The values are:
| value | label | total |
| :--- | :--- | ---: |
| - | INAPPLICABLE | 2673 |
| 0-5 | UNDER 6 POUNDS | 936 |
| 6 | 6 POUNDS | 1666 |
| 7 | 7 POUNDS | 2146 |
| 8 | 8 POUNDS | 1168 |
| 9-95 | 9 POUNDS OR MORE | 474 |
| 98 | Refused | 1 |
| 99 | Don't know | 94 |

`birthwgt_oz1` stands for birth weight ounces. The values are:
| value | label | total |
| :--- | :--- | ---: |
| - | INAPPLICABLE | 2967 |
| 0-15 | 0-15 OUNCES | 6355 |
| 98 | Refused | 1 |
| 99 | Don't know | 35 |

`outcome` encodes the outcome of the pregnancy:
| value | label |
| :--- | :--- |
| 1 | Live birth |
| 2 | Induced abortion |
| 3 | Stillbirth |
| 4 | Miscarriage |
| 5 | Ectopic pregnancy |
| 6 | Current pregnancy |

`nbrnaliv` records the number of babies born from the pregnancy:
| value | label |
| :--- | :--- |
| . | INAPPLICABLE |
| 1 | 1 BABY |
| 2 | 2 BABIES |
| 3 | 3 OR MORE BABIES |
| 8 | Refused |

#### Explore the HDFStore Object

See https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#hdf5-pytables.

In [None]:
# Get the HDFStore object and see what's inside.
# The pandas DataFrame is available from
# pd.HDFStore("nsfg.hdf5", mode="r").get("nsfg").
with pd.HDFStore("nsfg.hdf5", mode="r") as store:
    print(type(store))
    print()
    print(store.info())
    print()
    print(store.keys())
    print()
    print(store.groups())
    print()
    
    # Walk the group hierarchy (copied from the documentation).
    # Somehow, the data is stored as a pandas DataFrame.
    for (path, subgroups, subkeys) in store.walk():
        for subgroup in subgroups:
            print("GROUP: {}/{}".format(path, subgroup))
        for subkey in subkeys:
            key = "/".join([path, subkey])
            print("KEY: {}".format(key))
            print()
            data = store.get(key)
            print(type(data))
            print()
            print(data)

#### Read the Data

In [None]:
# Read the data from the file using the "nsfg" key.
nsfg = pd.read_hdf("nsfg.hdf5", "nsfg")
print(type(nsfg))
print()
print(nsfg.info())
print()
print(nsfg.head())

### General Social Survey (GSS)

The General Social Survey (GSS) is an annual sample of the U.S. population recording hundreds of variables. The survey asks about demographic, social, and political beliefs. The data are widely used by politicians, policy makers, and researchers. Allen Downey has selected a few of the variables, cleaned and validated the data, and packaged the data into the gss.hdf5 file.

#### Read the Data

In [None]:
# Read the data into a DataFrame using the key "gss".
gss = pd.read_hdf("gss.hdf5", "gss")
print(gss.info())
print()
print(gss.head(5))

### Behavioral Risk Factor Surveillance System (BRFSS)

#### Read the Data

In [None]:
# Read the data into a DataFrame using the key "brfss".
brfss = pd.read_hdf("brfss.hdf5", "brfss")
print(brfss.info())
print()
print(brfss.head())

## Read, Clean, and Validate

### DataFrames and Series

#### Reading Data (Example)

The data has been read into the `nsfg` variable, which is a pandas DataFrame.

#### Read the Codebook (Exercise)

In [None]:
# Each column is a pandas Series.
# NaN is used to indicate invalid or missing data.
pounds = nsfg["birthwgt_lb1"]
print(type(pounds))
print()
print(pounds.head())

#### Exploring the NSFG Data (Exercise)

In [None]:
# Display the number of rows and columns.
print(nsfg.shape)
# Display the names of the columns.
print(nsfg.columns)
# Select column birthwgt_oz1: ounces.
ounces = nsfg['birthwgt_oz1']
# Print the first 5 elements of ounces.
print(ounces.head())

### Clean and Validate

#### Selecting Columns (Example)

We reuse the pounds and ounces variables created above.

#### Validating Data (Example)

We can validate the numbers by comparing them to the codebook values (see above). "The results agree with the codebook, so we have some confidence that we are reading and interpreting the data correctly."

In [None]:
# We can start validating the data by counting the distinct values, which
# creates a pandas Series ordered by the counts.
pounds_counts = pounds.value_counts()
print(type(pounds_counts))
print(pounds_counts)
print()

# Sort the data by the index, the number of pounds.
sorted_pounds_counts = pounds_counts.sort_index()
print(type(sorted_pounds_counts))
print(sorted_pounds_counts)

#### Validate using Describe (Example)

Another way to validate the data is to use the `.describe()` method of the pandas Series object to create summary statistics. The mean is distorted by the special values of 98 and 99.

In [None]:
# Get summary statistics.
print(pounds.describe())

#### Replace Bad Data (Example)

Replace values of 98 and 99, which indicate missing data, with NaN. The summary statistics exclude rows with NaN. The code shows two ways to replace data.

In [None]:
# Replace values of 98 and 99 with np.nan.
pounds = pounds.replace([98, 99], np.nan)
print(pounds.describe())

In [None]:
print(ounces.value_counts().sort_index())

In [None]:
ounces.replace([98, 99], np.nan, inplace=True)
print(ounces.describe())

#### Calculate a New Column Using Series Arithmetic (Example)

Combine pounds and ounces into a combined value, which could have units of pounds, ounces, or another mass unit.

In [None]:
# Combine pounds and ounces into a new birth_weight series.
birth_weight = pounds + ounces / 16
print(birth_weight.describe())

#### Validate a Variable (Exercise)

According to the codebook (see above), the outcome column contains 1 for a live birth. How many live births occurred in the data?

In [None]:
# Count the number of live births (6,489).
print(nsfg["outcome"].value_counts().sort_index())

#### Clean a Variable (Exercise)

Replace the nbrnaliv value of 8 with NaN.

In [None]:
# Convert 8 to NaN in place.
print("Before:")
print(nsfg["nbrnaliv"].info())
print()
print(nsfg["nbrnaliv"].value_counts().sort_index())
print()
print("After:")
nsfg["nbrnaliv"].replace(8, np.nan, inplace=True)
print(nsfg["nbrnaliv"].info())
print(nsfg["nbrnaliv"].value_counts().sort_index())

#### Compute a Variable (Exercise)

The agecon and agepreg variable contain the age at conception and the age at the end of pregnancy multiplied by 100. Create new variables by dividing these values by 100. Calculate the length of the pregnancy (in years). Calculating a new variable is sometimes called a "recode".

In [None]:
# Get the values of agecon and agepreg in years.
# Compute the difference.
agecon = nsfg["agecon"] / 100
agepreg = nsfg["agepreg"] / 100
preg_length = agepreg - agecon
print(preg_length.describe())

### Filter and Visualize

#### Create a Histogram of Birth Weights (Example)

"There are more light babies than heavy babies."

In [None]:
# Create a histogram from the birth_weight variable.
# I applied what I learned from the "Introduction to Data Visualization with
# Matplotlib" and "Statistical Thinking in Python" courses here.
# bins is set up at 1/4-pound intervals.
# Allen applied the .dropna() method to the birth_weight variable.
fig, ax = plt.subplots()
fig.set_size_inches((12, 8))
bins = np.linspace(0, 18, 73)
_ = plt.hist(birth_weight.dropna(), bins=bins)
_ = plt.xlabel("Birth weight (pounds)")
_ = plt.xticks(np.arange(0, 19))
_ = plt.ylabel("Number of births")
plt.show()

#### Filter Preterm Births (Example)

Preterm babies are babies born less than 37 weeks after conception. We can filter for these using the prglngth column.

In [None]:
# Identify and count preterm births. True = 1; False = 0.
preterm = nsfg["prglngth"] < 37
print(preterm.head())
# Count the number of preterm births.
print(preterm.sum())
# Calculate the proportion of preterm births.
print(preterm.mean())
# Get info about preterm.
print(preterm.info())
# And summary statistics.
print(preterm.describe())
# Count True and False values.
print(preterm.value_counts())

In [None]:
# Get the birth weights for preterm babies.
preterm_weight = birth_weight[preterm == True]
# Alternative formula:
# preterm_weight = birth_weight[preterm]
print(preterm_weight.mean())
fullterm_weight = birth_weight[preterm == False]
# Alternative formula:
# fullterm_weight = birth_weight[~preterm]
print(fullterm_weight.mean())

#### Plot Histograms of Preterm and Fullterm Births (Extra)

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches((12, 8))
bins = np.linspace(0, 18, 73)
alpha = 0.4
_ = plt.hist(birth_weight.dropna()[preterm], bins=bins,
             label="Preterm births", alpha=alpha)
_ = plt.hist(birth_weight.dropna()[~preterm], bins=bins,
             label="Fullterm births", alpha=alpha)
_ = plt.xlabel("Birth weight (pounds)")
_ = plt.xticks(np.arange(0, 19))
_ = plt.ylabel("Number of births")
_ = plt.legend()
plt.show()

#### Resampling (Example)

The NSFG is not representative of the U.S. population; some groups are oversampled. Oversampling makes sure you have enough people in some subgroups to perform a reliable statistical analysis. We can correct using Allen's `resample_rows_weighted()` function. I obtained the code for that function from the DataCamp console using `??resample_rows_weight`.

```Python
def resample_rows_weighted(df, column='finalwgt', seed=17):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    np.random.seed(seed)
    weights = df[column] / sum(df[column])
    indices = np.random.choice(df.index, len(df), replace=True, p=weights)
    sample = df.loc[indices]
    return sample
```

#### Create a Histogram of Age at Conception (Exercise)

In [None]:
# Plot the histogram
plt.hist(agecon, bins=20, histtype="step")
plt.xlabel('Age at conception')
plt.ylabel('Number of pregnancies')
plt.show()

#### Resample the Data (Demonstration)

In [None]:
# Resample the data to make it more representative using the weight
# values in the "wgt2013_2015" column.
# This code was provided.
def resample_rows_weighted(df, column, seed=17):
    """Resample a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    np.random.seed(seed)
    weights = df[column] / sum(df[column])
    indices = np.random.choice(df.index, len(df), replace=True, p=weights)
    sample = df.loc[indices]
    return sample

# Resample the data.
nsfg2 = resample_rows_weighted(nsfg, 'wgt2013_2015')
# Clean the weight variables.
pounds2 = nsfg2['birthwgt_lb1'].replace([98, 99], np.nan)
ounces2 = nsfg2['birthwgt_oz1'].replace([98, 99], np.nan)
# Compute total birth weight
birth_weight2 = pounds2 + ounces2 / 16
birth_weight2.describe()

#### Compute Mean Birth Weight (Exercise)

Using nsfg2, the resampled data, as a data source, compute the birth weights.

In [None]:
# Create a Boolean Series for full-term babies
full_term = nsfg2["prglngth"] >= 37

# Select the weights of full-term babies
full_term_weight = birth_weight2[full_term]

# Compute the mean weight of full-term babies
print(full_term_weight.mean())

#### Filter out Multiple Births (Exercise)

Some pregnancies lead to multiple births. Filter these out since the distribution of birth weight is different for twins, triplets, etc.

In [None]:
# Filter single births.
single = nsfg2["nbrnaliv"] == 1
# Compute birth weight for single full-term babies.
single_full_term_weight = birth_weight2[full_term & single]
print('Single full-term mean:', single_full_term_weight.mean())
# Compute birth weight for multiple full-term babies
mult_full_term_weight = birth_weight2[full_term & ~single]
print('Multiple full-term mean:', mult_full_term_weight.mean())

## Distributions

### Probability Mass Functions

The code makes use of Allen Downey's empiricaldist module, which is documented here: https://nbviewer.org/github/AllenDowney/empiricaldist/blob/master/empiricaldist/dist_demo.ipynb. By default, the probabilities are normalized.

#### Plot a Histogram of Years of Education (Demonstration)

In [None]:
# Create a variable containing the number of years of education.
# Create a histogram of the data.
# Create a bin for each year of education.
educ = gss["educ"]
plt.hist(educ.dropna(), bins=np.arange(0, 22), label="educ")
plt.xticks(np.arange(0, 23, 2))
plt.xlabel("Years of education")
plt.ylabel("Count")
plt.legend()
plt.show()

#### Create a PMF of Years of Education (Demonstration)

In [None]:
# The API has changed; call empiricaldist.Pmf.from_seq(),
# not empiricaldist.Pmf() as shown in the demonstration.
# The dataset used in the course's video is different from the one
# delivered by the course.
pmf_educ = empiricaldist.Pmf.from_seq(educ, normalize=False)
print(pmf_educ)
print(pmf_educ[12])

#### Create a Normalized PMF of Years of Education (Demonstration)

In [None]:
# Normalize the PMF.
pmf_educ2 = empiricaldist.Pmf.from_seq(educ, normalize=True)
print(pmf_educ2)
print(pmf_educ2[12])

#### Plot the PMF of Years of Education (Demonstration)

As expected, the bar chart is similar to the histogram. There are peaks at 12, 14, and 16 years, which correspond to completing high school, two years of colleage, and four years of college.

In [None]:
# Plot a bar chart of the PMF.
pmf_educ2.bar(label="educ")
plt.xlabel("Years of education")
plt.ylabel("PMF")
plt.show()

#### Create a PMF of the Data in the year Column (Exercise)

In [None]:
# Create the PMF.
pmf_year = empiricaldist.Pmf.from_seq(gss["year"], normalize=False)
print(pmf_year)
# 2,867 people were interviewed in 2016.
print(pmf_year[2016])

#### Create a PMF of the Data in the age Column (Exercise)

In [None]:
# Create and plot the PMF of the data in the age column.
# The Pmf object has a method named .bar() that calls 
# matplotlib.pyplot.bar().
age = gss['age']
pmf_age = empiricaldist.Pmf.from_seq(age)
pmf_age.bar()
plt.xlabel('Age')
plt.ylabel('PMF')
plt.show()

In [None]:
# Alternatively:
# qs is quantities, ps is probabilities.
plt.bar(pmf_age.qs, pmf_age.ps)
plt.xlabel("Age")
plt.ylabel("PMF")
plt.show()

### Cumulative Distribution Functions

#### From PMF to CDF (Example)

For discrete random variables, the PMF (probability mass function) returns the probability that you get exactly x for a given value of x. The CDF returns the probability that you get a value less than or equal to x for a given value of x.

Using the empiricaldist module's Cdf class, we can calculate and plot the CDF.

See also the "Statistical Thinking in Python (Part 1)" course, which shows how to calculate the empirical CDF.

The PMF and CDF plots look like the uniform distribution up to about age 45, after which sample ages are less frequent.

In [None]:
# Calculate and plot the CDF.
# Call empiricaldist.Cdf.from_seq() to calculate the CDF correctly.

cdf = empiricaldist.Cdf.from_seq(gss["age"])
print(type(cdf))
_ = cdf.plot()
plt.xlabel("Age")
plt.ylabel("CDF")
plt.show()

# Get the empirical CDF for given age.
q = 51
p = cdf(q)
print(p)

#### Evaluating the Inverse CDF (Example)

In [None]:
# Get the age at which the CDF is a given value.
# The interquartile range (IQR), a measure of the spread of the data,
# is 30-57. The median age is 43.
probabilities = (0.25, 0.50, 0.75)
for p in probabilities:
    print("{}: {}".format(p, cdf.inverse(p)))

#### Make and Use a CDF (Exercise)

In [None]:
# Create the CDF and use it to calculate the proportion of ages > 30.
cdf_age = empiricaldist.Cdf.from_seq(gss["age"])
prop = 1 - cdf_age(30)
print(prop)

#### Compute IQR (Exercise)

In [None]:
# Calculate the IQR for income.
cdf_income = empiricaldist.Cdf.from_seq(gss["realinc"])
percentile_75th = cdf_income.inverse(0.75)
percentile_25th = cdf_income.inverse(0.25)
iqr = percentile_75th - percentile_25th
print(iqr)

#### Plot a CDF (Exercise)

In [None]:
# Create and plot the CDF for realinc.
income = gss["realinc"]
cdf_income = empiricaldist.Cdf.from_seq(income)
cdf_income.plot()
plt.xlabel('Income (1986 USD)')
plt.ylabel('CDF')
plt.show()

### Comparing Distributions

In general, CDFs are smoother than PMFs.

#### Compare Multiple PMFs (Example)

In [None]:
# Display the PMFs for age for males and females.
male = gss["sex"] == 1
age = gss["age"]
male_age = age[male]
# Use the bitwise ~ (not) operator here. "not male" does not work.
# female_age = age[~male]
female_age = age[np.logical_not(male)]
# Using a bar plot obscures the overlapping bars.
# Using a line plot is clearer.
# empiricaldist.Pmf.from_seq(male_age).bar(label="Male")
# empiricaldist.Pmf.from_seq(female_age).bar(label="Female")
empiricaldist.Pmf.from_seq(male_age).plot(label="Male")
empiricaldist.Pmf.from_seq(female_age).plot(label="Female")
plt.xlabel("Age (years)")
plt.ylabel("Probability")
plt.legend()
plt.show()

#### Determine Equality of two Pandas Series (Extra)

In [None]:
# Determine equality of two Pandas series:
female_age1 = age[np.logical_not(male)]
female_age2 = age[~male]
print(female_age1.equals(female_age2))

#### Compare Multiple CDFs (Example)

In this example, the line for "Male" is slightly to the left of the line for "Female". This means there were more males than females at or below the given age.

In [None]:
# Plot the age CDFs for males and females.
male_cdf = empiricaldist.Cdf.from_seq(male_age).plot(label="Male")
female_cdf = empiricaldist.Cdf.from_seq(female_age).plot(label="Female")
plt.xlabel("Age (years)")
plt.ylabel("Cumulative probability")
plt.legend()
plt.show()

#### Compare Household Income before and after 1995 (Example)

1995 is the midpoint of the survey. The `realinc` variable represents household income in 1986 dollars. The `year` column provides the year of the interview.

> There are a lot of unique values in this distribution, and none of them appear very often. The PMF is so noisy, we can't really see the shape of the distribution. It looks like there are more people with high incomes after 1995, but it's hard to tell.

> Below $30,000 the CDFs are almost identical; above that, we can see that the 1995 and after distribution is shifted to the right. In other words, the fraction of people with high incomes is about the same, but the income of high earners has increased.

In general, Allen Downey recommends using CDFs to compare distributions because they give a clear view of the distribution without as much noise.

In [None]:
# Plot realinc before 1995 to realinc at or after 1995.
income = gss["realinc"]
pre95 = gss["year"] < 1995
alpha = 0.6
empiricaldist.Pmf.from_seq(income[pre95]).plot(label="Before 1995", alpha=alpha)
empiricaldist.Pmf.from_seq(income[np.logical_not(pre95)]).plot(label="1995 and after", alpha=alpha)
plt.xlabel("Income (1986 USD)")
plt.ylabel("PMF")
plt.legend()
plt.show()

In [None]:
# Try using CDF plots.
# Plot realinc before 1995 to realinc at or after 1995.
empiricaldist.Cdf.from_seq(income[pre95]).plot(label="Before 1995", alpha=alpha)
empiricaldist.Cdf.from_seq(income[np.logical_not(pre95)]).plot(label="1995 and after", alpha=alpha)
plt.xlabel("Income (1986 USD)")
plt.ylabel("PMF")
plt.legend()
plt.show()

#### Distribution of Education (Exercise)

In [None]:
# What fraction of respondents reported 12 years of education or less?
# I had to call .dropna() to get the result for the next exercise to match
# the result from this exercise.
educ = gss["educ"].dropna()
educ_cdf = empiricaldist.Cdf.from_seq(educ)
print(educ_cdf(12))

#### Extract Education Levels (Exercise)

Create boolean filters for different education levels. Find the fraction of respondents who reported 12 years of education or less.

In [None]:
# Bachelor's degree
bach = (educ >= 16)
# Associate degree
assc = ((educ >= 14) & (educ < 16))
# High school (12 or fewer years of education)
high = (educ <= 12)
print(high.mean())

#### Plot Income CDFs (Exercise)

Compare incomes for different education levels. The CDFs show that people with more education had higher incomes.

In [None]:
# Obtain new Series objects without calling .dropna(), because
# this causes length mismatches.
educ = gss["educ"]
bach = (educ >= 16)
assc = ((educ >= 14) & (educ < 16))
high = (educ <= 12)

income = gss["realinc"]
empiricaldist.Cdf.from_seq(income[high]).plot(label="High School")
empiricaldist.Cdf.from_seq(income[assc]).plot(label="Associate")
empiricaldist.Cdf.from_seq(income[bach]).plot(label="Bachelor")
plt.xlabel("Income (1986 USD")
plt.ylabel("CDF")
plt.legend()
plt.show()

#### Compare Incomes for Education Levels (Extra)

In [None]:
# Repeat this analysis with more education levels.
scol = educ == 13
bach = educ == 16
advc = educ > 16
empiricaldist.Cdf.from_seq(income[high]).plot(label="High School")
empiricaldist.Cdf.from_seq(income[scol]).plot(label="Some College")
empiricaldist.Cdf.from_seq(income[assc]).plot(label="Associate")
empiricaldist.Cdf.from_seq(income[bach]).plot(label="Bachelor")
empiricaldist.Cdf.from_seq(income[advc]).plot(label="Advanced")
plt.xlabel("Income (1986 USD")
plt.ylabel("CDF")
plt.legend()
plt.show()
# Print median incomes for the extreme education levels.
print("High School:", empiricaldist.Cdf.from_seq(income[high]).inverse(0.5))
print("Advanced:", empiricaldist.Cdf.from_seq(income[advc]).inverse(0.5))

### Modeling Distributions

#### CDF of the Normal Distribution (Example)

Use a pseudorandom number generator for the normal distribution to create 1000 sample values. Create and plot the CDF of the random samples.

In [None]:
rng = np.random.default_rng()
sample = rng.normal(size=1000)
ecdf = empiricaldist.Cdf.from_seq(sample)
ecdf.plot()
plt.xlabel("Random value")
plt.ylabel("CDF")
plt.xticks(np.arange(-4, 5, 1))
plt.show()

`scipy.stats.norm` is an object that represents the normal distribution. A CDF created by scipy overlaps the CDF created by Numpy. If this were real data, we would conclude that the normal distribution was a good model for the data.

In [None]:
# Create a CDF from the normal distribution with mean 0 and standard
# deviation 1.
# Plot the two CDS.
alpha = 0.4
ecdf.plot(alpha=alpha, label="Numpy")
xs = np.linspace(-3, 3)
ys = scipy.stats.norm(0, 1).cdf(xs)
plt.plot(xs, ys, alpha=alpha, label="Scipy")
plt.xlabel("Random value")
plt.ylabel("CDF")
plt.legend()
plt.show()

#### The PDF of the Normal Distribution (the Bell Curve) (Example)

In [None]:
ys = scipy.stats.norm(0, 1).pdf(xs)
plt.plot(xs, ys, color="gray")
plt.xlabel("Normal random value")
plt.ylabel("PDF")
plt.show()

#### Compare PDF to PMF for the Normal Distribution (Example)

This doesn't work well. The 1000 random samples all have unique values, so the probability of each sample is 1/1000.

In [None]:
empiricaldist.Pmf.from_seq(sample).plot(label="Numpy")
plt.plot(xs, ys, label="Scipy")
plt.xlabel("Random value")
plt.ylabel("Probability")
plt.legend()
plt.show()

#### Kernel Density Estimation (KDE) (Example)

Kernel density estimation (KDE) is
> a way of getting from a PMF, a probability mass function, to a PDF, a probability density function.

> To generate a KDE plot, we'll use the Seaborn library for data visualization, which I import as sns. Seaborn provides kdeplot, which takes the sample, estimates the PDF, and plots it. Here's what it looks like.

> he KDE plot matches the normal PDF pretty well, although the differences look bigger when we compare PDFs than they did with the CDFs. On one hand, that means that the PDF is a more sensitive way to look for differences, but often it is too sensitive. It's hard to tell whether apparent differences mean anything, or if they are just random, as in this case.

In [None]:
# Create the KDE plot.
sns.kdeplot(sample, label="Numpy + KDE")
plt.plot(xs, ys, label="Scipy")
plt.xlabel("Random value")
plt.ylabel("Probability")
plt.legend()
plt.show()

#### Distribution of Income (Exercise)

> In many datasets, the distribution of income is approximately lognormal, which means that the logarithms of the incomes fit a normal distribution. We'll see whether that's true for the GSS data.

In [None]:
# Extract realinc and compute its log.
# Find the mean and standard deviation.
income = gss['realinc']
log_income = np.log10(income)
mean = log_income.mean()
std = log_income.std()
print(mean, std)
dist = scipy.stats.norm(mean, std)

#### Comparing CDFs (Exercise)

> To see whether the distribution of income is well modeled by a lognormal distribution, we'll compare the CDF of the logarithm of the data to a normal distribution with the same mean and standard deviation.

> The lognormal model is a pretty good fit for the data, but clearly not a perfect match. That's what real data is like; sometimes it doesn't fit the model.

In [None]:
# Compare the model CDF to the observed CDF.
xs = np.linspace(2, 5.5, 71) # 50 xs by default
# print(xs)
ys = dist.cdf(xs)
plt.plot(xs, ys, label="Scipy")
empiricaldist.Cdf.from_seq(log_income).plot(label="GSS")
plt.xlabel('log10 of realinc')
plt.ylabel('CDF')
plt.legend()
plt.show()

#### Comparing PDFs (Exercise)

Compare a PDF (probability distribution function) and a KDE (kernel density estimate).

In [None]:
# Create and plot the model PDF.
xs = np.linspace(2, 5.5, 71)
ys = dist.pdf(xs)
plt.plot(xs, ys, label="Scipy")
# Plot the data KDE.
sns.kdeplot(log_income, label="GSS")
plt.xlabel("Log10(realinc)")
plt.ylabel("PDF")
plt.legend()
plt.show()

## Relationships

### Exploring Relationships

### Visualizing Relationships

### Correlation

### Simple Regression

## Multivariate Thinking

### Limits of Simple Regression

### Multiple Regression

### Visualizing Regression Results

### Logistic Regression

### Next Steps