# Lesson 10 - Goodness of Fit
***
## Lesson Objectives
- Explain why certain probability distributions are reasonable models for real-world processes
- Fit probability distributions to data using estimated parameters
- Test the fit of data to probability distributions
- Create an empirical distribution from data

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy.stats as stats
from fitter import Fitter

## Lesson 10 Prep

**Question 1:** You collected data about a process and want to see if a particular probability distribution will be a good model for that data.  You grouped the data you collected into histogram-like bins and you have 20, 90, 250, and 150 in each of those four bins.  You want to test if those are a good fit for a probability distribution in which the bins should contain 5%, 15%, 45%, and 35% of the observations.  Perform a chi-square goodness-of-fit test.  Give the p-value below, rounded to 3 decimal places.

**Question 2:** You collected data about a process and want to see if a particular probability distribution will be a good model for that data.  Perform a KS test to see if the data is a good fit for a gamma distribution with shape parameter 5 and rate parameter 1.  Give the p-value below, rounded to 3 decimal places.

## Read the Data

Read in the data using pandas `read_csv` function and insect the first five lines using `.head()`.

In [41]:
# read data

# inspect first 5 lines


Save the values in the `Bus Service Time` column to another variable.

## Fitting Distributions
Create a `Fitter` object using the bus times and try to fit the distributions ['gamma', 'lognorm', 'beta', 'norm'], fit the distributions to the data and show the summary of the results.

## Identify the Best Distribution
Find the best distribution using the `.get_best()` method with your `Fitter` object.

## KS Goodness of Fit Test
Calculate the KS Goodness of Fit test statistic using the [Scipy KS Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest). The random variables should be your data, the CDF should be the distribution name that you are choosing to represent the data with, and the arguments should be the arguments corresponding to the distribution you chose. 

**HINT:** A quick way to get the parameters of your chosen distribution is to use the `Fitter` objects `fitted_param[____]` attribute (e.g. `fitted_param['norm']`).

## AD Goodness of Fit Test
Calculate the Anderson Darling Goodness of Fit test statistic using the [Scipy AD Test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html). The `x` should be your data, the dist should be the distribution name that you are choosing to represent the data with.