----------
# Tech 9 - Outliers and Regression
_______

## Load Relevant Packages and Dependencies

-------

Set up the environment, load packages and dependencies.

In [None]:
# database tool
import pandas as pd 
import numpy as np 

#new package for linear regression
import statsmodels.formula.api as smf

# disable an unneeded warning
pd.options.mode.chained_assignment = None  # default='warn'

# make numbers display better
pd.options.display.float_format = '{:,.3f}'.format

In [None]:
# Import Yellow Cab visualization data from 2020
df_2020 = pd.read_csv("cab2020_visualization.csv")

In [None]:
# check shape
df_2020.shape

## Deal with outliers

### Using reasonable judgement to identify outliers

Most of the very high cost trips are probably errors or uninteresting since they took very few seconds and went very few miles. For example, it does not seem plausible that a $8,000 cab fare took went about 0 miles and took about 0 seconds. 

Let's correct this by eliminating rows with unusually high dollars per second (i.e., trip with unreasonably low seconds but high payment).

In [None]:
df_2020['dollar_per_sec'] = df_2020['fare'] / df_2020['trip_seconds']

# looking at the summary statistics of just this new variable
df_2020['dollar_per_sec'].describe()

In [None]:
# looking at more percentiles
df_2020['dollar_per_sec'].quantile(0.99)

In [None]:
# drop those with a 1.00 per second or more
df_2020 = df_2020[df_2020['dollar_per_sec']<1]
df_2020.shape

In [None]:
df_2020['dollar_per_sec'].describe()

## Linear Regression
----------
We want to investigate how `fare` and `tips` (two Y variables) are related to `trip_seconds`, `trip_miles`, `month`, `hour`, and `day`.

In [None]:
df_2020['dayofweek'] = pd.DatetimeIndex(df_2020['startdatetime']).dayofweek
df_2020.head()

In [None]:
results_fare = smf.ols('fare ~ trip_seconds + trip_miles + StartHour + StartMonth + dayofweek', data=df_2020).fit()
results_fare.summary()

In [None]:
results_tips = smf.ols('tips ~ trip_seconds + trip_miles + StartHour + StartMonth + dayofweek', data=df_2020).fit()
results_tips.summary()