# **Automatidata project**
**Course 5 - Regression Analysis: Simplify complex data relationships**

The data consulting firm Automatidata has recently hired you as the newest member of their data analytics team. Their newest client, the NYC Taxi and Limousine Commission (New York City TLC), wants the Automatidata team to build a multiple linear regression model to predict taxi fares using existing data that was collected over the course of a year. The team is getting closer to completing the project, having completed an initial plan of action, initial Python coding work, EDA, and A/B testing.

The Automatidata team has reviewed the results of the A/B testing. Now it’s time to work on predicting the taxi fare amounts. You’ve impressed your Automatidata colleagues with your hard work and attention to detail. The data team believes that you are ready to build the regression model and update the client New York City TLC about your progress.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# Course 5 End-of-course project: Build a multiple linear regression model

In this activity, you will build a multiple linear regression model. As you've learned, multiple linear regression helps you estimate the linear relationship between one continuous dependent variable and two or more independent variables. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

Completing this activity will help you practice planning out and buidling a multiple linear regression model based on a specific business need. The structure of this activity is designed to emulate the proposals you will likely be assigned in your career as a data professional. Completing this activity will help prepare you for those career moments.
<br/>

**The purpose** of this project is to demostrate knowledge of EDA and a multiple linear regression model

**The goal** is to build a multiple linear regression model and evaluate the model
<br/>
*This activity has three parts:*

**Part 1:** EDA & Checking Model Assumptions
* What are some purposes of EDA before constructing a multiple linear regression model?

**Part 2:** Model Building and evaluation
* What resources do you find yourself using as you complete this stage?

**Part 3:** Interpreting Model Results

* What key insights emerged from your model(s)?

* What business recommendations do you propose based on the models built?

# Build a multiple linear regression model

<img src="https://drive.google.com/uc?id=1j4eZRrDDC_ayowY7oj2ymsRMphdE4Tuf" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="https://drive.google.com/uc?id=1xQC3f1RCcZxyVUbZ71T-e4HyRIJFF94C" width="100" height="100" align=left>


## PACE: **Plan**

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.


### Task 1. Imports and loading
Import the packages that you've learned are needed for building linear regression models.

In [None]:
# Imports
# Packages for numerics + dataframes
### YOUR CODE HERE ###
import numpy as np
import pandas as pd

# Packages for visualization
### YOUR CODE HERE ###
import matplotlib.pyplot as plt
import seaborn as sns

# Packages for date conversions for calculating trip durations
### YOUR CODE HERE ###
#import datetime as dt
# Exemplar
from datetime import datetime
from datetime import date
from datetime import timedelta

# Packages for OLS, MLR, confusion matrix
### YOUR CODE HERE ###
from sklearn.preprocessing import StandardScaler
#import statsmodels.api as sm
#from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

**Note:** `Pandas` is used to load the NYC TLC dataset. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load dataset into dataframe
file_path = '/content/drive/My Drive/Advanced Data Analytics Certificate/Activity Datasets/2017_Yellow_Taxi_Trip_Data.csv'
df0=pd.read_csv(file_path, index_col = 0) # Extra: added index_col

<img src="https://drive.google.com/uc?id=1kpRJdR0z6z3foENI0hyMdZ9duLvXZ8Ca" width="100" height="100" align=left>

## PACE: **Analyze**

In this stage, consider the following question where applicable to complete your code response:

* What are some purposes of EDA before constructing a multiple linear regression model?


* Understanding which variables are present in the data
* Reviewing the distribution of features, such as minimum, mean, and maximum values
* Plotting the relationship between the independent and dependent variables to visualize which features have a linear relationship
* Identifying issues with the data, such as incorrect values (e.g., typos) or missing values

**Exemplar response:**

1.   Outliers and extreme data values can significantly impact linear regression equations. After visualizing data, make a plan for addressing outliers by dropping rows, substituting extreme data with average data, and/or removing data values greater than 3 standard deviations.

2.   EDA activities also include identifying missing data to help the analyst make decisions on their exclusion or inclusion by substituting values with data set means, medians, and other similar methods.

3.   It's important to check for things like multicollinearity between predictor variables, as well to understand their distributions, as this will help you decide what statistical inferences can be made from the model and which ones cannot.

4.  Additionally, it can be useful to engineer new features by multiplying variables together or taking the difference from one variable to another. For example, in this dataset you can create a `duration` variable by subtracting `tpep_dropoff` from `tpep_pickup time`.

### Task 2a. Explore data with EDA

Analyze and discover data, looking for correlations, missing data, outliers, and duplicates.

Start with `.shape` and `.info()`.

In [None]:
# Start with `.shape` and `.info()`
### YOUR CODE HERE ###
df = df0.copy()
display(df.shape)
display(df.info())

(22699, 17)

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 1

None

Check for missing data and duplicates using `.isna()` and `.drop_duplicates()`.

In [None]:
# Check for missing data and duplicates using .isna() and .drop_duplicates()
### YOUR CODE HERE ###
#display(df0.isna().sum())
#display(df0.duplicated().sum())
#df0 = df0.drop_duplicates().reset_index(drop=True)
#display(df0.duplicated().sum())
#print(df0.shape)
display(df.shape)
display(df.drop_duplicates().shape)
display(df.isna().sum().sum())
display(df.isna().sum())

(22699, 17)

(22699, 17)

0

Unnamed: 0,0
VendorID,0
tpep_pickup_datetime,0
tpep_dropoff_datetime,0
passenger_count,0
trip_distance,0
RatecodeID,0
store_and_fwd_flag,0
PULocationID,0
DOLocationID,0
payment_type,0


**Exemplar note:** There are no duplicates or missing values in the data.

Use `.describe()`.

In [None]:
# Use .describe()
### YOUR CODE HERE ###
#df0.describe(include='all')
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


**Exemplar note:** Some things stand out from this table of summary statistics. For instance, there are clearly some outliers in several variables, like `tip_amount` (\$200) and `total_amount` (\$1,200). Also, a number of the variables, such as `mta_tax`, seem to be almost constant throughout the data, which would imply that they would not be expected to be very predictive.

### Task 2b. Convert pickup & dropoff columns to datetime


In [None]:
# Check the format of the data
### YOUR CODE HERE ###
#df0.head()
df['tpep_dropoff_datetime'].iloc[0]

'03/25/2017 9:09:47 AM'

In [None]:
# Convert datetime columns to datetime
### YOUR CODE HERE ###
#df0['tpep_pickup_datetime'] = pd.to_datetime(df0['tpep_pickup_datetime'])
#df0['tpep_dropoff_datetime'] = pd.to_datetime(df0['tpep_dropoff_datetime'])
#df0.head()
# Display data types of `tpep_pickup_datetime`, `tpep_dropoff_datetime`
print('Data type of tpep_pickup_datetime:', df['tpep_pickup_datetime'].dtype)
print('Data type of tpep_dropoff_datetime:', df['tpep_dropoff_datetime'].dtype)

# Convert `tpep_pickup_datetime` to datetime format
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p')

# Convert `tpep_dropoff_datetime` to datetime format
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p')

# Display data types of `tpep_pickup_datetime`, `tpep_dropoff_datetime`
print('Data type of tpep_pickup_datetime:', df['tpep_pickup_datetime'].dtype)
print('Data type of tpep_dropoff_datetime:', df['tpep_dropoff_datetime'].dtype)

df.head(3)

Data type of tpep_pickup_datetime: object
Data type of tpep_dropoff_datetime: object
Data type of tpep_pickup_datetime: datetime64[ns]
Data type of tpep_dropoff_datetime: datetime64[ns]


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,2017-03-25 08:55:43,2017-03-25 09:09:47,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,2017-04-11 14:53:28,2017-04-11 15:19:58,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,2017-12-15 07:26:56,2017-12-15 07:34:08,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75


### Task 2c. Create duration column

Create a new column called `duration` that represents the total number of minutes that each taxi ride took.

In [None]:
# Create `duration` column
### YOUR CODE HERE ###
#df0['duration'] = (df0['tpep_dropoff_datetime'] - df0['tpep_pickup_datetime']).dt.total_seconds() / 60
#df0.head()
df['duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime'])/np.timedelta64(1,'m')
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration
24870114,2,2017-03-25 08:55:43,2017-03-25 09:09:47,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56,14.066667
35634249,1,2017-04-11 14:53:28,2017-04-11 15:19:58,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8,26.5
106203690,1,2017-12-15 07:26:56,2017-12-15 07:34:08,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75,7.2
38942136,2,2017-05-07 13:17:59,2017-05-07 13:48:14,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69,30.25
30841670,2,2017-04-15 23:32:20,2017-04-15 23:49:03,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8,16.716667


### Outliers

Call `df.info()` to inspect the columns and decide which ones to check for outliers.

In [None]:
### YOUR CODE HERE ###
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               22699 non-null  int64         
 1   tpep_pickup_datetime   22699 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  22699 non-null  datetime64[ns]
 3   passenger_count        22699 non-null  int64         
 4   trip_distance          22699 non-null  float64       
 5   RatecodeID             22699 non-null  int64         
 6   store_and_fwd_flag     22699 non-null  object        
 7   PULocationID           22699 non-null  int64         
 8   DOLocationID           22699 non-null  int64         
 9   payment_type           22699 non-null  int64         
 10  fare_amount            22699 non-null  float64       
 11  extra                  22699 non-null  float64       
 12  mta_tax                22699 non-null  float64       
 

Keeping in mind that many of the features will not be used to fit your model, the most important columns to check for outliers are likely to be:
* `trip_distance`
* `fare_amount`
* `duration`



### Task 2d. Box plots

Plot a box plot for each feature: `trip_distance`, `fare_amount`, `duration`.

In [None]:
### YOUR CODE HERE ###
# Box plot of trip_distance
#plt.figure(figsize=(7,2))
#sns.boxplot(x=df0['trip_distance'])
#plt.title('Box plot of trip distance')
#plt.show()
# Box plot of fare_amount
#plt.figure(figsize=(7,2))
#sns.boxplot(x=df0['fare_amount'])
#plt.title('Box plot of fare amount')
#plt.show()
# Box plot of duration
#plt.figure(figsize=(7,2))
#sns.boxplot(x=df0['duration'])
#plt.title('Box plot of duration')
#plt.show()
fig, axes = plt.subplots(1, 3, figsize=(15, 2))
fig.suptitle('Boxplots for outlier detection')
sns.boxplot(ax=axes[0], x=df['trip_distance'])
sns.boxplot(ax=axes[1], x=df['fare_amount'])
sns.boxplot(ax=axes[2], x=df['duration'])
plt.show();

**Questions:**
1. Which variable(s) contains outliers?

2. Are the values in the `trip_distance` column unbelievable?

3. What about the lower end? Do distances, fares, and durations of 0 (or negative values) make sense?

1. Which variable(s) contains outliers?
* From the above boxplots, it is suggested that both `fare_amount` and `duration` may have outliers since they appear to contain negative values.
* `fare_amount` has a value around \$1000 for a single taxi trip.
* `duration` has at least one value greater than 1,400 minutes, or around 23 hours, for a single taxi trip.

2. Are the values in the `trip_distance` column unbelievable?
* The highest `trip_distance` is 33.96 miles for a single taxi trip.
* Even though it is relatively long for a taxi trip it could still be possible.

3. What about the lower end? Do distances, fares, and durations of 0 (or negative values) make sense?
* `trip_distance`, `fare_amount`, and `duration` with values of 0 or lower does appear to be unusual for taxi trips.

**Exemplar response:**
1. All three variables contain outliers. Some are extreme, but others not so much.

2. It's 30 miles from the southern tip of Staten Island to the northern end of Manhattan and that's in a straight line. With this knowledge and the distribution of the values in this column, it's reasonable to leave these values alone and not alter them. However, the values for `fare_amount` and `duration` definitely seem to have problematic outliers on the higher end.

3. Probably not for the latter two, but for `trip_distance` it might be okay.

### Task 2e. Imputations

#### `trip_distance` outliers

You know from the summary statistics that there are trip distances of 0. Are these reflective of erroneous data, or are they very short trips that get rounded down?

To check, sort the column values, eliminate duplicates, and inspect the least 10 values. Are they rounded values or precise values?

In [None]:
# Are trip distances of 0 bad data or very short trips rounded down?
### YOUR CODE HERE ###
# Sort the values of trip_distance and eliminate duplicates
#sorted_trip_distance = sorted(set(df0['trip_distance']))
# Inspect the least 10 values
#print(sorted_trip_distance[:10])
sorted(set(df['trip_distance']))[:10]

[0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09]

The distances are captured with a high degree of precision. However, it might be possible for trips to have distances of zero if a passenger summoned a taxi and then changed their mind. Besides, are there enough zero values in the data to pose a problem?

Calculate the count of rides where the `trip_distance` is zero.

In [None]:
### YOUR CODE HERE ###
# Calculate the count of rides where the trip_distance is zero
#print(df0[df0['trip_distance']==0].shape[0])
sum(df['trip_distance']==0)

148

**Exemplar note:** 148 out of ~23,000 rides is relatively insignificant. You could impute it with a value of 0.01, but it's unlikely to have much of an effect on the model. Therefore, the `trip_distance` column will remain untouched with regard to outliers.

#### `fare_amount` outliers

In [None]:
### YOUR CODE HERE ###
#sorted_fare_amount = sorted(set(df0['fare_amount']))
#print(sorted_fare_amount[:10])
df['fare_amount'].describe()

Unnamed: 0,fare_amount
count,22699.0
mean,13.026629
std,13.243791
min,-120.0
25%,6.5
50%,9.5
75%,14.5
max,999.99


**Question:** What do you notice about the values in the `fare_amount` column?
* There are negative values.

**Exemplar response:**

The range of values in the `fare_amount` column is large and the extremes don't make much sense.

* **Low values:** Negative values are problematic. Values of zero could be legitimate if the taxi logged a trip that was immediately canceled.

* **High values:** The maximum fare amount in this dataset is nearly \\$1,000, which seems very unlikely. High values for this feature can be capped based on intuition and statistics. The interquartile range (IQR) is \\$8. The standard formula of `Q3 + (1.5 * IQR)` yields \$26.50. That doesn't seem appropriate for the maximum fare cap. In this case, we'll use a factor of `6`, which results in a cap of $62.50.

Impute values less than $0 with `0`.

In [None]:
# Impute values less than $0 with 0
### YOUR CODE HERE ###
#df0['fare_amount'] = np.where(df0['fare_amount'] < 0, 0, df0['fare_amount'])
df.loc[df['fare_amount'] < 0, 'fare_amount'] = 0
df['fare_amount'].min()

0.0

Now impute the maximum value as `Q3 + (6 * IQR)`.

In [None]:
### YOUR CODE HERE ###
def impute_upper_limit(column_list, iqr_factor):
    '''
    Impute upper-limit values in specified columns based on their interquartile range.

    Arguments:
        column_list: A list of columns to iterate over
        iqr_factor: A number representing x in the formula:
                    Q3 + (x * IQR). Used to determine maximum threshold,
                    beyond which a point is considered an outlier.

    The IQR is computed for each column in column_list and values exceeding
    the upper threshold for each column are imputed with the upper threshold value.
    '''
    ### YOUR CODE HERE ###
    for column in column_list:
        # Reassign minimum to zero
        ### YOUR CODE HERE ###
        df0[column] = np.where(df0[column] < 0, 0, df0[column])
        # Calculate upper threshold
        ### YOUR CODE HERE ###
        Q3 = df0[column].quantile(0.75)
        IQR = Q3 - df0[column].quantile(0.25)
        upper_threshold = Q3 + (iqr_factor * IQR)
        # Reassign values > threshold to threshold
        ### YOUR CODE HERE ###
        df0[column] = np.where(df0[column] > upper_threshold, upper_threshold, df0[column])

#impute_upper_limit(['fare_amount'], 6)

In [None]:
# Exemplar version
def outlier_imputer(column_list, iqr_factor):
    '''
    Impute upper-limit values in specified columns based on their interquartile range.

    Arguments:
        column_list: A list of columns to iterate over
        iqr_factor: A number representing x in the formula:
                    Q3 + (x * IQR). Used to determine maximum threshold,
                    beyond which a point is considered an outlier.

    The IQR is computed for each column in column_list and values exceeding
    the upper threshold for each column are imputed with the upper threshold value.
    '''
    for col in column_list:
        # Reassign minimum to zero
        df.loc[df[col] < 0, col] = 0

        # Calculate upper threshold
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        upper_threshold = q3 + (iqr_factor * iqr)
        print(col)
        print('q3:', q3)
        print('upper_threshold:', upper_threshold)

        # Reassign values > threshold to threshold
        df.loc[df[col] > upper_threshold, col] = upper_threshold
        print(df[col].describe())
        print()

In [None]:
outlier_imputer(['fare_amount'], 6)

fare_amount
q3: 14.5
upper_threshold: 62.5
count    22699.000000
mean        12.897913
std         10.541137
min          0.000000
25%          6.500000
50%          9.500000
75%         14.500000
max         62.500000
Name: fare_amount, dtype: float64



#### `duration` outliers


In [None]:
# Call .describe() for duration outliers
### YOUR CODE HERE ###
df['duration'].describe()

Unnamed: 0,duration
count,22699.0
mean,17.013777
std,61.996482
min,-16.983333
25%,6.65
50%,11.183333
75%,18.383333
max,1439.55


The `duration` column has problematic values at both the lower and upper extremities.

* **Low values:** There should be no values that represent negative time. Impute all negative durations with `0`.

* **High values:** Impute high values the same way you imputed the high-end outliers for fares: `Q3 + (6 * IQR)`.

In [None]:
# Impute a 0 for any negative values
### YOUR CODE HERE ###
#df0['duration'] = np.where(df0['duration'] < 0, 0, df0['duration'])
df.loc[df['duration'] < 0, 'duration'] = 0
df['duration'].min()

0.0

In [None]:
# Impute the high outliers
### YOUR CODE HERE ###
#impute_upper_limit(['duration'], 6)
outlier_imputer(['duration'], 6)

duration
q3: 18.383333333333333
upper_threshold: 88.78333333333333
count    22699.000000
mean        14.460555
std         11.947043
min          0.000000
25%          6.650000
50%         11.183333
75%         18.383333
max         88.783333
Name: duration, dtype: float64



### Task 3a. Feature engineering

#### Create `mean_distance` column

When deployed, the model will not know the duration of a trip until after the trip occurs, so you cannot train a model that uses this feature. However, you can use the statistics of trips you *do* know to generalize about ones you do not know.

In this step, create a column called `mean_distance` that captures the mean distance for each group of trips that share pickup and dropoff points.

For example, if your data were:

|Trip|Start|End|Distance|
|--: |:---:|:-:|    |
| 1  | A   | B | 1  |
| 2  | C   | D | 2  |
| 3  | A   | B |1.5 |
| 4  | D   | C | 3  |

The results should be:
```
A -> B: 1.25 miles
C -> D: 2 miles
D -> C: 3 miles
```

Notice that C -> D is not the same as D -> C. All trips that share a unique pair of start and end points get grouped and averaged.

Then, a new column `mean_distance` will be added where the value at each row is the average for all trips with those pickup and dropoff locations:

|Trip|Start|End|Distance|mean_distance|
|--: |:---:|:-:|  :--   |:--   |
| 1  | A   | B | 1      | 1.25 |
| 2  | C   | D | 2      | 2    |
| 3  | A   | B |1.5     | 1.25 |
| 4  | D   | C | 3      | 3    |


Begin by creating a helper column called `pickup_dropoff`, which contains the unique combination of pickup and dropoff location IDs for each row.

One way to do this is to convert the pickup and dropoff location IDs to strings and join them, separated by a space. The space is to ensure that, for example, a trip with pickup/dropoff points of 12 & 151 gets encoded differently than a trip with points 121 & 51.

So, the new column would look like this:

|Trip|Start|End|pickup_dropoff|
|--: |:---:|:-:|  :--         |
| 1  | A   | B | 'A B'        |
| 2  | C   | D | 'C D'        |
| 3  | A   | B | 'A B'        |
| 4  | D   | C | 'D C'        |


In [None]:
# Create `pickup_dropoff` column
### YOUR CODE HERE ###
#df0['pickup_dropoff'] = df0['PULocationID'].astype(str) + ' ' + df0['DOLocationID'].astype(str)
df['pickup_dropoff'] = df['PULocationID'].astype(str) + ' ' + df['DOLocationID'].astype(str)
df['pickup_dropoff'].head(2)

Unnamed: 0,pickup_dropoff
24870114,100 231
35634249,186 43


Now, use a `groupby()` statement to group each row by the new `pickup_dropoff` column, compute the mean, and capture the values only in the `trip_distance` column. Assign the results to a variable named `grouped`.

In [None]:
### YOUR CODE HERE ###
#grouped = df0.groupby('pickup_dropoff')['trip_distance'].mean()
grouped = df.groupby('pickup_dropoff').mean(numeric_only=True)[['trip_distance']]
grouped[:5]

Unnamed: 0_level_0,trip_distance
pickup_dropoff,Unnamed: 1_level_1
1 1,2.433333
10 148,15.7
100 1,16.89
100 100,0.253333
100 107,1.18


`grouped` is an object of the `DataFrame` class.

1. Convert it to a dictionary using the [`to_dict()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html) method. Assign the results to a variable called `grouped_dict`. This will result in a dictionary with a key of `trip_distance` whose values are another dictionary. The inner dictionary's keys are pickup/dropoff points and its values are mean distances. This is the information you want.

```
Example:
grouped_dict = {'trip_distance': {'A B': 1.25, 'C D': 2, 'D C': 3}
```

2. Reassign the `grouped_dict` dictionary so it contains only the inner dictionary. In other words, get rid of `trip_distance` as a key, so:

```
Example:
grouped_dict = {'A B': 1.25, 'C D': 2, 'D C': 3}
 ```

In [None]:
# 1. Convert `grouped` to a dictionary
### YOUR CODE HERE ###
grouped_dict = grouped.to_dict()

# 2. Reassign to only contain the inner dictionary
### YOUR CODE HERE ###
grouped_dict = grouped_dict['trip_distance']

1. Create a `mean_distance` column that is a copy of the `pickup_dropoff` helper column.

2. Use the [`map()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html#pandas-series-map) method on the `mean_distance` series. Pass `grouped_dict` as its argument. Reassign the result back to the `mean_distance` series.
</br></br>
When you pass a dictionary to the `Series.map()` method, it will replace the data in the series where that data matches the dictionary's keys. The values that get imputed are the values of the dictionary.

```
Example:
df['mean_distance']
```

|mean_distance |
|  :-:         |
| 'A B'        |
| 'C D'        |
| 'A B'        |
| 'D C'        |
| 'E F'        |

```
grouped_dict = {'A B': 1.25, 'C D': 2, 'D C': 3}
df['mean_distance`] = df['mean_distance'].map(grouped_dict)
df['mean_distance']
```

|mean_distance |
|  :-:         |
| 1.25         |
| 2            |
| 1.25         |
| 3            |
| NaN          |

When used this way, the `map()` `Series` method is very similar to `replace()`, however, note that `map()` will impute `NaN` for any values in the series that do not have a corresponding key in the mapping dictionary, so be careful.

In [None]:
# 1. Create a mean_distance column that is a copy of the pickup_dropoff helper column
### YOUR CODE HERE ###
df['mean_distance'] = df['pickup_dropoff']

# 2. Map `grouped_dict` to the `mean_distance` column
### YOUR CODE HERE ###
df['mean_distance'] = df['mean_distance'].map(grouped_dict)

# Confirm that it worked
### YOUR CODE HERE ###
#df0.head()
df[(df['PULocationID']==100) & (df['DOLocationID']==100)][['mean_distance']]

Unnamed: 0,mean_distance
22510227,0.253333
69466211,0.253333
39498898,0.253333
25060264,0.253333
26641850,0.253333
28297285,0.253333


#### Create `mean_duration` column

Repeat the process used to create the `mean_distance` column to create a `mean_duration` column.

In [None]:
### YOUR CODE HERE ###
#grouped = df0.groupby('pickup_dropoff')['duration'].mean()
grouped = df.groupby('pickup_dropoff').mean(numeric_only=True)[['duration']]
display(grouped.head())

# Create a dictionary where keys are unique pickup_dropoffs and values are
# mean trip duration for all trips with those pickup_dropoff combos
### YOUR CODE HERE ###
grouped_dict = grouped.to_dict()
grouped_dict = grouped_dict['duration']

df['mean_duration'] = df['pickup_dropoff']
df['mean_duration'] = df['mean_duration'].map(grouped_dict)

# Confirm that it worked
### YOUR CODE HERE ###
#df0.head()
display(df[(df['PULocationID']==1) & (df['DOLocationID']==1)][['mean_duration']])

Unnamed: 0_level_0,duration
pickup_dropoff,Unnamed: 1_level_1
1 1,0.466667
10 148,69.366667
100 1,48.183333
100 100,3.130556
100 107,11.2


Unnamed: 0,mean_duration
111653084,0.466667
93959863,0.466667
3055315,0.466667


#### Create `day` and `month` columns

Create two new columns, `day` (name of day) and `month` (name of month) by extracting the relevant information from the `tpep_pickup_datetime` column.

In [None]:
# Create 'day' col
### YOUR CODE HERE ###
#df0['day'] = df0['tpep_pickup_datetime'].dt.day_name()
df['day'] = df['tpep_pickup_datetime'].dt.day_name().str.lower()

# Create 'month' col
### YOUR CODE HERE ###
#df0['month'] = df0['tpep_pickup_datetime'].dt.month_name()
df['month'] = df['tpep_pickup_datetime'].dt.strftime('%b').str.lower()

df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,pickup_dropoff,mean_distance,mean_duration,day,month
24870114,2,2017-03-25 08:55:43,2017-03-25 09:09:47,6,3.34,1,N,100,231,1,...,2.76,0.0,0.3,16.56,14.066667,100 231,3.521667,22.847222,saturday,mar
35634249,1,2017-04-11 14:53:28,2017-04-11 15:19:58,1,1.8,1,N,186,43,1,...,4.0,0.0,0.3,20.8,26.5,186 43,3.108889,24.47037,tuesday,apr
106203690,1,2017-12-15 07:26:56,2017-12-15 07:34:08,1,1.0,1,N,262,236,1,...,1.45,0.0,0.3,8.75,7.2,262 236,0.881429,7.25,friday,dec
38942136,2,2017-05-07 13:17:59,2017-05-07 13:48:14,1,3.7,1,N,188,97,1,...,6.39,0.0,0.3,27.69,30.25,188 97,3.7,30.25,sunday,may
30841670,2,2017-04-15 23:32:20,2017-04-15 23:49:03,1,4.37,1,N,4,112,2,...,0.0,0.0,0.3,17.8,16.716667,4 112,4.435,14.616667,saturday,apr


#### Create `rush_hour` column

Define rush hour as:
* Any weekday (not Saturday or Sunday) AND
* Either from 06:00&ndash;10:00 or from 16:00&ndash;20:00

Create a binary `rush_hour` column that contains a 1 if the ride was during rush hour and a 0 if it was not.

In [None]:
# Create 'rush_hour' col
### YOUR CODE HERE ###
#df0['rush_hour'] = 1
df['rush_hour'] = df['tpep_pickup_datetime'].dt.hour

# If day is Saturday or Sunday, impute 0 in `rush_hour` column
### YOUR CODE HERE ###
#df0['rush_hour'] = np.where((df0['day'] == 'Saturday') | (df0['day'] == 'Sunday'), 0, df0['rush_hour'])
df.loc[df['day'].isin(['saturday', 'sunday']), 'rush_hour'] = 0
#df.loc[(df['rush_hour']==0) & (df['day'].isin(['saturday', 'sunday']))]

In [None]:
# prompt: Create a function called rush_hourizer() to set values in rush_hour to 1 if a ride occurs on any weekday other than Saturday or Sunday and either from 06:00-10:00 or from 16:00-20:00, set 0 if not.

#def rush_hourizer(df):
  # Create 'rush_hour' col
  #df['rush_hour'] = 1
  # If day is Saturday or Sunday, impute 0 in `rush_hour` column
  #df['rush_hour'] = np.where((df['day'] == 'Saturday') | (df['day'] == 'Sunday'), 0, df['rush_hour'])
  # Impute 0 for times outside of rush hour
  #df['rush_hour'] = np.where(((df['tpep_pickup_datetime'].dt.hour < 6) | (df['tpep_pickup_datetime'].dt.hour >= 10)) & ((df['tpep_pickup_datetime'].dt.hour < 16) | (df['tpep_pickup_datetime'].dt.hour >= 20)), 0, df['rush_hour'])
  #return df


In [None]:
### YOUR CODE HERE ###
def rush_hourizer(hour):
    if 6 <= hour['rush_hour'] < 10:
        val = 1
    elif 16 <= hour['rush_hour'] < 20:
        val = 1
    else:
        val = 0
    return val

In [None]:
# Apply the `rush_hourizer()` function to the new column
### YOUR CODE HERE ###
#df0 = rush_hourizer(df0)
df.loc[(df.day != 'saturday') & (df.day != 'sunday'), 'rush_hour'] = df.apply(rush_hourizer, axis=1)
df.head()

  df.loc[(df.day != 'saturday') & (df.day != 'sunday'), 'rush_hour'] = df.apply(rush_hourizer, axis=1)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,...,tolls_amount,improvement_surcharge,total_amount,duration,pickup_dropoff,mean_distance,mean_duration,day,month,rush_hour
24870114,2,2017-03-25 08:55:43,2017-03-25 09:09:47,6,3.34,1,N,100,231,1,...,0.0,0.3,16.56,14.066667,100 231,3.521667,22.847222,saturday,mar,0
35634249,1,2017-04-11 14:53:28,2017-04-11 15:19:58,1,1.8,1,N,186,43,1,...,0.0,0.3,20.8,26.5,186 43,3.108889,24.47037,tuesday,apr,0
106203690,1,2017-12-15 07:26:56,2017-12-15 07:34:08,1,1.0,1,N,262,236,1,...,0.0,0.3,8.75,7.2,262 236,0.881429,7.25,friday,dec,1
38942136,2,2017-05-07 13:17:59,2017-05-07 13:48:14,1,3.7,1,N,188,97,1,...,0.0,0.3,27.69,30.25,188 97,3.7,30.25,sunday,may,0
30841670,2,2017-04-15 23:32:20,2017-04-15 23:49:03,1,4.37,1,N,4,112,2,...,0.0,0.3,17.8,16.716667,4 112,4.435,14.616667,saturday,apr,0


### Task 4. Scatter plot

Create a scatterplot to visualize the relationship between `mean_duration` and `fare_amount`.

In [None]:
# Create a scatterplot to visualize the relationship between variables of interest
### YOUR CODE HERE ###
#sns.scatterplot(x='mean_duration', y='fare_amount', data=df0)
sns.set(style='whitegrid')
f = plt.figure()
f.set_figwidth(5)
f.set_figheight(5)
sns.regplot(x=df['mean_duration'], y=df['fare_amount'],
            scatter_kws={'alpha':0.5, 's':5},
            line_kws={'color':'red'})
plt.ylim(0, 70)
plt.xlim(0, 70)
plt.title('Mean duration x fare amount')
plt.show()

The `mean_duration` variable correlates with the target variable. But what are the horizontal lines around fare amounts of 52 dollars and 63 dollars? What are the values and how many are there?

You know what one of the lines represents. 62 dollars and 50 cents is the maximum that was imputed for outliers, so all former outliers will now have fare amounts of \$62.50. What is the other line?

Check the value of the rides in the second horizontal line in the scatter plot.

In [None]:
### YOUR CODE HERE ###
#df0[df0['fare_amount'] == 52].shape[0]
df[df['fare_amount'] > 50]['fare_amount'].value_counts().head()

Unnamed: 0_level_0,count
fare_amount,Unnamed: 1_level_1
52.0,514
62.5,84
59.0,9
50.5,9
57.5,8


Examine the first 30 of these trips.

In [None]:
# Set pandas to display all columns
### YOUR CODE HERE ###
pd.set_option('display.max_columns', None)
df[df['fare_amount'] == 52].head(30)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,pickup_dropoff,mean_distance,mean_duration,day,month,rush_hour
18600059,2,2017-03-05 19:15:30,2017-03-05 19:52:18,2,18.9,2,N,236,132,1,52.0,0.0,0.5,14.58,5.54,0.3,72.92,36.8,236 132,19.211667,40.5,sunday,mar,0
47959795,1,2017-06-03 14:24:57,2017-06-03 15:31:48,1,18.0,2,N,132,163,1,52.0,0.0,0.5,0.0,0.0,0.3,52.8,66.85,132 163,19.229,52.941667,saturday,jun,0
95729204,2,2017-11-11 20:16:16,2017-11-11 20:17:14,1,0.23,2,N,132,132,2,52.0,0.0,0.5,0.0,0.0,0.3,52.8,0.966667,132 132,2.255862,3.021839,saturday,nov,0
103404868,2,2017-12-06 23:37:08,2017-12-07 00:06:19,1,18.93,2,N,132,79,2,52.0,0.0,0.5,0.0,0.0,0.3,52.8,29.183333,132 79,19.431667,47.275,wednesday,dec,0
80479432,2,2017-09-24 23:45:45,2017-09-25 00:15:14,1,17.99,2,N,132,234,1,52.0,0.0,0.5,14.64,5.76,0.3,73.2,29.483333,132 234,17.654,49.833333,sunday,sep,0
16226157,1,2017-02-28 18:30:05,2017-02-28 19:09:55,1,18.4,2,N,132,48,2,52.0,4.5,0.5,0.0,5.54,0.3,62.84,39.833333,132 48,18.761905,58.246032,tuesday,feb,1
55253442,2,2017-06-05 12:51:58,2017-06-05 13:07:35,1,4.73,2,N,228,88,2,52.0,0.0,0.5,0.0,5.76,0.3,58.56,15.616667,228 88,4.73,15.616667,monday,jun,0
65900029,2,2017-08-03 22:47:14,2017-08-03 23:32:41,2,18.21,2,N,132,48,2,52.0,0.0,0.5,0.0,5.76,0.3,58.56,45.45,132 48,18.761905,58.246032,thursday,aug,0
80904240,2,2017-09-26 13:48:26,2017-09-26 14:31:17,1,17.27,2,N,186,132,2,52.0,0.0,0.5,0.0,5.76,0.3,58.56,42.85,186 132,17.096,42.92,tuesday,sep,0
33706214,2,2017-04-23 21:34:48,2017-04-23 22:46:23,6,18.34,2,N,132,148,1,52.0,0.0,0.5,5.0,0.0,0.3,57.8,71.583333,132 148,17.994286,46.340476,sunday,apr,0


**Question:** What do you notice about the first 30 trips?

* The `RatecodeID` is 2, for JFK.
* The `store_and_fwd_flag` is "N", for "not a store and forward trip".
* The `mta_tax` is 0.5
* The `improvement_surcharge` is 0.3

**Exemplar response:**

It seems that almost all of the trips in the first 30 rows where the fare amount was \$52 either begin or end at location 132, and all of them have a `RatecodeID` of 2.

There is no readily apparent reason why PULocation 132 should have so many fares of 52 dollars. They seem to occur on all different days, at different times, with both vendors, in all months. However, there are many toll amounts of $5.76 and \\$5.54. This would seem to indicate that location 132 is in an area that frequently requires tolls to get to and from. It's likely this is an airport.


The data dictionary says that `RatecodeID` of 2 indicates trips for JFK, which is John F. Kennedy International Airport. A quick Google search for "new york city taxi flat rate \$52" indicates that in 2017 (the year that this data was collected) there was indeed a flat fare for taxi trips between JFK airport (in Queens) and Manhattan.

Because `RatecodeID` is known from the data dictionary, the values for this rate code can be imputed back into the data after the model makes its predictions. This way you know that those data points will always be correct.

### Task 5. Isolate modeling variables

Drop features that are redundant, irrelevant, or that will not be available in a deployed environment.

In [None]:
### YOUR CODE HERE ###
#df0.columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   VendorID               22699 non-null  int64         
 1   tpep_pickup_datetime   22699 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  22699 non-null  datetime64[ns]
 3   passenger_count        22699 non-null  int64         
 4   trip_distance          22699 non-null  float64       
 5   RatecodeID             22699 non-null  int64         
 6   store_and_fwd_flag     22699 non-null  object        
 7   PULocationID           22699 non-null  int64         
 8   DOLocationID           22699 non-null  int64         
 9   payment_type           22699 non-null  int64         
 10  fare_amount            22699 non-null  float64       
 11  extra                  22699 non-null  float64       
 12  mta_tax                22699 non-null  float64       
 

In [None]:
### YOUR CODE HERE ###
#df1 = df0.drop(columns=['passenger_count', 'RatecodeID', 'store_and_fwd_flag', 'payment_type',
#                        'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
#                        'pickup_dropoff', 'day', 'month'])

df2 = df.copy()

df2 = df2.drop(['tpep_dropoff_datetime', 'tpep_pickup_datetime', 'trip_distance', 'RatecodeID',
                'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'extra',
                'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount',
                'duration', 'pickup_dropoff', 'day', 'month'
               ], axis=1)

df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   VendorID         22699 non-null  int64  
 1   passenger_count  22699 non-null  int64  
 2   fare_amount      22699 non-null  float64
 3   mean_distance    22699 non-null  float64
 4   mean_duration    22699 non-null  float64
 5   rush_hour        22699 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 1.7 MB


### Task 6. Pair plot

Create a pairplot to visualize pairwise relationships between `fare_amount`, `mean_duration`, and `mean_distance`.

In [None]:
# Create a pairplot to visualize pairwise relationships between variables in the data
### YOUR CODE HERE ###
#sns.pairplot(df1, vars=['fare_amount', 'mean_duration', 'mean_distance'])
sns.pairplot(df2[['fare_amount', 'mean_duration', 'mean_distance']],
             plot_kws={'alpha':0.4, 'size':5},
             );

These variables all show linear correlation with each other. Investigate this further.

### Task 7. Identify correlations

Next, code a correlation matrix to help determine most correlated variables.

In [None]:
# Correlation matrix to help determine most correlated variables
### YOUR CODE HERE ###
#corr_matrix = df1.corr()
df2.corr(method='pearson')

Unnamed: 0,VendorID,passenger_count,fare_amount,mean_distance,mean_duration,rush_hour
VendorID,1.0,0.266463,0.001045,0.004741,0.001876,-0.002874
passenger_count,0.266463,1.0,0.014942,0.013428,0.015852,-0.022035
fare_amount,0.001045,0.014942,1.0,0.910185,0.859105,-0.020075
mean_distance,0.004741,0.013428,0.910185,1.0,0.874864,-0.039725
mean_duration,0.001876,0.015852,0.859105,0.874864,1.0,-0.021583
rush_hour,-0.002874,-0.022035,-0.020075,-0.039725,-0.021583,1.0


Visualize a correlation heatmap of the data.

In [None]:
# Create correlation heatmap
### YOUR CODE HERE ###
#plt.figure(figsize=(8, 6))
#sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
#plt.title('Correlation Matrix')
#plt.show()
plt.figure(figsize=(6,4))
sns.heatmap(df2.corr(method='pearson'), annot=True, cmap='Reds')
plt.title('Correlation heatmap',
          fontsize=18)
plt.show()

**Question:** Which variable(s) are correlated with the target variable of `fare_amount`?

* (`trip_distance`, `total_amount`, `duration`,) `mean_distance`, `mean_duration`

**Exemplar response:** `mean_duration` and `mean_distance` are both highly correlated with the target variable of `fare_amount` They're also both correlated with each other, with a Pearson correlation of 0.87.

Recall that highly correlated predictor variables can be bad for linear regression models when you want to be able to draw statistical inferences about the data from the model. However, correlated predictor variables can still be used to create an accurate predictor if the prediction itself is more important than using the model as a tool to learn about your data.

This model will predict `fare_amount`, which will be used as a predictor variable in machine learning models.

Try modeling with both variables even though they are correlated.

<img src="https://drive.google.com/uc?id=1xa68IrpTXu0KRFO49MEMiLaje8469nsk" width="100" height="100" align=left>

## PACE: **Construct**

After analysis and deriving variables with close relationships, it is time to begin constructing the model. Consider the questions in your PACE Strategy Document to reflect on the Construct stage.


### Task 8a. Split data into outcome variable and features

In [None]:
### YOUR CODE HERE ###
#df2 = df1.copy()
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   VendorID         22699 non-null  int64  
 1   passenger_count  22699 non-null  int64  
 2   fare_amount      22699 non-null  float64
 3   mean_distance    22699 non-null  float64
 4   mean_duration    22699 non-null  float64
 5   rush_hour        22699 non-null  int64  
dtypes: float64(3), int64(3)
memory usage: 1.7 MB


Set your X and y variables. X represents the features and y represents the outcome (target) variable.

In [None]:
# Remove the target column from the features
# X = df2.drop(columns='fare_amount')
### YOUR CODE HERE ###
#X = df2[['mean_distance', 'mean_duration']]
X = df2.drop(columns=['fare_amount'])

# Set y variable
### YOUR CODE HERE ###
y = df2[['fare_amount']]

# Display first few rows
### YOUR CODE HERE ###
display(X.head())
display(y.head())

Unnamed: 0,VendorID,passenger_count,mean_distance,mean_duration,rush_hour
24870114,2,6,3.521667,22.847222,0
35634249,1,1,3.108889,24.47037,0
106203690,1,1,0.881429,7.25,1
38942136,2,1,3.7,30.25,0
30841670,2,1,4.435,14.616667,0


Unnamed: 0,fare_amount
24870114,13.0
35634249,16.0
106203690,6.5
38942136,20.5
30841670,16.5


### Task 8b. Pre-process data


Dummy encode categorical variables

In [None]:
# Convert VendorID to string
### YOUR CODE HERE ###
X['VendorID'] = X['VendorID'].astype(str)

# Get dummies
### YOUR CODE HERE ###
X = pd.get_dummies(X, drop_first=True)
X.head()

Unnamed: 0,passenger_count,mean_distance,mean_duration,rush_hour,VendorID_2
24870114,6,3.521667,22.847222,0,True
35634249,1,3.108889,24.47037,0,False
106203690,1,0.881429,7.25,1,False
38942136,1,3.7,30.25,0,True
30841670,1,4.435,14.616667,0,True


### Split data into training and test sets

Create training and testing sets. The test set should contain 20% of the total samples. Set `random_state=0`.

In [None]:
# Create training and testing sets
#### YOUR CODE HERE ####
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Standardize the data

Use `StandardScaler()`, `fit()`, and `transform()` to standardize the `X_train` variables. Assign the results to a variable called `X_train_scaled`.

In [None]:
# Standardize the X variables
### YOUR CODE HERE ###
#scaler = StandardScaler()
#X_train_scaled = scaler.fit_transform(X_train)
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
print('X_train scaled:', X_train_scaled)

X_train scaled: [[-0.50301524  0.8694684   0.17616665 -0.64893329  0.89286563]
 [-0.50301524 -0.60011281 -0.69829589  1.54099045  0.89286563]
 [ 0.27331093 -0.47829156 -0.57301906 -0.64893329 -1.11998936]
 ...
 [-0.50301524 -0.45121122 -0.6788917  -0.64893329 -1.11998936]
 [-0.50301524 -0.58944763 -0.85743597  1.54099045 -1.11998936]
 [ 1.82596329  0.83673851  1.13212101 -0.64893329  0.89286563]]


### Fit the model

Instantiate your model and fit it to the training data.

In [None]:
# Fit your model to the training data
### YOUR CODE HERE ###
# Instantiate the model
lr= LinearRegression()

# Fit the model to the training data
lr.fit(X_train_scaled, y_train)

### Task 8c. Evaluate model

### Train data

Evaluate your model performance by calculating the residual sum of squares and the explained variance score (R^2). Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.

In [None]:
# Evaluate the model performance on the training data
### YOUR CODE HERE ###
# Predict on the training data
#y_train_pred = lr.predict(X_train_scaled)

# Calculate the residual sum of squares
#rss_train = np.sum((y_train - y_train_pred)**2)

# Calculate the explained variance score (R^2)
#r2_train = metrics.explained_variance_score(y_train, y_train_pred)

# Calculate the Mean Absolute Error
#mae_train = metrics.mean_absolute_error(y_train, y_train_pred)

# Calculate the Mean Squared Error
#mse_train = metrics.mean_squared_error(y_train, y_train_pred)

# Calculate the Root Mean Squared Error
#rmse_train = np.sqrt(mse_train)

# Print the evaluation metrics
#print('Training Data Metrics:')
#print('Residual Sum of Squares (RSS):', rss_train)
#print('Explained Variance Score (R^2):', r2_train)
#print('Mean Absolute Error (MAE):', mae_train)
#print('Mean Squared Error (MSE):', mse_train)
#print('Root Mean Squared Error (RMSE):', rmse_train)
r_sq = lr.score(X_train_scaled, y_train)
print('Coefficient of determination:', r_sq)
y_pred_train = lr.predict(X_train_scaled)
print('R^2:', r2_score(y_train, y_pred_train))
print('MAE:', mean_absolute_error(y_train, y_pred_train))
print('MSE:', mean_squared_error(y_train, y_pred_train))
print('RMSE:',np.sqrt(mean_squared_error(y_train, y_pred_train)))

Coefficient of determination: 0.8398434585044773
R^2: 0.8398434585044773
MAE: 2.186666416775414
MSE: 17.88973296349268
RMSE: 4.229625629236313


### Test data

Calculate the same metrics on the test data. Remember to scale the `X_test` data using the scaler that was fit to the training data. Do not refit the scaler to the testing data, just transform it. Call the results `X_test_scaled`.

In [None]:
# Scale the X_test data
### YOUR CODE HERE ###
# Transform the test data using the fitted scaler
X_test_scaled = scaler.transform(X_test)

In [None]:
# Evaluate the model performance on the testing data
### YOUR CODE HERE ###
# Predict on the test data
#y_test_pred = lr.predict(X_test_scaled)

# Calculate the residual sum of squares
#rss_test = np.sum((y_test - y_test_pred)**2)

# Calculate the explained variance score (R^2)
#r2_test = metrics.explained_variance_score(y_test, y_test_pred)

# Calculate the Mean Absolute Error
#mae_test = metrics.mean_absolute_error(y_test, y_test_pred)

# Calculate the Mean Squared Error
#mse_test = metrics.mean_squared_error(y_test, y_test_pred)

# Calculate the Root Mean Squared Error
#rmse_test = np.sqrt(mse_test)

# Print the evaluation metrics
#print('Test Data Metrics:')
#print('Residual Sum of Squares (RSS):', rss_test)
#print('Explained Variance Score (R^2):', r2_test)
#print('Mean Absolute Error (MAE):', mae_test)
#print('Mean Squared Error (MSE):', mse_test)
#print('Root Mean Squared Error (RMSE):', rmse_test)
r_sq_test = lr.score(X_test_scaled, y_test)
print('Coefficient of determination:', r_sq_test)
y_pred_test = lr.predict(X_test_scaled)
print('R^2:', r2_score(y_test, y_pred_test))
print('MAE:', mean_absolute_error(y_test,y_pred_test))
print('MSE:', mean_squared_error(y_test, y_pred_test))
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred_test)))

Coefficient of determination: 0.8682583641795454
R^2: 0.8682583641795454
MAE: 2.1336549840593864
MSE: 14.326454156998942
RMSE: 3.7850302716093225


**Exemplar note:** The model performance is high on both training and test sets, suggesting that there is little bias in the model and that the model is not overfit. In fact, the test scores were even better than the training scores.

For the test data, an R<sup>2</sup> of 0.868 means that 86.8% of the variance in the `fare_amount` variable is described by the model.

The mean absolute error is informative here because, for the purposes of the model, an error of two is not more than twice as bad as an error of one.

<img src="https://drive.google.com/uc?id=1O04Ts47cyQs_UPSPtJxUXGEKsSpteLRL" width="100" height="100" align=left>

## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 9a. Results

Use the code cell below to get `actual`,`predicted`, and `residual` for the testing set, and store them as columns in a `results` dataframe.

In [None]:
# Create a `results` dataframe
### YOUR CODE HERE ###
# Create a DataFrame to store the results
#results = pd.DataFrame()

# Store the actual values of the target variable
#results['actual'] = y_test

# Store the predicted values of the target variable
#results['predicted'] = y_test_pred

# Calculate and store the residuals
#results['residual'] = results['actual'] - results['predicted']

# Display the first few rows of the results DataFrame
#results.head()
results = pd.DataFrame(data={'actual': y_test['fare_amount'],
                             'predicted': y_pred_test.ravel()})
results['residual'] = results['actual'] - results['predicted']
results.head()

Unnamed: 0,actual,predicted,residual
102188254,14.0,12.356503,1.643497
50574134,28.0,16.314595,11.685405
14767643,5.5,6.726789,-1.226789
16019414,15.5,16.227206,-0.727206
1352127,9.5,10.536408,-1.036408


### Task 9b. Visualize model results

Create a scatterplot to visualize `actual` vs. `predicted`.

In [None]:
# Create a scatterplot to visualize `predicted` over `actual`
### YOUR CODE HERE ###
#sns.scatterplot(x='actual', y='predicted', data=results)
fig, ax = plt.subplots(figsize=(6, 6))
sns.set(style='whitegrid')
sns.scatterplot(x='actual',
                y='predicted',
                data=results,
                s=20,
                alpha=0.5,
                ax=ax
)
# Draw an x=y line to show what the results would be if the model were perfect
plt.plot([0,60], [0,60], c='red', linewidth=2)
plt.title('Actual vs. predicted');

Visualize the distribution of the `residuals` using a histogram.

In [None]:
# Visualize the distribution of the `residuals`
### YOUR CODE HERE ###
#sns.histplot(results['residual'])
sns.histplot(results['residual'], bins=np.arange(-15,15.5,0.5))
plt.title('Distribution of the residuals')
plt.xlabel('residual value')
plt.ylabel('count');


In [None]:
# Calculate residual mean
### YOUR CODE HERE ###
results['residual'].mean()

-0.01544262152868054

Create a scatterplot of `residuals` over `predicted`.

In [None]:
# Create a scatterplot of `residuals` over `predicted`
### YOUR CODE HERE ###
#sns.scatterplot(x='predicted', y='residual', data=results)
sns.scatterplot(x='predicted', y='residual', data=results)
plt.axhline(0, c='red')
plt.title('Scatterplot of residuals over predicted values')
plt.xlabel('predicted value')
plt.ylabel('residual value')
plt.show()

**Exemplar note:** The model's residuals are evenly distributed above and below zero, with the exception of the sloping lines from the upper-left corner to the lower-right corner, which you know are the imputed maximum of \\$62.50 and the flat rate of \\$52 for JFK airport trips.

### Task 9c. Coefficients

Use the `coef_` attribute to get the model's coefficients. The coefficients are output in the order of the features that were used to train the model. Which feature had the greatest effect on trip fare?

In [None]:
# Output the model's coefficients
#lr.coef_
coefficients = pd.DataFrame(lr.coef_, columns=X.columns)
coefficients

Unnamed: 0,passenger_count,mean_distance,mean_duration,rush_hour,VendorID_2
0,0.030825,7.133867,2.812115,0.110233,-0.054373


What do these coefficients mean? How should they be interpreted?

* The order of features used to train the model is as follows: `mean_distance` and `mean_duration`.
* It is suggested that 7.12325932 is the coefficient for `mean_distance`.
* It is suggested that 2.81932525 is the coefficient for `mean_duration`.
* An increase of 1 mile in `mean_distance` results in an estimated increase of \$7.12 in `fare_amount`.
* An increase of 1 minute in `mean_duration` results in an estimated increase of \$2.82 in `fare_amount`.

**Exemplar response:**

The coefficients reveal that `mean_distance` was the feature with the greatest weight in the model's final prediction. Be careful here! A common misinterpretation is that for every mile traveled, the fare amount increases by a mean of \\$7.13. This is incorrect. Remember, the data used to train the model was standardized with `StandardScaler()`. As such, the units are no longer miles. In other words, you cannot say "for every mile traveled...", as stated above. The correct interpretation of this coefficient is: controlling for other variables, *for every +1 change in standard deviation*, the fare amount increases by a mean of \\$7.13.

Note also that because some highly correlated features were not removed, the confidence interval of this assessment is wider.

So, translate this back to miles instead of standard deviation (i.e., unscale the data).

1. Calculate the standard deviation of `mean_distance` in the `X_train` data.

2. Divide the coefficient (7.133867) by the result to yield a more intuitive interpretation.

In [None]:
# 1. Calculate SD of `mean_distance` in X_train data
print(X_train['mean_distance'].std())

# 2. Divide the model coefficient by the standard deviation
print(7.133867 / X_train['mean_distance'].std())

3.574812975256436
1.9955916713344308


Now you can make a more intuitive interpretation: for every 3.57 miles traveled, the fare increased by a mean of \\$7.13. Or, reduced: for every 1 mile traveled, the fare increased by a mean of \\$2.00.

### Task 9d. Conclusion

1. What are the key takeaways from this notebook?



2. What results can be presented from this notebook?



1. What are the key takeaways from this notebook?
* The variables mean_distance and mean_duration has correlation with fare_amount.
* The variables mean_distance and mean_duration are possible candidates to predicting taxi fare amounts.

2. What results can be presented from this notebook?
* The linear regression model created uses mean_distance and mean_duration as the independent variables and fare_amount as the dependent variable.
* The training set contains 80% of the total samples, while the test set contains 20% of the total samples.
* The training data metrics are as follows:
    * Residual Sum of Squares (RSS): 325133.3316093109
    * Explained Variance Score (R^2): 0.8397085382230706
    * Mean Absolute Error (MAE): 2.1912213115414287
    * Mean Squared Error (MSE): 17.904803767239983
    * Root Mean Squared Error (RMSE): 4.231406830740809
* The test data metrics are as follows:
    * Residual Sum of Squares (RSS): 65183.08950669534
    * Explained Variance Score (R^2): 0.8679750234271767
    * Mean Absolute Error (MAE): 2.137045209651203
    * Mean Squared Error (MSE): 14.357508701915272
    * Root Mean Squared Error (RMSE): 3.7891303358310693

**Exemplar responses:**
**What are the key takeaways from this notebook?**

* Multiple linear regression is a powerful tool to estimate a dependent continous variable from several independent variables.
* Exploratory data analysis is useful for selecting both numeric and categorical features for multiple linear regression.
* Fitting multiple linear regression models may require trial and error to select variables that fit an accurate model while maintaining model assumptions (or not, depending on your use case).

**What results can be presented from this notebook?**

*  You can discuss meeting linear regression assumptions, and you can present the MAE and RMSE scores obtained from the model.

Bonus Content will be in another notebook. The following is the beginning portion of the content.

# BONUS CONTENT

More work must be done to prepare the predictions to be used as inputs into the model for the upcoming course. This work will be broken into the following steps:

1. Get the model's predictions on the full dataset.

2. Impute the constant fare rate of \$52 for all trips with rate codes of `2`.

3. Check the model's performance on the full dataset.

4. Save the final predictions and `mean_duration` and `mean_distance` columns for downstream use.




### 1. Predict on full dataset

In [None]:
X_scaled = scaler.transform(X)
y_preds_full = lr.predict(X_scaled)

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.