<center> <h2> DS 3000 - Fall 2019</h2> </center>
<center> <h3> DS Report </h3> </center>


<center> <h1>Analyzing the performance of the MBTA</h1> </center>
<center><h4>Lei, Dan, and Ethan</h4></center>


<hr style="height:2px; border:none; color:black; background-color:black;">

#### Executive Summary:

Add your summary here (100-150 words)

Provide a brief summary of your project. After reading this executive summary, your readers should have a rough understanding of what you did in this project. You can think of this summary in terms of the four sections of the report and write 1-2 sentences describing each section.



<hr style="height:2px; border:none; color:black; background-color:black;">

## Outline
1. <a href='#1'>INTRODUCTION</a>
2. <a href='#2'>METHOD</a>
3. <a href='#3'>RESULTS</a>
4. <a href='#4'>DISCUSSION</a>

<a id="1"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 1. INTRODUCTION (Ethan)

In this section, orient your readers to your project. You've already written some of these in previous deliverables. Based on your final analysis, revise your problem statement and write a concise introduction section. This section should touch upon the following points, but should be written in full paragraphs. Your writing should incorporate all of these points (and more if you like) in a coherent way. Remember that you are trying to convince your readers that this is an important problem to tackle. 

Problem Statement
* Describe the problem you would like to tackle. 
* What is the topic of your project? 
* What do you want to learn about it?

Significance of the Problem
* Why is it important to tackle this problem in your project?
* In what ways could the insights from this project be useful?
* Has there been previous work on your topic? Do some research into your topic. Cite your sources appropriately. You can use the numbered reference format or APA (if you are more comfortable with it).

Questions/Hypothesis
* End this section with a list of questions and hypotheses
* You should tie these questions/hypotheses to the problem statement and its significance
    * e.g. Given the aforementioned problem and its importance, we set out to tackle the following questions:
    
**Requirement:**
* You should have at least one question tapping into the comparison of various machine learning algorithms in predicting your target variable from your features variables.

<a id="2"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 2. METHOD

### 2.1. Data Acquisition (Ethan)

* Describe where you obtained your data. Provide a link to the original source. 
* If you scraped your data, include your code as a script file.
* Your data should be stored in an online repository (e.g., GitHub) and your code should retrieve your data from that online resource. You can read csv files from the Web in the same way that you read files from local drive.
* Describe the dataset and variables. What do variables represent?


All the data for project comes from this website: https://mbtabackontrack.com/performance/#/download. It contains a large amount of data concerning the MBTA that is updated on a monthly basis. 

### 2.2. Variables (Lei / Dan)
* If you are testing hypotheses, what are your IVs and DVs?
* For your predictive models, what are your features and target variables?


### 2.3. Data Analysis (Dan)
* Specifically describe your predictive model. What outcome variable are you going to predict from what feature variables?
* Describe whether this is a supervised or unsupervised learning problem. Also identify the sub-category of the learning task (e.g. classification).
* What machine learning algorithms are you going to use? Why?

<a id="3"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 3. RESULTS

### 3.1. Data Wrangling (Ethan)
* Perform simple data cleaning (delete extra columns, deal with NA values, etc.)
* Perform data wrangling to extract your features and target values (e.g., grouping your dataframe by columns, applying functions to format dataframes, etc.)
* Preprocess your variables (e.g., scaling/transforming feature variables to normalize them)
* Feature extraction (dummy variables, new features from existing features, etc.)
* Use one feature selection technique to select a subset of your original features


In [None]:
import pandas as pd

In [66]:
# Returns a DataFrame with statistics about the reliability of a transport service.
def get_reliability_df():
    df = pd.read_csv("https://raw.githubusercontent.com/ethan-leba/data-science-mtba/master/resources/combined_ontime_performance.csv")
    
    # Filter out rows concerning commuter rail as we are not analyzing this data
    df = df[df['mode_type'] != 'Commuter Rail']
    
    # Create a reliability percentage row
    df['pct_reliability'] = df['otp_numerator'] / df['otp_denominator']
    
    # Remove extraneous columns
    df = df[['gtfs_route_id', 'mode_type', 'service_date', 'pct_reliability']]
    return df

In [67]:
# Returns a DataFrame with statistics about the rider's satisfaction with the MBTA.
def get_rider_satisfaction_df():
    # The CSV file contains 7 categories of satisfaction, ranging from
    # Extremely dissatisfied to extremely satisfied, and a percentage of 
    # responses for each answer. To make analysis easier to conduct, we are
    # convert these columns in a score ranging from 1 to 7 of satisfaction.
    def response_to_num(df, amt):
        pct = df.copy()["response_1_percent"]
        for x in range(2,amt):
            pct += df[f"response_{x}_percent"] * x 
        return pct
    
    df = pd.read_csv("https://raw.githubusercontent.com/ethan-leba/data-science-mtba/master/resources/combined_rider_satisfaction.csv")
    
    # Drop rows with N/A values
    df = df.dropna()
    
    # Sets the percentage row
    df['satisfaction_score'] = response_to_num(df, 7)
    
    # Remove extraneous columns
    df = df[['survey_date', 'question_description', 'satisfaction_score']]
    
    return df

In [74]:
# Returns a DataFrame with statistics about the amount of ridership a public transport service recieves.
def get_weekly_ridership_df():
    df = pd.read_csv("https://raw.githubusercontent.com/ethan-leba/data-science-mtba/master/resources/combined_weekly_ridership.csv")
    
    # Remove extraneous columns
    df = df[["MODE", "RIDERSHIP", "AVG_RIDERSHIP", "ROUTE_OR_LINE"]]
    
    return df

In [75]:
reliability_df = get_reliability_df()
satisfaction_df = get_rider_satisfaction_df()
ridership_df = get_weekly_ridership_df()

In [77]:
ridership_df.head()

Unnamed: 0,MODE,RIDERSHIP,AVG_RIDERSHIP,ROUTE_OR_LINE
0,Bus,7106920.0,323041.818182,Bus
1,Bus,688735.0,31306.136364,Silver Line
2,Rail,1439092.0,65413.259469,Blue Line
3,Rail,3400249.0,154556.756873,Green Line
4,Rail,4220321.0,191832.785561,Orange Line


In [78]:
satisfaction_df.head()

Unnamed: 0,survey_date,question_description,satisfaction_score
0,2019-05-01,How would you rate your most recent trip?,4.092
1,2019-05-01,How satisfied are you with the MBTA's communic...,4.0201
2,2019-05-01,How would you rate the mbta overall?,4.1225
3,2019-05-01,The mbta provides reliable public transportati...,3.752
4,2018-01-01,How would you rate the mbta overall?,3.8013


In [79]:
reliability_df.head()

Unnamed: 0,gtfs_route_id,mode_type,service_date,pct_reliability
0,Red,Rail,2017-10-01,0.886728
1,Orange,Rail,2017-10-01,0.929751
2,Green-B,Rail,2017-10-01,0.738828
3,Green-C,Rail,2017-10-01,0.763832
4,Green-D,Rail,2017-10-01,0.808401


### 3.2. Data Exploration (Lei)
* Generate appropriate data visualizations for your key variables identified in the previous section
* You should have at least three visualizations (and at least two different visualization types)
* For each visualization provide an explanation regarding the variables involved and an interpretation of the graph.
* If you are using Plotly, insert your visualizations as images as well (upload the graph images to an online source, e.g. github, and link those in Jupyter Notebook)


### 3.3. Model Construction (Dan)
* If you proposed hypotheses, conduct your hypothesis tests
* For your machine learning question(s), split data into training, validation, and testing sets (or use cross-validation)
* Apply machine learning algorithms (apply at least three algorithms)
* Train your algorithms

### 3.4. Model Evaluation (Dan)
* Evaluate the performance of your algorithms on appropriate evaluation metrics, using your validation set
* Interpret your results from multiple models (and hypothesis tests, if any)

### 3.5. Model Optimization (Dan)
* Tune your models using appropriate hyperparameters
* Explain why you are doing this (e.g., to avoid overfitting, etc.)

### 3.6. Model Testing (Dan)
* Test your tuned algorithms using your testing set

<a id="4"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

## 4. DISCUSSION
* Provide a summary of the steps you took to analyze your data and test your predictive model
* Intepret your findings from 3.4., 3.5, and 3.6
    * Which algorithms did you compare?
    * Which algorithm(s) revealed best performance?
    * Which algorithm(s) should be used for your predictive model?
* If you tested hypotheses, interpret the results. What does it mean to have significant/non-significant differences with regards to your data?


* End this section with a conclusion paragraph containing some pointers for future work 
    *(e.g., get more data, perform another analysis, etc.)

<a id="5"></a>
<hr style="height:2px; border:none; color:black; background-color:black;">

### CONTRIBUTIONS
* Describe each team member's contributions to the report (who did what in each section)
* Remember this is a team effort!
* Each member of your team will provide peer evaluation of other team members. Your final grade on the project will be based on those peer evaluations. 