# Predict Septa delays!

Can you train a model that successfully predicts SEPTA Regional Rail delays?




## Import and tidy data

Here, we've prepared the data a bit for you. Run the code blocks below to prepare your tidy dataframe, and feel free to make additional changes as you see fit!

In [None]:
import pandas as pd

septa_data = pd.read_csv('https://raw.githubusercontent.com/arcus/education-materials/master/ml-intermediate/datasets/septa/septa_otp.csv')
backup = septa_data.copy()
septa_data.info()

In [None]:
# if you need to refresh your dataframe for any reason, use this backup data instead of querying github again
# just in the spirit of being a good open-source-citizen ^_^
# septa_data = backup

In [None]:
septa_data.head()

In [None]:
septa_data.status.value_counts()

In [None]:
septa_data['delayed'] = septa_data.status.str.contains("min")

In [None]:
septa_data.delayed.value_counts()

In [None]:
# let's create a new delay_length column by extracting integers from the status column

septa_data['delay_length'] = septa_data.status.str.extract(r'(\d+)').fillna(0)
septa_data.head(10)

In [None]:
# now let's create weekday and time columns (which may be useful features..!), and drop the now-extraneous status and timeStamp columns

septa_data['timeStamp'] = pd.to_datetime(septa_data['timeStamp'])
septa_data['weekday'] = septa_data['timeStamp'].dt.day_name()
septa_data['time'] = [time.time() for time in septa_data['timeStamp']]
septa_data = septa_data.drop(columns=['status', 'timeStamp', 'date'])


septa_data.head(10)

### Encoding dummy variables

In [None]:
# first, let's encode weekday as dummy vars
septa_data = pd.get_dummies(data=septa_data, columns=['weekday'], drop_first=True)

In [None]:
septa_data.head(10)

Ah, but take a look at all of the station names! these should definitely be treated as categorical rather than string data, and thus would be good candidates for us to encode as dummy vars

In [None]:
septa_data.origin.unique()

In [None]:
# we should encode stations as dummy variables, too
septa_data = pd.get_dummies(data=septa_data, columns=['origin', 'next_station'], drop_first=True)

In [None]:
septa_data.head()

Finally, let's get rid of train_id as a feature (to create a generalized model of prediction based on other route features. there are >1000 train IDs)

In [None]:
septa_data['north'] = septa_data.direction.str.contains("N")
septa_data = septa_data.drop(columns=['train_id', 'direction'])

In [None]:
septa_data.head()

Keep in mind that we are not doing any specific encoding to time here. What do you think? How might you like to encode time, and why? Consider doing extra feature engineering here if you're adventurous...!

### Select our target variable (and drop the other 'delay' variable from our features!!)

W must decide on something to predict! If we want to approach this question as a classification question (i.e. delayed or not delayed), we should predict the delayed variable as the target, and exclude the delay length as a feature. 

Conversely, if we wish to predict the delay time as a regression task, we'll want to exclude the delayed boolean variable.

For now, I've written code to approach this as a **classification** task, but feel free to rewrite:

In [None]:
septa_data = septa_data.drop(columns=['delay_length'])

### Data preparation complete!

From here, we will start to implement a training process with cross-validation. The steps we will hit include:

* Set up cross-validation
* Define preprocessing and classification pipeline
* Fit a model
* Compute metrics
* Interpret results
* Try a new model? New parameters?

## Set up cross-validation

To get you started, here is the setup for cross-validation:

In [None]:
### create 5 stratified folds
from sklearn.model_selection import StratifiedKFold
class_labels = septa_data.delayed.values
data = septa_data.values
skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
train_sets = []
test_sets = []

#split data between variables and outcome
X, y = septa_data[septa_data.columns[septa_data.columns != 'delayed']].copy(), septa_data.delayed.copy()
for train_index, test_index in skf.split(data, class_labels):
  train_sets += [(X.iloc[train_index].copy(), y.iloc[train_index].copy())]
  test_sets += [(X.iloc[test_index].copy(), y.iloc[test_index].copy())]
  print(train_index.shape, test_index.shape)

What does one fold look like?

In [None]:
train_sets[0][0]

Keep going! See if you can transpose some of the materials from Victor's notebook to this example, or experiment with your own steps here.

## Define preprocessing and classification pipeline

In [None]:
# take a look at victor's code, but consider rewriting as you go

# for instance, consider trying a model other than decision tree!
# e.g. support vector machine: https://scikit-learn.org/stable/modules/svm.html

## Fit a model

In [None]:
# take a look at victor's code, but consider rewriting as you go

## Compute metrics

In [None]:
# take a look at victor's code, but consider rewriting as you go

## Interpret results

In [None]:
# take a look at victor's code, but consider rewriting as you go

## Try a new model? New parameters?