Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [83]:
import pandas as pd 
import geopandas as gpd
from glob import glob
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='pyproj')
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

In [3]:
# Use wildcard to read-in all csv files in folder
filepath = glob("/home/alex/data/la-metro-bike-share/*.csv")
# Use low memory option to avoid specifying datatypes explicitly
dataframes = [pd.read_csv(f, low_memory=False) for f in filepath]
# Concatenate each individual CSV dataframe into a single dataframe
df = pd.concat(dataframes)

In [4]:
def crop_trips_by_coordinates(dataframe):
    dataframe = dataframe[dataframe["start_lat"] > 33].copy()
    dataframe = dataframe[dataframe["end_lat"] > 33].copy()
    dataframe = dataframe[dataframe["start_lon"] < -116].copy()
    dataframe = dataframe[dataframe["end_lon"] < -116].copy()
    return dataframe

def generate_datetime_features(dataframe):
    dataframe["start_time"] = pd.to_datetime(dataframe["start_time"])
    dataframe["end_time"] = pd.to_datetime(dataframe["end_time"])
    dataframe["trip_duration_in_minutes"] = dataframe["end_time"] - dataframe["start_time"]
    df = dataframe[dataframe["trip_duration_in_minutes"] > pd.Timedelta(minutes=0)].copy()
    df = df[df["trip_duration_in_minutes"] < pd.Timedelta(hours=24)]
    df["trip_duration_in_minutes"] = df["trip_duration_in_minutes"] / pd.Timedelta(minutes=1)
    df["year"] = df["start_time"].dt.year
    df["month"] = df["start_time"].dt.month
    df["day_of_week"] = df["start_time"].dt.dayofweek
    df["hour"] = df["start_time"].dt.hour
    df = df.sort_values("start_time")
    return df

def add_coordinate_features(dataframe):
    df = dataframe.copy()
    df['StartCoordinate'] = list(zip(df.start_lat, df.start_lon))
    df['EndCoordinate'] = list(zip(df.end_lat, df.end_lon))
    return df

def engineer_data(dataframe):
    # Copy input DataFrame
    engineered_data = dataframe.copy()
    # Apply coarse spatial filter based on coordinates
    engineered_data = crop_trips_by_coordinates(engineered_data)
    # Remove duplicate trips
    engineered_data.drop_duplicates(["trip_id"], inplace=True)
    # Generate datetime features such as trip day of week and hour of day
    engineered_data = generate_datetime_features(engineered_data)
    # Crop unneccessary columns
    engineered_data = engineered_data[["trip_id","start_time","end_time","start_lat","start_lon","end_lat","end_lon","bike_id","trip_route_category","passholder_type","trip_duration_in_minutes","year","month","day_of_week","hour"]].copy()
    # Dictionary to map Los Angeles user classes to simplified "Customer" vs. "Subscriber" dichotomy
    simplified_dictionary = {"Annual Pass": "Subscriber", 
                             "Flex Pass":"Subscriber", 
                             "Monthly Pass": "Subscriber", 
                             "One Day Pass": "Customer", 
                             "Walk-up":"Customer"}
    engineered_data["UserType"] = engineered_data["passholder_type"].map(simplified_dictionary)
    engineered_data['DayType'] = engineered_data['day_of_week'].apply(lambda x: 'Weekday' if x <= 4 else 'Weekend')
    return engineered_data

In [5]:
# Engineer data
trip_data = engineer_data(df)

# Return snapshot of results
print(trip_data.shape)
trip_data.head()

(905474, 17)


Unnamed: 0,trip_id,start_time,end_time,start_lat,start_lon,end_lat,end_lon,bike_id,trip_route_category,passholder_type,trip_duration_in_minutes,year,month,day_of_week,hour,UserType,DayType
0,1912818,2016-07-07 04:17:00,2016-07-07 04:20:00,34.05661,-118.23721,34.05661,-118.23721,6281,Round Trip,Monthly Pass,3.0,2016,7,3,4,Subscriber,Weekday
1,1919661,2016-07-07 06:00:00,2016-07-07 06:33:00,34.05661,-118.23721,34.05661,-118.23721,6281,Round Trip,Monthly Pass,33.0,2016,7,3,6,Subscriber,Weekday
2,1933383,2016-07-07 10:32:00,2016-07-07 10:37:00,34.052898,-118.24156,34.052898,-118.24156,5861,Round Trip,Flex Pass,5.0,2016,7,3,10,Subscriber,Weekday
3,1944197,2016-07-07 10:37:00,2016-07-07 13:38:00,34.052898,-118.24156,34.052898,-118.24156,5861,Round Trip,Flex Pass,181.0,2016,7,3,10,Subscriber,Weekday
4,1940317,2016-07-07 12:51:00,2016-07-07 12:58:00,34.049889,-118.25588,34.049889,-118.25588,6674,Round Trip,Walk-up,7.0,2016,7,3,12,Customer,Weekday


In [30]:
def merge_trips_with_ancillary_data(dataframe):
    
    starts_gdf = gpd.GeoDataFrame(dataframe.drop(["end_lon","end_lat"], axis=1), geometry=gpd.points_from_xy(dataframe.start_lon, dataframe.start_lat))
    ends_gdf = gpd.GeoDataFrame(dataframe.drop(["start_lon","start_lat"], axis=1), geometry=gpd.points_from_xy(dataframe.end_lon, dataframe.end_lat))
    
    starts_gdf.crs = "EPSG:4326"
    ends_gdf.crs = "EPSG:4326"
    
    census_places = "http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/census-places-2012.geojson"
    census_tracts = "http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/census-tracts-2012.geojson"
    neighborhoods = "http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/la-county-neighborhoods-current.geojson"
    regions = "http://s3-us-west-2.amazonaws.com/boundaries.latimes.com/archive/1.0/boundary-set/la-county-regions-current.geojson"

    census_places_gdf = gpd.read_file(census_places)
    census_tracts_gdf = gpd.read_file(census_tracts)
    neighborhoods_gdf = gpd.read_file(neighborhoods)
    regions_gdf = gpd.read_file(regions)
    
    census_places_gdf = census_places_gdf.to_crs("EPSG:4326")
    census_tracts_gdf = census_tracts_gdf.to_crs("EPSG:4326")
    neighborhoods_gdf = neighborhoods_gdf.to_crs("EPSG:4326")
    regions_gdf = regions_gdf.to_crs("EPSG:4326")
    
    census_places_gdf = census_places_gdf.drop(["kind","external_id","slug","set","metadata","resource_uri"], axis=1)
    census_places_gdf.rename(columns={"name":"CensusPlace"}, inplace=True)
    
    census_tracts_gdf = census_tracts_gdf.drop(["kind","external_id","slug","set","metadata","resource_uri"], axis=1)
    census_tracts_gdf.rename(columns={"name":"CensusTract"}, inplace=True)
    
    neighborhoods_gdf.drop(["kind","external_id","slug","set","metadata","resource_uri"], axis=1, inplace=True)
    neighborhoods_gdf.rename(columns={"name":"Neighborhood"}, inplace=True)

    regions_gdf.drop(["kind","external_id","slug","set","metadata","resource_uri"], axis=1, inplace=True)
    regions_gdf.rename(columns={"name":"Region"}, inplace=True)

    starts_census_tracts_gdf = gpd.sjoin(starts_gdf, census_tracts_gdf, how="left")
    starts_census_places_gdf = gpd.sjoin(starts_gdf, census_places_gdf, how="left")
    starts_neighborhoods_gdf = gpd.sjoin(starts_gdf, neighborhoods_gdf, how="left")
    starts_regions_gdf = gpd.sjoin(starts_gdf, regions_gdf, how="left")
    
    starts_regions_gdf.drop(["index_right"], axis=1, inplace=True)
    starts_neighborhoods_gdf.drop(["index_right"], axis=1, inplace=True)
    starts_census_places_gdf.drop(["index_right"], axis=1, inplace=True)
    starts_census_tracts_gdf.drop(["index_right"], axis=1, inplace=True)
    
    ends_census_tracts_gdf = gpd.sjoin(ends_gdf, census_tracts_gdf, how="left")
    ends_census_places_gdf = gpd.sjoin(ends_gdf, census_places_gdf, how="left")
    ends_neighborhoods_gdf = gpd.sjoin(ends_gdf, neighborhoods_gdf, how="left")
    ends_regions_gdf = gpd.sjoin(ends_gdf, regions_gdf, how="left")
    
    ends_regions_gdf.drop(["index_right"], axis=1, inplace=True)
    ends_neighborhoods_gdf.drop(["index_right"], axis=1, inplace=True)
    ends_census_places_gdf.drop(["index_right"], axis=1, inplace=True)
    ends_census_tracts_gdf.drop(["index_right"], axis=1, inplace=True)
    
    starts_with_boundaries = starts_regions_gdf.merge(starts_neighborhoods_gdf[["trip_id","Neighborhood"]], on="trip_id").merge(starts_census_places_gdf[["trip_id","CensusPlace"]], on="trip_id").merge(starts_census_tracts_gdf[["trip_id","CensusTract"]], on="trip_id")
    ends_with_boundaries = ends_regions_gdf.merge(ends_neighborhoods_gdf[["trip_id","Neighborhood"]], on="trip_id").merge(ends_census_places_gdf[["trip_id","CensusPlace"]], on="trip_id").merge(ends_census_tracts_gdf[["trip_id","CensusTract"]], on="trip_id")
    
    starts_with_boundaries_df = pd.DataFrame(starts_with_boundaries.drop(["geometry"], axis=1))
    ends_with_boundaries_df = pd.DataFrame(ends_with_boundaries[["trip_id","Region","Neighborhood","CensusPlace","CensusTract"]])
    
    trips_with_boundaries_df = starts_with_boundaries_df.merge(ends_with_boundaries_df, on="trip_id", suffixes=("_start","_end"))
        
    return trips_with_boundaries_df

In [31]:
trips_with_boundaries = merge_trips_with_ancillary_data(trip_data)


trips_with_boundaries.tail()

Unnamed: 0,trip_id,start_time,end_time,start_lat,start_lon,bike_id,trip_route_category,passholder_type,trip_duration_in_minutes,year,...,UserType,DayType,Region_start,Neighborhood_start,CensusPlace_start,CensusTract_start,Region_end,Neighborhood_end,CensusPlace_end,CensusTract_end
905469,134866192,2019-12-31 23:34:46,2019-12-31 23:42:28,34.048038,-118.253738,12019,One Way,Walk-up,7.7,2019,...,Customer,Weekday,Central L.A.,Downtown,Los Angeles,6037207710,Central L.A.,Downtown,Los Angeles,6037207301
905470,134866394,2019-12-31 23:41:52,2019-12-31 23:50:58,34.04744,-118.24794,18912,One Way,Monthly Pass,9.1,2019,...,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207302,Central L.A.,Downtown,Los Angeles,6037206031
905471,134866292,2019-12-31 23:43:19,2019-12-31 23:47:41,34.045422,-118.253517,12298,One Way,Annual Pass,4.366667,2019,...,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207301,Central L.A.,Downtown,Los Angeles,6037207710
905472,134866392,2019-12-31 23:48:17,2019-12-31 23:53:55,34.04681,-118.256981,19053,One Way,Annual Pass,5.633333,2019,...,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207710,Central L.A.,Downtown,Los Angeles,6037207900
905473,134867192,2019-12-31 23:58:52,2020-01-01 00:25:27,34.04417,-118.261169,19053,One Way,Annual Pass,26.583333,2019,...,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207900,Central L.A.,Downtown,Los Angeles,6037207501


In [34]:
trips_with_boundaries["Stays in same neighborhood"] = trips_with_boundaries["Neighborhood_start"] == trips_with_boundaries["Neighborhood_end"]

In [35]:
trips_with_boundaries.head()

Unnamed: 0,trip_id,start_time,end_time,start_lat,start_lon,bike_id,trip_route_category,passholder_type,trip_duration_in_minutes,year,...,DayType,Region_start,Neighborhood_start,CensusPlace_start,CensusTract_start,Region_end,Neighborhood_end,CensusPlace_end,CensusTract_end,Stays in same neighborhood
0,1912818,2016-07-07 04:17:00,2016-07-07 04:20:00,34.05661,-118.23721,6281,Round Trip,Monthly Pass,3.0,2016,...,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,Central L.A.,Downtown,Los Angeles,6037206020,True
1,1919661,2016-07-07 06:00:00,2016-07-07 06:33:00,34.05661,-118.23721,6281,Round Trip,Monthly Pass,33.0,2016,...,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,Central L.A.,Downtown,Los Angeles,6037206020,True
2,1933383,2016-07-07 10:32:00,2016-07-07 10:37:00,34.052898,-118.24156,5861,Round Trip,Flex Pass,5.0,2016,...,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,Central L.A.,Downtown,Los Angeles,6037207400,True
3,1944197,2016-07-07 10:37:00,2016-07-07 13:38:00,34.052898,-118.24156,5861,Round Trip,Flex Pass,181.0,2016,...,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,Central L.A.,Downtown,Los Angeles,6037207400,True
4,1940317,2016-07-07 12:51:00,2016-07-07 12:58:00,34.049889,-118.25588,6674,Round Trip,Walk-up,7.0,2016,...,Weekday,Central L.A.,Downtown,Los Angeles,6037207710,Central L.A.,Downtown,Los Angeles,6037207710,True


In [38]:
trips_with_boundaries.to_csv("bike-share-trips-with-admin-boundaries.csv")


# Choose your target. Which column in your tabular dataset will you predict?

I am predicting whether a rider starts and ends a bike-share trip in the same neighborhood.

In [37]:
data_for_predictions = trips_with_boundaries.drop(["trip_id","start_time","end_time","bike_id", "trip_route_category","Region_end","Neighborhood_end","CensusPlace_end","CensusTract_end"], axis=1)

data_for_predictions.head()

Unnamed: 0,start_lat,start_lon,passholder_type,trip_duration_in_minutes,year,month,day_of_week,hour,UserType,DayType,Region_start,Neighborhood_start,CensusPlace_start,CensusTract_start,Stays in same neighborhood
0,34.05661,-118.23721,Monthly Pass,3.0,2016,7,3,4,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,True
1,34.05661,-118.23721,Monthly Pass,33.0,2016,7,3,6,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,True
2,34.052898,-118.24156,Flex Pass,5.0,2016,7,3,10,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,True
3,34.052898,-118.24156,Flex Pass,181.0,2016,7,3,10,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,True
4,34.049889,-118.25588,Walk-up,7.0,2016,7,3,12,Customer,Weekday,Central L.A.,Downtown,Los Angeles,6037207710,True


# Is your problem regression or classification?

This is a classification problem.

# How is your target distributed?
- Classification: How many classes? Are the classes imbalanced?
- Regression: Is the target right-skewed? If so, you may want to log transform the target.

This is a binary classification problem (two classes) with imbalanced classes.

In [43]:
data_for_predictions["Stays in same neighborhood"].value_counts(normalize=True)*100

True     84.728551
False    15.271449
Name: Stays in same neighborhood, dtype: float64

# Choose your evaluation metric(s).
- Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?


Since the majority class falls outside the 50-70% rule-of-thumb, I will pursue evaluation metrics including accuracy, precision, and recall.

# Choose which observations you will use to train, validate, and test your model.
- Are some observations outliers? Will you exclude them?
- Will you do a random split or a time-based split?

Since this is a relatively large dataset, I will use a random subset of 20% of total rides to generate training, validation, and testing data subsets. Outliers have been removed (trips outside LA County and rides lasting longer than 24 hours). Since the "trip_duration_in_minutes" is unknown until the ride is complete, this feature would introduce leakage into the model. I remove "trip_duration_in_minutes" for modeling purposes.

In [45]:
data_for_predictions.drop(["trip_duration_in_minutes"], axis=1, inplace=True)
data_for_predictions.head()

Unnamed: 0,start_lat,start_lon,passholder_type,year,month,day_of_week,hour,UserType,DayType,Region_start,Neighborhood_start,CensusPlace_start,CensusTract_start,Stays in same neighborhood
0,34.05661,-118.23721,Monthly Pass,2016,7,3,4,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,True
1,34.05661,-118.23721,Monthly Pass,2016,7,3,6,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037206020,True
2,34.052898,-118.24156,Flex Pass,2016,7,3,10,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,True
3,34.052898,-118.24156,Flex Pass,2016,7,3,10,Subscriber,Weekday,Central L.A.,Downtown,Los Angeles,6037207400,True
4,34.049889,-118.25588,Walk-up,2016,7,3,12,Customer,Weekday,Central L.A.,Downtown,Los Angeles,6037207710,True


In [69]:
sample_for_predictions = data_for_predictions.sample(frac=.20)

In [70]:
train, test = train_test_split(
    sample_for_predictions, 
    train_size=0.80, 
    test_size=0.20, 
    random_state=42)

train, validate = train_test_split(
    train, 
    train_size=0.80, 
    test_size=0.20, 
    random_state=42)

In [71]:
train.shape, validate.shape, test.shape

((115900, 14), (28976, 14), (36219, 14))

In [72]:
# The status_group column is the target
target = 'Stays in same neighborhood'

# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality.index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
features

['start_lat',
 'start_lon',
 'year',
 'month',
 'day_of_week',
 'hour',
 'passholder_type',
 'UserType',
 'DayType',
 'Region_start',
 'Neighborhood_start',
 'CensusPlace_start',
 'CensusTract_start']

Baseline accuracy, precision, and recall for majority classifier

In [95]:
y_train = train[target]
y_validate = validate[target]
y_test = test[target]

majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)

print("\n",classification_report(y_train, y_pred))


               precision    recall  f1-score   support

       False       0.00      0.00      0.00     17769
        True       0.85      1.00      0.92     98131

    accuracy                           0.85    115900
   macro avg       0.42      0.50      0.46    115900
weighted avg       0.72      0.85      0.78    115900



In [96]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_validate = validate[features]
y_validate = validate[target]
X_test = test[features]
y_test = test[target]

In [75]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LogisticRegression(solver='lbfgs', n_jobs=-1, random_state=8)
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_validate, y_validate))

Train Accuracy 0.8441501294219155
Validation Accuracy 0.8450441744892324


In [77]:
y_pred = pipeline.predict(X_validate)
print(classification_report(y_validate, y_pred))

              precision    recall  f1-score   support

       False       0.41      0.06      0.11      4377
        True       0.86      0.98      0.92     24599

    accuracy                           0.85     28976
   macro avg       0.63      0.52      0.51     28976
weighted avg       0.79      0.85      0.79     28976



This is a poor model for predicting rider who finish their bike-share rides in a different neighborhood. 

In [81]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    DecisionTreeClassifier(random_state=8)
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_validate, y_validate))
y_pred = pipeline.predict(X_validate)
print("\n",classification_report(y_validate, y_pred))

Train Accuracy 0.9793270060396894
Validation Accuracy 0.8414204859193816

               precision    recall  f1-score   support

       False       0.48      0.53      0.50      4377
        True       0.91      0.90      0.91     24599

    accuracy                           0.84     28976
   macro avg       0.70      0.71      0.70     28976
weighted avg       0.85      0.84      0.84     28976



This model is a bit better at detecting inter-neighborhood rides.

In [84]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    RandomForestClassifier(random_state=8, n_jobs=-1)
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_validate, y_validate))
y_pred = pipeline.predict(X_validate)
print("\n",classification_report(y_validate, y_pred))

Train Accuracy 0.9792924935289042
Validation Accuracy 0.8850427940364439

               precision    recall  f1-score   support

       False       0.65      0.51      0.57      4377
        True       0.92      0.95      0.93     24599

    accuracy                           0.89     28976
   macro avg       0.78      0.73      0.75     28976
weighted avg       0.88      0.89      0.88     28976



In [85]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    RandomForestClassifier(random_state=8, n_jobs=-1, min_samples_leaf=2)
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_validate, y_validate))
y_pred = pipeline.predict(X_validate)
print("\n",classification_report(y_validate, y_pred))

Train Accuracy 0.9303106125970665
Validation Accuracy 0.8948785201546107

               precision    recall  f1-score   support

       False       0.73      0.48      0.58      4377
        True       0.91      0.97      0.94     24599

    accuracy                           0.89     28976
   macro avg       0.82      0.73      0.76     28976
weighted avg       0.89      0.89      0.89     28976



In [86]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

In [87]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(), 
    StandardScaler(),
    RandomForestClassifier(random_state=8)
)

param_distributions = {
    'randomforestclassifier__min_samples_leaf': randint(1, 1000),  
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': randint(1, 100), 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=30, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

print('Best hyperparameters', search.best_params_)
print('Cross-validation MAE', search.best_score_)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   48.5s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done  85 out of  90 | elapsed:  7.8min remaining:   27.5s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  8.3min finished


Best hyperparameters {'randomforestclassifier__max_depth': 20, 'randomforestclassifier__max_features': 0.23006352593049395, 'randomforestclassifier__min_samples_leaf': 27, 'randomforestclassifier__n_estimators': 197, 'simpleimputer__strategy': 'median'}
Cross-validation MAE -0.893477128448014


In [89]:
print ('Train Accuracy', search.score(X_train, y_train))
print ('Validation Accuracy', search.score(X_validate, y_validate))
y_pred = search.predict(X_validate)
print("\n",classification_report(y_validate, y_pred))

Train Accuracy 0.896160483175151
Validation Accuracy 0.895810325786858

               precision    recall  f1-score   support

       False       0.76      0.45      0.57      4377
        True       0.91      0.97      0.94     24599

    accuracy                           0.90     28976
   macro avg       0.83      0.71      0.75     28976
weighted avg       0.89      0.90      0.88     28976



This is a much better model than majority classifier. Random Forests ftw!