Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

# Imports

In [1]:
import pandas as pd
import numpy as np

import pandas_profiling

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV

from category_encoders import OneHotEncoder

# Read in data

In [10]:
import os
from pathlib import Path

__location__ = Path(os.getcwd()).parent
__data_dir__ = __location__ / 'data'

/home/allan/workspace/lambda/DS-Unit-2-Applied-Modeling/data


In [7]:
from functions import wrangle#figure out how to add wrangle to path

In [None]:
path = r"C:\Users\allan\OneDrive\Desktop\data_sets\28524_45582_compressed_AnimeList.csv\AnimeList.csv"#change path
df = pd.read_csv(path)

# Clean data

In [None]:
clean = wrangle(df)

# EDA

In [None]:
clean.profile_report()

# Split X and y

In [None]:
y = clean['rank'].copy()
X = clean.drop('rank',axis=1).copy()

# TTS

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=42)

# Model(ForestRegressor)

In [None]:
forest = make_pipeline(OneHotEncoder(use_cat_names=True),RandomForestRegressor(random_state=42,n_jobs=-1))
forest.fit(X_train,y_train);

In [None]:
baseline = [y_train.mean()]

print(f"Forest train : {forest.score(X_train,y_train)}")
print(f"Forest test: {forest.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}\n")
print(f"Train MAE: {mean_absolute_error(y_train,forest.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,forest.predict(X_test))}")

# Hyper params

In [None]:
hyper_forest = make_pipeline(OneHotEncoder(use_cat_names=True),
                             RandomForestRegressor(random_state=42,min_samples_split=2,
                                                   n_estimators=200,n_jobs=-1,min_samples_leaf=1,
                                                  max_features='sqrt',max_depth=30))

hyper_forest.fit(X_train,y_train);

In [None]:
print(f"Forest train : {hyper_forest.score(X_train,y_train)}")
print(f"Forest test: {hyper_forest.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}")
print()
print(f"Train MAE: {mean_absolute_error(y_train,hyper_forest.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,hyper_forest.predict(X_test))}")

Yeah this is what i though was weird for hyper param tuning. It tries all combinations and return whatever gave the best score, doesn't that just mean more overfitting? I feel like it should fit on taining the tune on validation right?

# Random Params

In [None]:
rand_param = {'randomforestregressor__n_estimators': [100,200,300],
             'randomforestregressor__max_features':['auto','sqrt','log2'],
             'randomforestregressor__max_depth':[10,20,30,40,50,None],
             'randomforestregressor__min_samples_split':[2,5,10,15,20],
             'randomforestregressor__min_samples_leaf':[1,2,5,10,15]}

In [None]:
test = make_pipeline(OneHotEncoder(use_cat_names=True),
                     RandomForestRegressor(n_jobs=-1))

rand_grid = RandomizedSearchCV(estimator=test,param_distributions=rand_param,
                               n_iter=500,verbose=2,random_state=42,n_jobs=-1)

In [None]:
#rand_grid.fit(X_train,y_train);

In [None]:
#print(rand_grid.best_params_)

# Tuned model

In [None]:
tuned = make_pipeline(OneHotEncoder(use_cat_names=True),RandomForestRegressor(random_state=42,n_jobs=-1,
                                                                              n_estimators=300,min_samples_split=2,
                                                                              min_samples_leaf=1,max_features='log2',
                                                                              max_depth=20))
tuned.fit(X_train,y_train);

In [None]:
baseline = [y_train.mean()]

print(f"Forest train : {tuned.score(X_train,y_train)}")
print(f"Forest test: {tuned.score(X_test,y_test)}")
print()
print(f"Base MAE: {mean_absolute_error(y_train,baseline*len(y_train))}\n")
print(f"Train MAE: {mean_absolute_error(y_train,tuned.predict(X_train))}")
print(f"Test MAE: {mean_absolute_error(y_test,tuned.predict(X_test))}")

# Graph Feature Importances

In [None]:
import matplotlib.pyplot as plt

In [None]:
encoder = tuned.named_steps['onehotencoder']
encoded_columns = encoder.transform(X_test).columns
importances = pd.Series(tuned.named_steps['randomforestregressor'].feature_importances_, encoded_columns)
plt.figure(figsize=(10,30))
importances.sort_values().plot.barh();