Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [x] Continue to clean and explore your data. 
- [x] For the evaluation metric you chose, what score would you get just by guessing?
- [x] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [71]:
# imports

import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install plotly
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install math
!{sys.executable} -m pip install category_encoders==2.*
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from category_encoders.ordinal import OrdinalEncoder

%matplotlib inline



ERROR: Could not find a version that satisfies the requirement math (from versions: none)
ERROR: No matching distribution found for math




In [5]:
# original dataframe

basedf = pd.read_csv('../module1-define-ml-problems/data/vehicles.csv')

In [60]:
# copy dataframe for personal use

df = basedf

In [61]:
# Dropping county because it has no filled values, other columns aren't relevant
# Replacing price values of 0 with NaN - we don't want to keep these observations
# We then drop those rows

df = df.drop(['county', 'url', 'region_url', 'image_url', 'description'], axis=1)
df['price'] = df['price'].replace(0, np.nan)
df = df.dropna(axis=0, subset=['price'])

In [62]:
# Removing ridiculous outliers (outrageously high prices and low prices to bait responses)

df = df[(df['price'] >= np.percentile(df['price'], 1)) & 
        (df['price'] <= np.percentile(df['price'], 99.9))]

In [16]:
len(df['price'])

400716

In [65]:
print("Baseline MAE:\n", mean_absolute_error(df['price'], [df['price'].mean()]*400716))
print("Baseline RMSE:\n", math.sqrt(mean_absolute_error(df['price'], [df['price'].mean()]*400716)))

Baseline MAE:
 8320.055332824919
Baseline RMSE:
 91.21433732053815


In [66]:
X = df.drop('price', axis=1)
y = df['price']

In [67]:
# Split train and validation set
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=0.2,
                                                  random_state=42)

In [72]:
model = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    LinearRegression()
)

In [73]:
model.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['region', 'manufacturer', 'model',
                                      'condition', 'cylinders', 'fuel',
                                      'title_status', 'transmission', 'vin',
                                      'drive', 'size', 'type', 'paint_color',
                                      'state'],
                                mapping=[{'col': 'region',
                                          'data_type': dtype('O'),
                                          'mapping': omaha / council bluffs      1
corvallis/albany            2
orlando                     3
medford-ashland             4
san diego                   5
                         ... 
la salle co               399
oneonta                   4...
dtype: int64},
                                         {'col': 'state',
                                          'data_type': dtype('O'),
                                          'mapping': ia  

In [87]:
print("Baseline MAE:\n", mean_absolute_error(df['price'], [df['price'].mean()]*400716))
print("Baseline RMSE:\n", math.sqrt(mean_absolute_error(df['price'], [df['price'].mean()]*400716)))

print('\nTraining MAE:\n', mean_absolute_error(y_train, model.predict(X_train)))
print('Training RMSE:\n', math.sqrt(mean_absolute_error(y_train, model.predict(X_train))))

print('\nValidation MAE:\n', mean_absolute_error(y_val, model.predict(X_val)))
print('Validation RMSE:\n', math.sqrt(mean_absolute_error(y_val, model.predict(X_val))))

Baseline MAE:
 8320.055332824919
Baseline RMSE:
 91.21433732053815

Training MAE:
 7004.030282686361
Training RMSE:
 83.69008473341607

Validation MAE:
 6952.354494747344
Validation RMSE:
 83.3807801279608
