Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

# My Portfolio Dataset

In [14]:
import warnings
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score, mean_squared_error

warnings.filterwarnings('ignore')

## Exploratory Data Analysis

In [15]:
data_url = './data/ab_us_2020.csv'
df = pd.read_csv(data_url)

print(df.shape)
df.head()

(226030, 17)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,city
0,38585,Charming Victorian home - twin beds + breakfast,165529,Evelyne,,28804,35.65146,-82.62792,Private room,60,1,138,16/02/20,1.14,1,0,Asheville
1,80905,French Chic Loft,427027,Celeste,,28801,35.59779,-82.5554,Entire home/apt,470,1,114,07/09/20,1.03,11,288,Asheville
2,108061,Walk to stores/parks/downtown. Fenced yard/Pet...,320564,Lisa,,28801,35.6067,-82.55563,Entire home/apt,75,30,89,30/11/19,0.81,2,298,Asheville
3,155305,Cottage! BonPaul + Sharky's Hostel,746673,BonPaul,,28806,35.57864,-82.59578,Entire home/apt,90,1,267,22/09/20,2.39,5,0,Asheville
4,160594,Historic Grove Park,769252,Elizabeth,,28801,35.61442,-82.54127,Private room,125,30,58,19/10/15,0.52,1,0,Asheville


In [3]:
ProfileReport(df)

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=31.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






---

## Data Cleaning

In [4]:
df.describe(include='number')

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,226030.0,226030.0,226030.0,226030.0,226030.0,226030.0,226030.0,177428.0,226030.0,226030.0
mean,25471760.0,93523850.0,35.662829,-103.220662,219.716529,452.549,34.50653,1.43145,16.698562,159.314856
std,13178140.0,98274220.0,6.849855,26.222091,570.353609,210337.6,63.602914,1.68321,51.068966,140.179628
min,109.0,23.0,18.92099,-159.7149,0.0,1.0,0.0,0.01,1.0,0.0
25%,15158900.0,13992750.0,32.761783,-118.598115,75.0,1.0,1.0,0.23,1.0,0.0
50%,25909160.0,51382660.0,37.261125,-97.8172,121.0,2.0,8.0,0.81,2.0,140.0
75%,37726240.0,149717900.0,40.724038,-76.919322,201.0,7.0,39.0,2.06,6.0,311.0
max,45560850.0,367917600.0,47.73462,-70.99595,24999.0,100000000.0,966.0,44.06,593.0,365.0


In [5]:
# Find the count of outliers
mask = df['minimum_nights'] >= 500
df[mask]['minimum_nights'].value_counts()

1000         18
500          13
1125          6
1124          3
700           2
999           2
600           2
950           1
750           1
1123          1
1250          1
800           1
730           1
100000000     1
Name: minimum_nights, dtype: int64

In [6]:
print(df['minimum_nights'].max())

100000000


In [7]:
outlier_index = df['minimum_nights'].argmax()
df = df.drop(outlier_index)
df.describe(include='number')

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,226029.0,226029.0,226029.0,226029.0,226029.0,226029.0,226029.0,177427.0,226029.0,226029.0
mean,25471800.0,93524260.0,35.66282,-103.220577,219.7172,10.129873,34.506647,1.431458,16.698632,159.315561
std,13178150.0,98274240.0,6.849869,26.222118,570.354782,25.251182,63.60303,1.683212,51.069068,140.179538
min,109.0,23.0,18.92099,-159.7149,0.0,1.0,0.0,0.01,1.0,0.0
25%,15158880.0,13992750.0,32.76178,-118.59801,75.0,1.0,1.0,0.23,1.0,0.0
50%,25909160.0,51384450.0,37.2611,-97.81715,121.0,2.0,8.0,0.81,2.0,140.0
75%,37726330.0,149718800.0,40.72404,-76.91921,201.0,7.0,39.0,2.06,6.0,311.0
max,45560850.0,367917600.0,47.73462,-70.99595,24999.0,1250.0,966.0,44.06,593.0,365.0


In [8]:
# Drop irrelevant features
df = df.copy().drop(
  columns=['name', 'host_id', 'host_name', 'neighbourhood_group', 'reviews_per_month']
)
df.set_index('id', inplace=True)

print(df.shape)

(226029, 11)


---
## Feature and Target Selection


In [12]:
# Feature and target matrices
target = 'price'
X = df.drop(target, axis=1)
y = df[target]

print(f"Features: {len(X.columns)}")
X.head()

Features: 10


Unnamed: 0_level_0,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,calculated_host_listings_count,availability_365,city
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
38585,28804,35.65146,-82.62792,Private room,1,138,16/02/20,1,0,Asheville
80905,28801,35.59779,-82.5554,Entire home/apt,1,114,07/09/20,11,288,Asheville
108061,28801,35.6067,-82.55563,Entire home/apt,30,89,30/11/19,2,298,Asheville
155305,28806,35.57864,-82.59578,Entire home/apt,1,267,22/09/20,5,0,Asheville
160594,28801,35.61442,-82.54127,Private room,30,58,19/10/15,1,0,Asheville


---



### ML Problems

Target: price

Predicting the price of an Airbnb unit is a regression problem. To model this problem, I'll build a pipeline using both linear and random forest regression algorithms.


Metrics: r^2 score, mean-squared-error (MSE

---
## Training, Validation, and Testing Datasets

In [11]:
# Split into training, validation, and testing subsets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=122995
)

X_train, X_val, y_train, y_val = train_test_split(
  X_train, y_train, test_size=0.2, random_state=122995
)

print(f"""
  Training dataset: {len(X_train) / len(df) *100}%
  Validation dataset: {len(X_val) / len(df) *100}%
  Testing dataset: {len(X_test) / len(df) *100}%
""")


  Training dataset: 63.99975224418105%
  Validation dataset: 16.00015927159789%
  Testing dataset: 20.000088484221052%

