<a href="https://colab.research.google.com/github/BrianThomasRoss/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/Brian_Ross_LS_DS_212_assignment_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

### Train / Test Split

In [191]:
df.description.value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               1627
<p><a  website_redacted                                                                                                                                                                                                                                                             

In [0]:
def check_month(row):
  month_created = row.split("-")[1]
  return month_created

In [0]:
train = df.loc[(df['created'].apply(check_month) == "04") |                     #  Only the months of April and May
               (df['created'].apply(check_month) == "05")]                                    
test =  df.loc[df['created'].apply(check_month) == "06"]                        #  Only the month of June

## Feature Engineering

### 1

In [194]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [0]:
def is_premium(df):
  """
  Function to determine if apartment listing can be considered as premium based
  on having all of the qualifying amenities listed in the DataFrame
  """ 
  # Amenities

  has_pool = df['swimming_pool']
  has_fitness_center = df['fitness_center']
  has_laundry_in_unit = df['laundry_in_unit']
  has_high_speed_internet = df['high_speed_internet']
  has_doorman = df['doorman']
  has_elevator = df['elevator']
  has_dishwasher = df['dishwasher']

  amenities_sum = has_pool + has_fitness_center + has_laundry_in_unit + has_high_speed_internet + has_doorman + has_elevator + has_dishwasher

  if amenities_sum == 7: # If apartment has all the amenities, then it is premium
    is_premium = 1
  else:
    is_premium = 0

  df['is_premium'] = is_premium

  return df




In [0]:
train = train.apply(is_premium, axis=1)
test = test.apply(is_premium, axis=1)       # Apply function

In [197]:
train['is_premium'].value_counts()    # Check function worked

0    31513
1      331
Name: is_premium, dtype: int64

### 2

In [0]:
def bed_to_bath_ratio(df):
  try:
    df['bed_to_bath_ratio'] = df['bathrooms'] / df['bedrooms']
  except ZeroDivisionError:
    df['bed_to_bath_ratio'] = 0
  return df

train = train.apply(bed_to_bath_ratio, axis=1)
test = test.apply(bed_to_bath_ratio, axis=1)

### 3

In [0]:
def interest_level_to_numeric(df):

  if df['interest_level'] == 'high':
    df['interest_level'] = 3
  if df['interest_level'] == "medium":
    df['interest_level'] = 2
  if df['interest_level'] == "low":
    df['interest_level'] = 1

  return df

In [0]:
train = train.apply(interest_level_to_numeric, axis=1)
test = test.apply(interest_level_to_numeric, axis=1)

### 4

In [0]:
def is_long_description(df):
  description = df['description']
  try:
    if float(len(description)) > 200:
      df['long_description'] = 1
    else:
      df['long_description'] = 0
  except TypeError:
    df['long_description'] = 0
  return df

train = train.apply(is_long_description, axis=1)
test = test.apply(is_long_description, axis=1)

## Fit Model

- Get the model's coefficients and intercept.
- Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
- What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- As always, commit your notebook to your fork of the GitHub repo

In [0]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [202]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [0]:
model = LinearRegression()

features = ['bedrooms', 
            'bathrooms',
            'bed_to_bath_ratio', 
            'doorman',
            'fitness_center',
            'dishwasher',
            'laundry_in_unit',
            'high_speed_internet',
            # 'swimming_pool',
            'elevator',
            'is_premium',
            'interest_level',
            'long_description',
            'longitude',
            'exclusive'
            ]
target = 'price'

X_train = train[features]
X_test = test[features]

y_train = train[target]
y_test = test[target]


In [314]:

model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f"Train Error: ${mae:.2f}")

r2 = r2_score(y_train, y_pred)
print(f"R-Squared Score: {r2*100:.2f}%")

mse = mean_squared_error(y_train, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:,.2f}")


Train Error: $672.73
R-Squared Score: 63.05%
RMSE: 1,071.09


In [315]:
y_pred_test = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_test)
print(f'Test Error: ${mae:,.2f}')

r2 = r2_score(y_test, y_pred_test)
print(f"R-Squared Score: {r2*100:.2f}%")

mse = mean_squared_error(y_test, y_pred_test)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:,.2f}")

Test Error: $677.97
R-Squared Score: 64.02%
RMSE: 1,057.47


In [0]:
coeff = model.coef_.tolist()

In [317]:
coeff

[482.0804362656699,
 1738.15917786561,
 58.53178346527551,
 440.29708639408113,
 38.82238554998939,
 9.567181153320348,
 452.55750006029643,
 -365.88887117320166,
 132.05164132053324,
 137.49447461814316,
 -427.86606046723773,
 -132.78809620598145,
 -12361.841090144035,
 132.26200739854943]

In [318]:
intercept = model.intercept_
intercept

-913351.5820728872