Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !
- [ ] Create more than 2 features

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [None]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))&
        (df['bathrooms'] <= 5)]

In [None]:
df.bathrooms.value_counts()

## Taking Care of Imports

In [None]:
import numpy as np
import pandas as pd

import itertools

import plotly.graph_objs as go
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


* Data Exploration
* What data can be combined 
* New features

In [None]:
df.head()

## **1. Engineer at least two new features**


* Already engineered one feature in my previous notebook, Haversine point (takes longitude and latitude and maps as a single point taking into account Earth's curvature)
* Total number of rooms (beds + baths)
* Luxury rentals - How many total perks does each apartment have?
* Are cats or dogs allowed?
* Ratio of beds to baths
* What's the neighborhood, based on address or latitude & longitude (Maybe)



**Haversine Point**

In [None]:
#Haversine Point
from math import radians, cos, sin, asin, sqrt

def single_pt_haversine(lat, lng, degrees = True):
    """
    'Single point Haversine: Calculates the great circle distance 
    between a point on Earth and the (0,0) lat-long coordinate'
    """
    
    r = 6371 #Earth's radius(km). Have r = 3956 if you want miles
    
    #Convert decimal degrees to radians
    if degrees:
        lat, ln = map(radians, [lat,lng])
        
    # 'Single point' Haversine formula
    a = sin(lat/2)**2 + cos(lat) * sin(lng/2)**2
    d = 2 * r * asin(sqrt(a))
    
    return d

In [None]:
df['haversine'] = [single_pt_haversine(x,y) for x, y in zip(df.latitude, df.longitude)]

**Adding number of bathrooms and bedrooms together**

In [None]:
#Number of rooms
df['rooms'] = df['bathrooms'] + df['bedrooms']

**Changing interest level to numeric using .replace()**

In [None]:
df['interest_level'].unique()

In [None]:
df['interest_level'] = df['interest_level'].replace(['low', 'medium', 'high'], [1,2,3])

In [None]:
df['interest_level'].unique()

**Are pets allowed**

(I am going to combine the 2 columns - 1,0 for whether or not animals are allowed)

In [None]:
df['cats_allowed'].sum()

In [None]:
df['dogs_allowed'].sum()

In [None]:
def pets_allowed(x):
    if x['cats_allowed'] == 1:
        return 1
    elif x['dogs_allowed'] == 1:
        return 1
    else:
        return 0

In [None]:
df['pets_allowed'] = df.apply(pets_allowed, axis=1)
df.head()

In [None]:
#dropping cats_allowed/dogs_allowed columns
df = df.drop(['cats_allowed', 'dogs_allowed'], axis = 1)

In [None]:
df.head()

**Creating a luxury rating scale - adding all the features available to get a scale for ammenities**

In [None]:
#looking at what features I want to include in the luxury rating
df.columns

In [None]:
#going to reorder the columns to make addition easier
#also, I am unsure what exclusive means - you're the only one with access to your apartment?
#dropping display address - it seems street address is more infomrative

df = df[['created', 'street_address', 'description', 
       'latitude', 'longitude', 'haversine',
       'price', 'interest_level', 'no_fee', 'exclusive',
       'bathrooms', 'bedrooms', 'pre-war', 'new_construction',
       'loft', 'pets_allowed', 'hardwood_floors', 'dining_room', 
       'laundry_in_unit', 'dishwasher', 'high_speed_internet', 
       'balcony', 'terrace', 'elevator', 'doorman', 'laundry_in_building', 
       'fitness_center', 'swimming_pool', 'roof_deck', 'outdoor_space',
       'garden_patio', 'common_outdoor_space', 'wheelchair_access', 'rooms']]

In [None]:
#checking it's in the right order
df.head()

In [None]:
df.shape

In [None]:
#Luxury score - this includes interest level and bathroom/bedroom
df['luxury_rating'] = sum(df.iloc[:,7:33].values.T)

In [None]:
df.luxury_rating.head()

In [None]:
#what are the different scores?
min(df['luxury_rating']), max(df['luxury_rating'])

**Checking if there are null values**

In [None]:
df.isnull().sum()

**Checking the correlation matrix to see what features most affect price.**

In [None]:
corr_matrix=df.corr()
corr_matrix['price'].sort_values(ascending=False)