<a href="https://colab.research.google.com/github/erivetna87/DS-Unit-2-Regression-Classification/blob/master/DS-Unit-2-Regression-Classification/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [x] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [x] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |██▋                             | 10kB 15.4MB/s eta 0:00:01[K     |█████▏                          | 20kB 3.3MB/s eta 0:00:01[K     |███████▊                        | 30kB 4.7MB/s eta 0:00:01[K     |██████████▎                     | 40kB 3.1MB/s eta 0:00:01[K     |████████████▉                   | 51kB 3.8MB/s eta 0:00:01[K     |███████████████▍                | 61kB 4.5MB/s eta 0:00:01[K     |██████████████████              | 71kB 5.1MB/s eta 0:00:01[K     |████████████████████▋           | 81kB 5.8MB/s eta 0:00:01[K     |███████████████████████▏        | 92kB 6.5MB/s eta 0:00:01[K     |█████████████████████████▊      | 102kB 5.0MB/s eta 0:00:01[K     |████████████████████████████▎   | 112kB 5.0MB/s eta 0:00:01[K     |██████████████████████████████▉ | 122kB 5.0MB/

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.simplefilter("ignore")

In [0]:
import numpy as np
import pandas as pd
import pandas_profiling
import plotly.express as px
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
%matplotlib inline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
df.to_csv()
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

df['created'] = pd.to_datetime(df['created'],infer_datetime_format=True)

In [0]:
# df.profile_report()

In [0]:
#New Features
#1) Total Rooms
#2) cats or dogs
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['cats_or_dogs'] = (df['cats_allowed']==1) | (df['dogs_allowed']==1)

In [0]:
train = df.loc[(df['created'] > '2016-04-01') & (df['created'] < '2016-06-01')]

In [0]:
test = df.loc[(df['created'] > '2016-06-01') & (df['created'] < '2016-07-01')]

In [0]:
train.shape, test.shape

((31844, 36), (16973, 36))

In [0]:
#Scatter Plot for Pricing Data
px.scatter(train, x='longitude', y='latitude',color='price')

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
%matplotlib inline

In [0]:
# Clustering the locations by Lat/Long
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, n_jobs=-1)
train['cluster'] = kmeans.fit_predict(train[['longitude', 'latitude']])
test['cluster'] = kmeans.predict(test[['longitude', 'latitude']])
px.scatter(train, x='longitude', y='latitude', color='cluster')


In [0]:
def cat_features(df,feature):
  cat_features = pd.get_dummies(df[str(feature)], prefix = str(feature))
  for col in cat_features:
    df[col] = cat_features[col]
  
  return df

cat_features(train,'interest_level')
cat_features(test,'interest_level')

    





In [0]:
train, test = cat_features(train,'interest_level'), cat_features(test,'interest_level')
train.shape, test.shape



((31844, 40), (16973, 40))

In [0]:
# train_col = train.columns.get_values().tolist()
# print(train_col)

In [0]:
model = linear_model.LinearRegression()



features = ['bathrooms', 'longitude']
target = ['price']

X = train[features]
y = train[target]

model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:

def lr(train, test, features, target):
    X_train = train[features]
    y_train = train[target]
    X_test = test[features]
    y_test = test[target]

    model = linear_model.LinearRegression()
    model.fit(X_train, y_train)
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    print(f'Linear Regression with {len(features)} features')
    print('Intercept', model.intercept_)
    coefficients = pd.Series(model.coef_, features)
    print(coefficients.to_string())
    
    print('Train Root Mean Squared Error:', np.sqrt(mean_squared_error(y_train, y_pred_train)))
    print('Test Root Mean Square Error:', np.sqrt(mean_squared_error(y_test, y_pred_test)))
    print('Train Mean Absolute Error:', mean_absolute_error(y_train, y_pred_train))
    print('Test Mean Absolute Error:', mean_absolute_error(y_test, y_pred_test))
    print('Train R^2 Score:', r2_score(y_train, y_pred_train))
    print('Test R^2 Score:', r2_score(y_test, y_pred_test))


In [0]:
train.columns.get_values().tolist()

In [0]:
target = 'price'
features = ['bathrooms', 'longitude']
lr(train, test, features, target)

In [0]:
target = 'price'
features = ['bathrooms', 'longitude','cats_or_dogs',
            'interest_level_high','interest_level_low',
            'interest_level_medium']
lr(train, test, features, target)

In [0]:
target = 'price'
features = ['bathrooms', 'longitude','cats_or_dogs',
            'interest_level_high','interest_level_low',
            'interest_level_medium','dishwasher']
lr(train, test, features, target)