BloomTech Data Science

*Unit 2, Sprint 1, Module 2*

---

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Regression II

In this project, you'll continue working with the New York City rent dataset you used in the last module project.

## Directions

The tasks for this project are as follows:

- **Task 1:** Import `csv` file using `wrangle` function.
- **Task 2:** Conduct exploratory data analysis (EDA), and modify `wrangle` function to engineer two new features.
- **Task 3:** Split data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline mean absolute error for your dataset.
- **Task 6:** Build and train a `Linearregression` model.
- **Task 7:** Calculate the training and test mean absolute error for your model.
- **Task 8:** Calculate the training and test $R^2$ score for your model.
- **Stretch Goal:** Determine the three most important features for your linear regression model.

**Note**

You should limit yourself to the following libraries for this project:

- `matplotlib`
- `numpy`
- `pandas`
- `sklearn`

In [2]:
#Every Library Used In Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import r2_score as r2

# I. Wrangle Data

In [3]:
def wrangle(filepath):

    #Reading in the Raw Data
    df = pd.read_csv(filepath)

    #Formatting Column Names
    df.columns = df.columns.str.upper().str.replace(' ','_')

    #Declaring Subjective Wrangling Constants
    high_cardinality_percentage_threshold = 0.1
    max_cat = 5 #Max Unique Count to be Autoclassified as Categorical

    #Column Classification Lists
    date_cols = ['CREATED']
    categorical_cols = []
    numerical_cols = [] 

    for col in df:

      #print(col, df[col].dtype, df[col].nunique())

      #Find Categorical/Numerical Variables
      if (df[col].dtype == 'object' or (df[col].nunique() in range(2,max_cat+1) )):
        #print("Column: ",col)
        if ~(col in categorical_cols):
          categorical_cols.append(col)
      else:
        if ~(col in numerical_cols+categorical_cols):
          numerical_cols.append(col)

      #Format Dates
      if( (col in categorical_cols) and (col in date_cols)):
        df[col] = pd.to_datetime(df[col])

      #Find Columns With Only One Value
      if df[col].nunique() == 1:
        if ~(col in single_value_cols):
          single_value_cols.append(col)
   
    #Drop All Rows With Null Values
    df = df.dropna(axis=0)

    # Remove the most extreme 1% prices,
    # the most extreme 1% latitudes, &
    # the most extreme 1% longitudes
    df = df[(df['PRICE'] >= np.percentile(df['PRICE'], 0.5)) & 
            (df['PRICE'] <= np.percentile(df['PRICE'], 99.5)) & 
            (df['LATITUDE'] >= np.percentile(df['LATITUDE'], 0.05)) & 
            (df['LATITUDE'] < np.percentile(df['LATITUDE'], 99.95)) &
            (df['LONGITUDE'] >= np.percentile(df['LONGITUDE'], 0.05)) & 
            (df['LONGITUDE'] <= np.percentile(df['LONGITUDE'], 99.95))]

    return df

filepath = DATA_PATH + 'apartments/renthop-nyc.csv'

df = df = wrangle(filepath)

In [4]:
df['DISPLAY_ADDRESS'].nunique()

8552

In [5]:
df['STREET_ADDRESS'].nunique()

14635

In [6]:
#THERE ARE REPEAT DATES
#df[ df['CREATED'].duplicated() ]

cols_to_drop = ['CREATED', 'DESCRIPTION', 'DISPLAY_ADDRESS', 'STREET_ADDRESS']

**Task 1:** Add the following functionality to the above `wrangle` function.

- The `'created'` column will parsed as a `DateTime` object and set as the `index` of the DataFrame. 
- Rows with `NaN` values will be dropped.

Then use your modified function to import the `renthop-nyc.csv` file into a DataFrame named `df`.

**Task 2:** Using your `pandas` and dataviz skills decide on two features that you want to engineer for your dataset. Next, modify your `wrangle` function to add those features. 

**Note:** You can learn more about feature engineering [here](https://en.wikipedia.org/wiki/Feature_engineering). Here are some ideas for new features:

- Does the apartment have a description?
- Length of description.
- Total number of perks that apartment has.
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths).

In [7]:
print(list(df.columns))

['BATHROOMS', 'BEDROOMS', 'CREATED', 'DESCRIPTION', 'DISPLAY_ADDRESS', 'LATITUDE', 'LONGITUDE', 'PRICE', 'STREET_ADDRESS', 'INTEREST_LEVEL', 'ELEVATOR', 'CATS_ALLOWED', 'HARDWOOD_FLOORS', 'DOGS_ALLOWED', 'DOORMAN', 'DISHWASHER', 'NO_FEE', 'LAUNDRY_IN_BUILDING', 'FITNESS_CENTER', 'PRE-WAR', 'LAUNDRY_IN_UNIT', 'ROOF_DECK', 'OUTDOOR_SPACE', 'DINING_ROOM', 'HIGH_SPEED_INTERNET', 'BALCONY', 'SWIMMING_POOL', 'NEW_CONSTRUCTION', 'TERRACE', 'EXCLUSIVE', 'LOFT', 'GARDEN_PATIO', 'WHEELCHAIR_ACCESS', 'COMMON_OUTDOOR_SPACE']


In [8]:
# Conduct your exploratory data analysis here, 
# and then modify the function above.

# Feature Formatting
df.loc[df['INTEREST_LEVEL'] == 'low', "INTEREST_LEVEL"] = 0
df.loc[df['INTEREST_LEVEL'] == 'medium', "INTEREST_LEVEL"] = 1
df.loc[df['INTEREST_LEVEL'] == 'high', "INTEREST_LEVEL"] = 2

# Additional Feature #1
feature1 = 'TOTAL_PERKS'
df[feature1] = np.repeat(0,df.shape[0])
perks = list(df.nunique()[ df.nunique() == 2].index)
perks.remove('NEW_CONSTRUCTION')
perks.remove('EXCLUSIVE')
for col in perks:
  df[feature1] += df[col]
print(df.groupby(by=feature1)['PRICE'].mean()); print()

# Additional Feature #2
feature2 = 'ANYTHING_OUTDOORS'
df[feature2] = 1 - (1-df['BALCONY'])*(1-df['SWIMMING_POOL'])*(1-df['GARDEN_PATIO'])*(1-df['COMMON_OUTDOOR_SPACE'])*(1-df['TERRACE'])*(1-df['OUTDOOR_SPACE'])*(1-df['ROOF_DECK'])
print(df.groupby(by=feature2)['PRICE'].mean()); print()
#plt.title(feature2+' vs PRICE')
#plt.scatter(y=df['PRICE'], x=df[feature2]);

# plt.scatter(y=df['PRICE'], x=df['DOGS_ALLOWED']);
# print(df.groupby(by='DOGS_ALLOWED')['PRICE'].mean()); print()

TOTAL_PERKS
0     2804.442364
1     2845.290182
2     3066.058111
3     3212.627418
4     3659.147171
5     3777.141882
6     3985.879926
7     4088.913563
8     4129.486201
9     4266.277477
10    4370.703658
11    4505.367786
12    4530.938947
13    4872.078324
14    4711.526132
15    4870.595890
16    4899.337079
17    4813.633333
18    4277.000000
Name: PRICE, dtype: float64

ANYTHING_OUTDOORS
0    3349.794731
1    4173.225573
Name: PRICE, dtype: float64



# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'price'`.

**Note:** In contrast to the last module project, this time you should include _all_ the numerical features in your dataset.

In [9]:
target = 'PRICE'
features = list(df.columns)
for col in cols_to_drop:
  features.remove(col)
features.remove(target)

X = df[ features ]
y = df[target]

**Task 4:** Split `X` and `y` into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`).

- Your training set should include data from April and May 2016. 
- Your test set should include data from June 2016.

In [10]:
mask = ( (df['CREATED'].dt.month == 4) | (df['CREATED'].dt.month == 5) )

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

#print(df[~mask]['CREATED'].dt.month.value_counts()) #To Check That The Split Worked

# III. Establish Baseline

**Task 5:** Since this is a **regression** problem, you need to calculate the baseline mean absolute error for your model. First, calculate the mean of `y_train`. Next, create a list `y_pred` that has the same length as `y_train` and where every item in the list is the mean. Finally, use `mean_absolute_error` to calculate your baseline.

In [11]:
y_pred_baseline = [ y_train.mean() ] * len(y_train)
baseline_mae = mae(y_train, y_pred_baseline)
print('Baseline MAE:', baseline_mae)

Baseline MAE: 1202.398300781848


# IV. Build Model

**Task 6:** Build and train a `LinearRegression` model named `model` using your feature matrix `X_train` and your target vector `y_train`.

In [12]:
model = LinearRegression()

model.fit(X_train, y_train)


LinearRegression()

# V. Check Metrics

**Task 7:** Calculate the training and test mean absolute error for your model.

In [13]:
training_mae = mae(y_train, model.predict(X_train))
test_mae = mae(y_test, model.predict(X_test))

print('Training MAE:', training_mae)
print('Test MAE:', test_mae)

Training MAE: 672.531584013441
Test MAE: 675.8599022993726


**Task 8:** Calculate the training and test $R^2$ score for your model.

In [14]:
training_r2 = r2(y_train, model.predict(X_train))
test_r2 = r2(y_train, model.predict(X_train))

print('Training R2:', training_r2)
print('Test R2:', test_r2)

Training R2: 0.6362112842090302
Test R2: 0.6362112842090302


# VI. Communicate Results

**Stretch Goal:** What are the three most influential coefficients in your linear model? You should consider the _absolute value_ of each coefficient, so that it doesn't matter if it's positive or negative.

In [15]:
var = 'ABS COEF'
coef_df = pd.DataFrame(model.coef_, index=features, columns=[var])
print(coef_df.abs().sort_values(by=var, ascending=False))

                          ABS COEF
LONGITUDE             13067.936829
BATHROOMS              1703.109841
LATITUDE               1246.013084
BEDROOMS                486.029874
LAUNDRY_IN_UNIT         454.473246
INTEREST_LEVEL          412.073801
DOORMAN                 392.246581
HIGH_SPEED_INTERNET     307.361491
ROOF_DECK               255.603326
DINING_ROOM             244.134217
EXCLUSIVE               211.519199
COMMON_OUTDOOR_SPACE    196.048460
HARDWOOD_FLOORS         173.458382
OUTDOOR_SPACE           167.972423
WHEELCHAIR_ACCESS       151.102351
TERRACE                 150.019561
LAUNDRY_IN_BUILDING     135.234591
NO_FEE                  130.037915
NEW_CONSTRUCTION        126.646834
PRE-WAR                 117.645647
ANYTHING_OUTDOORS       115.988931
LOFT                    107.141610
ELEVATOR                105.206171
BALCONY                  93.207392
GARDEN_PATIO             85.592250
FITNESS_CENTER           75.901349
CATS_ALLOWED             59.947035
DOGS_ALLOWED        