### Group C
- Campos, Joshua
- Halili, Gesara
- Nwuzor, Chisom
- Tran, Quynh

# Data Science Project 2020

In this project, we will be tackling a supervised classification problem using the data from the Kaggle competition "[Two Sigma Connect: Rental Listing Inquiries](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)". Our task will be to classify rental listing inquiries, and predict how much interest they will receive according to three categories: Low, Medium and High. 

For this project, we will be using three different models, and comparing their performance to find the best one for this specific task. The models will be:
- Regularized Linear Models
- Trees
- Random Forests

Our classification problem will be composed of various sections: 
- Preparation & Exploratory Data Analysis
- Data Preprocessing & Feature Engineering
- Model Performance & Hyper-Parameter Tuning
- Prediction Explanation & Story Telling

## Preparation & Exploratory Data Analysis

### Importing the libraries

The first thing we have to do is to import all the libraries that we will be using for our project, which include the typical libraries for data science, such as Numpy, Pandas, Matplotlib, and Scikit-Learn, among others.

In [None]:
import json
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string 

from re import search 

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import log_loss
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer, LabelEncoder, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import ElasticNet

import functions

### Reading the file

Here we read the data, which is stored as a JSON file. We will be only using the training data for the whole project, which we will split into training and testing sets respectively. This is done to decrease the processing power and time required for the model training and the hyper-parameter tuning.

In [None]:
with open('data/train.json') as f:
    data = json.load(f)

df = pd.DataFrame(data)

### Exploratory Data Analysis

Before continuing, we must get a grasp of what the data looks like. For this, we print the head of the data, as well as some statistics and information about the variables. Besides this, we also take a look at the unique number of values for the display addresses, the manager ids, and the features. 

Finally, we encode the target variable from text to numerical values, ranging from 0 to 2, and plot the counts for each class to get an idea of how balanced the dataset is.

In [None]:
df_head = df.head()
df_describe = df.describe()
df_info = df.info()

print('\nNum. of Unique Display Addresses: {}'.format(df['display_address'].nunique()))
print('Num. of Unique Manager IDs: {}'.format(df['manager_id'].nunique()))

all_unique_features = set()    
df['features'].apply(lambda x: functions.add_unique_elements(x, all_unique_features))
print('Num. of Unique Features: {}'.format(len(all_unique_features)))


df['interest_level'] = df['interest_level'].apply(lambda x: 0 if x=='low' else 1 if x=='medium' else 2)
# Check homogeneity of target values
sns.countplot('interest_level', data=df)
plt.title('Unbalanced Classes')

### Feature Engineering

#### Categorical Variables

In [None]:
addresses = ['display_address', 'street_address']

for address in addresses:
    print(address)
    ''' delete rows that contain descriptions instead of actual addresses '''
    address_delete = [] 
    for i in range(len(df)):
        address_val = df[address][i]
        if search('!' or '*', address_val):
            address_delete.append(i)

    df = df.drop(df.index[address_delete])
    print("Num. of deleted addresses: {}".format(
        len(address_delete)))
    
    
    ''' Data Cleaning '''
    address_column = df[address]
    address_column_transformed = ( address_column
                                           .apply(str.upper)
                                           .apply(lambda x: x.replace('WEST','W'))
                                           .apply(lambda x: x.replace('EAST','E'))
                                           .apply(lambda x: x.replace('STREET','ST'))
                                           .apply(lambda x: x.replace('AVENUE','AVE'))
                                           .apply(lambda x: x.replace('BOULEVARD','BLVD'))
                                           .apply(lambda x: x.replace('.',''))
                                           .apply(lambda x: x.replace(',',''))
                                           .apply(lambda x: x.replace('&',''))
                                           .apply(lambda x: x.replace('(',''))
                                           .apply(lambda x: x.replace(')',''))
                                           .apply(lambda x: x.strip())
                                           #.apply(lambda x: re.sub('(?<=\d)[A-Z]{2}', '', x))
                                           .apply(lambda x: re.sub('[^A-Za-z0-9]+ ', '', x)) #remove all special characters and punctuaction
                                           .apply(lambda x: x.replace('FIRST','1'))
                                           .apply(lambda x: x.replace('SECOND','2'))
                                           .apply(lambda x: x.replace('THIRD','3'))
                                           .apply(lambda x: x.replace('FOURTH','4'))
                                           .apply(lambda x: x.replace('FIFTH','5'))
                                           .apply(lambda x: x.replace('SIXTH','6'))
                                           .apply(lambda x: x.replace('SEVENTH','7'))
                                           .apply(lambda x: x.replace('EIGHTH','8'))
                                           .apply(lambda x: x.replace('EIGTH','8'))
                                           .apply(lambda x: x.replace('NINTH','9'))
                                           .apply(lambda x: x.replace('TENTH','10'))
                                           .apply(lambda x: x.replace('ELEVENTH','11'))
                                         )

    print("Num. of Unique Addresses after Transformation: {}".format(
        address_column_transformed.nunique()))

    df[address] = address_column_transformed 
    

display=df["display_address"].value_counts()
manager_id=df["manager_id"].value_counts()
building_id=df["building_id"].value_counts()
street=df["street_address"].value_counts()

df["display_count"]=df["display_address"].apply(lambda x:display[x])
df["manager_count"]=df["manager_id"].apply(lambda x:manager_id[x])  
df["building_count"]=df["building_id"].apply(lambda x:building_id[x])
df["street_count"]=df["street_address"].apply(lambda x:street[x])

price_by_building = df.groupby('building_id')['price'].agg([np.min,np.max,np.mean]).reset_index()
price_by_building.columns = ['building_id','min_price_by_building',
                            'max_price_by_building','mean_price_by_building']
df = pd.merge(df,price_by_building, how='left',on='building_id')
df = df.drop(df.index[address_delete])

cat_vars = ['building_id','manager_id','display_address','street_address']
OE = OrdinalEncoder()
for cat_var in cat_vars:
    print ("Ordinal Encoding %s" % (cat_var))
    df[cat_var]=OE.fit_transform(df[[cat_var]])

#### Text Variables

In [None]:
# Studies have shown that titles with excessive all caps and special characters give renters the impression 
# that the listing is fraudulent – i.e. BEAUTIFUL***APARTMENT***CHELSEA.
df['num_of_#']=df.description.apply(lambda x:x.count('#'))
df['num_of_!']=df.description.apply(lambda x:x.count('!'))
df['num_of_$']=df.description.apply(lambda x:x.count('$'))
df['num_of_*']=df.description.apply(lambda x:x.count('*'))
df['num_of_>']=df.description.apply(lambda x:x.count('>'))

df['has_phone'] = df['description'].apply(lambda x:re.sub('['+string.punctuation+']', '', x).split())\
        .apply(lambda x: [s for s in x if s.isdigit()])\
        .apply(lambda x: len([s for s in x if len(str(s))==10]))\
        .apply(lambda x: 1 if x>0 else 0)
df['has_email'] = df['description'].apply(lambda x: 1 if '@renthop.com' in x else 0)

display_address_column = df['description']
df['description'] = [functions.text_cleaner(x) for x in display_address_column]

df['length_description'] = df['description'].apply(lambda x: len(x))
df['num_words_description'] = df['description'].apply(lambda x: len(x.split(" ")))

df['num_features'] = df['features'].apply(len)

v = CountVectorizer(stop_words='english', max_features=50)
x = v.fit_transform(df['features']\
                                     .apply(lambda x: " ".join(["_".join(i.split(" ")) for i in x])))

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('features', axis=1, inplace=True)
df = df.join(df1.set_index(df.index))

#### Date Variables

In [None]:
df['created'] = pd.to_datetime(df['created'])
df['created_year'] = df['created'].dt.year
df['created_month'] = df['created'].dt.month
df['created_day_of_month'] = df['created'].dt.day
df['created_day_of_week'] = df['created'].dt.dayofweek
df['created_hour'] = df['created'].dt.hour

#### Image Variables

In [None]:
df['num_photos'] = df['photos'].apply(len)
df['photos_per_bedroom'] = df[['num_photos','bedrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)
df['photos_per_bathroom'] = df[['num_photos','bathrooms']].apply(lambda x: x[0]/x[1] if x[1]!=0 else 0, axis=1)

#### Numerical Variables

In [None]:
df['total_rooms'] = df['bathrooms'] + df['bedrooms']
df['price_per_room'] = df[['price','total_rooms']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)
df['price_per_bedroom'] = df[['price','bedrooms']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)
df['price_per_bathroom'] = df[['price','bathrooms']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)

df['price_per_word_description'] = df[['price','num_words_description']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)
df['price_per_length_description'] = df[['price','length_description']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)
df['price_per_feature'] = df[['price','num_features']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)
df['price_per_photo'] = df[['price','num_photos']].apply(lambda x: x[0]/x[1] if x[1] != 0 else 0, axis=1)

central_park_coordinates = (40.7799963,-73.970621)
df['distance_to_central_park'] = df[['latitude','longitude']].apply(
        lambda x: functions.calculate_distance_between_coordinates(central_park_coordinates,(x[0],x[1])), axis=1)

wall_street_coordinates = (40.7059692,-74.0099558)
df['distance_to_wall_street'] = df[['latitude','longitude']].apply(
        lambda x: functions.calculate_distance_between_coordinates(wall_street_coordinates,(x[0],x[1])), axis=1)

times_square_coordinates = (40.7567473,-73.9888876)
df['distance_to_times_square'] = df[['latitude','longitude']].apply(
        lambda x: functions.calculate_distance_between_coordinates(times_square_coordinates,(x[0],x[1])), axis=1)

### Correlation of Features and Target

In [None]:
""" Object columns dropped"""
df = df.drop(['building_id', 'listing_id', 'description', 'created', 'display_address', 'manager_id', 
              'photos', 'street_address'], axis=1) 

# Convert target values into ordinal values 

df_corr = df.corr()
df_corr_abs = np.abs(df_corr['interest_level'])

df_corr_abs_sort = df_corr_abs.sort_values(ascending = False)
print(df_corr_abs_sort)

sns.set(rc={'figure.figsize':(15.7,10.27)})
sns.heatmap(df.corr())

### Data Normalization

In [None]:
df_copy = df.drop("interest_level", axis=1)
scaler = preprocessing.MinMaxScaler()
names = df_copy.columns
d = scaler.fit_transform(df_copy)
scaled_df = pd.DataFrame(d, columns=names)
scaled_df.head()
scaled_df= scaled_df.join(df[['interest_level']].set_index(scaled_df.index))
scaled_df

### Splitting of the Dataset

In [None]:
df_dev, df_rest = train_test_split(scaled_df, test_size=0.3)
df_test, df_val = train_test_split(df_rest, test_size=0.5)

X_val =  df_val.drop("interest_level", axis=1)
y_val = df_val["interest_level"]

X_test = df_test.drop("interest_level", axis=1)
y_test = df_test["interest_level"]

X_dev = df_dev.drop("interest_level", axis=1)
y_dev = df_dev["interest_level"]


X = scaled_df.drop("interest_level", axis=1)
y = scaled_df["interest_level"]

### Hyperparameter Tuning  for the Random Forest