Andy Nguyen, Michael Wolfe, Spencer Fogelman, & Joseph Caguioa

DS 7331.407

Thursday 6:30pm - 8:00pm

# Logistic Regression and SVMs of AirBnB Data

### Instructions
[50 Points]
* Assess performance of each model using 80/20 training-test split
* Adjust model parameters to optimize accuracy
    * if dataset size requires stochastic gradient descent, then only linear kernel is appropriate
    
[10 Points]
* Discuss advantages of each model for each classification task
* Does one type of model offer superior performance over another in terms of prediction accuracy?
    * In terms of training time or efficiency? Explain.

[30 points]
* Use weights from logistic regression to interpret importance of different features for each classification task. Explain interpretation in detail.
    * Why do you think some variables are more important?
    
[10 points]
* Look at the chosen support vectors for the classifcation task. Do these provide any insight into the data? Explain.

**Setup**

In [35]:
import pandas as pd
import numpy as np
import seaborn as sns
#import plotly.graph_objects as go
import datetime
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import warnings

In [36]:
df = pd.read_csv('https://raw.githubusercontent.com/anguyen-07/DS7331-ML_Labs/master/data/airbnb_train.csv')
df['grade'] = pd.cut(df.review_scores_rating, [0,60,70,80,90,101], right=False, labels = ['F', 'D', 'C', 'B', 'A'])
df.dtypes

id                           int64
log_price                  float64
property_type               object
room_type                   object
amenities                   object
accommodates                 int64
bathrooms                  float64
bed_type                    object
cancellation_policy         object
cleaning_fee                  bool
city                        object
description                 object
first_review                object
host_has_profile_pic        object
host_identity_verified      object
host_response_rate          object
host_since                  object
instant_bookable            object
last_review                 object
latitude                   float64
longitude                  float64
name                        object
neighbourhood               object
number_of_reviews            int64
review_scores_rating       float64
thumbnail_url               object
zipcode                     object
bedrooms                   float64
beds                

The first step to preparing the Airbnb dataset for use with logistic regression and support vector machines is to remove or impute missing data and change the variables to compatible datatypes. Much of the logic behind this work was covered in Lab 1.

**Cleanup (from first project)**

In [37]:
##Clean up datatypes and duplicates
df_ratings = df.dropna(subset=['review_scores_rating'])
floats = ['log_price','bathrooms','latitude','longitude','review_scores_rating']
df_ratings[floats] = df_ratings[floats].astype(np.float64)
ints = ['id','accommodates','number_of_reviews','bedrooms','beds']
df_ratings["host_response_rate"] = df_ratings["host_response_rate"].str.rstrip('%').astype(np.float64)/100
date_time = ['first_review','host_since','last_review']
df_ratings[date_time] = df_ratings[date_time].apply(pd.to_datetime)
booleans = ['host_has_profile_pic','host_identity_verified','instant_bookable']
df_ratings[booleans] = df_ratings[booleans].replace({'t':True,'f':False})
df_ratings[booleans] = df_ratings[booleans].astype(np.bool)
categorical = ['property_type','room_type','bed_type','cancellation_policy','city','neighbourhood','zipcode']
df_ratings[categorical] = df_ratings[categorical].astype('category')
df_ratings.drop_duplicates()
df_ratings.host_since[df_ratings.host_since.isna()] = df_ratings.first_review[df_ratings.host_since.isna()]

In [38]:
##Impute missing values
df_imputed = df_ratings
df_imputed["bathrooms"] = df_imputed["bathrooms"].fillna(df_imputed.groupby(["property_type","accommodates"])["bathrooms"].apply(lambda x : x.fillna(x.median())))
df_imputed["bedrooms"] = df_imputed["bedrooms"].fillna(df_imputed.groupby(["property_type","accommodates"])["bedrooms"].apply(lambda x : x.fillna(x.median())))
df_imputed["beds"] = df_imputed["beds"].fillna(df_imputed.groupby(["property_type","accommodates"])["beds"].apply(lambda x : x.fillna(x.median())))
df_imputed["host_response_rate"] = df_imputed["host_response_rate"].fillna(df_imputed.groupby(["number_of_reviews"])["host_response_rate"].apply(lambda x : x.fillna(x.mean())))
df_imputed[ints] = df_imputed[ints].astype(np.int64)

In [39]:
df_imputed['price'] = np.exp(df_imputed['log_price'])

In [40]:
import re
#Create a new cleaned amenities column where all amenities are in list form
df_imputed['amenities_new'] = df_imputed.apply(lambda row: re.sub(r'[{}""]', '', row['amenities']), axis=1)
df_imputed['amenities_new'] = df_imputed.apply(lambda row: row['amenities_new'].lower().split(','), axis=1)
df_imputed = df_imputed.reset_index()
df_imputed['length_amenities'] = df_imputed.apply(lambda row: len(row['amenities_new']), axis=1)

# Create separate columns based on amenities
df_imputed['internet'] = df_imputed.apply(lambda row: 'internet' in row.amenities.lower(), axis=1)
df_imputed['TV'] = df_imputed.apply(lambda row: 'tv' in row.amenities.lower(), axis=1)
df_imputed['air_conditioning'] = df_imputed.apply(lambda row: 'air conditioning' in row.amenities.lower(), axis=1)
df_imputed['kitchen'] = df_imputed.apply(lambda row: 'kitchen' in row.amenities.lower(), axis=1)
df_imputed['pool'] = df_imputed.apply(lambda row: 'pool' in row.amenities.lower(), axis=1)
df_imputed['parking'] = df_imputed.apply(lambda row: 'parking' in row.amenities.lower(), axis=1)

df_imputed['description_length'] = df_imputed['description'].apply(len)

df_imputed['superuser'] = False
df_imputed.loc[df.review_scores_rating >=96, 'superuser'] = True

In [48]:
import datetime
date_published = datetime.datetime(2018,3,14)
df_imputed['host_since'] = pd.to_datetime(df_imputed['host_since'])


In [50]:
#df_imputed['host_since_days'] = df.apply(lambda row: (date_published - row['host_since']).days, axis=1)

TypeError: ("unsupported operand type(s) for -: 'datetime.datetime' and 'str'", 'occurred at index 0')

In [49]:
df_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57389 entries, 0 to 57388
Data columns (total 42 columns):
index                     57389 non-null int64
id                        57389 non-null int64
log_price                 57389 non-null float64
property_type             57389 non-null category
room_type                 57389 non-null category
amenities                 57389 non-null object
accommodates              57389 non-null int64
bathrooms                 57389 non-null float64
bed_type                  57389 non-null category
cancellation_policy       57389 non-null category
cleaning_fee              57389 non-null bool
city                      57389 non-null category
description               57389 non-null object
first_review              57388 non-null datetime64[ns]
host_has_profile_pic      57389 non-null bool
host_identity_verified    57389 non-null bool
host_response_rate        57388 non-null float64
host_since                57389 non-null datetime64[ns]
instant

In [47]:
date_published.dtype()

AttributeError: 'datetime.datetime' object has no attribute 'dtype'

## Create Models

### Logistic Regression Model

In [15]:
##Basic model, all numeric data
#x_train, x_test, y_train, y_test = train_test_split(df_imputed.loc[:,['log_price','accommodates','bathrooms','number_of_reviews','review_scores_rating','bedrooms','beds']]
#                                                    , df_imputed.loc[:,['grade']], test_size=0.2, random_state=0)
#LogisticRegression().fit(x_train,y_train).predict(x_test[0].reshape(1,-1))

### Interpretation of Feature Importance

<i>[30 points]
* Use weights from logistic regression to interpret importance of different features for each classification task. Explain interpretation in detail.
    * Why do you think some variables are more important?</i>

### Support Vector Machine Model

### Interpretation of Support Vectors

<i>[10 pts] Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.</i>

## Model Comparisons: Advantages, Performance, Efficiency

<i>[10 Points]
* Discuss advantages of each model for each classification task
* Does one type of model offer superior performance over another in terms of prediction accuracy?
    * In terms of training time or efficiency? Explain.</i>