# A) Prediction problem: Logistic regression

## A.1) Exploratory data analysis

### Insight 1
Given that this is a logistic regression problem, a good place to start was to see whether or not there is an imbalance in `host_is_superhost`. If there is, it could suggest that a tweak in threshold value is needed after we predict using the model. In this scenario, there is no inherent cost/outcome that we are concerned with, as we just want to accurately predict whether or not a host is a superhost. There is no drastic difference between the classes, so using 0.5 as the default threshold value is suitable. 

### Insight 2

Similar to the linear regression problem, we can examine whether predictors being transformed, such as `number_of_reviews`, can be helpful for building the classifcation model. When we visualize the distributions by creating a histogram, we can see that some distributions are right-skewed, so using log transformations would be helpful in building a more accurate model.

By the same vein, other similar variables such as `number_of_reviews_ltm` and `number_of_reviews_l30d` also have a similar right-skewed distribution, so they can also be transformed.

### Insight 3

I also went about doing exploratory data analysis by analyzing the overall correlations of the data set just like with the linear regression problem, particularly with respsect to `host_is_superhost`.  

From the correlations, `log_reviews`, `log_reviews_ltm`, `host_total_listings_count`, and `has_missing` (status of whether an observation has any missing values) variables seem to be some of the highest correlated predictors with `host_is_superhost`. As such, this provided a good basis for these predictors to be included in the logistic regression model.

## A.2) Data cleaning / preparation

Mention the data cleaning / preparation steps taken to prepare your data for developing the model. This may include imputing missing values, creating dummy variables, combining levels of categorical variable(s), discarding predictors that are not useful, etc.

In [41]:
import pandas as pd
import seaborn as sns
import numpy as np
from datetime import datetime as dt
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from sklearn.linear_model import LinearRegression, LogisticRegression, LassoCV, RidgeCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score, roc_curve, auc, \
precision_score, recall_score, confusion_matrix, r2_score
from sklearn.model_selection import cross_val_score, cross_val_predict

In [42]:
train = pd.read_csv('./train_classification.csv')
test = pd.read_csv('./test_classification.csv')

In [43]:
train['has_missing'] = train.isnull().any(axis=1).astype(int)
test['has_missing'] = test.isnull().any(axis=1).astype(int)

In [44]:
# Define a function to categorize the property types
def categorize_property(property_type):
    if 'Entire' in property_type:
        return 'Entire Home/Apartment'
    elif 'Private' in property_type:
        return 'Private Room'
    elif 'Shared' in property_type:
        return 'Shared Accommodation'
    elif property_type in ['Room in hotel', 'Room in boutique hotel', 'Boat']:
        return 'Specialty Accommodations'
    else:
        return 'Other'

In [45]:
# overall function to clean training and test data
def clean_data(df):
    
    if 'host_is_superhost' in df.columns:
        df.host_is_superhost = df.host_is_superhost.replace({'t': 1, 'f': 0})
        
    # replace missing values of numeric variables wtih the median
    numeric_columns = df.select_dtypes(include=['number']).columns
    df[numeric_columns] = df[numeric_columns].apply(lambda x: x.fillna(x.median()))

    # replace missing values of categorical variables with the mode 
    categorical_columns = df.select_dtypes(include=['object']).columns
    df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])
    
    # log transform response variable for training data and drop price
    if 'price' in df.columns:
        df['log_price'] = np.log(df['price'])
    
    # replace any 0 values to 1 so that it can go through log transformation
    df['number_of_reviews'] = df['number_of_reviews'].replace(0, 1)
    df['number_of_reviews_ltm'] = df['number_of_reviews_ltm'].replace(0, 1)
    df['number_of_reviews_l30d'] = df['number_of_reviews_l30d'].replace(0, 1)

    df['log_reviews'] = np.log(df.number_of_reviews)
    df['log_reviews_ltm'] = np.log(df.number_of_reviews_ltm)
    df['log_reviews_l30d'] = np.log(df.number_of_reviews_l30d)
    
    # calculate the number of days since the host became a host
    df['host_since'] = pd.to_datetime(df['host_since'])
    current_date = dt.now()
    df['host_since_days'] = (current_date - df['host_since']).dt.days
    
    # calculate days since first/last review
    df['first_review'] = pd.to_datetime(df['first_review'], errors='coerce')
    df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')

    df['first_review_days'] = (current_date - df['first_review']).dt.days
    df['last_review_days'] = (current_date - df['last_review']).dt.days
    
    # make response_rate and acceptance_rate into numeric dtype
    df['host_response_rate'] = df['host_response_rate'].str.rstrip('%').astype('float')
    df['host_acceptance_rate'] = df['host_acceptance_rate'].str.rstrip('%').astype('float')
    
    # subgroup property_type
    df['property_cats'] = df['property_type'].apply(categorize_property)
    
    # Extract numeric values from the 'bathrooms' column
    df['bath_numeric'] = df['bathrooms_text'].str.extract('(\d+\.*\d*)', expand=False).astype(float)

    # Handle "Half-bath" by assigning a numeric value of 0.5
    df['bath_numeric'] = df.apply(lambda row: 0.5 if 'half' in row['bathrooms_text'].lower() else row['bath_numeric'], axis=1)

    # to binary
    df.host_identity_verified = df.host_identity_verified.replace({'t': 1, 'f': 0})
    df.host_has_profile_pic = df.host_has_profile_pic.replace({'t': 1, 'f': 0})
    df.has_availability = df.has_availability.replace({'t': 1, 'f': 0})
    df.instant_bookable = df.instant_bookable.replace({'t': 1, 'f': 0})
    
    # drop the columns modified
    df.drop(columns = ['host_since', 'first_review', 'last_review', 'property_type', 'bathrooms_text', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d'], inplace = True)

In [46]:
clean_data(train)
clean_data(test)

In [None]:
# set response and predictors for scaling
y_train = train.host_is_superhost
X_train = train.drop(columns = ['id'])
X_test = test.drop(columns = ['id'])

## A.3) Developing the model

### Step 1: Transformations

As touched upon in the EDA section, there were some variables that were identified to be more suitable for a log transformation due to having a right-skewed distribution. The variables that ended up getting log transformed were `number_of_reviews` and the other review variables.

Just replacing the transformed log versions of the `review` variables instead of the original non-transformed predictors improved the model, which led me to keep them in the model. 

### Step 2: Variable Selection

For this prediction problem, I also started out by using intuition and trial and error. `host_identity_verified` seems like it would do a decent job of predicting whether or not `host_is_superhost`, as it makes sense that superhosts generally have their identity verified to build their credibility with potential customers. The number of reviews that a host generates would also suggest if they are a superhost, as the more reviews they have the more it suggests that their listings are popular and credible. The rating values of those reviews, such as `review_scores_rating`, would also help indicate whether or not the host is a superhost, as they generally have more experience in catering towards the customers' experience. Furthermore, a host having higher `host_total_listings_count` and longer `host_since_days` would imply that they are highly experienced in hosting.

### Step 3: Significant Interactions

Similar to the variable selection step, intution and trial and error helped me in identifying signficant interactions that more accurately predict the `host_is_superhost` class. I intuitively thought that interactions between the quantity of reviews and the different review values would add more prediction power. For example, a listing could have some combination of high/low number and values of reviews. Having both predictors being high could help suggest a listing's host to be a superhost, and vice versa.

Likewise, the interaction between `host_total_listings_count` and `host_since_days` makes sense in predicting `host_is_superhost`, as these two predictor variables give us information about how experienced a host is and how much time they have had to be reputable enough to become a superhost.

### Step 4: Using `host_id`

Because we were also given the `host_id` in the training and test data sets, we can use that information to obtain the `host_is_superhost` class. We can get those from the training set and impute it into our test prediction if the `host_id` matches.

## A.4) Model equation

In [47]:
model = smf.logit(formula = 'host_is_superhost~host_identity_verified+(log_reviews*review_scores_location)+(log_reviews_ltm*review_scores_rating)+(host_total_listings_count*host_since_days) + has_missing', data = train).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.472530
         Iterations 11


0,1,2,3
Dep. Variable:,host_is_superhost,No. Observations:,4977.0
Model:,Logit,Df Residuals:,4965.0
Method:,MLE,Df Model:,11.0
Date:,"Tue, 12 Mar 2024",Pseudo R-squ.:,0.3108
Time:,21:20:26,Log-Likelihood:,-2351.8
converged:,True,LL-Null:,-3412.4
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-5.6783,1.231,-4.614,0.000,-8.090,-3.266
host_identity_verified,0.6164,0.118,5.209,0.000,0.384,0.848
log_reviews,0.3279,0.385,0.852,0.394,-0.426,1.082
review_scores_location,-0.1006,0.197,-0.510,0.610,-0.488,0.286
log_reviews:review_scores_location,-0.0631,0.080,-0.788,0.431,-0.220,0.094
log_reviews_ltm,-11.4127,0.788,-14.482,0.000,-12.957,-9.868
review_scores_rating,0.9533,0.267,3.566,0.000,0.429,1.477
log_reviews_ltm:review_scores_rating,2.4757,0.164,15.116,0.000,2.155,2.797
host_total_listings_count,-0.0038,0.001,-2.924,0.003,-0.006,-0.001


In [48]:
predicted = model.predict(test) > 0.5

In [49]:
id = test.id.values
predicted = predicted.values
submission = pd.DataFrame({'id': id, 'predicted': predicted})
submission = submission.reset_index(drop=True)

In [50]:
# add 'host_id' to submission
submission['host_id'] = test['host_id']

# use apply to replace 'predicted' based on 'host_id'
submission['predicted'] = submission.apply(lambda row: train[train['host_id'] == row['host_id']]['host_is_superhost'].values[0] 
                                           if not train[train['host_id'] == row['host_id']].empty else row['predicted'], axis=1)

# drop 'host_id' from submission
submission = submission.drop(columns=['host_id'])

In [None]:
submission.to_csv('classification_submission.csv', index=False)