# CS599 Applied Machine Learning - Fall 2022: Class Challenge

In this class challenge assignment, you will be building a machine learning model to predict the price of an Airbnb rental given the dataset we have provided. **Total points: 100 pts**

To submit your solution, you need to submit a **python (.py)** file, named **challenge.py** on Gradescope. 

Submission link: https://www.gradescope.com/courses/427800/assignments/2437846

<br>

There will be a Leaderboard for the challenge, which you can find at https://www.gradescope.com/courses/427800/assignments/2437846/leaderboard

You can use a nickname or make it anonymous for the leaderboard shown in the red box as below
<img src="https://raw.githubusercontent.com/chaudatascience/cs599_fall2022/master/challenge/submit.png" width="400">


To encourage you to get started early on the challenge, you are required to submit an **initial submission** by midterm (due on **Nov 29, 11:59 pm**). For this submission, your model needs to be better than the linear model with random weights that we provided. The **final submission** will be due on **Dec 8, 11:59 pm**. 


## Problem and dataset description
Pricing a rental property such as an apartment or house on Airbnb is a difficult challenge. A model that accurately predicts the price can potentially help renters and hosts on the platform make better decisions. In this assignment, your task is to train a model that takes features of a listing as input and predicts the price.
 
We have provided you with a dataset collected from the Airbnb website for New York, which has a total of 29,985 entries, each with 764 features. You may use the provided data as you wish in development. We will train your submitted code on the same provided dataset, and will evaluate it on 2 other test sets (one public, and one hidden during the challenge).
 
We have already done some minimal data cleaning for you, such as converting text fields into categorical values and getting rid of the NaN values. To convert text fields into categorical values, we used different strategies depending on the field. For example, sentiment analysis was applied to convert user reviews to numerical values ('comments' column). We added different columns for state names, '1' indicating the location of the property. Column names are included in the data files and are mostly descriptive.
 
Also in this data cleaning step, the price value that we are trying to predict is calculated by taking the log of original price. Hence, the minimum value for our output price is around 2.302 and maximum value is around 9.21 on the training set.


## Datasets and Codebase

Please download the zip file from the link posted on Piazza/Resources. 
In this notebook, we implemented a linear regression model with random weights (**attached in the end**). For datasets, there’re 2 CSV files for features and labels:

    challenge.ipynb (This file: you need to add your code in here, convert it to .py to submit)
    data_cleaned_train_comments_X.csv
    data_cleaned_train_y.csv


## Instructions to build your model
1.  Implement your model in **challenge.ipynb**. You need to modify the *train()* and *predict()* methods of **Model** class (*attached at the end of this notebook*). You can also add other methods/attributes  to the class, or even add new classes in the same file if needed, but do NOT change the signatures of the *train()* and *predict()* as we will call these 2 methods for evaluating your model.

2. To submit, you need to convert your notebook (.ipynb) to a python **(.py)** file. Make sure in the python file, it has a class named **Model**, and in the class, there are two methods: *train* and *predict*. Other experimental code should be removed if needed to avoid time limit exceeded on gradescope.
 
3.  You can submit your code on gradescope to test your model. You can submit as many times you like. The last submission will count as the final model.
 
An example linear regression model with random weights is provided to you in this notebook. Please take a look and replace the code with your own.


## Evaluation

We will evaluate your model as follows

    model = Model() # Model class imported from your submission
    X_train = pd.read_csv("data_cleaned_train_comments_X.csv")  # pandas Dataframe
    y_train = pd.read_csv("data_cleaned_y.csv")  # pandas Dataframe
    model.train(X_train, y_train) # train your model on the dataset provided to you
    y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
    mse = mean_squared_error(y_test, y_pred) # compute mean squared error

There will be 2 test sets, one is public which means you can see MSE on this test set on the Leaderboard (denoted as *MSE (PUBLIC TESTSET)*), and the other one is hidden during the challenge (denoted as *MSE (HIDDEN TESTSET)*). 
Your score on the hidden test set will be your performance measure. So, don’t try to overfit your model on the public test set. Your final grade will depend on the following criteria:


1.  	Is it original code (implemented by you)?
2.  	Does it take a reasonable time to complete?
Your model needs to finish running in under 40 minutes on our machine. We run the code on a machine with 4 CPUs, 6.0GB RAM.
3.  	Does it achieve a reasonable MSE?
    - **Initial submission (10 pts)**: Your model has to be better than the random weights linear model (denoted as RANDOM on Leaderboard) provided in the file. Note this will due on **Nov 29, 11:59pm**.
    - **Final submission (90 pts)**: Your last submission will count as the final submission. If its performance is better than our baseline model (denoted as BASELINE on the leaderboard), you will get 60 points for the final submission. If its performance is better than our full credit model (denoted as FULL CREDIT), you will get 90 points for the final submission. Submissions lie in between BASELINE and FULL CREDIT will get 60 points plus some partial credit depending on the performance. We will use MSE on the hidden test set to evaluate your model (lower is better). Due date: **Dec 8, 11:59pm**.
![alt text](https://raw.githubusercontent.com/chaudatascience/cs599_fall2022/master/challenge/leaderboard.png)


**Bonus**: **Top 3** with the best MSE on the hidden test set will get a 5 point bonus.

**Note 1: This is a regression problem** in which we want to predict the price for an AirBnB property. You should try different models and finetune their hyper parameters.  A little feature engineering can also help to boost the performance.

**Note 2**: You may NOT use additional datasets. This assignment is meant to challenge you to build a better model, not collect more training data, so please only use the data we provided. We tested the code on Python 3.10 and 3.9, thus it’s highly recommended to use these Python versions for the challenge.


In this challenge, you can only use built-in python modules, and these following:
- Numpy
- pandas
- scikit_learn
- matplotlib
- scipy
- torchsummary
- xgboost
- torchmetrics
- lightgbm
- catboost
- torch



In [3]:
### Sample code for the challenge

import numpy as np
import pandas as pd
from torch.utils.data.dataloader import DataLoader
from sklearn.neural_network import MLPRegressor
from sklearn.decomposition import PCA
import xgboost as xgb
from sklearn.linear_model import Ridge, Lasso
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, Matern, RBF, DotProduct, RationalQuadratic, ExpSineSquared
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

class Model:
    # Modify your model, default is a linear regression model with random weights
    
    def __init__(self):
        self.model = None
        self.to_drop = None
        self.model_2 = None
        self.model_3 = None

    def train(self, X_train: pd.DataFrame, y_train: pd.DataFrame) -> None:
        """
        Train model with training data.
        Currently, we use a linear regression with random weights
        You need to modify this function.
        :param X_train: shape (N,d)
        :param y_train: shape (N,1)
            where N is the number of observations, d is feature dimension
        :return: None
        """
        cor_matrix = X_train.corr().abs()
        upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
        to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.6)]
        self.to_drop = to_drop
        X_train = X_train.drop(columns = to_drop)
        
        
#         feature_names = list(X_train.columns)
#         for feature in feature_names:
#             upper_limit = X_train[feature].mean() + 3*X_train[feature].std()
#             lower_limit = X_train[feature].mean() - 3*X_train[feature].std()
#             new_X_train = X_train.copy()
#             new_X_train.loc[new_X_train[feature]>upper_limit, feature] = upper_limit
#             new_X_train.loc[new_X_train[feature]<lower_limit, feature] = lower_limit
#             X_train = new_X_train
        
        y_train = np.array(y_train)
        r,c = y_train.shape
        y_train = y_train.reshape(r)
        
        
#         regr = MLPRegressor(activation="relu", solver="sgd", batch_size=160, learning_rate="adaptive", learning_rate_init=0.002, power_t=0.1, momentum=0.9, max_iter=1000, random_state=2022)
#         regr = xgb.XGBRegressor(learning_rate=0.1, max_depth=3, n_estimators=500)
#         regr = Ridge(alpha=1.5)
#         regr = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
        model_1 = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15,
                                    min_samples_split=10, 
                                   loss='huber')
    
#         regr = make_pipeline(RobustScaler(),
#                     GradientBoostingRegressor(n_estimators=5000, learning_rate=0.05,
#                                    max_depth=4, max_features='sqrt',
#                                    min_samples_leaf=15, min_samples_split=10, 
#                                    loss='huber', random_state =5))
                            
        
#         model_2 = xgb.XGBRegressor(learning_rate=0.1, max_depth=3, n_estimators=500)
#         model_2 = KNeighborsRegressor(n_neighbors=10, weights='distance')
        
#         kernel = 1.0**2 * Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0), nu=0.5)
#         model_2 = GaussianProcessRegressor(kernel=kernel, alpha=5e-9, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, normalize_y=False, copy_X_train=True)
    
#         model_2 = HistGradientBoostingRegressor(learning_rate=0.05,
#                                    max_depth=4,
#                                    min_samples_leaf=15)
        
        model_3 = MLPRegressor(solver="sgd", learning_rate="adaptive", learning_rate_init=0.002, max_iter=1000)
        
        eclf1 = VotingRegressor(estimators=[('gbr', model_1), ('mlp', model_3)])
        eclf1 = eclf1.fit(X_train, y_train)
        self.model = eclf1

        return None

    def predict(self, X_test: pd.DataFrame) -> np.array:
        """
        Use the trained model to predict on un-seen dataset
        You need to modify this function
        :param X_test: shape (N, d), where N is the number of observations, d is feature dimension
        return: prediction, shape (N,1)
        """
        
        X_test = X_test.drop(columns = self.to_drop)
        y_pred = self.model.predict(X_test)
#         y_pred_2 = self.model_2.predict(X_test)
#         y_pred_3 = self.model_3.predict(X_test)
        
#         y_pred = (y_pred_1+y_pred_2+y_pred_3)/3.0
        
        return y_pred

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# model = Model() # Model class imported from your submission
X_train = pd.read_csv("/Users/williamlee/Desktop/Past Courses/CS599/class_challenge/data_cleaned_train_comments_X.csv")  # pandas Dataframe
y_train = pd.read_csv("/Users/williamlee/Desktop/Past Courses/CS599/class_challenge/data_cleaned_train_y.csv")  # pandas Dataframe
# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2)
# model.train(X_train, y_train) # train your model on the dataset provided to you
# y_pred = model.predict(X_test) # test your model on the hidden test set (pandas Dataframe)
# mse = mean_squared_error(y_test, y_pred) # compute mean squared error

In [5]:
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error
# from sklearn.neural_network import MLPRegressor
# from sklearn.pipeline import make_pipeline
# from sklearn.preprocessing import StandardScaler
# import xgboost as xgb

# # regr = MLPRegressor(hidden_layer_sizes = (100,), activation="relu", solver="sgd", batch_size=160, learning_rate="adaptive", learning_rate_init=0.002, power_t=0.1, momentum=0.9, max_iter=1000)
# # regr = MLPRegressor(hidden_layer_sizes = (100,), activation="relu", solver="sgd", learning_rate_init=0.002, max_iter=1000)
# regr = xgb.XGBRegressor(objective='reg:squarederror', n_estimators = 400, learning_rate =0.1, random_state=2022)
# regr.fit(X_train, y_train)
# y_pred = regr.predict(X_test)

# mse = mean_squared_error(y_test, y_pred)
# print(mse)

In [6]:
# # for x in [0.1, 0.2, 0.3]:
# #     for y in [250, 300, 350]:
# from sklearn.linear_model import Ridge, Lasso

# X_train = pd.read_csv("/Users/williamlee/Desktop/CS599/class_challenge/data_cleaned_train_comments_X.csv")  # pandas Dataframe
# y_train = pd.read_csv("/Users/williamlee/Desktop/CS599/class_challenge/data_cleaned_train_y.csv")  # pandas Dataframe
# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=2022)
# cor_matrix = X_train.corr().abs()
# upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
# to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]
# X_train = X_train.drop(columns = to_drop)

# # feature_names = list(X_train.columns)
# # for feature in feature_names:
# #     upper_limit = X_train[feature].mean() + 3*X_train[feature].std()
# #     lower_limit = X_train[feature].mean() - 3*X_train[feature].std()
# #     new_X_train = X_train.copy()
# #     new_X_train.loc[new_X_train[feature]>upper_limit, feature] = upper_limit
# #     new_X_train.loc[new_X_train[feature]<lower_limit, feature] = lower_limit
# #     X_train = new_X_train

# y_train = np.array(y_train)
# r,c = y_train.shape
# y_train = y_train.reshape(r)


# #         regr = MLPRegressor(activation="relu", solver="sgd", batch_size=160, learning_rate="adaptive", learning_rate_init=0.002, power_t=0.1, momentum=0.9, max_iter=1000, random_state=2022)
# # regr = xgb.XGBRegressor(objective='reg:squarederror', n_estimators = 400, learning_rate = 0.1, random_state=2022)
# regr = Ridge(alpha=3, random_state=2022)
# regr.fit(X_train, y_train)
# X_test = X_test.drop(columns = to_drop)
# y_pred = regr.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)
# print(mse)
        
# #best 0.11982430715908494  350  0.1
# #0.1452965260127538 1.0
# #0.14528293249677898 1.5

In [7]:
# # for x in [0.1, 0.2, 0.3]:
# #     for y in [250, 300, 350]:


# X_train = pd.read_csv("/Users/williamlee/Desktop/CS599/class_challenge/data_cleaned_train_comments_X.csv")  # pandas Dataframe
# y_train = pd.read_csv("/Users/williamlee/Desktop/CS599/class_challenge/data_cleaned_train_y.csv")  # pandas Dataframe
# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33, random_state=2022)
# cor_matrix = X_train.corr().abs()
# upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
# to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.99)]
# X_train = X_train.drop(columns = to_drop)

# feature_names = list(X_train.columns)
# for feature in feature_names:
#     upper_limit = X_train[feature].mean() + 3*X_train[feature].std()
#     lower_limit = X_train[feature].mean() - 3*X_train[feature].std()
#     new_X_train = X_train.copy()
#     new_X_train.loc[new_X_train[feature]>upper_limit, feature] = upper_limit
#     new_X_train.loc[new_X_train[feature]<lower_limit, feature] = lower_limit
#     X_train = new_X_train

# y_train = np.array(y_train)
# r,c = y_train.shape
# y_train = y_train.reshape(r)


# #         regr = MLPRegressor(activation="relu", solver="sgd", batch_size=160, learning_rate="adaptive", learning_rate_init=0.002, power_t=0.1, momentum=0.9, max_iter=1000, random_state=2022)
# regr = xgb.XGBRegressor(objective='reg:squarederror', n_estimators = y, learning_rate =x, random_state=2022)
# regr.fit(X_train, y_train)
# X_test = X_test.drop(columns = to_drop)
# y_pred = regr.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)
# print(mse)

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from datetime import datetime
start_time = datetime.now()


model = Model()
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.33)
model.train(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)
# do your work here
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

#0.12109658812758879 for learning rate = 0.05
#0.129

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  upper_tri = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))


0.13627437584720034
Duration: 0:03:07.515489


**GOOD LUCK!**
