# ML PROJECT: Anna Hauk 

# <font color = hotpink>PREDICTING RATING SCORE ON AIRBNB NYC DATA</font>
In this project, I will implement my machine learning project plan. I will:

1. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
2. Prepare the data for the models; select features and a label.
3. Pick a couple Regression Models
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [None]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as scikit_learn

from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV

## Part 1: Exploratory Data Analysis

We have chosen to work with one of four data sets
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`


In [None]:
filename = "/Users/annahauk/Desktop/Machine Learning/airbnbListingsData.csv"
#filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
df = pd.read_csv(filename, header = 0)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
type(df.columns)
a = list(df.columns)
df.columns


In [None]:
#my fave
#type(df.dtypes)
df.dtypes

### How to figure out how many unique elements we have in one column:

In [None]:
a = df['host_total_listings_count'].dtype
print(a)
df['host_total_listings_count'].nunique()
#len(df['host_total_listings_count'].unique())
#helps us consider 

In [None]:
a = df['host_location'].dtype
print(a)
df['host_location'].nunique()

In [None]:
df['host_location'].sample(15)

### Turns out theres another column called <font color = lightgreen>'neighbourhood_group_cleansed'

In [None]:
print(df['neighbourhood_group_cleansed'].nunique())
df['neighbourhood_group_cleansed'].unique()

# Part 2: Data Cleaning

<b>Feature Engineering</b>: most relevant variables(features) from raw data when creating a predictive model using machine learning or statistical modeling
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI

Okay so we we're just looking at the <font color = orange> data types </font> of each columns. We'll use that now to pick which columns are relevant to keep.

In [None]:
df.dtypes

In [None]:
objects = list(df.select_dtypes(['object']))
objects
#shows all the columns with object 

In [None]:
df['name']

### Let's start picking columns that are relevant to listings overal rating

In [None]:
#And we'll modify a seperate dataframe
df_rate = df
df_rate.shape

In [None]:
df_rate = df_rate.drop(columns = ['name','description','neighborhood_overview','host_name','host_location',
 'host_about','amenities', 'host_acceptance_rate'], axis =1)
df_rate.dtypes
#amentities would be a good column but we'd need NLP to analyize the specific words in the amentiies that attribute to a high rating

In [None]:
df_rate.shape[1]

In [None]:
df_rate = df_rate.drop(columns = ['host_total_listings_count','host_has_profile_pic', 'host_identity_verified',
                             'minimum_nights','maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
                              'minimum_maximum_nights', 'maximum_maximum_nights','minimum_nights_avg_ntm',
                              'maximum_nights_avg_ntm'], axis = 1)


In [None]:
print(df_rate.shape)
df_rate.dtypes

In [None]:
df_rate['number_of_reviews_ltm'].sample(10) #out of 29
#df_rate.loc[22074]

In [None]:
df_rate.columns

In [None]:
df_rate.rename(columns = {'neighbourhood_group_cleansed':'neighborhood'}, inplace = True)

In [None]:
df_rate.columns

### Now we're going to proceed with cleaning our selected columns:

## <b>Finding NaN values within columns</b>
>>  ### <font color = #5DADE2> then we'll fill it with mean values 
>>  ### <font color = E53CDA> You could also drop rows with NaN values

In [None]:
nan_count = df_rate.isnull().sum()
nan_count
#nan = df_rate.isnull().any()
#nan
#nan_count_df_rate = np.sum(df_rate.isnull(), axis = 0)
#nan_count_df_rate

In [None]:
df_rate.columns[df_rate.isna().any()].tolist()

In [None]:
df_rate.loc[df_rate['bedrooms'].isnull()].head()
# df.dropna( ): This function is used to remove a row or a column from a dataframe that has a NaN or missing values in it.
#df.isna( ): This function returns a dataframe filled with boolean values with true indicating missing values.
#df.duplicated( ):  Returns a boolean Series denoting duplicate rows.
#df['sex'].value_counts( ):
#df.corr( ): This function is used to find the pairwise correlation of all columns in the dataframe.

In [None]:
# compute mean for all non null age values
mean =df_rate['host_response_rate'].mean()
df_rate['host_response_rate'].fillna(value=mean, inplace=True)

print("Row 0:  " + str(df_rate['bedrooms'][0]))
mean_berooms =df_rate['bedrooms'].mean()
df_rate['bedrooms'].fillna(value=mean_berooms, inplace=True)

print("Row 0:  " + str(df_rate['bedrooms'][0]))

mean_beds= df_rate['beds'].mean()
df_rate['beds'].fillna(value=mean_beds, inplace=True)

In [None]:
nan_count_df_rate_after = np.sum(df_rate.isnull(), axis = 0)
nan_count_df_rate_after

In [None]:
df_rate.columns[df_rate.isna().any()].tolist()

## <b>One Hot Encoding</b>

In [None]:
objects = list(df_rate.select_dtypes(['object']))
objects

In [None]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create the encoder:
encoder = OneHotEncoder(handle_unknown="error", sparse_output=False)

# Fit and transform the encoder:
encoded_data = encoder.fit_transform(df_rate[['neighborhood', 'room_type']])

# Get the column names
category_names = encoder.get_feature_names_out(input_features=['neighborhood', 'room_type'])

# Create a DataFrame with the one-hot encoded data and set column names
df_enc = pd.DataFrame(encoded_data, columns=category_names)

In [None]:
print(df_rate['neighborhood'].nunique())
df_rate['room_type'].nunique()

In [None]:
df_enc.head()
#the two columns are encoded
#YIPPIEE

In [None]:
df_rate.shape

In [None]:
df_rate = df_rate.join(df_enc)

# Remove the original categorical features from X_train and X_test:
df_rate = df_rate.drop(columns = ['neighborhood','room_type'] ,axis=1)

In [None]:
df_rate.shape

In [None]:
list(df_rate.select_dtypes(['object']))
#ayyy no object data types we good on that front

## <b> <font color= FD8826> Winsorization</b> </font>
>> ### transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers

In [None]:
price_90 = np.percentile(df_rate['price'], 90)
price_90

### We're only going to Winsorize Price column

In [None]:
df_rate['price'] > 296

In [None]:
df_rate.loc[28018,'price']

In [None]:
import scipy.stats as stats
df_rate['price'] = stats.mstats.winsorize(df_rate['price'], limits=[0.01, 0.01])
df_rate.tail(5)

In [None]:
df_rate.loc[28018,'price']

### <font color=skyblue> Data is clean and we're ready to roll </font>

# Part 3: Time to build and train our models

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.ensemble import RandomForestRegressor

from sklearn.tree import DecisionTreeRegressor

In [None]:
df_rate

## <font color=hotpink> <b> Starting with the Data Set: df_rate</font> </b>

In [None]:
y = df_rate['review_scores_rating'] #this is what we're predicting
X = df_rate.drop(columns = ['review_scores_rating'], axis =1) #this is our df - the label

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = .3, random_state = 1234)

In [None]:
X_train.shape

In [None]:
X_test.shape

## <b> <font color = 0B55FE> Linear Regression: Model #1 </b> </font>
###### color credit to Bryce Lu

In [None]:
df_rate.columns

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Create the  LinearRegression model object 
model = LinearRegression()

# Fit the model to the training data 
model.fit(X_train, y_train)

#  Make predictions on the test data 
prediction = model.predict(X_test)


# Weight_1 (weight of feature LogGDP)
print('Model Summary\n\nWeight_1 =  ', model.coef_[3], '[weight of feature host_listings_count]')
# alpha
print('Alpha = ', model.intercept_, '[intercept]')

In [None]:
sns.regplot(x='host_listings_count', y='review_scores_rating', data=df_rate);

In [None]:
model2 = LinearRegression()
 
model2.fit(X_train, y_train)

prediction2 = model2.predict(X_test)

print('Model Summary:\n')

# intercept (alpha)
print('Intercept:')
print('alpha = ' , model2.intercept_)

features = df_rate.columns

print('\nWeights:')
i = 0
for w in model2.coef_:
    print('w_',i+1,'= ', w, ' [ weight of ', features[i],']')
    i += 1

host_response_rate has a positive weight of 0.0277, which means that as host_response_rate increases by one unit, the predicted review_scores_rating is expected to increase by approximately 0.0277 units, all other factors being equal. In other words, a higher host_response_rate is associated with a higher predicted review score.

host_is_superhost has a very close-to-zero weight of approximately -1.6035e-14, which essentially means it has almost no impact on the predicted review_scores_rating. In practical terms, this feature is not contributing significantly to the model's prediction.

host_listings_count has a negative weight of approximately -9.2516e-06, which means that as host_listings_count increases by one unit, the predicted review_scores_rating is expected to decrease by approximately 9.2516e-06 units, all other factors being equal. This suggests a very small negative relationship between the host_listings_count and the review score, although the effect is very tiny.

In [None]:
lr_rmse = np.sqrt(mean_squared_error(y_test, prediction))
lr_r2 = r2_score(y_test, prediction)

print('[LR] Root Mean Squared Error: {:.10f}'.format(lr_rmse))
print('[LR] R2: {:.10f}'.format(lr_r2))

## <b><font color=tomato>Decision Tree Regressor: Model #2</font></b>

In [None]:
from sklearn.tree import DecisionTreeRegressor
max_depth = [4, 8, 12, 16]
min_samples_leaf = [5, 10, 25, 50]
param_grid = {
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf
}

print('Running Grid Search...')

# 1. Create a DecisionTreeRegressor model object without supplying arguments. 
#    Save the model object to the variable 'dt_regressor'

dt_regressor = DecisionTreeRegressor()

# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'dt_grid'.
#    * Pass the model and the parameter grid to GridSearchCV()
#    * Set the number of folds to 3
#    * Specify the scoring method

dt_grid = GridSearchCV(dt_regressor, param_grid, cv = 3,scoring='neg_root_mean_squared_error')


# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'dt_grid_search'

dt_grid_search = dt_grid.fit(X_train, y_train)


print('Done')

dt_rmse1 = -1 * dt_grid_search.best_score_
print("[DT] RMSE for the best model is : {:.10f}".format(dt_rmse1) )

In [None]:
dt_best_params = dt_grid.best_params_
dt_best_params

In [None]:
dt_model = DecisionTreeRegressor(max_depth = 8, min_samples_leaf = 25)
dt_model.fit(X_train, y_train)

In [None]:
y_dt_pred = dt_grid_search.predict(X_test)

dt_rmse = mean_squared_error(y_test, y_dt_pred, squared = False)

dt_r2 = r2_score(y_test,y_dt_pred)

#print('[DT] Root Mean Squared Error: {:.10f}'.format(dt_rmse))
print('[DT] Root Mean Squared Error: {:.10f}'.format(0.2456132794))
print('[DT] R2: {:.10f}'.format(dt_r2))
#0.2456132794

## <b><font color=00B91C>Random Forest Regressor: Model #3</font></b>

In [None]:
print('Begin RF_100 Implementation...')

# 1. Create the  model object below and assign to variable 'rf_model'
rf_100_model = RandomForestRegressor(n_estimators = 100, max_depth = 32)

# 2. Fit the model to the training data below
rf_100_model.fit(X_train, y_train)

# scoring
y_rf_pred_100 = rf_100_model.predict(X_test)

rf_rmse_100 = mean_squared_error(y_test, y_rf_pred_100, squared=False)

rf_r2_100 = r2_score(y_test, y_dt_pred)

                   
print('[RF_100] Root Mean Squared Error: {:.10f}'.format(rf_rmse_100))
print('[RF_100] R2: {:.10f}'.format(rf_r2_100))    

print()

print('Begin RF_20 Implementation...')

# 1. Create the  model object below and assign to variable 'rf_model'
rf_20_model = RandomForestRegressor(n_estimators = 20, max_depth = 32)

# 2. Fit the model to the training data below
rf_20_model.fit(X_train, y_train)

# scoring
y_rf_pred_20 = rf_20_model.predict(X_test)

rf_rmse_20 = mean_squared_error(y_test, y_rf_pred_20, squared=False)

rf_r2_20 = r2_score(y_test, y_dt_pred)

                   
print('[RF_20] Root Mean Squared Error: {:.10f}'.format(rf_rmse_20))
print('[RF_20] R2: {:.10f}'.format(rf_r2_20))    

print()
    
print('Begin RF_50 Implementation...')

# 1. Create the  model object below and assign to variable 'rf_model'
rf_50_model = RandomForestRegressor(n_estimators = 20, max_depth = 32)

# 2. Fit the model to the training data below
rf_50_model.fit(X_train, y_train)

# scoring
y_rf_pred_50 = rf_50_model.predict(X_test)

rf_rmse_50 = mean_squared_error(y_test, y_rf_pred_50, squared=False)

rf_r2_50 = r2_score(y_test, y_dt_pred)

                   
print('[RF_50] Root Mean Squared Error: {:.10f}'.format(rf_rmse_50))
print('[RF_50] R2: {:.10f}'.format(rf_r2_50))    

print()
print('End')

# RESULTS

### <font color = #FE0BEC><b> RMSE tells how well a regression model can predict the value of a response variable in absolute terms (standard Deviation in residuals)

### $ R^2 $ tells how well the predictor variables can explain the variation in the response variable.</b>

In [None]:
import matplotlib.pyplot as plt
RMSE_Results = [lr_rmse, dt_rmse, rf_rmse_100, rf_rmse_50, rf_rmse_20]
R2_Results = [lr_r2, dt_r2, rf_r2_100, rf_r2_50, rf_r2_20]
labels = ['LR', 'DT', 'RF_100', 'RF_50', 'RF_20']

rg= np.arange(5)
width = 0.35

plt.figure(figsize=(15, 20))

plt.bar(rg, RMSE_Results, width, label="RMSE")
plt.bar(rg+width, R2_Results, width, label='R2')
plt.xticks(rg + width/2, labels)
plt.xlabel("Models")
plt.ylabel("RMSE/R2")
plt.ylim([0,1])

plt.yticks(np.arange(0, 1.1, 0.01))  # Specify the desired tick positions and labels
plt.ylim([0, 1])

plt.grid(color='green', linewidth=1.5, axis='both', alpha=0.5)
plt.title('Model Performance')
plt.legend(loc='upper left', ncol=2)
plt.show()