<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_IV_14_LasVegasExampleMultiClass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Las Vegas Dataset

We go back to the Las Vegas dataset that we considered as a linear regression model, although we now will interpret the ratings as a categorical variable and use multinomial regression for our predictions.

### Load and Prepare Data

Let's load the dataset from the course repository and prepare it as before:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
lasvegas = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_II_8_LasVegasTripAdvisorReviews.csv')

In [None]:
numerics = list(lasvegas.select_dtypes(include=['int64']).columns)
numerics.remove('Hotel stars')
numerics.remove('Score')
factors = list(lasvegas.select_dtypes(include=['object']).columns)
factors.append('Hotel stars')
factors.remove('User country')
factors.remove('Hotel name')

In [None]:
lasvegas_numcols = lasvegas[numerics]

In [None]:
lasvegas_numcols['helpful_proportion'] = lasvegas_numcols['Helpful votes'] / lasvegas_numcols['Nr. reviews']

In [None]:
lasvegas_faccols = lasvegas[factors]
dummies = pd.get_dummies(lasvegas_faccols.astype('object'), drop_first=True)

In [None]:
lasvegas_new = pd.concat([lasvegas_numcols, dummies], axis = 1)
lasvegas_new = pd.concat([lasvegas_new, lasvegas['Score']], axis =1)

### Linear Regression

Let's rerun the linear regression model to have it as a comparison. One question we are interested in is what model performs better.

In [None]:
y = lasvegas_new['Score']
X = lasvegas_new.drop(columns=['Score'])
X = sm.add_constant(X)
model_sm = sm.OLS(y, X.astype(float)).fit() #Because of the way data was stored in the df, sm does not work. Have to coerce into numbers.
model_sm.summary()

### Multinomial Regression

Let's now predict using multinomial regression. We start by reformatting our target variable as a categorical variable.

In [None]:
y_cat = pd.cut(y, bins=5, labels=['very bad', 'bad', 'medium', 'good', 'very good'])

And then we run our multinomial regression using the categorical traget:

In [None]:
model_mn = sm.MNLogit(y_cat, X.astype(float)).fit(maxiter = 10000)
print(model_mn.summary())

So, we see we have four tables of coefficients, one for each category where the first category ('very bad') is the baseline. Many of the coefficents do not seem to be significantly different from zero. One reason is that there are many more coefficients here!

Let's check the predictions by generating a multi-class confusion matrix:

In [None]:
y_hat = model_mn.predict(X.astype(float))
y_hat_label = y_hat.idxmax(axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y-1, y_hat_label)
print(cm)

So the diagonal terms are large, suggesting we get many correct. Let's check the misclassification rate:

In [None]:
misclassification_rate = 1 - np.diag(cm).sum() / cm.sum()
print(misclassification_rate)

### Comparison to Linear Regression Predictions

One thing we notice is that the "pseudo-R-squared" here is lower than the R-squared of the linear regression model. Does that mean the multinomial regression predictions are worse?

Let's compare:

In [None]:
y_pred_linear = model_sm.predict(X.astype(float))

# 1. Distribution of Predicted Scores (Linear Regression)
plt.hist(y_pred_linear, bins=20)
plt.xlabel("Predicted Scores (Linear Regression)")
plt.ylabel("Frequency")
plt.title("Distribution of Predicted Scores (Linear Regression)")
plt.show()

# 2. Distribution of Predicted Categories (Multinomial Model)
y_hat_label.value_counts().plot(kind='bar')
plt.xlabel("Predicted Categories (Multinomial Model)")
plt.ylabel("Frequency")
plt.title("Distribution of Predicted Categories (Multinomial Model)")
plt.show()

So it appears that the linear regression predictions are centered around 4 and rarely predict 5 and never 2 or 1---and in some cases even predict higher than 5. In contrastm the most frequent prediction for the multinomial model is the highest category!

Let's check the confusion matrix and misclassification rate for the linear regression model:

In [None]:
y_pred_linear_rounded = np.round(y_pred_linear).astype(int)
# Ensure predictions are within valid range (0-4)
y_pred_linear_rounded = np.clip(y_pred_linear_rounded, 1, 5)

# Generate confusion matrix
cm_linear = confusion_matrix(y, y_pred_linear_rounded)
print(cm_linear)

# Calculate misclassification rate
misclassification_rate_linear = 1 - np.diag(cm_linear).sum() / cm_linear.sum()
print(misclassification_rate_linear)

So the misclassification rate is quite a bit higher... Overall, this favors the multinomial regression predictions.