# Hotel's Reviews Sentiment Analysis
## Gathering Data
First step is to gather the data from DB.
```
select translation_title, translation_text, score
from hotel_reputation;
```

We then load the file to start working with. I chose JSON instead of CSV because we are working with tests, which have commas and quotes and other symbols as well, so it would be difficult for CSV to know how to separate the columns without having conflicts with the actual contents. Also, JSON is easier to look at for humans. The downside is that it weights roughly 20 MB more and tjus it's also slightly slower to load and occupies slightly more memory when loaded.

In [257]:
import pandas as pd

reviews_df = pd.read_json('reviews.json')
pd.set_option('display.max_colwidth', None)
print(reviews_df.head())

                               translation_title  \
0                                    Exceptional   
1           It doesn't get any better than that!   
2                                    Exceptional   
3                                      Excellent   
4  Well situated for the station and city centre   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     translation_text  \
0                                                                                                                                                                                     

## Data Processing
### Combining Title and Text
Next we will combine the title and text columns into a single one, so we can work with the text as a whole.

In [258]:
def combine_texts(title, text):
    if pd.isna(title) and pd.isna(text):
        return ''
    elif pd.isna(text) or text.strip() == '':
        return title
    elif pd.isna(title) or title.strip() == '':
        return text
    else:
        return f"{title} {text}"

reviews_df['combined_text'] = reviews_df.apply(lambda row: combine_texts(row['translation_title'], row['translation_text']), axis=1)
reviews_df.drop(columns=['translation_title', 'translation_text'], inplace=True)
print(reviews_df.head())

   score  \
0    5.0   
1    5.0   
2    5.0   
3    3.0   
4    5.0   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      combined_text  
0                                                                                                                                                                                                                                                                                                                    Exceptional 😀 - The hotel is located in the best location, 10 meters

### Cleaning Data

#### Removing rows with combined_text empty
Some rows didn't have nor title nor text, so we have `null` values in the `combined_text` column:

In [259]:
print('Rows with combined_text empty:')
print(len(reviews_df[reviews_df['combined_text'] == '']))

Rows with combined_text empty:
1028


We will remove these rows, as they don't bring any value to our analysis.

In [260]:
print('Length before removing rows with combined_text empty:', len(reviews_df))
reviews_df = reviews_df[reviews_df['combined_text'] != '']
print('Length after removing rows with combined_text empty:', len(reviews_df))

Length before removing rows with combined_text empty: 371448
Length after removing rows with combined_text empty: 370420


#### Fixing rows with scores not boxed in [1, 2, 3, 4, 5]
Some rows had scores not boxed in [1, 2, 3, 4, 5]:

In [261]:
print('Scores not boxed in [1, 2, 3, 4, 5]:')
print(len(reviews_df[~reviews_df['score'].isin([1, 2, 3, 4, 5])]))
print('-' * 50)
print('Value counts:')
print(reviews_df['score'].value_counts())

Scores not boxed in [1, 2, 3, 4, 5]:
110583
--------------------------------------------------
Value counts:
score
5.00    119323
4.00    103652
4.50     58790
3.50     32841
3.00     23539
2.50      8397
2.00      8327
1.00      4996
1.50      2694
0.50      2340
4.80      1003
4.60       933
4.40       693
4.15       657
3.75       591
3.95       466
3.55       318
3.35       246
2.90       169
3.10       152
2.70        87
2.30        55
2.10        44
1.90        27
1.65        26
4.75        17
1.25        16
1.45         9
4.25         4
3.25         3
2.75         1
3.45         1
1.75         1
4.05         1
4.70         1
Name: count, dtype: int64


We will not remove this rows, as their data is still useful, instead, we will think of a way to box them in one of the 5 categories.
A side investigation demonstrated that the best strategy is to ceil them (refer to the previous Google collab notebooks).

In [262]:
import numpy as np

reviews_df['score'] = reviews_df['score'].apply(np.ceil)
print('Scores not boxed in [1, 2, 3, 4, 5]:')
print(len(reviews_df[~reviews_df['score'].isin([1, 2, 3, 4, 5])]))
print('-' * 50)
print('Value counts:')
print(reviews_df['score'].value_counts())

Scores not boxed in [1, 2, 3, 4, 5]:
0
--------------------------------------------------
Value counts:
score
5.0    181422
4.0    138270
3.0     32292
2.0     11100
1.0      7336
Name: count, dtype: int64


#### Fixing rows with trailing and leading spaces
We have rows with trailing and leading spaces in the `combined_text` column:

In [263]:
print('Rows with trailing and leading spaces in the combined_text column:')
print(len(reviews_df[reviews_df['combined_text'].str.contains('^\s+|\s+$')]))

Rows with trailing and leading spaces in the combined_text column:
47101


We will remove these spaces.

In [264]:
reviews_df['combined_text'] = reviews_df['combined_text'].str.strip()
print('Rows with trailing and leading spaces in the combined_text column:')
print(len(reviews_df[reviews_df['combined_text'].str.contains('^\s+|\s+$')]))

Rows with trailing and leading spaces in the combined_text column:
0


## Data Splitting

In [265]:
from sklearn.model_selection import train_test_split

X = reviews_df['combined_text']
y = reviews_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Data Vectorization

In [266]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## Model Training

In [267]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1500)
model.fit(X_train_vectorized, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Model Evaluation

In [268]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test_vectorized)
train_accuracy = accuracy_score(y_train, model.predict(X_train_scaled))
test_accuracy = accuracy_score(y_test, y_pred)
print(f'Training Accuracy: {train_accuracy * 100:.2f}%')
print(f'Testing Accuracy: {test_accuracy * 100:.2f}%')

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)

Training Accuracy: 79.88%
Testing Accuracy: 87.97%
Confusion Matrix:
[[ 1221   131    58    43     9]
 [  107  1753   188   110    18]
 [   70   182  5328   880   145]
 [   26    55   582 24259  2619]
 [    9    15   170  3496 32610]]
Classification Report:
              precision    recall  f1-score   support

         1.0       0.85      0.84      0.84      1462
         2.0       0.82      0.81      0.81      2176
         3.0       0.84      0.81      0.82      6605
         4.0       0.84      0.88      0.86     27541
         5.0       0.92      0.90      0.91     36300

    accuracy                           0.88     74084
   macro avg       0.86      0.85      0.85     74084
weighted avg       0.88      0.88      0.88     74084

