# <center><span style="color:red;"> <b> Booking reviews </b></span></center>

<p align="center">
    <strong> </strong> <a href="https://www.kaggle.com/competitions/sf-booking"> Kaggle - Hotel Booking Link</a>
</p>

## <center><span style="text-decoration: underline;">Feature Engineering + Prediction of the Reviewer's Score </span></center>

<h4 style="margin-bottom: 0;">📝 Dataset Description: Hotel Reviews Dataset </h4>

- **`hotel_address`**: Address of the hotel.
- **`review_date`**: Date when the reviewer posted the corresponding review.
- **`average_score`**: Average score of the hotel, calculated based on the most recent comments over the past year.
- **`hotel_name`**: Name of the hotel.
- **`reviewer_nationality`**: Country of the reviewer.
- **`negative_review`**: Negative review provided by the reviewer.
- **`review_total_negative_word_counts`**: Total number of words in the negative review.
- **`positive_review`**: Positive review provided by the reviewer.
- **`review_total_positive_word_counts`**: Total number of words in the positive review.
- **`reviewer_score`**: Score assigned by the reviewer based on their experience.
- **`total_number_of_reviews_reviewer_has_given`**: Total number of reviews the reviewer has given in the past.
- **`total_number_of_reviews`**: Total number of valid reviews for the hotel.
- **`tags`**: Tags assigned by the reviewer to the hotel.
- **`days_since_review`**: Number of days between the review date and the dataset's last update.
- **`additional_number_of_scoring`**: Some guests provide a rating without leaving a review. This number indicates the count of valid ratings without associated reviews.
- **`lat`**: Latitude of the hotel.
- **`lng`**: Longitude of the hotel.

Source: ["Booking reviews" (kaggle.com)](https://www.kaggle.com/competitions/sf-booking) 

## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> Table of Content </b></div>

1. [Import Libraries](#chapter1)
2. [Initialize Functions](#chapter2)
3. [Descriptive Statistics](#chapter3)
4. [Feature Engineering](#chapter4)
5. [Graphical + Statistical Analyses](#chapter5)    
    &emsp; 5.1 [Insurance Charges](#chapter5a)  
    &emsp; 5.2 [Insurance Charges and Gender](#chapter5b)  
    &emsp; 5.3 [Insurance Charges and Smoking Status](#chapter5c)  
    &emsp; 5.4 [Insurance Charges across Regions](#chapter5d)              
    &emsp; 5.5 [Insurance Charges across Age Groups](#chapter5e)   
    &emsp; 5.6 [Conclusions of EDA](#chapter5f)     
6. [Modeling](#chapter6)  
    &emsp; 6.1 [Linear Model](#chapter6a)     
    &emsp; 6.2 [Stochastic Gradient Descent](#chapter6b)    
    &emsp; 6.3  [Linear Model – Outliers adjustment](#chapter6c)   
    &emsp; 6.4  [Polynomials](#chapter6d)    
    &emsp; 6.5  [Polynomials with Regularization](#chapter6e)    
7.  [Conclusion](#chapter7)    

<a id="chapter1"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> ⚙️ Download Libraries </b></div>

In [4]:
import pandas as pd 
import numpy as np
from collections import Counter
import re
from nltk.stem import WordNetLemmatizer

# Modeling
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

<a id="chapter3"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> ✨ Some preliminary descriptive analysis  </b></div>

In [43]:
# Upload Dataset
df = pd.read_csv('data/hotels.csv')
data_temp = df.copy()
display('Beginning of the dataset:')
display(data_temp.head(2))
display('End of the dataset:')
display(data_temp.head(2))
display("Number of variables:", data_temp.shape[1])
display('Number of observations:', data_temp.shape[0])

'Beginning of the dataset:'

Unnamed: 0,hotel_address,additional_number_of_scoring,review_date,average_score,hotel_name,reviewer_nationality,negative_review,review_total_negative_word_counts,total_number_of_reviews,positive_review,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,tags,days_since_review,lat,lng
0,Stratton Street Mayfair Westminster Borough Lo...,581,2/19/2016,8.4,The May Fair Hotel,United Kingdom,Leaving,3,1994,Staff were amazing,4,7,10.0,"[' Leisure trip ', ' Couple ', ' Studio Suite ...",531 day,51.507894,-0.143671
1,130 134 Southampton Row Camden London WC1B 5AF...,299,1/12/2017,8.3,Mercure London Bloomsbury Hotel,United Kingdom,poor breakfast,3,1361,location,2,14,6.3,"[' Business trip ', ' Couple ', ' Standard Dou...",203 day,51.521009,-0.123097


'End of the dataset:'

Unnamed: 0,hotel_address,additional_number_of_scoring,review_date,average_score,hotel_name,reviewer_nationality,negative_review,review_total_negative_word_counts,total_number_of_reviews,positive_review,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,tags,days_since_review,lat,lng
0,Stratton Street Mayfair Westminster Borough Lo...,581,2/19/2016,8.4,The May Fair Hotel,United Kingdom,Leaving,3,1994,Staff were amazing,4,7,10.0,"[' Leisure trip ', ' Couple ', ' Studio Suite ...",531 day,51.507894,-0.143671
1,130 134 Southampton Row Camden London WC1B 5AF...,299,1/12/2017,8.3,Mercure London Bloomsbury Hotel,United Kingdom,poor breakfast,3,1361,location,2,14,6.3,"[' Business trip ', ' Couple ', ' Standard Dou...",203 day,51.521009,-0.123097


'Number of variables:'

17

'Number of observations:'

386803

In [5]:
data_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386803 entries, 0 to 386802
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   hotel_address                               386803 non-null  object 
 1   additional_number_of_scoring                386803 non-null  int64  
 2   review_date                                 386803 non-null  object 
 3   average_score                               386803 non-null  float64
 4   hotel_name                                  386803 non-null  object 
 5   reviewer_nationality                        386803 non-null  object 
 6   negative_review                             386803 non-null  object 
 7   review_total_negative_word_counts           386803 non-null  int64  
 8   total_number_of_reviews                     386803 non-null  int64  
 9   positive_review                             386803 non-null  object 
 

In [6]:
data_temp.describe() 

Unnamed: 0,additional_number_of_scoring,average_score,review_total_negative_word_counts,total_number_of_reviews,review_total_positive_word_counts,total_number_of_reviews_reviewer_has_given,reviewer_score,lat,lng
count,386803.0,386803.0,386803.0,386803.0,386803.0,386803.0,386803.0,384355.0,384355.0
mean,498.246536,8.397231,18.538988,2743.992042,17.776985,7.17725,8.396906,49.443522,2.823402
std,500.258012,0.547881,29.703369,2316.457018,21.726141,11.05442,1.63609,3.466936,4.579043
min,1.0,5.2,0.0,43.0,0.0,1.0,2.5,41.328376,-0.369758
25%,169.0,8.1,2.0,1161.0,5.0,1.0,7.5,48.214662,-0.143649
50%,342.0,8.4,9.0,2134.0,11.0,3.0,8.8,51.499981,-0.00025
75%,660.0,8.8,23.0,3613.0,22.0,8.0,9.6,51.516288,4.834443
max,2682.0,9.8,408.0,16670.0,395.0,355.0,10.0,52.400181,16.429233


In [7]:
data_temp.describe(include='object')

Unnamed: 0,hotel_address,review_date,hotel_name,reviewer_nationality,negative_review,positive_review,tags,days_since_review
count,386803,386803,386803,386803,386803,386803,386803,386803
unique,1493,731,1492,225,248828,311737,47135,731
top,163 Marsh Wall Docklands Tower Hamlets London ...,8/2/2017,Britannia International Hotel Canary Wharf,United Kingdom,No Negative,No Positive,"[' Leisure trip ', ' Couple ', ' Double Room '...",1 days
freq,3587,1911,3587,184033,95907,26885,3853,1911


In [8]:
if data_temp.isin([np.inf, -np.inf]).any().any():
    print("Infinite values found!")
else:
    print("No infinite values found!")    

# Check for missing values
if data_temp.isnull().any().any():
    print("Missing values found!")
else:
    print("No missing values found!")

No infinite values found!
Missing values found!


In [6]:
data_null = (data_temp.isnull().mean()*100).round(3).to_frame()
data_null.columns = ['Missing %']
data_null = data_null.rename_axis('Variables')
 
display(data_null)

Unnamed: 0_level_0,Missing %
Variables,Unnamed: 1_level_1
hotel_address,0.0
additional_number_of_scoring,0.0
review_date,0.0
average_score,0.0
hotel_name,0.0
reviewer_nationality,0.0
negative_review,0.0
review_total_negative_word_counts,0.0
total_number_of_reviews,0.0
positive_review,0.0


<div style="padding: 20px; border: 2px solid #c77220; border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); max-width: 100%; margin: 0 auto;">
    <ul style="font-size: 18px; font-family: 'Arial', sans-serif; line-height: 1.5em; word-wrap: break-word; overflow-wrap: break-word;">
    <h4 style="margin-bottom: 0;">💡 Interpretation:</h4>
        <li>  Two variables (<code>lat</code> and <code>lng</code>) have missing observations of around 0.633%.  </li>
    </ul>
</div>

In [44]:
print(f'Number of duplicates: {data_temp[data_temp.duplicated()].shape[0]}')

# Drop duplicates:
data_temp = data_temp.drop_duplicates()
print(f'Dimension of the dataframe after the drop of duplicates: {data_temp.shape}')

Number of duplicates: 307
Dimension of the dataframe after the drop of duplicates: (386496, 17)


In [34]:
print(f'There are {data_temp['hotel_name'].nunique()} unique hotels')

There are 1492 unique hotels


In [12]:
print('Most frequent (%) hotel name in the sample:')
display((data_temp['hotel_name'].value_counts(normalize=True).sort_values(ascending=False)*100).round(3)[:20])

Most frequent (%) hotel name in the sample:


hotel_name
Britannia International Hotel Canary Wharf           0.928
Strand Palace Hotel                                  0.830
Park Plaza Westminster Bridge London                 0.801
Copthorne Tara Hotel London Kensington               0.695
DoubleTree by Hilton Hotel London Tower of London    0.616
Grand Royale London Hyde Park                        0.574
Holiday Inn London Kensington                        0.543
Hilton London Metropole                              0.513
Intercontinental London The O2                       0.502
Millennium Gloucester Hotel London                   0.489
Park Grand Paddington Court                          0.449
Hilton London Wembley                                0.442
Park Plaza County Hall London                        0.440
Park Plaza London Riverbank                          0.423
Blakemore Hyde Park                                  0.422
M by Montcalm Shoreditch London Tech City            0.421
DoubleTree by Hilton London Docklands Riversi

In [13]:
data_temp['review_date'] = pd.to_datetime(data_temp['review_date'])
print(f"The latest review was done on {data_temp['review_date'].sort_values(ascending=False).iloc[0].date()}")
print(f"The first review was done on {data_temp['review_date'].sort_values(ascending=True).iloc[0].date()}")

The latest review was done on 2017-08-03
The first review was done on 2015-08-04


In [14]:
# Convert the 'tags' column from string to a proper list of tags
# x = "['Leisure trip', 'Solo traveler']" print(eval(x))  # Output: ['Leisure trip', 'Solo traveler']
data_temp['tags'] = data_temp['tags'].apply(lambda x: eval(x) if isinstance(x, str) else x)
all_tags = [tag.strip() for tags_list in data_temp['tags'] for tag in tags_list]
print(f"Number of unique tags: {len(set(all_tags))}")

Number of unique tags: 2368


In [15]:
word_counts = Counter(all_tags)
most_common_words = pd.Series(word_counts).sort_values(ascending=False)[:20]
print("Most frequent tags:")
print(most_common_words)

Most frequent tags:
Leisure trip                      313353
Submitted from a mobile device    230608
Couple                            189046
Stayed 1 night                    145296
Stayed 2 nights                   100176
Solo traveler                      81166
Stayed 3 nights                    71940
Business trip                      61934
Group                              49057
Family with young children         45810
Stayed 4 nights                    35708
Double Room                        26386
Standard Double Room               24150
Superior Double Room               23518
Family with older children         19787
Deluxe Double Room                 18612
Double or Twin Room                16824
Stayed 5 nights                    15592
Standard Double or Twin Room       13058
Classic Double Room                12604
dtype: int64


In [16]:
print('Most frequent (%) hotel addresses:')
display((data_temp['hotel_address'].value_counts(normalize=True)*100).round(3)[:20])

Most frequent (%) hotel addresses:


hotel_address
163 Marsh Wall Docklands Tower Hamlets London E14 9SJ United Kingdom              0.928
372 Strand Westminster Borough London WC2R 0JJ United Kingdom                     0.830
Westminster Bridge Road Lambeth London SE1 7UT United Kingdom                     0.801
Scarsdale Place Kensington Kensington and Chelsea London W8 5SY United Kingdom    0.695
7 Pepys Street City of London London EC3N 4AF United Kingdom                      0.616
1 Inverness Terrace Westminster Borough London W2 3JP United Kingdom              0.574
Wrights Lane Kensington and Chelsea London W8 5SP United Kingdom                  0.543
225 Edgware Road Westminster Borough London W2 1JU United Kingdom                 0.513
1 Waterview Drive Greenwich London SE10 0TW United Kingdom                        0.502
4 18 Harrington Gardens Kensington and Chelsea London SW7 4LH United Kingdom      0.489
27 Devonshire Terrace Westminster Borough London W2 3DP United Kingdom            0.449
Lakeside Way Brent

In [17]:
print("Most frequent (%) reviewers' nationality:")
display((data_temp['reviewer_nationality'].value_counts(normalize=True)*100).round(3)[:20])

Most frequent (%) reviewers' nationality:


reviewer_nationality
United Kingdom               47.595
United States of America      6.855
Australia                     4.196
Ireland                       2.877
United Arab Emirates          1.969
Saudi Arabia                  1.738
Netherlands                   1.707
Switzerland                   1.680
Canada                        1.546
Germany                       1.540
France                        1.429
Israel                        1.271
Italy                         1.180
Belgium                       1.171
Turkey                        1.061
Kuwait                        0.957
Spain                         0.913
Romania                       0.885
Russia                        0.764
South Africa                  0.747
Name: proportion, dtype: float64

In [18]:
print("Most common negative review:")
display((data_temp['negative_review'].value_counts(normalize=True)*100).round(3)[:50])

Most common negative review:


negative_review
No Negative                    24.795
 Nothing                        2.777
 Nothing                        0.816
 nothing                        0.429
 N A                            0.208
 None                           0.191
                                0.157
 N a                            0.099
 Breakfast                      0.077
 Small room                     0.073
 Location                       0.072
 All good                       0.065
 Everything                     0.065
 Nothing really                 0.062
 none                           0.058
 nothing                        0.057
 No complaints                  0.052
 Nil                            0.051
 Nothing really                 0.050
 Price                          0.050
 n a                            0.046
 Nothing to dislike             0.041
 Nothing at all                 0.040
 Nothing at all                 0.036
 Small rooms                    0.035
 None                           0.

In [19]:
print("Most common positive review:")
display((data_temp['positive_review'].value_counts(normalize=True)*100).round(3)[:50])

Most common positive review:


positive_review
No Positive                    6.950
 Location                      1.766
 Everything                    0.439
 location                      0.323
 Nothing                       0.241
 The location                  0.214
 Great location                0.209
 Good location                 0.179
 Location                      0.172
 Breakfast                     0.118
 Everything                    0.116
 Friendly staff                0.099
 Staff                         0.090
 Excellent location            0.077
 Great location                0.072
 Location and staff            0.066
 everything                    0.060
 Good location                 0.055
 Nothing                       0.044
 nothing                       0.041
 Comfy bed                     0.041
 The location                  0.039
 good location                 0.039
 The staff                     0.039
 the location                  0.036
 Location was good             0.036
 Location was great   

<a id="chapter4"></a>  
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> 🧐 Feature Engineering </b></div>

### Textual Analysis

##### Negative Comments

In [45]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def preproc_text(text):
    text = str(text).lower() 
    text = re.sub(r'\b\w\b', '', text) # remove single characters
    text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single space
    text = re.sub(r'[^\w\s]', '', text) # Replace punctuation
    words = text.strip().split()  # Split text into words
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in words])
    return lemmatized_text    
    
data_temp['cleaned_comments'] = data_temp['negative_review'].apply(preproc_text)

In [46]:
# Use TF-IDF vectorizer to extract key textual features - extract top 20 terms based on Term Frequency-Inverse Document Frequency
tfidf_vectorizer = TfidfVectorizer(max_features=20, stop_words='english')
tfidf_features = tfidf_vectorizer.fit_transform(data_temp['cleaned_comments'])
# Convert TF-IDF features to a DataFrame
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
display(tfidf_df)

Unnamed: 0,bathroom,bed,bit,breakfast,day,did,didn,good,hotel,like,little,negative,night,room,service,shower,small,staff,time,wa
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.000000,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
386491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
386492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0
386493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.555009,0.0,0.0,0.831844,0.0,0.0,0.0
386494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0


* As seen from the table, there are some words that don't make much sense such as `bit`, `did`, `negative`. So, I focus only on most relevant ones.

In [47]:
negative_w = ['bathroom', 'bed', 'breakfast', 'hotel', 'little', 'room', 'service', 'shower', 'small', 'staff']
# Keep only columns as in negative_w
tfidf_df_u = tfidf_df[tfidf_df.columns.intersection(negative_w)]
tfidf_df_u = tfidf_df_u.rename(columns=lambda x:f'neg_{x}')
tfidf_df_u.head(2)


Unnamed: 0,neg_bathroom,neg_bed,neg_breakfast,neg_hotel,neg_little,neg_room,neg_service,neg_shower,neg_small,neg_staff
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
# Combine the original data with TF-IDF features
data_temp = pd.concat([data_temp, tfidf_df_u], axis=1)
tfidf_df_u.describe().round(3)

Unnamed: 0,neg_bathroom,neg_bed,neg_breakfast,neg_hotel,neg_little,neg_room,neg_service,neg_shower,neg_small,neg_staff
count,386496.0,386496.0,386496.0,386496.0,386496.0,386496.0,386496.0,386496.0,386496.0,386496.0
mean,0.03,0.036,0.062,0.066,0.027,0.151,0.021,0.023,0.051,0.039
std,0.14,0.155,0.205,0.2,0.135,0.277,0.12,0.125,0.175,0.159
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.252,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


<div style="padding: 20px; border: 2px solid #c77220; border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); max-width: 100%; margin: 0 auto;">
    <ul style="font-size: 18px; font-family: 'Arial', sans-serif; line-height: 1.5em; word-wrap: break-word; overflow-wrap: break-word;">
    <h4 style="margin-bottom: 0;">💡 Interpretation:</h4>
        <li>  Most negative comments relate to `rooms` (15.1%), followed by `hotel` (6.6%) itself and `breakfast` (6.2%). </li>
    </ul>
</div>

In [49]:
# Define a simplified set of common negative words
common_negative_words = set([
    'bad', 'poor', 'terrible', 'awful', 'worst', 'negative', 'disappointing',
    'unpleasant', 'dirty', 'noisy', 'expensive', 'uncomfortable', 'rude',
    'slow', 'broken', 'ugly', 'sad', 'angry', 'disgusting', 'problem', 'issue'
])

# Optimized feature extraction function without sentiment analysis
def extract_features_optimized(comment: str) -> dict:
    # Basic text preprocessing
    comment = str(comment).strip().lower()
    words = comment.split()
        
    negative_word_count = sum(1 for word in words if word in common_negative_words)
    
    # Punctuation count
    exclamation_count = comment.count('!')
    
    # Binary feature for "No Negative"
    no_negative = int('no neg' in comment)    

    return {
        'neg_word_count': negative_word_count,
        'neg_exclam_count': exclamation_count,
        'no_negative': no_negative
    }

# Apply the optimized feature extraction to the dataset
features_optimized = data_temp['negative_review'].apply(extract_features_optimized)
# Without .tolist(), passing a Series of dictionaries directly to pd.DataFrame() might not correctly parse the nested dictionary 
# structure and could result in a single-column DataFrame with dictionary objects, rather than expanding them into multiple columns.
features_optimized_df = pd.DataFrame(features_optimized.tolist())

# Combine the original and new features
data_temp = pd.concat([data_temp, features_optimized_df], axis=1)

In [55]:
data_temp[['neg_word_count', 'neg_exclam_count', 'no_negative']].describe().round(3)

Unnamed: 0,neg_word_count,neg_exclam_count,no_negative
count,386803.0,386803.0,386803.0
mean,0.462,0.0,0.248
std,0.605,0.0,0.432
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,0.0
max,12.0,0.0,1.0


<div style="padding: 20px; border: 2px solid #c77220; border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); max-width: 100%; margin: 0 auto;">
    <ul style="font-size: 18px; font-family: 'Arial', sans-serif; line-height: 1.5em; word-wrap: break-word; overflow-wrap: break-word;">
    <h4 style="margin-bottom: 0;">💡 Interpretation:</h4>
        <li>  On average, there 0.462 negative words (`neg_word_count`) per comment. </li>
        <li>  There are no comments with exclamation mark (!). </li>
        <li>  24.8% of comments are not negative (`no_negative`). So, some comments that are labelled as negative, are not negative per se. </li>
    </ul>
</div>

### Textual Analysis

##### Positive Comments

In [None]:
y = data_temp['reviewer_score'] 

imp_cat = pd.Series(chi2(X[cat_cols], y)[0], index=cat_cols)

In [None]:
num_cols = data_temp.select_dtypes(include=['number']).columns
cat_cols = data_temp.select_dtypes(include=['number']).columns



In [None]:
data_temp['lat'].describe()

In [None]:
import pandas as pd
from textblob import TextBlob

# Load the data from the uploaded CSV file
file_path = 'negative_comments.csv'
df = pd.read_csv(file_path)

# Define a simplified set of common negative words
common_negative_words = set([
    'bad', 'poor', 'terrible', 'awful', 'worst', 'negative', 'disappointing',
    'unpleasant', 'dirty', 'noisy', 'expensive', 'uncomfortable', 'rude',
    'slow', 'broken', 'ugly', 'sad', 'angry', 'disgusting', 'problem', 'issue'
])

# Optimized feature extraction function without sentiment analysis
def extract_features_optimized(comment: str) -> dict:
    # Basic text preprocessing
    comment = str(comment).strip().lower()
    
    # Tokenization
    words = comment.split()
    word_count = len(words)
    char_count = len(comment)
    avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0
    
    # Sentence count
    sentence_count = comment.count('.') + comment.count('!') + comment.count('?')
    
    # Negative word count using the simplified list
    negative_word_count = sum(1 for word in words if word in common_negative_words)
    
    # Punctuation count
    exclamation_count = comment.count('!')
    question_count = comment.count('?')
    
    # Binary feature for "No Negative"
    no_negative = int('no negative' in comment)
    
    return {
        'char_count': char_count,
        'word_count': word_count,
        'avg_word_length': avg_word_length,
        'sentence_count': sentence_count,
        'negative_word_count': negative_word_count,
        'exclamation_count': exclamation_count,
        'question_count': question_count,
        'no_negative': no_negative
    }

# Apply the optimized feature extraction to the dataset
features_optimized = df['negative_review'].apply(extract_features_optimized)
features_optimized_df = pd.DataFrame(features_optimized.tolist())

# Combine the original and new features
combined_optimized_df = pd.concat([df, features_optimized_df], axis=1)

# Save the result to a CSV file
combined_optimized_df.to_csv('extracted_features.csv', index=False)

print('Feature extraction completed and saved to extracted_features.csv.')


<a id="chapter6"></a>
## <div style="text-align:center; border-radius:30px 30px; padding:7px; color:white; margin:0; font-size:110%; font-family:Pacifico; background-color:#3170af; overflow:hidden"><b> 🛠️ Modeling </b></div>

In [None]:
X = data_temp.drop(['reviewer_score'], axis = 1)
y = data_temp['reviewer_score'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
regr = RandomForestRegressor(n_estimators=100)  
 
regr.fit(X_train, y_train) 
y_pred = regr.predict(X_test) 

In [None]:
data_temp['lat'] = data_temp['lat'].fillna(data_temp['lat'].mode()[0])
data_temp['lng'] = data_temp['lng'].fillna(data_temp['lng'].mode()[0])

In [None]:
X = data_temp.select_dtypes(include=['number']).drop(columns=['reviewer_score'])
y = data_temp['reviewer_score'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
regr = RandomForestRegressor(n_estimators=100)  
 
regr.fit(X_train, y_train) 
y_pred = regr.predict(X_test) 

In [None]:
print('MAPE:', metrics.mean_absolute_percentage_error(y_test, y_pred).round(3))