# 1. Import Libraies:
* ## Here you import the necessary libraries: `pandas` for data handling, `train_test_split` to split the data into training and testing sets, and `RandomForestRegressor` to build the regression models and `OneHotEncoder` for encoding categorical features.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder

# 2. Load the Dataset:

* ## You load the dataset from a CSV file named `'hotelData.csv'` into a Pandas DataFrame called `data`.

In [None]:
# Load the dataset
data = pd.read_csv("hotelData.csv")

# 3. Data Preprocessing
>## Encode Categorical Features
* ## You create an instance of `OneHotEncoder`, specifying that you want to encode categorical features while ignoring unknown categories. You then use this encoder to transform the selected categorical columns and create a new DataFrame `encoded_cities_df` containing the encoded features.


In [None]:
# Data Preprocessing
# Handle missing values, encode categorical features using one-hot encoding
encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
encoded_cities = encoder.fit_transform(data[["City", "Sub City","Image", "Description", "Restaurant name", "Most reviewed foods"]])
encoded_cities_df = pd.DataFrame(encoded_cities, columns=encoder.get_feature_names_out(["City", "Sub City","Image", "Description", "Restaurant name", "Most reviewed foods"]))

># Combine encded Cities
* ## You define `selected_columns` as a list containing the column name `"Uniq ID"`. Then, you concatenate the `encoded_cities_df` with the selected columns from the original `data` DataFrame to create the `features` DataFrame. You also extract the `target` columns for prediction into the `targets` DataFrame.

In [None]:
# Combine encoded cities with other features
selected_columns = ["Uniq ID"]
features = pd.concat([encoded_cities_df,data[selected_columns]], axis=1)
targets = data[["Food Rate", "Staff Rate", "Environment Rate"]]

# 4. Split Data into Features and Targets:

* ## using the `train_test_split` function from the `sklearn.model_selection` module to split your data into training and testing sets. This function is commonly used in machine learning to divide a dataset into two parts: one for training a model and the other for evaluating its performance.

>## Here's a breakdown of the code:

>* ## `features`: This is the DataFrame containing the features that your machine learning models will use for prediction.

>* ## `targets`: This is the DataFrame containing the target variables you want to predict. In your case, the target variables are `"Food Rate"`, `"Staff Rate"`, and `"Environment Rate"`.

>* ## `test_size`: This parameter determines the proportion of the data that will be allocated to the testing set. Here, it's set to `0.2`, which means that `20%` of the data will be used for testing, and the remaining `80%` will be used for `training`.

>* ## `random_state`: This parameter sets the random seed for reproducibility. By setting it to a specific value (e.g., `42`), you ensure that the random split of data will be the same every time you run your code. This is useful for consistent and repeatable results during development and testing.

>* ## After executing this line of code, you'll have the following variables:

>* ## `X_train`: This DataFrame contains the features used for training your machine learning models.

>* ## `X_test`: This DataFrame contains the features used for evaluating the performance of your models.

>* ## `y_train`: This DataFrame contains the target variables corresponding to the training features. For your case, these are the target `"Food Rate"`, `"Staff Rate"`, and `"Environment Rate"` values.

>* ## `y_test`: This DataFrame contains the target variables corresponding to the testing features.

>* ## With these variables, you can proceed to train your models on `X_train` and `y_train`, make predictions on `X_test`, and evaluate your models' performance using `y_test`.

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=42)

# 6. Train the Models:

* ## You create separate instances of `RandomForestRegressor` for each target variable and train them on the training data.

* ## `model_food = RandomForestRegressor()`:
### Here, you create an instance of the `RandomForestRegressor` class and assign it to the variable model_food. This instance represents your machine learning model that will predict the "Food Rate".

* ## `model_food.fit(X_train, y_train["Food Rate"])`:
### The `.fit()` method is used to train the machine learning model. You provide two arguments:
### `X_train`: This is the feature data for training. It's a DataFrame that contains all the features for the training set.
### `y_train["Food Rate"]`: This is the target data for training, specifically the "Food Rate" values for the training set. It's a Series that contains the actual "Food Rate" values corresponding to the training data.
* ## These two lines together initialize the `RandomForestRegressor` model for predicting "Food Rate" and then train it using the provided training features `(X_train)` and corresponding target values `(y_train["Food Rate"])`.

* ## The same process is repeated for the `model_staff` and `model_env`:

In [None]:
# Train separate models for each target variable
food_model = RandomForestRegressor()
food_model.fit(X_train, y_train["Food Rate"])

staff_model = RandomForestRegressor()
staff_model.fit(X_train, y_train["Staff Rate"])

env_model = RandomForestRegressor()
env_model.fit(X_train, y_train["Environment Rate"])

# 7. Make Predictions:

* ## `model_food`: This is the trained `RandomForestRegressor` model for predicting food rates.
* ## `.predict(X_test)`: This method takes the test features `X_test` as input and returns predicted values for the target variable `(food rate)` based on the learned model.
* ## `food_predictions`: This variable stores the predicted food rates for the test data.
* # Similarly, for the staff and environment rate predictions:

In [None]:
# Predict review values
food_predictions = food_model.predict(X_test)
staff_predictions = staff_model.predict(X_test)
env_predictions = env_model.predict(X_test)

# 8. Combining the predictions with additional information from the original data

* ## In this code, you are creating a new DataFrame named `results` that combines the predictions with additional information from the original `data` DataFrame. Let's go through each column step by step:

 >## - `"Uniq_ID": data.loc[X_test.index, "Uniq ID"]`

 >## This column represents the unique ID for each observation in the testing set. `data.loc[X_test.index, "Uniq ID"]` extracts the `"Uniq ID"` values corresponding to the indices of the `X_test` DataFrame (which holds the testing set features).
  >## - `"Food_Prediction": food_predictions`

 >## This column contains the predictions made by the food prediction model for each observation in the testing set.
 >## - `"Staff_Prediction": staff_predictions`

 >## This column contains the predictions made by the staff prediction model for each observation in the testing set.
 >## - `"Environment_Prediction": env_predictions`

 >## This column contains the predictions made by the environment prediction model for each observation in the testing set.
 >## "City": data.loc[X_test.index, "City"]

 >## This column represents the city for each observation in the testing set.
data.loc[X_test.index, "City"] extracts the "City" values corresponding to the indices of the X_test DataFrame.
 >## "Image": data.loc[X_test.index, "Image"]

 >## This column represents the image information for each observation in the testing set. `data.loc[X_test.index, "Image"]` extracts the "Image" values corresponding to the indices of the `X_test` DataFrame.
 >## `"Description": data.loc[X_test.index, "Description"]`

 >## This column represents the description for each observation in the testing set. `data.loc[X_test.index, "Description"]` extracts the "Description" values corresponding to the indices of the `X_test` DataFrame.
 >## `"Restaurant_name": data.loc[X_test.index, "Restaurant name"]`

 >## This column represents the restaurant name for each observation in the testing set.`data.loc[X_test.index, "Restaurant name"]` extracts the `"Restaurant name"` values corresponding to the indices of the `X_test` DataFrame.
 >## `"Most_reviewed_foods": data.loc[X_test.index, "Most reviewed foods"]`

 >## This column represents the most reviewed foods for each observation in the testing set. `data.loc[X_test.index, "Most reviewed foods"]` extracts the `"Most reviewed foods"` values corresponding to the indices of the `X_test` DataFrame.
 >## The resulting results DataFrame combines the extracted values to provide a comprehensive view of the predictions and relevant information for each observation in the testing set.

In [None]:
# Combine predictions and original data
results = pd.DataFrame({
    "Uniq_ID": data.loc[X_test.index, "Uniq ID"],
    "Food_Prediction": food_predictions,
    "Staff_Prediction": staff_predictions,
    "Environment_Prediction": env_predictions,
    # Other relevant columns from your dataset
    "City": data.loc[X_test.index, "City"],
    "Image": data.loc[X_test.index, "Image"],
    "Description": data.loc[X_test.index, "Description"],
    "Restaurant_name": data.loc[X_test.index, "Restaurant name"],
    "Most_reviewed_foods": data.loc[X_test.index, "Most reviewed foods"]
})

# Display the results
print(results)

     Uniq_ID  Food_Prediction  Staff_Prediction  Environment_Prediction  \
521      522            3.754             3.688                   4.070   
737      738            3.854             3.694                   3.759   
740      741            4.050             3.585                   3.950   
660      661            3.756             3.238                   3.688   
411      412            4.166             3.573                   3.982   
..       ...              ...               ...                     ...   
408      409            3.968             3.568                   3.874   
332      333            3.737             3.177                   3.287   
208      209            3.855             3.181                   3.458   
613      614            3.881             3.591                   3.676   
78        79            3.950             3.576                   3.703   

            City                                              Image  \
521   Kurunegala  https://q-

# 9. Save Results to a .pkl File:

* ## You save the sorted results DataFrame to a pickle file named `'final_sorted_results.h5'`.

In [None]:
# Save the results to a .pkl file
results.to_pickle('final_sorted_results.h5')