# Building our Machine Learning Modal for the given set of data  

--| Steps:
1. Preprocess the dataset by handling missing values, encoding categorical variables, and splitting the data into training and testing sets.
2. Select a regression algorithm (e.g., linear regression,decision tree regression) and train it on the training data.
3. Evaluate the model's performance using appropriate regression metrics (e.g., mean squared error, R-squared) on the testing data.
4. Interpret the model's results and analyze the most influential features affecting restaurant ratings.

## preprocessing the data

### As told in above steps we have given a large amount of dataset which is needed to preprocessed.after which we will be able to perform our task easier.

#### Importance of preprocessing

#### 1.Handling Missing Data:
##### Missing values can lead to inaccurate models and poor performance. Preprocessing helps in dealing with missing values appropriately, either by imputing them or by removing rows/columns with missing values.
  
#### 2.Encoding Categorical Variables:
##### Machine learning algorithms require numerical input. Categorical data must be converted into a numerical format using techniques like label encoding or one-hot encoding.

#### 3. Feature Scaling:
##### Different features might have different scales (e.g., age in years vs. salary in dollars). Feature scaling (normalization or standardization) ensures that all features contribute equally to the model performance.

#### 4.Dealing with Outliers:
##### Outliers can skew the results of your model. Identifying and handling outliers can improve model accuracy.

#### 5.Data Cleaning:
##### Removing duplicates, correcting errors, and ensuring consistency in the data helps improve the quality of the dataset and, consequently, the model performance.

#### 6.Data Transformation:
##### Certain transformations can make the data more suitable for modeling (e.g., log transformation for skewed data, polynomial features for non-linear relationships).

#### 7.Feature Selection:
##### Selecting the most relevant features can improve model performance and reduce overfitting.

#### 8.Improving Model Convergence:
##### Properly preprocessed data helps in faster convergence of algorithms, leading to reduced training times and better performance.

In [1]:
# step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.impute import SimpleImputer
import re


# step 2: Load the dataset
file_path = 'dataset.csv'  
df = pd.read_csv("dataset.csv")

# step 4: Data Cleaning & Preprocessing

# Handle missing values
# Replace '?' with NaN
df.replace('?', np.nan, inplace=True)

# Function to clean special characters from a string
def clean_text(text):
    if isinstance(text, str):
        # Remove or replace unwanted characters here
        text = re.sub(r'[^\x00-\x7F]+', '', text)  # Removing non-ASCII characters
        text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
        return text
    return text
    
# Apply the cleaning function to all string columns
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].apply(clean_text)


# Convert numerical columns to numeric, forcing errors to NaN
numerical_columns = ['Average Cost for two', 'Price range', 'Votes', 'Country Code', 'Restaurant ID']
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')

# Fill numerical NaNs with the mean
df[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].mean())

# Convert 'Aggregate rating' column to float such that it is in float dtype
df['Aggregate rating'] = df['Aggregate rating'].astype(float)


In [2]:
# Data Formatting (Mapping categorical values to numerical)

# 1.Rating color
color_mapping = {
    'Dark Green': 5,
    'Green': 4,
    'Yellow': 3,
    'Orange': 2,
    'Red': 1,
    'White': 0 
}
df['Rating color (numerical)'] = df['Rating color'].map(color_mapping)

# 2.Define the text mapping with the correct case and spacing
text_mapping = {
    'Excellent': 5,
    'Very Good': 4,
    'Good': 3,
    'Average': 2,
    'Poor': 1,
    'Not rated': 0
}

# Apply the mapping to the 'Rating text' column
df['Rating text (numerical)'] = df['Rating text'].map(text_mapping)

# Print the head of the DataFrame to see the new columns
print(df[['Rating text', 'Rating text (numerical)']].head())

# 3.Define the mapping for 'Has Table booking' column with correct case
table_booking_mapping = {
    'Yes': 1,
    'No': 0
}

# Apply the mapping to the 'Has Table booking' column
df['Has Table booking (numerical)'] = df['Has Table booking'].map(table_booking_mapping)

# Print the head of the DataFrame to see the new columns
print(df[['Has Table booking', 'Has Table booking (numerical)']].head())


# 4.Define the mapping for 'Is delivering now' column with correct case
delivering_now_mapping = {
    'Yes': 1,
    'No': 0
}

# Apply the mapping to the 'Is delivering now' column
df['Is delivering now (numerical)'] = df['Is delivering now'].map(delivering_now_mapping)

# Print the head of the DataFrame to see the new columns
print(df[['Is delivering now', 'Is delivering now (numerical)']].head())


# 5.Define the mapping for 'Switch to order menu' column with correct case
order_menu_mapping = {
    'Yes': 1,
    'No': 0
}

# Apply the mapping to the 'Switch to order menu' column
df['Switch to order menu (numerical)'] = df['Switch to order menu'].map(order_menu_mapping)

# Print the head of the DataFrame to see the new columns
print(df[['Switch to order menu', 'Switch to order menu (numerical)']].head())


  Rating text  Rating text (numerical)
0   Excellent                        5
1   Excellent                        5
2   Very Good                        4
3   Excellent                        5
4   Excellent                        5
  Has Table booking  Has Table booking (numerical)
0               Yes                              1
1               Yes                              1
2               Yes                              1
3                No                              0
4               Yes                              1
  Is delivering now  Is delivering now (numerical)
0                No                              0
1                No                              0
2                No                              0
3                No                              0
4                No                              0
  Switch to order menu  Switch to order menu (numerical)
0                   No                                 0
1                   No                  

In [3]:
# Data Binning
# Binning 'Votes'
bins_votes = [0, 100, 500, 1000, np.inf]
labels_votes = ['Low', 'Medium', 'High', 'Very High']
df['Votes_binned'] = pd.cut(df['Votes'], bins=bins_votes, labels=labels_votes, right=False, include_lowest=True)
df['Votes_binned'] = df['Votes_binned'].cat.add_categories(['Unknown']).fillna('Unknown')

# Binning 'Average Cost for two'
bins_cost = [0, 100, 200, 300, 400, np.inf]
labels_cost = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
df['Cost_binned'] = pd.cut(df['Average Cost for two'], bins=bins_cost, labels=labels_cost)

# Binning 'Aggregate rating'
bins_rating = [0, 2, 3, 4, 5]
labels_rating = [ 'Poor', 'Average', 'Good', 'Excellent']
df['Rating_binned'] = pd.cut(df['Aggregate rating'], bins=bins_rating, labels=labels_rating)

# Binning 'Price range'
bins = [0, 1, 2, 3, 4, 5]
labels = [ 'Very Low', 'Low', 'Medium', 'High', 'Very High']
df['Price_range_binned'] = pd.cut(df['Price range'], bins=bins, labels=labels, right=False)

# Handling Missing Values after binning
df.replace('?', np.nan, inplace=True)

In [4]:
# Assuming df is your DataFrame after Data Binning

# Handling Missing Values after binning

# Replace '?' with NaN
df.replace('?', np.nan, inplace=True)

# List of binned columns
binned_columns = ['Votes_binned', 'Cost_binned', 'Rating_binned', 'Price_range_binned']

# Fill missing values in binned columns with 'Unknown'
for col in binned_columns:
    if 'Unknown' not in df[col].cat.categories:
        df[col] = df[col].cat.add_categories(['Unknown'])
    df[col] = df[col].fillna('Unknown')

# Replace empty strings with NaN in binned columns
for col in binned_columns:
    df[col] = df[col].replace('', np.nan)

# Fill any remaining NaNs with 'Unknown' in binned columns
for col in binned_columns:
    df[col] = df[col].fillna('Unknown')

# Print the head of the DataFrame to see the new columns
print(df[binned_columns].head())


# Convert numerical columns to numeric, forcing errors to NaN (if not already done)
numerical_columns = ['Restaurant ID', 'Country Code', 'Longitude', 'Latitude', 'Average Cost for two',
                     'Price range', 'Aggregate rating', 'Votes']

df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')
df[numerical_columns] = df[numerical_columns].fillna(df[numerical_columns].mean())


  Votes_binned Cost_binned Rating_binned Price_range_binned
0       Medium   Very High     Excellent               High
1         High   Very High     Excellent               High
2       Medium   Very High     Excellent          Very High
3       Medium   Very High     Excellent          Very High
4       Medium   Very High     Excellent          Very High


In [5]:
# Check if there are any missing values left
print(df.isnull().sum())

# Now, save the preprocessed DataFrame to an CSV file as shown previously
output_csv_path = 'Preprocessed_Dataset.csv'
df.to_csv(output_csv_path, index=False)

print(f"Preprocessed data saved to {output_csv_path}")


Restaurant ID                       0
Restaurant Name                     0
Country Code                        0
City                                0
Address                             0
Locality                            0
Locality Verbose                    0
Longitude                           0
Latitude                            0
Cuisines                            9
Average Cost for two                0
Currency                            0
Has Table booking                   0
Has Online delivery                 0
Is delivering now                   0
Switch to order menu                0
Price range                         0
Aggregate rating                    0
Rating color                        0
Rating text                         0
Votes                               0
Rating color (numerical)            0
Rating text (numerical)             0
Has Table booking (numerical)       0
Is delivering now (numerical)       0
Switch to order menu (numerical)    0
Votes_binned