## Supervised Learning using Regression

## Predicting Price

## Objectives
Upon completing this assignment, you will learn how to write a simple AI application involving supervised learning using regression.

## Summary
Write an AI application that predicts the price of an Android cell phone based on its attributes. Use the prvoided dataset (k_mobilephonepriceprediction.csv), which contains data on 1370 Android cell phones with 17 features, including the price. Use 80% of the data for training and 20% for testing. Train the sklearn's 'LinearRegression' model. After training the model, test the model and compute the Mean Absolute Percentage Error (MAPE). Additionally, report the trained model's coefficients (coef_) and intercept (intercept_). 

#### Regressor Models to be Used
Evaluate the following regression models from sklearn and compare their performance using MAPE:

- LinearRegression from sklearn.linear_model
- KNeighborsRegressor from sklearn.neighbors (use n_neighbors=5)
- SVR from sklearn.svm
- RandomForestRegressor from sklearn.ensemble

Try some made-up attribute values with the best performing model and report the predicted prices.

## Implementation 

#### Preprocessing
- Remove rows with missing or null values
- Remove duplicate rows

#### Columns Used

Although dataset has 17 columns, use only the following 6 columns for the prediction: Rating, Spec_score, Ram, Display, Screen_resolution, company, Price

#### Column Cleaning
- 'Rating' and 'Spec_score' column values are already numerical.
- 'Ram' and 'Display' values need cleaning as they are strings (e.g., '4 GB RAM' and '6.7 inches'). Remove alphabetical chars and convert the integer to float.
- Screen_resolution are ordinal strings (e.g., '2408 x 1080 px Display with Water Drop Notch' and '720 x 1560 px Display with Punch Hole'). Convert them to numerical values using sklearn.preprocessing's label encoder 'LabelEncoder'.
- 'company' has many unique company names. Retain the top 5 names and convert the rest to "Others". Use 'pandas.get_dummies' for one-hot encoding.
- 'Price' values are strings with commas (e.g., '9,999'). Remove the commas and convert to float.

## Data values are either quantitative (numerical) or qualitative (categorical)

#### Quantitative (Numerical) Values

Quantitative values can be shown along a number line and support mathematical operations (+, -, *, /). They can be either discrete or continuous.

**Discrete Values**: Discrete numerical values (e.g., integers (neg, 0, and pos) and whole numbers (0 and pos #s)) exist along a number line within a range but exclude some values (e.g., fractions and decimals).  
**Continuous Values**: Continuous numerical values exist along a number line within a range without exclusions (e.g., floats). For example, shoe sizes are discrete values, but foot sizes are continuous values.  
**Regressors vs. Classifiers**: For continuous target values (e.g., prices), use regressors. For other targets, use classifiers.

#### Qualitative (Categorical, Non-numerical) Values
Qualitative values are not shown along a number line and do not support mathematical operations. They can be either nominal or ordinal.

**Nominal Values**: Nominal values are names without any ranking (e.g., hair colors: black, brown, red, etc.).  
**Ordinal Values**: Ordinal values are names with an implied ranking (e.g., job satisfaction levels).  
**Implementing Nominal and Ordinal Values**: Convert both nominal and ordinal values to numerical values:
- For nominal values, use pandas.get_dummies to create separate columns for each unique value. Pandas creates a separate column for each unique name value. For example, in a hair color column, a new column is created for each named value (e.g., black, brown, red) and assigned a value of 0 or 1, indicating the presence or absence of the color.
- For ordinal values, use LabelEncoder to substitute ordered names with numerical values. LabelEncoder does not create new columns. Instead, it substitutes the value 0, 1, 2, 3, etc., for the ordered name values. 

## Implementation Notes

#### Dataset source
The dataset was downloaded from ???.

## Submittal
Your submission should include:  

- 'jpynb' file containing the source code, output, and your interaction.
- the corresponding 'HTML' file.

## Keith Yrisarri Stateson
June 23, 2024. Python 3.11.0

## Title: Predicting Mobile Phone Prices Using Various Regression Models - Supervised Learning

## Summary
This program is an AI application to predict the prices of Android cell phones based on their attributes. Supervised learning and regression techniques are used to train and evaluate multiple models on a provided dataset, predict phone prices for new, made-up attributes, and assess model accuracy using MAPE. The goal is to determine which model performs best in predicting phone prices and to understand the influence of various features on the price.

## Table of Contents
    
Part 1: DataFrame Cleaning

Part 2: Evaluate the Features and Target variable

Part 3: Data Cleaning

Part 4: Feature Engineering

Part 5: Train-Test Split and Feature Scaling

Part 6: Modeling
- Linear Regression Model
- Random Forest Regressor Model
- KNN Model (K-Nearest Neighbors)
- SVR Model (Support Vector Regressor)

Part 7: Identify the best performing model to predict mobile phone prices

Part 8: Predict Mobile Phone Prices with New Data

## Part 1: DataFrame Cleaning

Evaluate the dataframe for missing values, empty rows and columns, and duplicate entries

In [2]:
import seaborn as sns
import pandas as pd
import warnings
import numpy as np
import re

warnings.filterwarnings('ignore')
#warnings.filterwarnings('ignore', category=UserWarning)


df=pd.read_csv('k_mobilephonepriceprediction.csv',index_col=0)

print(df.shape)
df.head(3)

(1370, 17)


Unnamed: 0,Name,Rating,Spec_score,No_of_sim,Ram,Battery,Display,Camera,External_Memory,Android_version,Price,company,Inbuilt_memory,fast_charging,Screen_resolution,Processor,Processor_name
0,Samsung Galaxy F14 5G,4.65,68,"Dual Sim, 3G, 4G, 5G, VoLTE,",4 GB RAM,6000 mAh Battery,6.6 inches,50 MP + 2 MP Dual Rear &amp; 13 MP Front Camera,"Memory Card Supported, upto 1 TB",13,9999,Samsung,128 GB inbuilt,25W Fast Charging,2408 x 1080 px Display with Water Drop Notch,Octa Core Processor,Exynos 1330
1,Samsung Galaxy A11,4.2,63,"Dual Sim, 3G, 4G, VoLTE,",2 GB RAM,4000 mAh Battery,6.4 inches,13 MP + 5 MP + 2 MP Triple Rear &amp; 8 MP Fro...,"Memory Card Supported, upto 512 GB",10,9990,Samsung,32 GB inbuilt,15W Fast Charging,720 x 1560 px Display with Punch Hole,1.8 GHz Processor,Octa Core
2,Samsung Galaxy A13,4.3,75,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,5000 mAh Battery,6.6 inches,50 MP Quad Rear &amp; 8 MP Front Camera,"Memory Card Supported, upto 1 TB",12,11999,Samsung,64 GB inbuilt,25W Fast Charging,1080 x 2408 px Display with Water Drop Notch,2 GHz Processor,Octa Core


In [3]:
# sum missing values by column
df.isna().sum()

Name                   0
Rating                 0
Spec_score             0
No_of_sim              0
Ram                    0
Battery                0
Display                0
Camera                 0
External_Memory        0
Android_version      443
Price                  0
company                0
Inbuilt_memory        19
fast_charging         89
Screen_resolution      2
Processor             28
Processor_name         0
dtype: int64

In [4]:
# Sum of missing values across all rows and columns in the entire dataframe. It does not indicate
# the number of empty rows directly, but rather the total number of missing entries in the dataframe.

df.isna().sum().sum()

np.int64(581)

In [5]:
# Find empty rows (rows where all elements are NaN)

empty_rows = df[df.isna().all(axis=1)]
print('Empty Rows: ', empty_rows)

Empty Rows:  Empty DataFrame
Columns: [Name, Rating, Spec_score, No_of_sim, Ram, Battery, Display, Camera, External_Memory, Android_version, Price, company, Inbuilt_memory, fast_charging, Screen_resolution, Processor, Processor_name]
Index: []


In [6]:
# Find empty columns (columns where all elements are NaN)

empty_columns = df.columns[df.isna().all()].tolist()
print("Empty Columns:", empty_columns)

Empty Columns: []


In [7]:
# Find duplicate rows

duplicate_rows = df[df.duplicated()]
print('Duplicate Rows:', duplicate_rows)

Duplicate Rows: Empty DataFrame
Columns: [Name, Rating, Spec_score, No_of_sim, Ram, Battery, Display, Camera, External_Memory, Android_version, Price, company, Inbuilt_memory, fast_charging, Screen_resolution, Processor, Processor_name]
Index: []


In [8]:
# Drop rows with missing values
df = df.dropna()
print(df.shape)

# Verifiy that there are no rows with missing values
df.isna().sum().sum()

(817, 17)


np.int64(0)

## Part 2: Evaluate the Features and Target variable

In [9]:
# Evaluate the distribution of the target variable

df.Price.value_counts()

Price
19,990      22
14,999      22
13,999      21
11,999      20
9,999       18
            ..
14,950       1
15,590       1
17,945       1
19,490       1
1,19,990     1
Name: count, Length: 343, dtype: int64

In [10]:
# Evaluate the distribution of the feature 'Rating'
df.Rating.value_counts()

Rating
4.40    67
4.30    59
4.55    59
4.60    53
4.10    52
4.00    52
4.65    51
4.50    51
4.35    50
4.15    48
4.20    47
4.25    45
4.45    44
4.05    44
4.70    44
4.75    42
3.95     6
3.90     3
Name: count, dtype: int64

In [11]:
# Evaluate the distribution of the feature 'Spec_score'

df.Spec_score.value_counts()

Spec_score
75    81
84    53
86    50
80    42
85    37
82    37
83    37
78    36
77    34
79    34
81    31
74    29
76    28
89    28
71    26
88    23
73    21
72    21
87    19
70    16
69    13
90    12
67    12
68    12
91    11
93    11
92    10
66     9
64     8
94     8
63     6
65     5
95     4
96     3
54     2
61     2
98     1
58     1
62     1
60     1
53     1
55     1
Name: count, dtype: int64

In [12]:
# Evaluate the distribution of the feature 'Ram'

df.Ram.value_counts()

Ram
8 GB RAM     306
4 GB RAM     210
6 GB RAM     177
12 GB RAM     84
16 GB RAM     16
3 GB RAM      15
2 GB RAM       6
18 GB RAM      2
24 GB RAM      1
Name: count, dtype: int64

In [13]:
# Evaluate the distribution of the feature 'Display'

df.Display.value_counts()

Display
6.67 inches    91
6.5 inches     88
6.6 inches     82
6.7 inches     64
6.78 inches    51
               ..
7.45 inches     1
7.4 inches      1
7.1 inches      1
7.63 inches     1
10 inches       1
Name: count, Length: 65, dtype: int64

In [14]:
# Evaluate the distribution of the feature 'Screen_resolution'

df.Screen_resolution.value_counts()

Screen_resolution
1080 x 2400 px                                  240
720 x 1600 px                                    57
720 x 1600 px Display with Water Drop Notch      54
1080 x 2412 px                                   45
1080 x 2408 px                                   42
                                               ... 
1440 x 3200 px Display with Punch Hole            1
720 x 1544 px Display with Water Drop Notch       1
1080 x 2408 px Display with Punch Hole            1
Full HD+ Display with Punch Hole                  1
1080 x 2388 px Display with Water Drop Notch      1
Name: count, Length: 89, dtype: int64

In [15]:
# Evaluate the distribution of the feature 'company'

df.company.value_counts()

company
Samsung     149
Realme      126
Vivo        124
Motorola     72
Xiaomi       69
Poco         54
OnePlus      37
iQOO         23
Honor        21
TCL          20
OPPO         20
POCO         18
Huawei       18
Lava         13
Oppo         11
Google        9
itel          9
Asus          6
Lenovo        5
Tecno         4
LG            3
Itel          2
Nothing       1
Gionee        1
IQOO          1
Coolpad       1
Name: count, dtype: int64

In [16]:
df_tartget = df.Price

## Part 3: Data Cleaning - Features and Target variable

*Conversion of Panda Series into a NumPy Array.*  
Many machine learning libraries, such as scikit-learn, expect input data to be in the form of NumPy arrays rather than pandas Series.  
Converting the target to a NumPy array ensures compatibility with these libraries.

In [17]:
# Create definitions to clean the data

def cleanup_target (item):
    """This function cleans up the target column."""
    item = re.sub (r'[,]', '',item)  # Replace comma with an empty string
    item = re.sub (r'\s+', '',item)  # Replace whitespace characters: spaces, tabs, newlines, and carriage returns with an empty string
    item = item.strip()
    return float (item)

def cleanup_ram (item):
    """This function cleans up the Ram column."""
    item_extract_digits = re.findall(r'\d+', item)  # Extract the digits from the string
    cleaned_item = item_extract_digits[0]  # Extract the first element from the list
    return float (cleaned_item)

def cleanup_display (item):
    """This function cleans up the Display column."""
    item = re.sub (r'inches', '',item)  # Replace inch with an empty string
    item = item.strip()
    return float (item)

def cleanup_screen_resolution (item):
    """This function cleans up the Screen_resolution column to the format, e.g., '1920 x 1080'."""  
    if 'px' in item:
        item_index = item.find('px')  # if the data wasn't friendly, could use: re.findall(r'\d+', item)
        cleaned_item = item[:item_index - 1]  # Adjusted to remove the space before 'px'
        cleaned_item = cleaned_item.strip()
        return cleaned_item
    else:
        cleaned_item = "1920 x 1080"  # to account for the value 'Full HD+...'
        return cleaned_item

def company (item):
    """This function cleans up the company column."""
    if (item == 'Samsung') or (item == 'Realme') or (item == 'Vivo') or (item == 'Motorola') or (item == 'Xiaomi'):
        return item
    else:
        return 'Other'
    # company = company.strip()
    # if company in ['Samsung', 'Realme', 'Vivo', 'Motorola', 'Xiaomi']:
    #     return company
    # else:
    #     return 'Other'

In [18]:
# Assign and clean the target variable

target = df.Price
print(target.head(3))
print(type(target))

print('\n')
target = target.apply(cleanup_target)
target = np.array(target)
print(target[0:3])
print(type(target))
print(type(target[0]))

0     9,999
1     9,990
2    11,999
Name: Price, dtype: object
<class 'pandas.core.series.Series'>


[ 9999.  9990. 11999.]
<class 'numpy.ndarray'>
<class 'numpy.float64'>


In [19]:
# Assign the features

df_features = df.filter(['Rating', 'Spec_score', 'Ram', 'Display', 'Screen_resolution', 'company'], axis=1)
df_features

Unnamed: 0,Rating,Spec_score,Ram,Display,Screen_resolution,company
0,4.65,68,4 GB RAM,6.6 inches,2408 x 1080 px Display with Water Drop Notch,Samsung
1,4.20,63,2 GB RAM,6.4 inches,720 x 1560 px Display with Punch Hole,Samsung
2,4.30,75,4 GB RAM,6.6 inches,1080 x 2408 px Display with Water Drop Notch,Samsung
4,4.10,69,4 GB RAM,6.5 inches,720 x 1600 px Display with Water Drop Notch,Samsung
5,4.40,75,6 GB RAM,6.5 inches,720 x 1600 px,Samsung
...,...,...,...,...,...,...
1365,4.05,75,4 GB RAM,6.6 inches,720 x 1612 px,TCL
1366,4.10,80,8 GB RAM,6.8 inches,1200 x 2400 px,TCL
1367,4.00,80,6 GB RAM,6.6 inches,720 x 1612 px,TCL
1368,4.50,79,6 GB RAM,6.6 inches,720 x 1612 px,TCL


In [20]:
# Clean the feature 'Ram' and convert it to a numpy array

print(df_features.Ram[0:3])

df_features.Ram = df_features.Ram.apply(cleanup_ram)
df_features.Ram = np.array(df_features.Ram)

print('\n')
print(df_features.Ram[0:3])
print(type(df_features.Ram))
print(type(df_features.Ram[0]))

0    4 GB RAM
1    2 GB RAM
2    4 GB RAM
Name: Ram, dtype: object


0    4.0
1    2.0
2    4.0
Name: Ram, dtype: float64
<class 'pandas.core.series.Series'>
<class 'numpy.float64'>


In [21]:
# Clean the feature 'Display' and convert it to a numpy array

df_features.Display = df_features.Display.apply(cleanup_display)
df_features.Display = np.array(df_features.Display)

print(type(df_features.Display))
print(type(df_features.Display[0]))

<class 'pandas.core.series.Series'>
<class 'numpy.float64'>


In [22]:
# Clean the feature 'Screen_resolution' and convert it to a numpy array

df_features.Screen_resolution = df_features.Screen_resolution.apply(cleanup_screen_resolution)
df_features.Screen_resolution = np.array(df_features.Screen_resolution)

print(type(df_features.Screen_resolution))
print(type(df_features.Screen_resolution[0]))

<class 'pandas.core.series.Series'>
<class 'str'>


In [23]:
# Clean the feature 'company' and convert it to a numpy array

df_features.company = df_features.company.apply(company)
df_features.company = np.array(df_features.company)

print(type(df_features.company))
print(type(df_features.company[0]))

<class 'pandas.core.series.Series'>
<class 'str'>


## Part 4: Feature Engineering

Transform nominal categorical data to numerical using pandas.get_dummies, and drop the catgorical column and add the newly created numerical version to the features dataframe.

Transforms ordinal categorical data to numerical using the LabelEncoder.

In [24]:
print(df_features.head(3))

   Rating  Spec_score  Ram  Display Screen_resolution  company
0    4.65          68  4.0      6.6       2408 x 1080  Samsung
1    4.20          63  2.0      6.4        720 x 1560  Samsung
2    4.30          75  4.0      6.6       1080 x 2408  Samsung


In [25]:
# Encode the 'Screen_resolution' feature from ordinal categorical to numerical using LabelEncoder

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

print(df_features.Screen_resolution.head(3))
print('\n')

df_features.Screen_resolution = le.fit_transform(df_features.Screen_resolution)

print(df_features.Screen_resolution.head(3))
print('\n')
print(df_features.Screen_resolution.value_counts())
print('\n')

# Visual mapping of Screen_resolution values to integers 
mapping = dict(zip(le.classes_, range(len(le.classes_))))
print(mapping)

0    2408 x 1080
1     720 x 1560
2    1080 x 2408
Name: Screen_resolution, dtype: object


0    49
1    55
2     9
Name: Screen_resolution, dtype: int64


Screen_resolution
7     284
56    113
9      58
4      55
10     45
     ... 
37      1
41      1
2       1
36      1
51      1
Name: count, Length: 62, dtype: int64


{'1080 x 1920': 0, '1080 x 2160': 1, '1080 x 2256': 2, '1080 x 2280': 3, '1080 x 2340': 4, '1080 x 2376': 5, '1080 x 2388': 6, '1080 x 2400': 7, '1080 x 2404': 8, '1080 x 2408': 9, '1080 x 2412': 10, '1080 x 2448': 11, '1080 x 2460': 12, '1080 x 2480': 13, '1080 x 2520': 14, '1176 x 2400': 15, '1200 x 2400': 16, '1200 x 2640': 17, '1212 x 2616': 18, '1220 x 2712': 19, '1240 x 2772': 20, '1260 x 2712': 21, '1260 x 2720': 22, '1260 x 2800': 23, '1264 x 2780': 24, '1344 x 2772': 25, '1440 x 2780': 26, '1440 x 2960': 27, '1440 x 3040': 28, '1440 x 3088': 29, '1440 x 3120': 30, '1440 x 3168': 31, '1440 x 3200': 32, '1440 x 3216': 33, '1600 x 2560': 34, '1600 x 720': 35, '1

In [26]:
# Convert the 'company' feature from categorical to numerical using get_dummies

print(df_features['company'].head(3))
print('\n')

df_features_company_num = pd.get_dummies(df_features.company, dtype=int, drop_first=True)
df_features_company_num

0    Samsung
1    Samsung
2    Samsung
Name: company, dtype: object




Unnamed: 0,Other,Realme,Samsung,Vivo,Xiaomi
0,0,0,1,0,0
1,0,0,1,0,0
2,0,0,1,0,0
4,0,0,1,0,0
5,0,0,1,0,0
...,...,...,...,...,...
1365,1,0,0,0,0
1366,1,0,0,0,0
1367,1,0,0,0,0
1368,1,0,0,0,0


In [27]:
# Drop the 'company' column from the features dataframe and concatenate the numerical 'company_num' column

df_features = df_features.drop(['company'], axis=1)
df_features = pd.concat([df_features, df_features_company_num], axis=1)
print(df_features)

      Rating  Spec_score   Ram  Display  Screen_resolution  Other  Realme  \
0       4.65          68   4.0      6.6                 49      0       0   
1       4.20          63   2.0      6.4                 55      0       0   
2       4.30          75   4.0      6.6                  9      0       0   
4       4.10          69   4.0      6.5                 56      0       0   
5       4.40          75   6.0      6.5                 56      0       0   
...      ...         ...   ...      ...                ...    ...     ...   
1365    4.05          75   4.0      6.6                 58      1       0   
1366    4.10          80   8.0      6.8                 16      1       0   
1367    4.00          80   6.0      6.6                 58      1       0   
1368    4.50          79   6.0      6.6                 58      1       0   
1369    4.65          93  12.0     10.0                 42      1       0   

      Samsung  Vivo  Xiaomi  
0           1     0       0  
1           1  

## Part 5: Train-Test Split and Feature Scaling

Assign features and target.  
Split the dataset into 80% training, 20% testing.  
Standardize the training and test feature data.  
Apply the transformation to both the training and test datasets.

In [28]:
# Assign the features and the target variable, and split the data into training and test sets

X = df_features
y = target

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [29]:
# Standardize the training and test data using StandardScaler and fit and transform the training data

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
print(X_train_scaled)
print('\n')
print(X_test_scaled)

[[ 0.97353535 -0.18314954 -0.34562274 ... -0.44268578 -0.43025338
  -0.30327091]
 [-1.44866916  0.50816059  0.37085868 ...  2.25893863 -0.43025338
  -0.30327091]
 [ 0.09273371 -0.04488752 -0.34562274 ... -0.44268578 -0.43025338
  -0.30327091]
 ...
 [ 0.53313453  0.64642261  0.37085868 ... -0.44268578 -0.43025338
  -0.30327091]
 [-1.22846875  0.50816059  0.37085868 ... -0.44268578 -0.43025338
  -0.30327091]
 [ 0.09273371 -2.11881791 -1.42034486 ... -0.44268578 -0.43025338
   3.29738188]]


[[ 0.31293412 -1.42750778 -1.06210415 ... -0.44268578 -0.43025338
  -0.30327091]
 [ 1.41393617  0.36989856  0.37085868 ... -0.44268578 -0.43025338
  -0.30327091]
 [-0.1274667   1.19947072  0.37085868 ...  2.25893863 -0.43025338
  -0.30327091]
 ...
 [-0.78806793 -0.18314954 -0.34562274 ... -0.44268578 -0.43025338
  -0.30327091]
 [-1.22846875 -0.87445967  0.37085868 ... -0.44268578  2.32421186
  -0.30327091]
 [ 0.31293412 -0.18314954 -0.34562274 ...  2.25893863 -0.43025338
  -0.30327091]]


## Part 6: Modeling

Linear Regression  
Random Forest Regressor  
KNN Model (K-Nearest Neighbor)  
SVR (Support Vector Regression)

In [30]:
# Train the model using Linear Regression, RandomForestRegressor, KNeighborsRegressor, and SVR, and make predictions

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_percentage_error

# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# RandomForestRegressor
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train_scaled, y_train)

# KNeighborsRegressor
knn_model = KNeighborsRegressor()
knn_model.fit(X_train_scaled, y_train)

# SVR
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train)

# Prediction using Linear Regression
y_pred_lr = lr_model.predict(X_test_scaled)
# y_pred_lr = np.maximum(0, y_pred_lr)  # This only corrects the output after the prediction, but it doesn't address
# the underlying issue with the model training and feature engineering.
print(f'Predicted Prices (Linear Regression): {y_pred_lr[:10]}')
print('\n')

# Prediction using RandomForestRegressor
y_pred_rf = rf_model.predict(X_test_scaled)
# y_pred_rf = np.maximum(0, y_pred_rf)
print(f'Predicted Prices (RandomForest): {y_pred_rf[:10]}')
print('\n')

# Prediction using KNeighborsRegressor
y_pred_knn = knn_model.predict(X_test_scaled)
# y_pred_knn = np.maximum(0, y_pred_knn)
print(f'Predicted Prices (KNeighbors): {y_pred_knn[:10]}')
print('\n')

# Prediction using SVR
y_pred_svr = svr_model.predict(X_test_scaled)
# y_pred_svr = np.maximum(0, y_pred_svr)
print(f'Predicted Prices (SVR): {y_pred_svr[:10]}')
print('\n')

# Print the actual prices of the test set
print(f'Actual Prices: {y_test[0:10]}')

Predicted Prices (Linear Regression): [10005.79775342 23906.88301057 54909.530152   17592.3095252
 42305.35025807 12969.69994686 46291.18979772 21624.45500174
 30027.20189087 -9168.89902842]


Predicted Prices (RandomForest): [10644.66       19337.505      42039.3        11326.62
 24793.815      13024.92       30742.11666667 17841.5625
 22371.84880952  8594.035     ]


Predicted Prices (KNeighbors): [11095.4 16995.2 32551.4 10537.2 29199.  12165.4 33215.8 17085.6 28193.6
  7659. ]


Predicted Prices (SVR): [15970.37158905 16028.28402133 16061.11881974 15989.94798245
 16067.70983654 16023.44945965 16067.37293901 16046.85313871
 16055.47187045 15973.20087165]


Actual Prices: [13990. 19990. 64999.  8999. 29999. 14990. 24999. 12990. 29990.  6999.]


In [31]:
# Calculate MAPE for each model

# Calculate MAPE for Linear Regression
mape_lr = mean_absolute_percentage_error(y_test, y_pred_lr)
print(f'Linear Regression MAPE: {mape_lr}')

# Calculate MAPE for RandomForestRegressor
mape_rf = mean_absolute_percentage_error(y_test, y_pred_rf)
print(f'RandomForestRegressor MAPE: {mape_rf}')

# Calculate MAPE for KNeighborsRegressor
mape_knn = mean_absolute_percentage_error(y_test, y_pred_knn)
print(f'KNeighborsRegressor MAPE: {mape_knn}')

# Calculate MAPE for SVR
mape_svr = mean_absolute_percentage_error(y_test, y_pred_svr)
print(f'SVR MAPE: {mape_svr}')

Linear Regression MAPE: 0.6308508179691432
RandomForestRegressor MAPE: 0.22001452363267027
KNeighborsRegressor MAPE: 0.2601163379108481
SVR MAPE: 0.45149758966197023


## Part 7: Identify the best performing model to predict mobile phone prices

In [32]:
# Best model based on MAPE
mape_values_dict = {'Linear Regression': mape_lr, 'RandomForestRegressor': mape_rf, 'KNeighborsRegressor': mape_knn, 'SVR': mape_svr}
print(f'Best Model is: {min(mape_values_dict, key=mape_values_dict.get)} with MAPE: {min(mape_values_dict.values())}')

Best Model is: RandomForestRegressor with MAPE: 0.22001452363267027


## Part 8: Predict Mobile Phone Prices with New Data

In [35]:
# Rating  Spec_score   Ram  Display  Screen_resolution  Other  Realme Samsung  Vivo  Xiaomi
sc_new_data = sc.transform([[4.99, 65, 6.0, 6.5, 33, 0, 0, 1, 0, 0]])
print(rf_model.predict (sc_new_data))
print(knn_model.predict (sc_new_data))
print(svr_model.predict (sc_new_data))
print(lr_model.predict (sc_new_data))

[9759.045]
[8997.2]
[16010.54752924]
[860.26422755]
