## Supervised Learning using Regression

## Predicting Price

## Objectives
Upon completing this assignment, you will learn how to write a simple AI application involving supervised learning using regression.

## Summary
Write an AI application that predicts the price of an Android cell phone based on its attributes. Use the prvoided dataset (k_mobilephonepriceprediction.csv), which contains data on 1370 Android cell phones with 17 features, including the price. Use 80% of the data for training and 20% for testing. Train the sklearn's 'LinearRegression' model. After training the model, test the model and compute the Mean Absolute Percentage Error (MAPE). Additionally, report the trained model's coefficients (coef_) and intercept (intercept_). 

#### Regressor Models to be Used
Evaluate the following regression models from sklearn and compare their performance using MAPE:
- LinearRegression from sklearn.linear_model
- KNeighborsRegressor from sklearn.neighbors (use n_neighbors=5)
- SVR from sklearn.svm
- RandomForestRegressor from sklearn.ensemble

Try some made-up attribute values with the best performing model and report the predicted prices.

## Implementation 

#### Preprocessing
- Remove rows with missing or null values
- Remove duplicate rows

#### Columns Used
Although dataset has 17 columns, use only the following 6 columns for the prediction: Rating,	Spec_score,	Inbuilt_memory, Processor, company, Price

#### Column Cleaning
- Rating and 'Spec_score' column values are already numerical.
- 'Inbuilt_memory' values need cleaning as they are strings (e.g., "64 GB"). Remove "GB" and convert to flaot.
- 'Processor' values are ordinal strings (e.g., 'Octa Core Processor'). Convert them to numerical values using sklearn.preprocessing's label encoder 'LabelEncoder'.
- 'company' has many unique company names. Retain the top 4 names and vert the rest to "Others". Use 'pandas.get_dummies' for one-hot encoding.
- 'Price' values are strings with commas (e.g., '9,999'). Remove the commas and convert to float.

## Data values are either quantitative (numerical) or qualitative (categorical)

#### Quantitative (Numerical) Values

Quantitative values can be shown along a number line and support mathematical operations (+, -, *, /). They can be either discrete or continuous.

**Discrete Values**: Discrete numerical values (e.g., integers (neg, 0, and pos) and whole numbers (0 and pos #s)) exist along a number line within a range but exclude some values (e.g., fractions and decimals).

**Continuous Values**: Continuous numerical values exist along a number line within a range without exclusions (e.g., floats). For example, shoe sizes are discrete values, but foot sizes are continuous values.

**Regressors vs. Classifiers**: For continuous target values (e.g., prices), use regressors. For other targets, use classifiers.

#### Qualitative (Categorical, Non-numerical) Values
Qualitative values are not shown along a number line and do not support mathematical operations. They can be either nominal or ordinal.

**Nominal Values**: Nominal values are names without any ranking (e.g., hair colors: black, brown, red, etc.).

**Ordinal Values**: Ordinal values are names with an implied ranking (e.g., job satisfaction levels).

**Implementing Nominal and Ordinal Values**: Convert both nominal and ordinal values to numerical values:
- For nominal values, use pandas.get_dummies to create separate columns for each unique value. Pandas creates a separate column for each unique name value. For example, in a hair color column, a new column is created for each named value (e.g., black, brown, red) and assigned a value of 0 or 1, indicating the presence or absence of the color.
- For ordinal values, use LabelEncoder to substitute ordered names with numerical values. LabelEncoder does not create new columns. Instead, it substitutes the value 0, 1, 2, 3, etc., for the ordered name values. 

## Implementation Notes

#### Dataset source
The dataset was downloaded from ???.

## Submittal
Your submission should include:

- 'jpynb' file containing the source code, output, and your interaction.
- the corresponding 'HTML' file.

## Coding
Follow the steps below.

## Keith Yrisarri Stateson
June 21, 2024. Python 3.11.0

## Title: Mobile Phone Price Prediction Using Supervised Learning

## Summary
This program is an AI application to predict the prices of Android cell phones based on their attributes. Supervised learning and regression techniques are used to train and evaluate multiple models on a provided dataset. The goal is to determine which model performs best in predicting phone prices and to understand the influence of various features on the price.

- Part 1: Data Preprocessing - clean data, remove null and duplicate rows
- Part 2: EDA and Feature Engineering - analyze features v target variable, convert categorical data to numerical, and nominal and ordinal values using pandas.get_dummies and LabelEncoder
- Part 3: Model Training and Evaluation - split the dataset into 80% training, 20% testing. Train various regression models and evaluate their performance using MAPE
    - 3a Linear Regression model
    - 3b Random Forest Regressor model
- Part 4: Prediction - use the best performing model to predict mobile phone prices for new, made-up attribute values
    - 4a Linear Regression
    - 4b Random Forest Regressor

In [1]:
pip install seaborn

Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting matplotlib!=3.6.1,>=3.4 (from seaborn)
  Downloading matplotlib-3.9.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading contourpy-1.2.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading fonttools-4.53.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (162 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.2/162.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting kiwisolver>=1.3.1 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading kiwisolver-1.4.5-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib!=3.6.1,>=3.4->seaborn

In [27]:
import re

## Part 1: Data Preprocessing

In [217]:
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
#warnings.filterwarnings('ignore', category=UserWarning)

df=pd.read_csv('k_mobilephonepriceprediction.csv',index_col=0)

print(df.shape)
df


(1370, 17)


Unnamed: 0,Name,Rating,Spec_score,No_of_sim,Ram,Battery,Display,Camera,External_Memory,Android_version,Price,company,Inbuilt_memory,fast_charging,Screen_resolution,Processor,Processor_name
0,Samsung Galaxy F14 5G,4.65,68,"Dual Sim, 3G, 4G, 5G, VoLTE,",4 GB RAM,6000 mAh Battery,6.6 inches,50 MP + 2 MP Dual Rear &amp; 13 MP Front Camera,"Memory Card Supported, upto 1 TB",13,9999,Samsung,128 GB inbuilt,25W Fast Charging,2408 x 1080 px Display with Water Drop Notch,Octa Core Processor,Exynos 1330
1,Samsung Galaxy A11,4.20,63,"Dual Sim, 3G, 4G, VoLTE,",2 GB RAM,4000 mAh Battery,6.4 inches,13 MP + 5 MP + 2 MP Triple Rear &amp; 8 MP Fro...,"Memory Card Supported, upto 512 GB",10,9990,Samsung,32 GB inbuilt,15W Fast Charging,720 x 1560 px Display with Punch Hole,1.8 GHz Processor,Octa Core
2,Samsung Galaxy A13,4.30,75,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,5000 mAh Battery,6.6 inches,50 MP Quad Rear &amp; 8 MP Front Camera,"Memory Card Supported, upto 1 TB",12,11999,Samsung,64 GB inbuilt,25W Fast Charging,1080 x 2408 px Display with Water Drop Notch,2 GHz Processor,Octa Core
3,Samsung Galaxy F23,4.10,73,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,6000 mAh Battery,6.4 inches,48 MP Quad Rear &amp; 13 MP Front Camera,"Memory Card Supported, upto 1 TB",12,11999,Samsung,64 GB inbuilt,,720 x 1600 px,Octa Core,Helio G88
4,Samsung Galaxy A03s (4GB RAM + 64GB),4.10,69,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,5000 mAh Battery,6.5 inches,13 MP + 2 MP + 2 MP Triple Rear &amp; 5 MP Fro...,"Memory Card Supported, upto 1 TB",11,11999,Samsung,64 GB inbuilt,15W Fast Charging,720 x 1600 px Display with Water Drop Notch,Octa Core,Helio P35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1365,TCL 40R,4.05,75,"Dual Sim, 3G, 4G, 5G, VoLTE,",4 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP + 2 MP Triple Rear &amp; 8 MP Fro...,Memory Card (Hybrid),12,18999,TCL,64 GB inbuilt,15W Fast Charging,720 x 1612 px,Octa Core,Dimensity 700 5G
1366,TCL 50 XL NxtPaper 5G,4.10,80,"Dual Sim, 3G, 4G, VoLTE,",8 GB RAM,5000 mAh Battery,6.8 inches,50 MP + 2 MP Dual Rear &amp; 16 MP Front Camera,Memory Card (Hybrid),14,24990,TCL,128 GB inbuilt,33W Fast Charging,1200 x 2400 px,Octa Core,Dimensity 7050
1367,TCL 50 XE NxtPaper 5G,4.00,80,"Dual Sim, 3G, 4G, 5G, VoLTE,",6 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP Dual Rear &amp; 16 MP Front Camera,"Memory Card Supported, upto 1 TB",13,23990,TCL,256 GB inbuilt,18W Fast Charging,720 x 1612 px,Octa Core,Dimensity 6080
1368,TCL 40 NxtPaper 5G,4.50,79,"Dual Sim, 3G, 4G, 5G, VoLTE,",6 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP + 2 MP Triple Rear &amp; 8 MP Fro...,"Memory Card Supported, upto 1 TB",13,22499,TCL,256 GB inbuilt,15W Fast Charging,720 x 1612 px,Octa Core,Dimensity 6020


In [251]:
df.isna().sum().sum()

np.int64(0)

In [252]:
df=df.dropna()
df.isna().sum().sum()

np.int64(0)

In [253]:
df.duplicated().sum()

np.int64(0)

## Part 2: EDA and Feature Engineering

In [254]:
df.Price.value_counts()

Price
19,990      22
14,999      22
13,999      21
11,999      20
9,999       18
            ..
14,950       1
15,590       1
17,945       1
19,490       1
1,19,990     1
Name: count, Length: 343, dtype: int64

In [255]:
df.Rating.value_counts()

Rating
4.40    67
4.30    59
4.55    59
4.60    53
4.10    52
4.00    52
4.65    51
4.50    51
4.35    50
4.15    48
4.20    47
4.25    45
4.45    44
4.05    44
4.70    44
4.75    42
3.95     6
3.90     3
Name: count, dtype: int64

In [256]:
df.Spec_score.value_counts()

Spec_score
75    81
84    53
86    50
80    42
85    37
82    37
83    37
78    36
77    34
79    34
81    31
74    29
76    28
89    28
71    26
88    23
73    21
72    21
87    19
70    16
69    13
90    12
67    12
68    12
91    11
93    11
92    10
66     9
64     8
94     8
63     6
65     5
95     4
96     3
54     2
61     2
98     1
58     1
62     1
60     1
53     1
55     1
Name: count, dtype: int64

In [257]:
df.Inbuilt_memory.value_counts()

Inbuilt_memory
128 GB inbuilt    456
256 GB inbuilt    174
64 GB inbuilt     148
512 GB inbuilt     19
32 GB inbuilt      16
1 TB inbuilt        4
Name: count, dtype: int64

In [258]:
df.Processor.value_counts()

Processor
Octa Core              770
Octa Core Processor     39
2 GHz Processor          2
Nine-Cores               2
1.8 GHz Processor        1
Quad Core                1
Deca Core Processor      1
2.3 GHz Processor        1
Name: count, dtype: int64

In [259]:
df.company.value_counts()

company
Samsung     149
Realme      126
Vivo        124
Motorola     72
Xiaomi       69
Poco         54
OnePlus      37
iQOO         23
Honor        21
TCL          20
OPPO         20
POCO         18
Huawei       18
Lava         13
Oppo         11
Google        9
itel          9
Asus          6
Lenovo        5
Tecno         4
LG            3
Itel          2
Nothing       1
Gionee        1
IQOO          1
Coolpad       1
Name: count, dtype: int64

In [260]:
import re
def cleanup_target (item):
    """ This function cleans up the target column, 'Price', by removing commas and converting the string to a float. """
    item = re.sub (r'[,]', '',item)  # Replace comma with an empty string
    item = re.sub (r'\s+', '',item)  # Replace whitespace characters: spaces, tabs, newlines, and carriage returns with an empty string
    return float (item)

def cleanup_company (item):
    """ This function cleans up the 'company' column by grouping the companies into 5 categories. """
    if (item == 'Samsung' or item == 'Realme' or item == 'Vivo' or item == 'Motorola'):
        return item
    else:
        return "Others"

def cleanup_processor (item):
    """ This function cleans up the 'Processor' column by grouping the processors into 3 categories. """
    item = item.strip()
    if (item == 'Octa Core' or item == 'Octa Core Processor'):
        return item
    else:
        return "Others"

def cleanup_inbuilt_memory (item):
    """ This function cleans up the 'Inbuilt_memory' column by removing 'TB', 'GB', and 'inbuilt', and converting the string to a float. """
    item = re.sub ('TB', '',item)
    item = re.sub ('GB', '',item)
    item = re.sub ('inbuilt', '',item)
    item = re.sub (r'\s+', '',item)
    value = float(item)
    if 'TB' in item:
        value = value * 1000
    return value

#### Conversion of Panda Series into a NumPy Array

1. Compatibility: Many machine learning libraries, such as scikit-learn, expect input data to be in the form of NumPy arrays rather than pandas Series. Converting the target to a NumPy array ensures compatibility with these libraries.
2. Performance: NumPy arrays are more efficient for numerical operations due to their optimized data structures and functions.
3. Consistency: Using NumPy arrays can help maintain consistency in data handling, especially when features are also stored as NumPy arrays.


In [261]:
# Assign the target column and clean it up
import numpy as np
import re
target = df.Price
print (target[0:3])

print('\n')
target = target.apply(cleanup_target)
target = np.array (target)  # Convert the pandas series to a numpy array
print (target[0:3])
print(type (target))
print(type(target[0]))

0     9,999
1     9,990
2    11,999
Name: Price, dtype: object


[ 9999.  9990. 11999.]
<class 'numpy.ndarray'>
<class 'numpy.float64'>


In [262]:
# create the features (indepedent variables) dataframe
df_features = df.filter(['Rating','Spec_score','Inbuilt_memory','Processor','company'], axis=1)
df_features

Unnamed: 0,Rating,Spec_score,Inbuilt_memory,Processor,company
0,4.65,68,128 GB inbuilt,Octa Core Processor,Samsung
1,4.20,63,32 GB inbuilt,1.8 GHz Processor,Samsung
2,4.30,75,64 GB inbuilt,2 GHz Processor,Samsung
4,4.10,69,64 GB inbuilt,Octa Core,Samsung
5,4.40,75,128 GB inbuilt,Octa Core,Samsung
...,...,...,...,...,...
1365,4.05,75,64 GB inbuilt,Octa Core,TCL
1366,4.10,80,128 GB inbuilt,Octa Core,TCL
1367,4.00,80,256 GB inbuilt,Octa Core,TCL
1368,4.50,79,256 GB inbuilt,Octa Core,TCL


In [263]:
print (df.Inbuilt_memory[0:3])

df_features.Inbuilt_memory = df_features.Inbuilt_memory.apply(cleanup_inbuilt_memory)
df_features.Inbuilt_memory = np.array (df_features.Inbuilt_memory)

print('\n')
print(df_features.Inbuilt_memory[0:3])
print(type (df_features.Inbuilt_memory))
print(type (df_features.Inbuilt_memory[0]))

0     128 GB inbuilt
1      32 GB inbuilt
2      64 GB inbuilt
Name: Inbuilt_memory, dtype: object


0    128.0
1     32.0
2     64.0
Name: Inbuilt_memory, dtype: float64
<class 'pandas.core.series.Series'>
<class 'numpy.float64'>


In [264]:
print (df_features.company.value_counts())

df_features.company = df_features.company.apply(cleanup_company)

print('\n')
print (df_features.company.value_counts())

company
Samsung     149
Realme      126
Vivo        124
Motorola     72
Xiaomi       69
Poco         54
OnePlus      37
iQOO         23
Honor        21
TCL          20
OPPO         20
POCO         18
Huawei       18
Lava         13
Oppo         11
Google        9
itel          9
Asus          6
Lenovo        5
Tecno         4
LG            3
Itel          2
Nothing       1
Gionee        1
IQOO          1
Coolpad       1
Name: count, dtype: int64


company
Others      346
Samsung     149
Realme      126
Vivo        124
Motorola     72
Name: count, dtype: int64


In [265]:
print (df_features.Processor.value_counts())

df_features.Processor = df_features.Processor.apply(cleanup_processor)

print('\n')
print (df_features.Processor.value_counts())

Processor
Octa Core              770
Octa Core Processor     39
2 GHz Processor          2
Nine-Cores               2
1.8 GHz Processor        1
Quad Core                1
Deca Core Processor      1
2.3 GHz Processor        1
Name: count, dtype: int64


Processor
Octa Core              770
Octa Core Processor     39
Others                   8
Name: count, dtype: int64


#### What LabelEncoder Does

LabelEncoder is a utility from sklearn.preprocessing used to convert categorical text data into numerical labels. Each unique category in the data is assigned a unique integer value. This process is essential for machine learning algorithms that require numerical input rather than text.

- Fit: The fit method identifies the unique categories in the data and assigns a unique integer to each category.
- Transform: The transform method converts the categories into their respective integer labels based on the mapping created during the fit step.

In [266]:
from sklearn.preprocessing import LabelEncoder
print(df_features.Processor.value_counts())
le = LabelEncoder()
df_features.Processor=le.fit_transform(df_features.Processor)
print('\n')
print(df_features.Processor.value_counts())

Processor
Octa Core              770
Octa Core Processor     39
Others                   8
Name: count, dtype: int64


Processor
0    770
1     39
2      8
Name: count, dtype: int64


In [267]:
# convert the 'company' categorial column to a numeric column 'df_features_company_num'
df_features_company_num = pd.get_dummies (df_features.company, dtype=int,drop_first=True)
df_features_company_num

Unnamed: 0,Others,Realme,Samsung,Vivo
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
4,0,0,1,0
5,0,0,1,0
...,...,...,...,...
1365,1,0,0,0
1366,1,0,0,0
1367,1,0,0,0
1368,1,0,0,0


In [268]:
# drop the 'company' column from the dataframe
print (df_features)
df_features=df_features.drop('company',axis=1)
print (df_features)


      Rating  Spec_score  Inbuilt_memory  Processor  company
0       4.65          68           128.0          1  Samsung
1       4.20          63            32.0          2  Samsung
2       4.30          75            64.0          2  Samsung
4       4.10          69            64.0          0  Samsung
5       4.40          75           128.0          0  Samsung
...      ...         ...             ...        ...      ...
1365    4.05          75            64.0          0   Others
1366    4.10          80           128.0          0   Others
1367    4.00          80           256.0          0   Others
1368    4.50          79           256.0          0   Others
1369    4.65          93           256.0          0   Others

[817 rows x 5 columns]
      Rating  Spec_score  Inbuilt_memory  Processor
0       4.65          68           128.0          1
1       4.20          63            32.0          2
2       4.30          75            64.0          2
4       4.10          69            

In [269]:
# concatenate the 'df_features' and 'df_features_company_num' dataframes
df_features=pd.concat([df_features,df_features_company_num],axis=1)
print (df_features)

      Rating  Spec_score  Inbuilt_memory  Processor  Others  Realme  Samsung  \
0       4.65          68           128.0          1       0       0        1   
1       4.20          63            32.0          2       0       0        1   
2       4.30          75            64.0          2       0       0        1   
4       4.10          69            64.0          0       0       0        1   
5       4.40          75           128.0          0       0       0        1   
...      ...         ...             ...        ...     ...     ...      ...   
1365    4.05          75            64.0          0       1       0        0   
1366    4.10          80           128.0          0       1       0        0   
1367    4.00          80           256.0          0       1       0        0   
1368    4.50          79           256.0          0       1       0        0   
1369    4.65          93           256.0          0       1       0        0   

      Vivo  
0        0  
1        0  


## Part 3: Model Training and Evaluation for Linear Regression

In [289]:
# Assign the features and target to X and y, and print the first 3 rows of X and y
X = df_features
y = target
print ('X header:', X[0:3], sep='\n')
print('\n', "y header:")
print (y[0:3])

X header:
   Rating  Spec_score  Inbuilt_memory  Processor  Others  Realme  Samsung  \
0    4.65          68           128.0          1       0       0        1   
1    4.20          63            32.0          2       0       0        1   
2    4.30          75            64.0          2       0       0        1   

   Vivo  
0     0  
1     0  
2     0  

 y header:
[ 9999.  9990. 11999.]


In [291]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=0)

In [295]:
# Standardize the training and test feature data by fitting the StandardScaler on the training data and then applying the transformation to both the training and test datasets
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scaled = sc.fit_transform (X_train)
X_test_scaled = sc.transform (X_test)
print (X_train_scaled)
print (X_test_scaled)

[[ 0.97353535 -0.18314954 -1.00358876 ...  2.18603775 -0.44268578
  -0.43025338]
 [-1.44866916  0.50816059 -0.24994142 ... -0.45744864  2.25893863
  -0.43025338]
 [ 0.09273371 -0.04488752 -0.24994142 ... -0.45744864 -0.44268578
  -0.43025338]
 ...
 [ 0.53313453  0.64642261 -0.24994142 ... -0.45744864 -0.44268578
  -0.43025338]
 [-1.22846875  0.50816059 -0.24994142 ...  2.18603775 -0.44268578
  -0.43025338]
 [ 0.09273371 -2.11881791 -1.00358876 ... -0.45744864 -0.44268578
  -0.43025338]]
[[ 0.31293412 -1.42750778 -1.00358876 ... -0.45744864 -0.44268578
  -0.43025338]
 [ 1.41393617  0.36989856 -0.24994142 ...  2.18603775 -0.44268578
  -0.43025338]
 [-0.1274667   1.19947072  1.25735324 ... -0.45744864  2.25893863
  -0.43025338]
 ...
 [-0.78806793 -0.18314954 -0.24994142 ... -0.45744864 -0.44268578
  -0.43025338]
 [-1.22846875 -0.87445967 -0.24994142 ... -0.45744864 -0.44268578
   2.32421186]
 [ 0.31293412 -0.18314954 -0.24994142 ... -0.45744864  2.25893863
  -0.43025338]]


In [329]:
# Train the model using LinearRegression algorithm
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit (X_train_scaled,y_train)

In [330]:
# Predict the target for the standardized test data using the trained Linear Regression model
y_pred = clf.predict(X_test_scaled)

# y_pred = np.maximum(0, y_pred) # This only corrects the output after the prediction, but it doesn't address the underlying issue with the model training and feature engineering.

print (y_pred[0:10])  # print the first 10 predicted values
print (y_test[0:10])  # print the first 10 actual values

[ -3115.80525716  23008.14142159  61569.39414317  12257.83430419
  43616.33479108  22445.33905461  44803.26725534  25926.64659115
  34023.7817411  -10543.67763276]
[13990. 19990. 64999.  8999. 29999. 14990. 24999. 12990. 29990.  6999.]


## MAPE

MAPE, Mean Absolute Percentage Error, is the average absolute percentage difference between actual and predicted values.

#### Interpretation of MAPE:
- MAPE = 0: Perfect model with no prediction error.
- MAPE < 0.1 (10%): Very good prediction accuracy.
- MAPE between 0.1 and 0.2 (10% to 20%): Good prediction accuracy.
- MAPE between 0.2 and 0.5 (20% to 50%): Reasonable prediction accuracy.
- MAPE > 0.5 (50%): Poor prediction accuracy, indicating that the model's predictions are quite far off from the actual values.

In [341]:
# Evaluate the accuracy of the model using the MAPE metric

from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f'MAPE: {mape}')

MAPE: 0.6614598159990303


In [333]:
clf.intercept_

np.float64(27111.808575803996)

In [334]:
clf.n_features_in_

8

In [335]:
clf.coef_

array([ -715.96343634, 14285.98822703,  8084.07676856,  3336.18463062,
        1237.24344377,  -890.2702836 ,  4096.75874812,  1784.07130422])

## Part 4a: Linear Regression Mobile Price Prediction

In [355]:
# Use standardscaler to transform input data for prediction

# Revised format of feature data (X), header and first row:
#    Rating  Spec_score  Inbuilt_memory  Processor  Others  Realme  Samsung  Vivo
# 0    4.65          68           128.0          1       0       0        1   0

sc_query_lr=sc.transform ([[4.0,60,256,3,0,1,0, 0]])
clf.predict(sc_query_lr)

array([30698.54099888])

## Part 3b: Model Training and Evaluation for Random Forest

In [353]:
# Train the model using Random Forest Regressor algorithm
# Assuming the previous cells have already defined the data preprocessing steps and variables

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_percentage_error

# Instantiate the RandomForestRegressor
rf_model = RandomForestRegressor(random_state=42)

# Train the model
rf_model.fit(X_train_scaled, y_train)

# Prediction
y_pred_rf = rf_model.predict(X_test_scaled)

# Ensure no negative predictions
# y_pred_rf = np.maximum(0, y_pred_rf)

# Calculate MAPE
mape_rf = mean_absolute_percentage_error(y_test, y_pred_rf)
print(f'RandomForestRegressor MAPE: {mape_rf}')
print('Predicted Prices (RandomForest):', '\n', y_pred_rf[:10])
print('\n')
print ('Actual Prices:', '\n', y_test[0:10])


RandomForestRegressor MAPE: 0.3030318654701749
Predicted Prices (RandomForest): 
 [ 9686.52833333 16571.6745     34006.55333333 11280.13
 26991.102      12097.69928571 36178.36833333 17403.505
 24481.4325      8162.72333333]


Actual Prices: 
 [13990. 19990. 64999.  8999. 29999. 14990. 24999. 12990. 29990.  6999.]


## Part 4b: Random Forest Regressor Mobile Price Prediction

In [356]:
# Use standardscaler to transform input data for prediction using RandomForestRegressor

sc_query_rf=sc.transform ([[4.0,60,256,3,0,1,0, 0]])
rf_model.predict (sc_query_rf)

array([9398.235])