## Supervised learning using Regression

## Predicting Price

## Objectives

On completing this assignment, you will learn how to write a simple AI application involving supervised learning using regression.

## Description

Write an AI application which, when provided with an Android cell phone attributes, will predict its price. For training and testing the application, please use the labeled data set provided in the file, k_mobilephonepriceprediction.csv. The data set contains data regarding 1370 android cell phones with 17 features including the price. Use 80% of the data items for training, and the remaining 20% for testing. Train the sklearn's Linear Regression (LinearRegression) model. After the model is trained, test it using the test data and produce the Mean Absolute Percentage Error (MeanAbsolutePercentageError) (MAPE) reflecting its performance. Also produce trained model's coefficient (coeff_) and intercept (intercpt_) values. 

#### Regressor models to be used

Try out the following regression models of sklearn's library and compare their performance using Mean Absolute Percentage Error (MAPE) values.

- Linear Regressor (LinearRegression) from sklearn,linear_model
- KNeighbor Regressor (KNeighborRegressor) from sklearn.neighbors using n_neighbors=5
- Support Vector Regressor (SVR) from sklearn.svm
- Random Forest Regressor (RandomForestRegressor) from sklearn.ensemble 

Also try out made-up attribute values of a few cell phones with the best performing model from the above list and report their attribute values used and predicted prices received.

## Implementation 

#### Preprocessing

- Remove missing and null containing rows
- Remove duplicate rows

#### Columns Used

Although data set provided have 17 columns including the price, use only the following including the price:

Rating,	Spec_score,	Inbuilt_memory, Processor, company, Price

#### Column Cleaning

- Rating and Spec_score column values are already numerical.
  
- Inbuilt_memory values needs cleaning because they are strings and are given in the form such as: 64 GB etc. We need to remove GB and convert their values into float.
  
- Processor values are strings such as 'Octa Core Processor', '1.8 GHz Processor' and they seems to be ordinal type. So, we need to convert them into numerical values using sklearn.preprocessing's label encoder (LabelEncoder).
  
- Company values contain a large number of different company names. We need to keep the top 4 names and change the remaining into "Others". Company names are strings and they are nominal type. So, we need to convert them into numeric using panda's getdummies function (one hot decoding).

- Price values needs cleaning because they are given as strings with commas in them such as 9,999. So, we need to remove commas and convert them into float.

## Discussion

#### Column Data Types

Data values are either quantitavtive or qualitative.

#### Quantitative (Numerical) Values

All quantitative (numrical) values can be shown along a number line and we can perform mathematical operations (+, -, *, /) upon them. These values could be either discrete or continuous.

##### Discrete Values
 Discrete numerical values exist along a number line within a range but theyAalways have some value that are not included within the range. For example, int or whole number values are discrete numerical values because they don't include the fractional or decimal values within the range.

##### Continuous Values

Continuous numerical values also exist along a number line within a range but they do not have any values that are excluded within the range. For example, float or decimal number values within the range are continuous numerical values because they don't exclude any value within the range.

For example, shoe sizes are discrete values because there are no show sizes of 8.1, 8.2 etc. However, foot size are considered continuous values because we can specify a foot size of 8.1, 8.2 etc.

Regressors versus Classifiers 

In our supervised learning problems, if the target (label) values are continuous such as prices (decimal values) then we use regressors to solve them. Otherwise we use classifiers to solve them.


#### Quanlitative (Categorical) (Non-numerical) Values

Qualitative (Categorical) (non-numrical) values are not shown along a number line and we can perform mathematical operations (+, -, *, /) upon them. These values could be either nominal or ordinal.

##### Nominal Values
 
When data values are just names without any ranking or order to them, they are considered nominal values. For example, if a hair-color column contains values such black, brown, red etc. These value are considered nominal values. 

##### Ordinal Values

When data values are names but there is an implied ranking or order to them, they are considered ordinal values. For example, if a job satisfaction column contains values such as unsatisfied, satisfied, very satisfied ets. then these values are considered ordinal values.

Implementing nominal and ordinal values

Both nominal and ordinal value columns are converted to numerical values. For converting nominal values, we use Pandas' getdummies method. It creates separate column for each unique namevvalue. So, in our hair color example above, it will create a column for each color and assign 0 or 1 indicating presence or absent of the color. 

On ther hand, for an ordinal values column, we used sklearn.preprocessing module's label encoder (LabelEncoder). It wdoes not create any new columns. Instead, it substitutes value 0, 1, 2, 3, etc for ordered name values. 

-4016-a63b-0a57016a58d3.png)

## Implementation Notes


#### Dataset source

The data set was downloaded fity-data-determining-factors


## Submittal

The uploaded submittal should contain the following:

- jpynb file after running the application from start to finish containing the marked source code, output, and your interaction.
  
- the corresponding html file.

## Coding

Follow the steps below.

In [1932]:
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
#warnings.filterwarnings('ignore', category=UserWarning)

df=pd.read_csv('k_mobilephonepriceprediction.csv',index_col=0)

print(df.shape)
df


(1370, 17)


Unnamed: 0,Name,Rating,Spec_score,No_of_sim,Ram,Battery,Display,Camera,External_Memory,Android_version,Price,company,Inbuilt_memory,fast_charging,Screen_resolution,Processor,Processor_name
0,Samsung Galaxy F14 5G,4.65,68,"Dual Sim, 3G, 4G, 5G, VoLTE,",4 GB RAM,6000 mAh Battery,6.6 inches,50 MP + 2 MP Dual Rear &amp; 13 MP Front Camera,"Memory Card Supported, upto 1 TB",13,9999,Samsung,128 GB inbuilt,25W Fast Charging,2408 x 1080 px Display with Water Drop Notch,Octa Core Processor,Exynos 1330
1,Samsung Galaxy A11,4.20,63,"Dual Sim, 3G, 4G, VoLTE,",2 GB RAM,4000 mAh Battery,6.4 inches,13 MP + 5 MP + 2 MP Triple Rear &amp; 8 MP Fro...,"Memory Card Supported, upto 512 GB",10,9990,Samsung,32 GB inbuilt,15W Fast Charging,720 x 1560 px Display with Punch Hole,1.8 GHz Processor,Octa Core
2,Samsung Galaxy A13,4.30,75,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,5000 mAh Battery,6.6 inches,50 MP Quad Rear &amp; 8 MP Front Camera,"Memory Card Supported, upto 1 TB",12,11999,Samsung,64 GB inbuilt,25W Fast Charging,1080 x 2408 px Display with Water Drop Notch,2 GHz Processor,Octa Core
3,Samsung Galaxy F23,4.10,73,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,6000 mAh Battery,6.4 inches,48 MP Quad Rear &amp; 13 MP Front Camera,"Memory Card Supported, upto 1 TB",12,11999,Samsung,64 GB inbuilt,,720 x 1600 px,Octa Core,Helio G88
4,Samsung Galaxy A03s (4GB RAM + 64GB),4.10,69,"Dual Sim, 3G, 4G, VoLTE,",4 GB RAM,5000 mAh Battery,6.5 inches,13 MP + 2 MP + 2 MP Triple Rear &amp; 5 MP Fro...,"Memory Card Supported, upto 1 TB",11,11999,Samsung,64 GB inbuilt,15W Fast Charging,720 x 1600 px Display with Water Drop Notch,Octa Core,Helio P35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1365,TCL 40R,4.05,75,"Dual Sim, 3G, 4G, 5G, VoLTE,",4 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP + 2 MP Triple Rear &amp; 8 MP Fro...,Memory Card (Hybrid),12,18999,TCL,64 GB inbuilt,15W Fast Charging,720 x 1612 px,Octa Core,Dimensity 700 5G
1366,TCL 50 XL NxtPaper 5G,4.10,80,"Dual Sim, 3G, 4G, VoLTE,",8 GB RAM,5000 mAh Battery,6.8 inches,50 MP + 2 MP Dual Rear &amp; 16 MP Front Camera,Memory Card (Hybrid),14,24990,TCL,128 GB inbuilt,33W Fast Charging,1200 x 2400 px,Octa Core,Dimensity 7050
1367,TCL 50 XE NxtPaper 5G,4.00,80,"Dual Sim, 3G, 4G, 5G, VoLTE,",6 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP Dual Rear &amp; 16 MP Front Camera,"Memory Card Supported, upto 1 TB",13,23990,TCL,256 GB inbuilt,18W Fast Charging,720 x 1612 px,Octa Core,Dimensity 6080
1368,TCL 40 NxtPaper 5G,4.50,79,"Dual Sim, 3G, 4G, 5G, VoLTE,",6 GB RAM,5000 mAh Battery,6.6 inches,50 MP + 2 MP + 2 MP Triple Rear &amp; 8 MP Fro...,"Memory Card Supported, upto 1 TB",13,22499,TCL,256 GB inbuilt,15W Fast Charging,720 x 1612 px,Octa Core,Dimensity 6020


In [1933]:
df.isna().sum().sum()

581

In [1934]:
df=df.dropna()
df.isna().sum().sum()

0

In [1935]:
df.duplicated().sum()

0

In [1936]:
df.Processor.value_counts()

Processor
 Octa Core              770
 Octa Core Processor     39
 2 GHz Processor          2
 Nine-Cores               2
 1.8 GHz Processor        1
 Quad Core                1
 Deca Core Processor      1
 2.3 GHz Processor        1
Name: count, dtype: int64

In [1937]:
df.fast_charging.value_counts()
df.No_of_sim.value_counts()

No_of_sim
Dual Sim, 3G, 4G, 5G, VoLTE,           442
Dual Sim, 3G, 4G, VoLTE,               318
Dual Sim, 3G, 4G, 5G, VoLTE, Vo5G,      44
Single Sim, 3G, 4G, 5G, VoLTE,           7
Single Sim, 3G, 4G, VoLTE,               3
Dual Sim, 3G, 4G,                        2
No Sim Supported,                        1
Name: count, dtype: int64

In [1938]:
df.No_of_sim.value_counts()


No_of_sim
Dual Sim, 3G, 4G, 5G, VoLTE,           442
Dual Sim, 3G, 4G, VoLTE,               318
Dual Sim, 3G, 4G, 5G, VoLTE, Vo5G,      44
Single Sim, 3G, 4G, 5G, VoLTE,           7
Single Sim, 3G, 4G, VoLTE,               3
Dual Sim, 3G, 4G,                        2
No Sim Supported,                        1
Name: count, dtype: int64

In [1939]:
df.company.value_counts()

company
Samsung     149
Realme      126
Vivo        124
Motorola     72
Xiaomi       69
Poco         54
OnePlus      37
iQOO         23
Honor        21
OPPO         20
TCL          20
Huawei       18
POCO         18
Lava         13
Oppo         11
Google        9
itel          9
Asus          6
Lenovo        5
Tecno         4
LG            3
Itel          2
IQOO          1
Nothing       1
Gionee        1
Coolpad       1
Name: count, dtype: int64

In [1940]:
import re
def cleanup_target (item):
    item = re.sub ('[,]', '',item)
    item = re.sub (r'\s+', '',item)
    return float (item)

def cleanup_company (item):
    if (item == 'Samsung' or item == 'Realme' or item == 'Vivo'):
        return item
    else:
        return "Others"

def cleanup_processor (item):
    if (item == 'Octa Core' or item == 'Octa Core Processor'):
        return item
    else:
        return "Others"

def cleanup_feature (item):
    if 'TB' in item:
        item = re.sub ('TB', '',item)
        item = re.sub ('inbuilt', '',item)
        item = re.sub (r'\s+', '',item)
        item = 1000 * float (item)
        #print (item)
        return item
    else:
        #item = re.sub ('inches', '',item)
        item = re.sub ('GB', '',item)
        #item = re.sub ('RAM', '',item)
        item = re.sub ('inbuilt', '',item)
        item = re.sub (r'\s+', '',item)
        return float (item)



In [1941]:
import numpy as np
import re
target = df.Price
print (target[0:3])
target = target.apply(cleanup_target)
target = np.array (target)
print (target[0:3])
print(type (target))
print(type(target[0]))

0     9,999
1     9,990
2    11,999
Name: Price, dtype: object
[ 9999.  9990. 11999.]
<class 'numpy.ndarray'>
<class 'numpy.float64'>


In [1942]:
df_features = df.filter(['Rating','Spec_score','Inbuilt_memory','Processor','company'], axis=1)
df_features

Unnamed: 0,Rating,Spec_score,Inbuilt_memory,Processor,company
0,4.65,68,128 GB inbuilt,Octa Core Processor,Samsung
1,4.20,63,32 GB inbuilt,1.8 GHz Processor,Samsung
2,4.30,75,64 GB inbuilt,2 GHz Processor,Samsung
4,4.10,69,64 GB inbuilt,Octa Core,Samsung
5,4.40,75,128 GB inbuilt,Octa Core,Samsung
...,...,...,...,...,...
1365,4.05,75,64 GB inbuilt,Octa Core,TCL
1366,4.10,80,128 GB inbuilt,Octa Core,TCL
1367,4.00,80,256 GB inbuilt,Octa Core,TCL
1368,4.50,79,256 GB inbuilt,Octa Core,TCL


In [1943]:
print (df.Inbuilt_memory[0:3])

df_features.Inbuilt_memory = df_features.Inbuilt_memory.apply(cleanup_feature)
df_features.Inbuilt_memory = np.array (df_features.Inbuilt_memory)

print (df_features.Inbuilt_memory[0:3])
print(type (df_features.Inbuilt_memory))
print(type (df_features.Inbuilt_memory[0]))

0     128 GB inbuilt
1      32 GB inbuilt
2      64 GB inbuilt
Name: Inbuilt_memory, dtype: object
0    128.0
1     32.0
2     64.0
Name: Inbuilt_memory, dtype: float64
<class 'pandas.core.series.Series'>
<class 'numpy.float64'>


In [1944]:
print (df_features.company.value_counts())

df_features.company = df_features.company.apply(cleanup_company)

print (df_features.company.value_counts())

company
Samsung     149
Realme      126
Vivo        124
Motorola     72
Xiaomi       69
Poco         54
OnePlus      37
iQOO         23
Honor        21
OPPO         20
TCL          20
Huawei       18
POCO         18
Lava         13
Oppo         11
Google        9
itel          9
Asus          6
Lenovo        5
Tecno         4
LG            3
Itel          2
IQOO          1
Nothing       1
Gionee        1
Coolpad       1
Name: count, dtype: int64
company
Others     418
Samsung    149
Realme     126
Vivo       124
Name: count, dtype: int64


In [1945]:
from sklearn.preprocessing import LabelEncoder
print(df_features.Processor.value_counts())
le = LabelEncoder()
df_features.Processor=le.fit_transform(df_features.Processor)
print(df_features.Processor.value_counts())

Processor
 Octa Core              770
 Octa Core Processor     39
 2 GHz Processor          2
 Nine-Cores               2
 1.8 GHz Processor        1
 Quad Core                1
 Deca Core Processor      1
 2.3 GHz Processor        1
Name: count, dtype: int64
Processor
5    770
6     39
1      2
4      2
0      1
7      1
3      1
2      1
Name: count, dtype: int64


In [1946]:
import pandas as pd
df_features_company_num = pd.get_dummies (df_features.company, dtype=int,drop_first=True)
df_features_company_num

Unnamed: 0,Realme,Samsung,Vivo
0,0,1,0
1,0,1,0
2,0,1,0
4,0,1,0
5,0,1,0
...,...,...,...
1365,0,0,0
1366,0,0,0
1367,0,0,0
1368,0,0,0


In [1947]:
import pandas as pd
print (df_features)
df_features=df_features.drop('company',axis=1)
print (df_features)


      Rating  Spec_score  Inbuilt_memory  Processor  company
0       4.65          68           128.0          6  Samsung
1       4.20          63            32.0          0  Samsung
2       4.30          75            64.0          1  Samsung
4       4.10          69            64.0          5  Samsung
5       4.40          75           128.0          5  Samsung
...      ...         ...             ...        ...      ...
1365    4.05          75            64.0          5   Others
1366    4.10          80           128.0          5   Others
1367    4.00          80           256.0          5   Others
1368    4.50          79           256.0          5   Others
1369    4.65          93           256.0          5   Others

[817 rows x 5 columns]
      Rating  Spec_score  Inbuilt_memory  Processor
0       4.65          68           128.0          6
1       4.20          63            32.0          0
2       4.30          75            64.0          1
4       4.10          69            

In [1948]:
df_features=pd.concat([df_features,df_features_company_num],axis=1)
print (df_features)

      Rating  Spec_score  Inbuilt_memory  Processor  Realme  Samsung  Vivo
0       4.65          68           128.0          6       0        1     0
1       4.20          63            32.0          0       0        1     0
2       4.30          75            64.0          1       0        1     0
4       4.10          69            64.0          5       0        1     0
5       4.40          75           128.0          5       0        1     0
...      ...         ...             ...        ...     ...      ...   ...
1365    4.05          75            64.0          5       0        0     0
1366    4.10          80           128.0          5       0        0     0
1367    4.00          80           256.0          5       0        0     0
1368    4.50          79           256.0          5       0        0     0
1369    4.65          93           256.0          5       0        0     0

[817 rows x 7 columns]


In [1949]:
X = df_features
y = target
print (X[0:3])
print (y[0:3])

   Rating  Spec_score  Inbuilt_memory  Processor  Realme  Samsung  Vivo
0    4.65          68           128.0          6       0        1     0
1    4.20          63            32.0          0       0        1     0
2    4.30          75            64.0          1       0        1     0
[ 9999.  9990. 11999.]


In [1950]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=0)

In [1951]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train_scaled = sc.fit_transform (X_train)
X_test_scaled = sc.transform (X_test)
print (X_train_scaled)
print (X_test_scaled)


[[ 0.97353535 -0.18314954 -0.85278535 ...  2.18603775 -0.44268578
  -0.43025338]
 [-1.44866916  0.50816059 -0.2552866  ... -0.45744864  2.25893863
  -0.43025338]
 [ 0.09273371 -0.04488752 -0.2552866  ... -0.45744864 -0.44268578
  -0.43025338]
 ...
 [ 0.53313453  0.64642261 -0.2552866  ... -0.45744864 -0.44268578
  -0.43025338]
 [-1.22846875  0.50816059 -0.2552866  ...  2.18603775 -0.44268578
  -0.43025338]
 [ 0.09273371 -2.11881791 -0.85278535 ... -0.45744864 -0.44268578
  -0.43025338]]
[[ 0.31293412 -1.42750778 -0.85278535 ... -0.45744864 -0.44268578
  -0.43025338]
 [ 1.41393617  0.36989856 -0.2552866  ...  2.18603775 -0.44268578
  -0.43025338]
 [-0.1274667   1.19947072  0.9397109  ... -0.45744864  2.25893863
  -0.43025338]
 ...
 [-0.78806793 -0.18314954 -0.2552866  ... -0.45744864 -0.44268578
  -0.43025338]
 [-1.22846875 -0.87445967 -0.2552866  ... -0.45744864 -0.44268578
   2.32421186]
 [ 0.31293412 -0.18314954 -0.2552866  ... -0.45744864  2.25893863
  -0.43025338]]


In [1952]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit (X_train_scaled,y_train)

In [1953]:
y_pred = clf.predict(X_test_scaled)
print (y_pred[0:10])
print (y_test[0:10])

[-1287.70825995 19706.13054316 60463.29142808 15375.82336102
 39909.21727923 23155.47888089 40364.6427592  25594.01260816
 31058.93274484 -6624.51562351]
[13990. 19990. 64999.  8999. 29999. 14990. 24999. 12990. 29990.  6999.]


In [1954]:
from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_test, y_pred)

0.6465496482642918

In [1955]:
clf.intercept_

27111.808575803992

In [1956]:
clf.n_features_in_

7

In [1957]:
clf.coef_

array([ -939.01236785, 10771.43207988, 13634.34235879,  1683.46211377,
       -2072.81166606,  3183.87335482,  1019.64024404])

In [1958]:
sc_q=sc.transform ([[4.0,60,256,3,0,1,0]])
clf.predict (sc_q)



array([9205.45151962])