#Predict Mobile Phone Pricing
* Objective
Build a system that can predict pricing for a mobile phone using data on available phones in the market. Predict if the mobile can be priced low/med/high/very high. Explore the data tounderstand the features and figure out an approach.
Dataset
This dataset contains data on various mobiles phones, their features, and pricing.
* Description of columns:
● battery_power: Battery Capacity in mAh
● blue: Has Bluetooth or not
● clock_speed: Processor speed
● dual_sim: Has dual sim support or not
● fc: Front camera megapixels
● four_g: Has 4G or not
● int_memory: Internal Memory in GB
● m_deep: Mobile depth in cm.
● mobile_wt: Weight in gm
● n_cores: Processor Core Count
● pc: Primary Camera megapixels
● px_height: Pixel Resolution height
● px_width: Pixel Resolution width
● ram: Ram in MB
● sc_h: Mobile Screen height in cm
● sc_w: Mobile Screen width in cm
● talk_time: Time a single battery charge will last. In hours.
● three_g: Has 3G or not
● touch_screen: Has touch screen or not
● wifi: Has WiFi or not
● Price_range: This is the target
○ 0 = low cost
○ 1 = medium cost
○ 2 = high cost
○ 3 = very high cost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Dataset/Unified mentors/Predict Mobile Phone Pricing.csv')

In [None]:
df.sample(10)

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
632,946,0,2.1,0,0,1,19,0.2,154,8,...,159,630,2104,7,4,16,1,1,0,1
1124,680,1,2.1,1,9,1,9,0.1,131,3,...,1428,1500,2438,14,2,17,1,0,0,2
411,1582,0,2.8,0,2,1,44,0.5,112,6,...,1486,1797,3890,17,10,10,1,1,1,3
679,675,0,2.3,0,10,0,60,0.9,144,5,...,192,757,1735,7,0,13,1,0,1,0
29,851,0,0.5,0,3,0,21,0.4,200,5,...,1171,1263,478,12,7,10,1,0,1,0
1811,537,1,2.0,0,1,1,55,0.3,103,7,...,1041,1430,2029,10,5,12,1,1,1,1
653,571,0,1.6,1,8,0,35,0.2,186,7,...,177,1282,2598,13,5,8,1,1,0,1
1192,1030,1,0.5,0,4,1,37,0.7,147,1,...,503,551,2800,8,6,12,1,0,1,2
1955,1515,1,2.1,1,4,1,24,0.9,176,5,...,747,1247,3104,6,5,20,1,0,0,3
1207,627,0,1.8,0,2,0,20,0.8,142,3,...,211,507,896,17,6,14,0,0,0,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

In [None]:
df.isnull().sum()

Unnamed: 0,0
battery_power,0
blue,0
clock_speed,0
dual_sim,0
fc,0
four_g,0
int_memory,0
m_dep,0
mobile_wt,0
n_cores,0


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score

In [None]:
# Split the data into training and testing sets
X = df.drop('price_range', axis=1)
y = df['price_range']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# train= 1600 and test= 400(20%)

In [None]:
df.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

In [None]:
# Define the columns to preprocess
numeric_columns = ['battery_power', 'clock_speed', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'blue', 'dual_sim', 'fc', 'four_g', 'three_g', 'touch_screen', 'wifi']


In [None]:

# Define the preprocessing pipeline for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


In [None]:
# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_columns)
    ]
)


In [None]:
# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])


In [None]:
pipeline.fit(X_train,y_train)

In [None]:
# model prediction
y_pred = pipeline.predict(X_test)
# Define evaluation metrics
print('Accuracy:', accuracy_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred, average = 'weighted'))
print('Precision', precision_score(y_test, y_pred, average = 'weighted'))
print('Recall', recall_score(y_test, y_pred, average = 'weighted'))
print('classification', classification_report(y_test, y_pred))
print('Confusion Matricx:', confusion_matrix(y_test, y_pred))

Accuracy: 0.8825
F1 Score: 0.8829308270057811
Precision 0.884804386670567
Recall 0.8825
classification               precision    recall  f1-score   support

           0       0.94      0.96      0.95       105
           1       0.90      0.85      0.87        91
           2       0.78      0.85      0.81        92
           3       0.92      0.87      0.89       112

    accuracy                           0.88       400
   macro avg       0.88      0.88      0.88       400
weighted avg       0.88      0.88      0.88       400

Confusion Matricx: [[101   4   0   0]
 [  7  77   7   0]
 [  0   5  78   9]
 [  0   0  15  97]]


In [None]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None,5,10,15,20],
    'min_samples_split': [2,5,10],
    'min_samples_leaf': [1,5,10]
}

# Define the evaluation metric
scorer = make_scorer(f1_score, average='weighted')

# perform grid search with cross validation
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring= scorer)
grid_search.fit(X_train, y_train)

# print the best hyperparametrs and the corresponding score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)


Best Hyperparameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}
Best Accuracy: 0.8774837051314768


In [None]:

# save prediction with price_range
submission_df= pd.DataFrame({'price_range': y_pred})
submission_df.to_csv('submission.csv', index=False)

In [None]:
submission_df.shape

(400, 1)

In [None]:
from google.colab import files

# Download the file
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>