# Moblie Price Classification


## Context
We're launching our own mobile company, aiming to rival major competitors like Apple and Samsung. Estimating the price of our mobile phones is proving to be a challenge, as mere assumptions don't suffice in this fiercely competitive market. To overcome this obstacle, we've collected sales data from various firms. Our aim is to discover a correlation between mobile phone features, such as RAM and internal memory, and the selling price. Instead of predicting the exact price, we want to establish a price range, showing the likely upper limit. By leveraging a machine learning model, we anticipate a more accurate estimation of our mobile phones' pricing, allowing us to stand toe-to-toe with other market players.


## Dataset
- Dataset source: https://www.kaggle.com/code/farzadnekouei/noise-resilient-mobile-price-classification
- Dataset columns are as follows:

- **id** - ID
- **battery_power** - Total energy a battery can store in one time measured in mAh
- **blue** - Has bluetooth or not
- **clock_speed** - Speed at which microprocessor executes instructions
- **dual_sim** - Has dual sim support or not
- **fc** - Front Camera mega pixels
- **four_g** - Has 4G or not
- **int_memory** - Internal Memory in Gigabytes
- **m_dep** - Mobile Depth in cm
- **mobile_wt** - Weight of mobile phone
- **n_cores** - Number of cores of processor
- **pc** - Primary Camera mega pixels
- **px_height** - Pixel Resolution Height
- **px_width** - Pixel Resolution Width
- **ram** - Random Access Memory in Megabytes
- **sc_h** - Screen Height of mobile in cm
- **sc_w** - Screen Width of mobile in cm
- **talk_time** - Longest time that a single battery charge will last when you are
- **three_g** - Has 3G or not
- **touch_screen** - Has touch screen or not
- **wifi** - Has wifi or not
- **price_range** - This is the target variable with value of:
    - 0 (low cost)
    - 1 (medium cost)
    - 2 (high cost)
    - 3 (very high cost)


## Objectives
- Exploring and Preprocessing Data
- Build different classification models to predict the mobile phone price range
- Price range prediction for 1000 usnseen data


## Applied Models:
- Support Vector Machine (SVM)
- Decision Tree (DT)
- Random Forest (RF)


## Table of Contents


- Step 1 | Import Libraries

- Step 2 | Read Dataset

- Step 3 | Dataset Overview
    - Step 3.1 | Dataset Basic Information
    - Step 3.2 | Statistical Description of Categorical Variables
    - Step 3.3 | Statistical Description of Numerical Variables
    
- Step 4 | Univariate Analysis
    - Step 4.1 | Categorical Variables Univariate Analysis
    - Step 4.2 | Numerical Variables Univariate Analysis
    
- Step 5 | Data Cleansing
    - Step 5.1 | Duplicate Values Detection
    - Step 5.2 | Missing Value Detection
    - Step 5.3 | Noise Detection
    - Step 5.4 | Feature Selection
        - Step 5.4.1 | Pearson Correlation
        - Step 5.4.2 | Drop-column Feature Importance
        
- Step 6 | Bivariate Analysis
    - Step 6.1 | Categorical Features vs Target
    - Step 6.2 | Numerical Features vs Target
    
- Step 7 | SVM Model Building
    - Step 7.1 | Scale Data using Standard Scaler
    - Step 7.2 | SVM Hyperparameter Tuning
    - Step 7.3 | SVM Model Evaluation
    
- Step 8 | Decision Tree Model Building
    - Step 8.1 | Noise Treatment using KNN Imputer
    - Step 8.2 | Decision Tree Hyperparameter Tuning
    - Step 8.3 | Decision Tree Feature Subset Selection
    - Step 8.4 | Decision Tree Model Evaluation
    
- Step 9 | Random Forest Model Building
    - Step 9.1 | Random Forest Hyperparameter Tuning
    - Step 9.2 | Random Forest Feature Subset Selection
    - Step 9.3 | Random Forest Model Evaluation
    
- Step 10 | Conclusion & Sample Data Prediction


# Step 1 | Import Libraries

In [1]:
!pip install pandas-profiling



In [2]:
!pip install missingno



In [3]:
!pip install plotly



In [4]:
import numpy as np
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from matplotlib import colors
from matplotlib.colors import ListedColormap, LinearSegmentedColormap

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report, RocCurveDisplay, ConfusionMatrixDisplay
from sklearn.base import clone 
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
%matplotlib inline

  This is separate from the ipykernel package so we can avoid doing imports until


In [5]:
# Initialize Plotly for use in the notebook
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

# Ignore warnings
warnings.filterwarnings('ignore')

# Step 2 | Read Dataset

In [6]:
df = pd.read_csv('/Users/dooinnkim/jupyter_notebook/2023_data_portfolio/mobile_price_classification/data/train.csv')
df.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


# Step 3 | Dataset Overview

## Step 3.1 | Dataset Basic Information

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

### Conclusion:
- Dataset includes data from 2000 mobile phones.
- Consists of 21 variables: 20 independent and 1 dependent (price_range).
- No missing values in the dataset.
- 8 categorical variables: n_cores, price_range, blue, dual_sim, four_g, three_g, touch_screen, wifi.
- 13 numeric variables: battery_power, clock_speed, fc, int_memory, m_dep, mobile_wt, pc, px_height, px_width, ram, talk_time, sc_h, sc_w.

## Step 3.2 | Statistical Description of Categorical Variables

In [8]:
# Filter out categorical features
df_categorical = df[['price_range', 'n_cores', 'blue', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi']].astype(str)

# Calculate number of unique values and unique values for each feature
unique_counts = df_categorical.nunique()
unique_values = df_categorical.apply(lambda x: x.unique())

# Create new dataframe with the results
pd.DataFrame({'Number of Unique Values': unique_counts, 'Unique Values': unique_values})


Unnamed: 0,Number of Unique Values,Unique Values
price_range,4,"[1, 2, 3, 0]"
n_cores,8,"[2, 3, 5, 6, 1, 8, 4, 7]"
blue,2,"[0, 1]"
dual_sim,2,"[0, 1]"
four_g,2,"[0, 1]"
three_g,2,"[0, 1]"
touch_screen,2,"[0, 1]"
wifi,2,"[1, 0]"


## Step 3.3 | Statistical Description of Numerical Variables

In [9]:
# Filter out numerical features
df_numerical = df.drop(df_categorical.columns, axis=1)

# Generate descriptive statistics
df_numerical.describe().T.round(1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
battery_power,2000.0,1238.5,439.4,501.0,851.8,1226.0,1615.2,1998.0
clock_speed,2000.0,1.5,0.8,0.5,0.7,1.5,2.2,3.0
fc,2000.0,4.3,4.3,0.0,1.0,3.0,7.0,19.0
int_memory,2000.0,32.0,18.1,2.0,16.0,32.0,48.0,64.0
m_dep,2000.0,0.5,0.3,0.1,0.2,0.5,0.8,1.0
mobile_wt,2000.0,140.2,35.4,80.0,109.0,141.0,170.0,200.0
pc,2000.0,9.9,6.1,0.0,5.0,10.0,15.0,20.0
px_height,2000.0,645.1,443.8,0.0,282.8,564.0,947.2,1960.0
px_width,2000.0,1251.5,432.2,500.0,874.8,1247.0,1633.0,1998.0
ram,2000.0,2124.2,1084.7,256.0,1207.5,2146.5,3064.5,3998.0


# Step 4 | Univariate Analysis
To analyze the dataset in detail, we conduct univariate analysis on both continuous and categorical features separately.

## Step 4.1 | Categorical Variables Univariate Analysis

In [10]:
# Create the subplots
fig = make_subplots(rows=3, cols=3, specs=[[{'type':'domain'}]*3]*3, vertical_spacing=0.05, horizontal_spacing=0.01)

# Loop through all the features and add the pie chart to the subplot
for i, feature in enumerate(df_categorical.columns):
    value_counts = df_categorical[feature].value_counts()
    labels = value_counts.index.tolist()
    values = value_counts.values.tolist()

    # Define color map based on orangered color
    cmap = colors.LinearSegmentedColormap.from_list("aliceblue", ["aliceblue", "white"])
    norm = colors.Normalize(vmin=0, vmax=len(labels))
    color_list = [colors.rgb2hex(cmap(norm(i))) for i in range(len(labels))]

    # Create the pie chart
    pie_chart = go.Pie(
        labels=labels,
        values=values,
        hole=0.6,
        marker=dict(colors=color_list, line=dict(color='white', width=3)),
        textposition='inside',
        textinfo='percent+label',
        title=feature,  # Add title with the feature name
        title_font=dict(size=25, color='black', family='Calibri')
    )

    # Add the pie chart to the subplot
    if i < 8:
        row = i // 3 + 1
        col = i % 3 + 1
        fig.add_trace(pie_chart, row=row, col=col)

# Update the layout
fig.update_layout(showlegend=False, height=1000, width=980, 
                   title={
                          'text':"Distribution of Categorical Variables",
                          'y':0.95,
                          'x':0.5,
                          'xanchor':'center',
                          'yanchor':'top',
                          'font': {'size':28, 'color':'black', 'family':'Calibri'}
                         })

# Show the plot
fig.show()