# Welcome Here 

# About Dataset --- In this notebook we are going to predict the price of Mobile Phone based on various input features.

Dataset as 21 features and 2000 entries. The meanings of the features are given below.

![](https://media2.giphy.com/media/v1.Y2lkPTc5MGI3NjExdGw2bjVjeDMxamE3d2hmdzR6M21hZXE0NTJkZ3Jrd3l6MnkweTk0ZCZlcD12MV9naWZzX3NlYXJjaCZjdD1n/8VkgrPdxMh0oo/giphy.gif)

1. battery_power: Total energy a battery can store in one time measured in mAh

2. blue: Has bluetooth or not

3. clock_speed: speed at which microprocessor executes instructions

4. dual_sim: Has dual sim support or not

5. fc: Front Camera mega pixels

6. four_g: Has 4G or not

7. int_memory: Internal Memory in Gigabytes

8. m_dep: Mobile Depth in cm

9. mobile_wt: Weight of mobile phone

10. n_cores: Number of cores of processor

11. pc: Primary Camera mega pixels

12. px_height: Pixel Resolution Height

13. px_width: Pixel Resolution Width

14. ram: Random Access Memory in Mega Bytes

15. sc_h: Screen Height of mobile in cm

16. sc_w: Screen Width of mobile in cm

17. talk_time: longest time that a single battery charge will last when you are

18. three_g: Has 3G or not

19. touch_screen: Has touch screen or not

20. wifi: Has wifi or not

21. price_range: This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score,roc_auc_score,precision_score, recall_score, f1_score



import warnings 
warnings.filterwarnings("ignore")

 # Loading and Preparing Dataset

In [None]:
train=pd.read_csv('/kaggle/input/mobile-price-classification/train.csv')
test=pd.read_csv('/kaggle/input/mobile-price-classification/test.csv')

# Check the shape of the training data and testing data

In [None]:
train.head()

In [None]:
train.shape

In [None]:
test.shape

# Detailed Exploratory Data Analysis

In [None]:
# check the datatypes in the dataset
train.info()

In [None]:
# Describe the dataset
train.describe()

In [None]:
# Check the  Null values in the dataset
train.isna().sum()

In [None]:
# check the duplicate values in the training dataset
train.duplicated().sum()

# Check the correlation between the variables

In [None]:
correlation_matrix= train.corr()
correlation_matrix

# Visualise the Correlation Matrix

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(correlation_matrix, annot=True, cmap="Purples", fmt=".2f")
plt.show()

> Here we can see that ram is highly correlated with price_range(target variable) that is 0.92

# Lets check dataset is balanced or not

In [None]:
train["price_range"].value_counts()

> Here we notice that "price_range" is equally distributed among 4 different categories that is  0(low cost), 1(medium cost), 2(high cost) and 3(very high cost), which means our dataset is balanced.

# Univariate Analysis 

# Let see the Data Distribution of Categorical Features

In [None]:
train["price_range"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="Red")
plt.xlabel("price_range")
plt.ylabel("count")
plt.title("Mobile_Price_Count")

In [None]:
train["blue"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="grey")
plt.xlabel("bluethooth")
plt.ylabel("count")
plt.title("bluethooth")

In [None]:
train["dual_sim"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="blue")
plt.xlabel("dual_sim")
plt.ylabel("count")
plt.title("Dual_Sim")

In [None]:
train["four_g"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="green")
plt.xlabel("four_g")
plt.ylabel("count")
plt.title("4G")

In [None]:
train["three_g"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="purple")
plt.xlabel("three_g")
plt.ylabel("count")
plt.title("3G")

In [None]:
train["touch_screen"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="brown")
plt.xlabel("touch_screen")
plt.ylabel("count")
plt.title("Touch_screen")

In [None]:
train["wifi"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="orange")
plt.xlabel("wifi")
plt.ylabel("count")
plt.title("WiFi")

In [None]:
train["n_cores"].value_counts().plot(kind="bar", figsize=(6,4), rot=0, color="pink")
plt.xlabel("n_cores")
plt.ylabel("count")
plt.title("No. of Cores")

# Let see the Data Distribution of Numerical Features

In [None]:
sns.histplot(train["ram"], color="grey", kde=True)
plt.title("Ram")

In [None]:
sns.histplot(train["battery_power"], color="blue", kde=True)
plt.title("Battery_Power")

In [None]:
sns.distplot(train["m_dep"], color="green")
plt.title("Mobile_Depth")

In [None]:
sns.histplot(train["fc"], color="violet", label="Front_Camera", kde=True)
plt.title("Front_Camera")

In [None]:
sns.distplot(train["mobile_wt"], color="orange")
plt.title("Mobile_Weight")

In [None]:
sns.histplot(train["px_height"], color="red", kde=True)
plt.title("Pixcel_Height")

In [None]:
sns.histplot(train["px_width"], color="purple", kde=True)
plt.title("Pixcel_Width")

# Lets see the outliers in the dataset with the help of violin and box plot

In [None]:
fig = px.violin(train, x="price_range", y="ram", title="Ram Vs Price Range", color="price_range", box=True, points="all")
fig.show()

In [None]:
fig = px.violin(train, x="price_range", y="battery_power", title="Battery_power Vs Price Range", color="price_range", box=True, points="all")
fig.show()

In [None]:
fig = px.violin(train, x="price_range", y="px_height", title="Pixel_height Vs Price Range", color="price_range", box=True, points="all")
fig.show()

In [None]:
fig = px.violin(train, x="price_range", y="px_width", title="Pixel_Width Vs Price Range", color="price_range", box=True, points="all")
fig.show()

In [None]:
fig = px.violin(train, x="price_range", y="int_memory", title="Internal Memory in Gigabytes Vs Price Range", color="price_range", box=True, points="all")
fig.show()

In [None]:
fig = px.violin(train, x="price_range", y="fc", title="Front_Camera Vs Price Range", color="price_range", box=True, points="all")
fig.show()

# Here we detect the outliers and also noticed that as the size of ram, battery_power, pixel_height, pixel_width and internal_memory and front camera pixel ratio increases, Price_Range also increases

# Divide the Dataset into Train and Test set

In [None]:
train.shape, test.shape

In [None]:
test.drop("id", axis=1, inplace=True)

In [None]:
def train_test_split_data(dataframe,target,test_size, random_state):
    x_train,x_test, y_train, y_test= train_test_split(dataframe.drop([target], axis=1),
                                                      dataframe[target],
                                                      test_size=test_size,
                                                      random_state=random_state,
                                                      stratify=dataframe[target]
                                                      )
    
    return x_train,x_test, y_train, y_test

In [None]:
x_train,x_test, y_train, y_test= train_test_split_data(train, target="price_range",test_size=0.3, random_state=42)

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

# Lets Detect the Outliers in the Dataset

In [None]:
def Winsorization_Method(columns, x_train, y_train , a, b):
    outliers=[]

    for col in columns:
        q1= np.percentile(x_train[col], a)
        q2= np.percentile(x_train[col],b)
        
        for pos in range(len(x_train)):
            if x_train[col].iloc[pos]>q2 or x_train[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(x_train)*100, 2)                       # Ratio of outliers
    x_train.drop(x_train.index[outliers], inplace=True)    # remove the outliers from the training dataset
    y_train.drop(y_train.index[outliers], inplace=True)
    
    
    
    return ratio, x_train, y_train

In [None]:
ratio, x_train, y_train= Winsorization_Method(x_train.select_dtypes(exclude="object").columns, x_train, y_train , a=1, b=99)

In [None]:
x_train.shape, y_train.shape

In [None]:
print(f"Ratio of Outliers Detected in the dataset:{ratio}")

# Data Preprocessing


In [None]:
robust_scaler= RobustScaler()
x_train=robust_scaler.fit_transform(x_train)
x_test=robust_scaler.transform(x_test)

# Lets do the Modelling

In [None]:
def modelling(x_train, x_test, y_train, y_test):
    
    cv_result = []
    best_estimators = []
    recall_scores = []
    precision_scores = []
    roc_auc_scores = []
    f1_scores = []
    
    
    classifiers = [DecisionTreeClassifier(),
             RandomForestClassifier(),
             LogisticRegression(random_state=0),
             GradientBoostingClassifier(),
             ]

    dt_param_grid = {"min_samples_split" : range(10,500,20),
                     "max_depth": range(1,20,2)}


    rf_param_grid = {"max_features": [1,3,10],
                     "min_samples_split":[2,3,10],
                     "min_samples_leaf":[1,3,10],
                     "n_estimators":[100,300],
                     "criterion":["gini"]}
    

    logreg_param_grid = {"C":np.logspace(-4, 4, 20),
                         "penalty": ["l1","l2","None"],
                         "max_iter":[1000]}


    gbc_param_grid = {
                      "learning_rate": [0.05, 0.1, 0.2],
                      "min_samples_split": [2,3,10],
                      "min_samples_leaf": [1,3,10]
                      }


    classifier_parameters = [dt_param_grid,
                            rf_param_grid,
                            logreg_param_grid,
                            gbc_param_grid,
                             ]
    
    for i in range(len(classifiers)):
        model= GridSearchCV(classifiers[i], classifier_parameters[i], cv=5, scoring ="accuracy", n_jobs = -1)
        model.fit(x_train, y_train)
        y_pred= model.predict(x_test)
        
        cv_result.append(model.best_score_)
        roc_auc_scores.append(roc_auc_score(y_test, model.predict_proba(x_test), multi_class='ovr'))
        recall_scores.append(recall_score(y_pred, y_test,average='macro'))
        precision_scores.append(precision_score(y_pred, y_test,average='macro'))
        f1_scores.append(f1_score(y_pred, y_test,average='macro'))
        best_estimators.append(model.best_estimator_)
        
        
        print(f"Model:{classifiers[i]}")
        print(f"Accuracy:{round(cv_result[i]*100,2)}")
        print(f"ROC AUC:{roc_auc_scores[i]}")
        print(f"Recall:{recall_scores[i]}")
        print(f"Precision:{precision_scores[i]}")
        print(f"F1-Score:{f1_scores[i]}")
        print(f"Best Estimator:{model.best_estimator_}")
    
        print("---------------------------------------------------------------------------------------------------------------")

                         
                         
    model_names = ['DecisionTreeClassifier','RandomForestClassifier','LogisticRegression','GradientBoostingClassifier']
    result_df = pd.DataFrame({'Recall':recall_scores, 'Precision':precision_scores, 'F1_Score':f1_scores,'AUC_Score':roc_auc_scores, 'Accuracy': cv_result,},index=model_names)
    result_df=result_df.sort_values(by="AUC_Score", ascending=False)
    return result_df
        

In [None]:
result_df= modelling(x_train, x_test, y_train, y_test)

In [None]:
result_df

# Result Visualization

In [None]:
result_df.plot(kind="barh", figsize=(10, 7), grid=True).legend(bbox_to_anchor=(1.2,1));

# Principal Component Analysis(PCA)
> Dimensionality Reduction Technique ---- PCA is process of figuring out the most important features or Principle Components that has most impact on the Target Variable.

> Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation. Dimensions are nothing but features that represent the data.

In [None]:
pca=PCA(0.98)                               # Retain 98% of useful features
x_train_pca= pca.fit_transform(x_train)
x_test_pca=pca.transform(x_test)

In [None]:
result_df= modelling(x_train, x_test, y_train, y_test)

In [None]:
result_df

In [None]:
# Percentage of variance explained by each of the selected components.
print(f"Variance_Ratio:{pca.explained_variance_ratio_}")

In [None]:
# Most important features 
print(f"No. of Components Used:{pca.n_components_}")

> Variances of each principal component show how much of the original variation in the dataset is explained by the principal component.


In [None]:
# The amount of variance explained by each of the selected components. 
print(f"Variance:{pca.explained_variance_}")

>  Here We can see:
1. The first principal component explains 64.74% of the total variation in the dataset.
2. The second principal component explains 58.16% of the total variation in the dataset.
3. The third principal component explains 55.13% of the total variation in the dataset.
4. The fourth principal component explains 37.77% of the total variation in the dataset.

# Scree Plot
> A common method for determining the number of PCs to be retained is a graphical representation known as a scree plot. A Scree Plot is a simple line segment plot.The x-axis displays the principal component and the y-axis displays the percentage of total variance explained by each individual principal component. It always displays a downward curve. Most scree plots look broadly similar in shape, starting high on the left, falling rather quickly, and then flattening out at some point. This is because the first component usually explains much of the variability, the next few components explain a moderate amount, and the latter components only explain a small fraction of the overall variability. The scree plot criterion looks for the “elbow” in the curve and selects all components just before the line flattens out. (In the PCA literature, the plot is called a ‘Scree’ Plot because it often looks like a ‘scree’ slope, where rocks have fallen down and accumulated on the side of a mountain.)

In [None]:
plt.ylim(0,max(pca.explained_variance_)) 
plt.style.context('seaborn-whitegrid')
plt.axhline(y=0.38,color='r',linestyle='--')
plt.plot(pca.explained_variance_, 'o-', linewidth=2, color="blue")

plt.ylabel('Variance Explained') 
plt.xlabel('Principle Components')
plt.title('Scree Plot')
plt.show()

# Lets Visualise the Data Distribution along the First and Second Principle Components

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x_train_pca[:,0],x_train_pca[:,1],c=x_train[:,1])
plt.xlabel("First Principle Component")
plt.ylabel("Second Principle Component ")

Hi,

I want to express my heartfelt gratitude for taking the time to visit and upvote my Kaggle notebook. Your support encourages me to keep pushing the boundaries of my data science journey. Each upvote fuels my passion and motivates me to create more insightful and impactful content. Your appreciation means the world to me. Thank you for being a part of this incredible community and for recognizing the effort I've put into my work. I'm truly honored by your gesture. Looking forward to sharing more exciting notebooks in the future.

![](https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/68c512cd-5771-4700-b611-d8bfe279847d/dbe01nw-79eb8c68-ae8d-4140-b0e1-fbf9b8bdc7f9.gif?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzY4YzUxMmNkLTU3NzEtNDcwMC1iNjExLWQ4YmZlMjc5ODQ3ZFwvZGJlMDFudy03OWViOGM2OC1hZThkLTQxNDAtYjBlMS1mYmY5YjhiZGM3ZjkuZ2lmIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.aSQUPlg7yRJPkRcVIvsZf0_KUvvhKpibMFs_NkZtgMU)