# Portfolio Part 3 - Analysis of Mobile Price Data (2024 S1)

In this Portfolio task, you will work on a new dataset named 'Mobile Price Data', it contains numerous details about mobile phone hardware, specifications, and prices. Your main task is to train classification models to predict **mobile phone prices** ('price range' in the dataset)and evaluate the strengths and weaknesses of these models.

Here's the explanation of each column:

|Column|Meaning|
|:-----:|:-----:|
|battery power|Total energy a battery can store in one time measured in mAh|
|blue|Has bluetooth or not|
|clock speed|speed at which microprocessor executes instructions|
|dual sim|Has dual sim support or not|
|fc|Front Camera mega pixels|
|four g|Has 4G or not|
|int memory|Internal Memory in Gigabytes|
|m dep|Mobile Depth in cm|
|mobile wt|Weight of mobile phone|
|n cores|Number of cores of processor|
|pc|Primary Camera mega pixels|
|px height|Pixel Resolution Height|
|px width|Pixel Resolution Width|
|ram|Random Access Memory in Mega Bytes|
|sc h|Screen Height of mobile in cm|
|sc w|Screen Width of mobile in cm|
|talk time|longest time that a single battery charge will last when you are|
|three g|Has 3G or not|
|touch screen|Has touch screen or not|
|wifi|Has wifi or not|
|price range|This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost)|

Blue, dual sim, four g, three g, touch screen, and wifi are all binary attributes, 0 for not and 1 for yes.

Your high level goal in this notebook is to build and evaluate predictive models for 'price range' from other available features. More specifically, you need to **complete the following major steps**:

1. ***Explore the data*** and ***clean the data if necessary***. For example, remove abnormal instanaces and replace missing values.

2. ***Study the correlation*** between 'price range' with other features. And ***select the variables*** that you think are helpful for predicting the price range. We do not limit the number of variables.

3. ***Split the dataset*** (Training set : Test set = 8 : 2)

4. ***Train a logistic regression model*** to predict 'price range' based on the selected features (from the second step). ***Calculate the accuracy*** of your model. (You are required to report the accuracy from both training set and test set.) ***Explain your model and evaluate its performance*** (Is the model performing well? If yes, what factors might be contributing to the good performance of your model? If not, how can improvements be made?).

5. ***Train a KNN model*** to predict 'price range' based on the selected features (you can use the features selected from the second step and set K with an ad-hoc manner in this step. ***Calculate the accuracy*** of your model. (You are required to report the accuracy from both training set and test set.)

6. ***Tune the hyper-parameter K*** in KNN (Hints: GridsearchCV), ***visualize the results***, and ***explain*** how K influences the prediction performance.

  Hints for visualization: You can use line chart to visualize K and mean accuracy scores on test set.

Note 1: In this assignment, we no longer provide specific guidance and templates for each sub task. You should learn how to properly comment your notebook by yourself to make your notebook file readable.

Note 2: You will not being evaluated on the accuracy of the model but on the process that you use to generate it and your explanation.

# 1.1 Explore the dataset
1. Import all necessary libraries
2. Explore the data
3. Describe the data
4. General information about the dataset


In [None]:
# 1. Import drive to Google Colab
from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [None]:
# 1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [None]:
# 2. Display the CSV file into a Dataframe
mobile = pd.read_csv('/drive/MyDrive/MQ 2024/[S1] COMP6200/[COMP6200] Portfolio 3/Mobile_Price_Data.csv')
mobile.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7.0,0.6,188,2,...,20,756.0,2549.0,9,7,19,0.0,0,1,1
1,1021,1,0.5,1,0,1,53.0,0.7,136,3,...,905,1988.0,2631.0,17,3,7,1.0,1,0,2
2,563,1,0.5,1,2,1,41.0,0.9,145,5,...,1263,1716.0,2603.0,11,2,9,1.0,1,0,2
3,615,1,2.5,0,0,0,10.0,0.8,131,6,...,1216,1786.0,2769.0,16,8,11,1.0,0,0,2
4,1821,1,1.2,0,13,1,44.0,0.6,141,2,...,1208,1212.0,1411.0,8,2,15,1.0,1,0,1


In [None]:
mobile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     1999 non-null   float64
 7   m_dep          1999 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       1999 non-null   float64
 13  ram            1999 non-null   float64
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        1999 non-null   float64
 18  touch_sc

# 1.2 Clean and Transform the Dataset
1. Remove missing data
2. Transform the data (if necessary)

In [None]:
# 1. Clean the data
# print the length of the data before removing the missing data.
print("Length of data before removing missing data:", len(mobile))

# counting number of Null values in each column before removing the missing data.
print("Null value counts in each column:")
print(mobile.isnull().sum())

Length of data before removing missing data: 2000
Null value counts in each column:
battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       1
m_dep            1
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         1
ram              1
sc_h             0
sc_w             0
talk_time        0
three_g          1
touch_screen     0
wifi             0
price_range      0
dtype: int64


In [None]:
# clean missing data in the records of int_memory/m_dep/px_width/ram/three_g.
mob_clean = mobile.dropna() # since int_memory/m_dep/px_width/ram/three_g are
                                              # columns contains missing data.

# print the length of the data after removing the missing data.
print("Length of data after removing missing data:", len(mob_clean))

# counting number of Null values in each column after removing the missing data.
print("Null value counts in each column:")
print(mob_clean.isnull().sum())

Length of data after removing missing data: 1995
Null value counts in each column:
battery_power    0
blue             0
clock_speed      0
dual_sim         0
fc               0
four_g           0
int_memory       0
m_dep            0
mobile_wt        0
n_cores          0
pc               0
px_height        0
px_width         0
ram              0
sc_h             0
sc_w             0
talk_time        0
three_g          0
touch_screen     0
wifi             0
price_range      0
dtype: int64


# 2. Study the Correlation & Select Variables
1. Explore the correlation on clean data, numbers
2. Function to be used: df.corr() correlation between price range and others
3. Explain your choice of selected features

In [None]:
# Calculate the correlation between 'Price' and all other variables
price_corr = mob_clean.corr()['price_range']

print("Correlation with Price:")
print(price_corr)

Correlation with Price:
battery_power    0.202652
blue             0.020846
clock_speed     -0.006926
dual_sim         0.018153
fc               0.019327
four_g           0.014572
int_memory       0.043861
m_dep            0.000776
mobile_wt       -0.028663
n_cores          0.003573
pc               0.031831
px_height        0.147946
px_width         0.164763
ram              0.917131
sc_h             0.023067
sc_w             0.037330
talk_time        0.022085
three_g          0.023771
touch_screen    -0.031155
wifi             0.020394
price_range      1.000000
Name: price_range, dtype: float64




> Based on the correlation between price_range and other variables, we have the two following with the highest correlations:
- ram: 0.917131
- battery_power: 0.202652

Although ram has a significant correlation with price_range compared to that of battery_power, I believe adding one more variable in the model might help improve it. However, it is also important to check whether these two variables have a high correlation with each other to mitigate the possibility of multicollinearity.



In [None]:
# Calculate the correlation between 'ram' and 'battery_power'
var_corr = mob_clean['ram'].corr(mob_clean['battery_power'])

print("Correlation between 'ram' and 'battery_power'")
print(var_corr)

Correlation between 'ram' and 'battery_power'
0.00089637767106347


**Conclusion:**
As examined, we observe an incredibly low correlation between these two variables. Therefore, 'ram' and 'battery_power' are two variables that we can use to develop a model to predict the price range of a mobile.

# 3. Split the data
In fact, we need to split the dataset in to training and test sets according for three different models:
1. Model-A: using the entire dataset
2. Model-B: using selected features (from Feature Selection (RFE))
3. Model-C: using selected features (based on calculated correlation from Section 2)

Notes:
- Function to be used: train_test_split, provided by sklearn library.
- By splitting the data, we have train_X, train_Y, test_X, and test_Y for three different models (Model-A, Model-B, and Model-C)

In [None]:
# Model-A
# Training data containing 80% of the entire data
X_train_A, X_test_A, y_train_A, y_test_A = train_test_split(mob_clean.drop(columns=['price_range']), mob_clean['price_range'], test_size = 0.2, random_state = 111)

# Inspect the structure of the training and the testing sets
print('X_train_A:',X_train_A.shape)
print('y_train_A:',y_train_A.shape)
print('X_test_A:',X_test_A.shape)
print('y_test_A:',y_test_A.shape)


X_train_A: (1596, 20)
y_train_A: (1596,)
X_test_A: (399, 20)
y_test_A: (399,)


In [41]:
# Model-B
# Import LogisticRegression from sklearn library
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Select the variables using RFE
from sklearn.feature_selection import RFE
selector = RFE(lr, n_features_to_select=2) # choose the number of features
selector = selector.fit(X_train_A, y_train_A) # fit the selector to the train data
selector.ranking_ # find the ranking of the variables

array([12, 17,  1, 15,  3, 19,  7, 18,  8,  1,  6, 10, 11,  9,  5,  2,  4,
       13, 14, 16])

In [42]:
mob_clean.head() # show the variables to locate the two variables ranked as the most important

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7.0,0.6,188,2,...,20,756.0,2549.0,9,7,19,0.0,0,1,1
1,1021,1,0.5,1,0,1,53.0,0.7,136,3,...,905,1988.0,2631.0,17,3,7,1.0,1,0,2
2,563,1,0.5,1,2,1,41.0,0.9,145,5,...,1263,1716.0,2603.0,11,2,9,1.0,1,0,2
3,615,1,2.5,0,0,0,10.0,0.8,131,6,...,1216,1786.0,2769.0,16,8,11,1.0,0,0,2
4,1821,1,1.2,0,13,1,44.0,0.6,141,2,...,1208,1212.0,1411.0,8,2,15,1.0,1,0,1




>*After trying different numbers of features to select (n_features_to_select), we obverse there is limited improvement as the number of features increases. Thus, we believe keeping the number as 2 is the best way to simplify the model.*



In [43]:
# Re-split the data
X_train_B, X_test_B, y_train_B, y_test_B = train_test_split(mob_clean[["clock_speed", "n_cores"]], mob_clean["price_range"], test_size=0.2, random_state=111)

# Inspect the structure of the training and the testing sets
print('X_train_B:', X_train_B.shape)
print('y_train_B:', y_train_B.shape)
print('X_test_B:', X_test_B.shape)
print('y_test_B:', y_test_B.shape)

X_train_B: (1596, 2)
y_train_B: (1596,)
X_test_B: (399, 2)
y_test_B: (399,)


In [32]:
# Model-C
# Re-split the data accordingly
X_train_C, X_test_C, y_train_C, y_test_C = train_test_split(mob_clean[["ram", "battery_power"]], mob_clean["price_range"], test_size=0.2, random_state=111)

# Inspect the structure of the training and the testing sets
print('X_train_C:', X_train_C.shape)
print('y_train_C:', y_train_C.shape)
print('X_test_C:', X_test_C.shape)
print('y_test_C:', y_test_C.shape)

X_train_C: (1596, 2)
y_train_C: (1596,)
X_test_C: (399, 2)
y_test_C: (399,)


# 4. Train a logistic regression model
1. Define the logistic regression model - refer to linear regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mse
lr.fit(train_X, train_Y)

2. Fit the model on training data
3. Test the model on test data
pred_Y = lr.predict(test_X)

4. Remember to show the train and test accuracy
accuracy_score(pred_y_test, y_test)

In [47]:
# Model-A: Using all variables
lr.fit(X_train_A, y_train_A) # fit the model on training set of the entire data

In [48]:
# Model-A: Evaluation Metrics
from sklearn.metrics import accuracy_score
y_pred_A = lr.predict(X_test_A) # test the model on test data (y_pred)
print("Accuracy on the Logistic Regression Model-A:", accuracy_score(y_test_A, y_pred_A)) # show the accuracy score

Accuracy on the Logistic Regression Model-A: 0.6090225563909775


In [49]:
# Model-B: Using Feature Selection (RFE)
# Use RFE selected features to re-train the model
lr.fit(X_train_B, y_train_B)

In [50]:
# Model-B: Evaluation Metrics (RFE Selected Features)
y_pred_B = lr.predict(X_test_B)
print("Accuracy on logistic regression Model-B:", accuracy_score(y_test_B, y_pred_B))

Accuracy on logistic regression Model-B: 0.2531328320802005


In [51]:
# Model-C: Using selected variables from correlation
lr.fit(X_train_C, y_train_C)

In [52]:
# Model-C: Evaluation Metrics
y_pred_C = lr.predict(X_test_C)
print("Accuracy on logistic regression with Model-C:", accuracy_score(y_test_C, y_pred_C))

Accuracy on logistic regression with Model-C: 0.8421052631578947


**ANALYSIS ON THE PERFORMANCE OF THREE DIFFERENT MODELS**

Based on the provided accuracy metrics, it is evident that Model-C, which utilizes selected features based on calculated correlation, outperforms both Model-A (using the entire dataset) and Model-B (using selected features from Feature Selection with RFE).

1. **Model-A (Using Entire Dataset):**
- Accuracy: 0.61
- This model uses all available features in the dataset.
- While it achieves a moderate accuracy, it might suffer from overfitting due to including irrelevant or redundant features.
2. **Model-B (Using Selected Features from RFE):**
- Accuracy: 0.25
- This model selects features using Recursive Feature Elimination (RFE), which aims to identify the most relevant features for prediction.
- The low accuracy suggests that either the feature selection process didn't effectively capture the key predictors, or the model is not adequately trained on the selected features.
3. **Model-C (Using Selected Features Based on Correlation):**
- Accuracy: 0.84
- This model selects features based on the correlation with the target variable (price of mobile).
- The high accuracy indicates that the selected features, likely ram and battery power as mentioned earlier, have a strong relationship with the target variable, enabling the model to make more accurate predictions.

In conclusion, Model-C is the best performer among the three models due to its significantly higher accuracy. The reason is likely because of the effective selection of features by identifying a strong correlation with the target variable, resulting in a more predictive and accurate model.

# 5. Train a KNN model (similar to LR Model)
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.metrics import accuracy_score

In [53]:
# Train a KNN model
from sklearn.neighbors import KNeighborsClassifier
k = KNeighborsClassifier(n_neighbors= 4)
k.fit(X_train_A, y_train_A)

In [54]:
# Get a accuracy of KNN model
y_pred_knn = k.predict(X_test_A)
print("Accuracy on KNN with selected features:", accuracy_score(y_test_A, y_pred_knn))

Accuracy on KNN with selected features: 0.9223057644110275


**COMPARISON BETWEEN LOGISTIC MODELS AND KNN MODEL**

As we observe:

**1. Logistic Regression Models:**
- Model-A: Accuracy = 0.61
- Model-B: Accuracy = 0.25
- Model-C: Accuracy = 0.84

**2. KNN Model:**
- Accuracy = 0.92


---

**Conclusion**

Thus, we can compare them in various perspectives as follows:
- **Accuracy:** The KNN model demonstrates the highest accuracy among all models, with an accuracy score of 0.92. This indicates that the KNN model is better at predicting the price range of mobile phones compared to logistic regression models.
- **Model Complexity:** KNN typically requires less assumption about the underlying data distribution compared to logistic regression, making it more flexible. However, it may require more computational resources during prediction due to its instance-based nature.
- **Feature Selection:** Both logistic regression and KNN models benefit from feature selection. In this case, the logistic regression model with selected features based on correlation (Model-C) achieved the highest accuracy among logistic regression models, indicating the importance of feature selection in improving model performance.
- **Interpretability:** Logistic regression models provide coefficients that indicate the strength and direction of the relationship between each feature and the target variable. This makes logistic regression models more interpretable compared to KNN, which operates based on distances between data points.

In summary, while logistic regression models can be interpretable and effective with proper feature selection, the KNN model in this scenario outperforms logistic regression in terms of accuracy. However, the choice between these models ultimately depends on more factors (i.e. interpretability, computational resources, and specific requirements of the problem at hand).

# 6. Hyperparameter tunning - K in KNN

In [55]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors': range(1, 30)}
clf = GridSearchCV(k, parameters)
clf.fit(X_train_A, y_train_A)

In [56]:
# Find the best k parameters
clf.best_params_

{'n_neighbors': 13}

In [57]:
# Find the best score with the best k parameters
clf.best_score_

0.9335854231974923

**Comparison: KNN model with/ without hyperparameter tuning**

1. KNN Model with Hyperparameter Tuning:
- Best k: 13
- Best accuracy: 0.93
2. KNN Model without Hyperparameter Tuning:
- k: 4
- Best accuracy: 0.92

---

**Conclusion**

**1. Advantages:**

> - **Improved Accuracy:** The KNN model with hyperparameter tuning achieved a slightly higher accuracy of 0.93 compared to 0.92 obtained without hyperparameter tuning. This suggests that optimizing the hyperparameter (the number of neighbors, k) led to an improvement in model performance.
- **Optimized K:** The optimal value of k found through hyperparameter tuning was 13. This suggests that considering a larger number of neighbors for classification yielded the best results for this dataset.

**2. Disadvantages:**
> - **Increased Computational Cost:** Although hyperparameter tuning led to improved accuracy, it also required additional computational resources. GridSearchCV exhaustively searches through the specified parameter grid, which can be computationally intensive, especially for larger datasets or when the parameter space is large.
- **Consideration of Trade-offs:** While hyperparameter tuning improved accuracy, it's important to consider the trade-offs between model performance and computational cost. In some cases, the marginal improvement in accuracy may not justify the increased computational overhead associated with hyperparameter tuning.


In summary, while hyperparameter tuning in KNN resulted in a slightly higher accuracy compared to the model without tuning, it's essential to assess whether the incremental improvement justifies the additional computational cost, especially in real-world applications where computational efficiency is crucial.