# Smartphone Specification Tier Analysis  
## Machine Learning Coursework — KH6001CMD  

This notebook develops a full machine learning workflow to analyze smartphone hardware
specifications and identify natural performance tiers using **unsupervised learning**.

### Project Overview  
We apply the complete machine learning pipeline required by the coursework, including:

- Loading and exploring a real smartphone specifications dataset  
- Cleaning, preprocessing, and engineering numeric features  
- Scaling numerical attributes and encoding categorical variables  
- Applying **K-Means clustering** to discover hidden groups in the data  
- Using **PCA** to visualize cluster separation  
- Interpreting clusters based on mean hardware specifications  
- Assigning final smartphone tiers: **Budget, Midrange, Flagship**  
- Evaluating clustering performance using silhouette score  
- Visualizing differences between the tiers  

This notebook forms the technical basis for the accompanying written report.


# Section 1 — Import Libraries & Load Dataset

In this section, we import all required Python libraries and load our original smartphone dataset.

We verify:
- the dataset loads correctly
- the file shape (rows × columns)
- the first few rows to understand the structure

This establishes the foundation for all preprocessing steps.


In [1]:
# Section 1 — Import libraries & load dataset

import pandas as pd
import numpy as np
import re

# Visualization
import plotly.express as px

# Machine Learning
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# Load dataset
df = pd.read_csv("smartphones_original.csv")

# Show shape
print("Original dataset shape:", df.shape)

# Preview
df.head()


Original dataset shape: (4144, 80)


Unnamed: 0,brand,model,device_type,release_date,status,architecture,aspect_ratio,audio_jack,battery_capacity,battery_placement,...,usb_otg,usb_type_c,user_interface,video,volte,waterproof,weight,width,wifi_hotspot,wlan
0,Oppo,K13 Turbo Pro (512GB),Smartphone,25 July 2025,Available,64 bit,19.5:9,USB Type-C,Li-Po 7000mAh,,...,Yes,USB Type-C 2.0,ColorOS 15,"4K@30/60fps, 1080p@30fps",Yes,Water resistant (up to 2m for 30 min),208 grams,77.2 mm,,"Wi-Fi 7 (802.11 a/b/g/n/ac/be/ax) 5GHz 6GHz, MIMO"
1,Oppo,K13 Turbo (512GB),Smartphone,25 July 2025,Available,64 bit,19.5:9,USB Type-C,Li-Po 7000mAh,,...,Yes,USB Type-C 2.0,ColorOS 15,"4K@30/60fps, 1080p@30fps",Yes,Water resistant (up to 2m for 30 min),207 grams,77.2 mm,,"Wi-Fi 6 (802.11 a/b/g/n/ac/ax) 5GHz, MIMO"
2,Vivo,Y400,Smartphone,Not announced yet,Rumored,64 bit,,USB Type-C,Li-Po 6000mAh,,...,Yes,USB Type-C 2.0,Funtouch 15,"1080p@30fps, gyro-EIS",Yes,Water resistant (up to 1.5m for 30 min),,,,Wi-Fi 5 (802.11 a/b/g/n/ac) 5GHz
3,Infinix,Hot 60 Pro Plus (256GB),Smartphone,25 July 2025,Available,64 bit,20:9,3.5 mm,Li-Po 5160mAh,,...,Yes,USB Type-C 2.0,XOS 15.1,"1440p@30fps, 1080p@30/60fps",Yes,Splash proof,155 grams,75.8 mm,,Wi-Fi 5 (802.11 a/b/g/n/ac) 5GHz
4,Realme,Note 70T,Smartphone,Exp. 31 July 2025,Upcoming,64 bit,20:9,3.5 mm,Li-Po 6000mAh,,...,Yes,USB Type-C 2.0,Realme UI 5.0,1080p@30fps,Yes,Splash proof,201 grams,76.6 mm,,Wi-Fi 5 (802.11 a/b/g/n/ac) 5GHz


# Section 2 — Extract Essential Numeric Features Only

The raw dataset contains messy text fields such as:

- `"4500 mAh"`
- `"6.6 inches"`
- `"1080 x 2400 pixels"`
- `"50 MP"`
- `"2.2 GHz Octa Core"`
- `"128GB"`
- `"120 Hz"`
- `"80W wired"`

Since our final ML model uses only a *core subset* of specification-based features,  
we extract **only the features we will use later**, avoiding unnecessary work.

### Extracted numeric features:
- battery_mAh  
- weight_g  
- width_mm, height_mm, thickness_mm  
- screen_inches  
- pixel_density  
- rear_mp  
- cpu_speed_GHz  
- cpu_cores  
- internal_storage  
- ram  
- refresh_Hz  
- charging_watts  
- expandable_GB  
- sim_count  
- brand (categorical)

This keeps the notebook clean, efficient, and aligned with the final model.


In [2]:
# Section 2 — Minimal Numeric Feature Extraction

df_clean = df.copy()

# ---------------------------
# BATTERY
# ---------------------------
df_clean['battery_mAh'] = (
    df_clean['battery_capacity']
    .astype(str).str.extract(r'(\d+)').astype(float)
)

# ---------------------------
# WEIGHT
# ---------------------------
df_clean['weight_g'] = (
    df_clean['weight']
    .astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
)

# ---------------------------
# DIMENSIONS
# ---------------------------
df_clean['width_mm'] = df_clean['width'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
df_clean['height_mm'] = df_clean['height'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
df_clean['thickness_mm'] = df_clean['thickness'].astype(str).str.extract(r'(\d+\.?\d*)').astype(float)

# ---------------------------
# SCREEN SIZE
# ---------------------------
df_clean['screen_inches'] = (
    df_clean['screen_size']
    .astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
)

# ---------------------------
# PIXEL DENSITY
# ---------------------------
df_clean['pixel_density'] = (
    df_clean['pixel_density']
    .astype(str).str.extract(r'(\d+\.?\d*)').astype(float)
)

# ---------------------------
# REAR CAMERA
# ---------------------------
df_clean['rear_mp'] = (
    df_clean['primary_camera_resolution']
    .astype(str).str.extract(r'(\d+)').astype(float)
)

# ---------------------------
# CPU SPEED
# ---------------------------
df_clean['cpu_speed_GHz'] = (
    df_clean['cpu']
    .astype(str).str.extract(r'(\d+\.?\d*)\s*GHz').astype(float)
)

# ---------------------------
# CPU CORES
# ---------------------------
df_clean['cpu_cores'] = (
    df_clean['cpu']
    .astype(str).str.extract(r'(\d+)\s*core', flags=re.IGNORECASE)
).astype(float)

# ---------------------------
# INTERNAL STORAGE
# ---------------------------
df_clean['internal_storage'] = (
    df_clean['internal_storage']
    .astype(str).str.extract(r'(\d+)\s*GB').astype(float)
)

# ---------------------------
# RAM
# ---------------------------
df_clean['ram'] = (
    df_clean['ram']
    .astype(str).str.extract(r'(\d+)\s*GB').astype(float)
)

# ---------------------------
# REFRESH RATE
# ---------------------------
df_clean['refresh_Hz'] = (
    df_clean['refresh_rate']
    .astype(str).str.extract(r'(\d+)').astype(float)
)

# ---------------------------
# CHARGING WATTS
# ---------------------------
def extract_max_watt(val):
    matches = re.findall(r'(\d+)\s*W', str(val), flags=re.IGNORECASE)
    return float(max(matches)) if matches else np.nan

df_clean['charging_watts'] = df_clean['quick_charging'].apply(extract_max_watt)

# ---------------------------
# EXPANDABLE MEMORY (GB)
# ---------------------------
def extract_expandable(val):
    s = str(val).upper()
    if "NO" in s:
        return 0
    tb = re.findall(r'(\d+)\s*TB', s)
    gb = re.findall(r'(\d+)\s*GB', s)
    if tb:
        return float(tb[0]) * 1024
    if gb:
        return float(gb[0])
    return 0

df_clean['expandable_GB'] = df_clean['expandable_memory'].apply(extract_expandable)

# ---------------------------
# SIM COUNT
# ---------------------------
def extract_sim_count(val):
    val = str(val).lower()
    if "triple" in val: return 3
    if "dual" in val: return 2
    if "single" in val: return 1
    return np.nan

df_clean['sim_count'] = df_clean['sim_slot'].apply(extract_sim_count)

# ---------------------------
# Keep brand as a category
# ---------------------------
df_clean['brand'] = df_clean['brand'].astype(str)

df_clean.head()


Unnamed: 0,brand,model,device_type,release_date,status,architecture,aspect_ratio,audio_jack,battery_capacity,battery_placement,...,width_mm,height_mm,thickness_mm,screen_inches,rear_mp,cpu_speed_GHz,refresh_Hz,charging_watts,expandable_GB,sim_count
0,Oppo,K13 Turbo Pro (512GB),Smartphone,25 July 2025,Available,64 bit,19.5:9,USB Type-C,Li-Po 7000mAh,,...,77.2,162.8,7.3,6.8,50.0,3.21,120.0,80.0,0.0,2.0
1,Oppo,K13 Turbo (512GB),Smartphone,25 July 2025,Available,64 bit,19.5:9,USB Type-C,Li-Po 7000mAh,,...,77.2,162.8,7.3,6.8,50.0,3.25,120.0,80.0,0.0,2.0
2,Vivo,Y400,Smartphone,Not announced yet,Rumored,64 bit,,USB Type-C,Li-Po 6000mAh,,...,,,,6.67,50.0,2.2,120.0,44.0,0.0,2.0
3,Infinix,Hot 60 Pro Plus (256GB),Smartphone,25 July 2025,Available,64 bit,20:9,3.5 mm,Li-Po 5160mAh,,...,75.8,164.0,6.0,6.78,50.0,2.2,144.0,45.0,0.0,2.0
4,Realme,Note 70T,Smartphone,Exp. 31 July 2025,Upcoming,64 bit,20:9,3.5 mm,Li-Po 6000mAh,,...,76.6,167.2,7.9,6.74,50.0,1.8,90.0,15.0,0.0,2.0


# Section 3 — Column Cleanup & Final Feature Selection

After extracting all essential specification-based numeric features in Section 2,
we now remove all unused, irrelevant, or redundant columns from the original dataset.

This step ensures that:
- the dataset remains clean and compact
- only meaningful features are kept for machine learning
- unnecessary preprocessing steps are avoided

### Final features we will keep:
- battery_mAh  
- weight_g  
- width_mm, height_mm, thickness_mm  
- screen_inches  
- pixel_density  
- rear_mp  
- cpu_speed_GHz  
- cpu_cores  
- internal_storage  
- ram  
- refresh_Hz  
- charging_watts  
- expandable_GB  
- sim_count  
- brand (categorical)

All other columns are removed permanently to simplify the dataset.


In [3]:
# Section 3 — Keep only the final selected features

final_features = [
    'battery_mAh', 'weight_g', 'width_mm', 'height_mm', 'thickness_mm',
    'screen_inches', 'pixel_density', 'rear_mp',
    'cpu_speed_GHz', 'cpu_cores', 'internal_storage', 'ram',
    'refresh_Hz', 'charging_watts', 'expandable_GB', 'sim_count',
    'brand'
]

# Create the final dataset
df_final = df_clean[final_features].copy()

print("Final dataset shape:", df_final.shape)
df_final.head()


Final dataset shape: (4144, 17)


Unnamed: 0,battery_mAh,weight_g,width_mm,height_mm,thickness_mm,screen_inches,pixel_density,rear_mp,cpu_speed_GHz,cpu_cores,internal_storage,ram,refresh_Hz,charging_watts,expandable_GB,sim_count,brand
0,7000.0,208.0,77.2,162.8,7.3,6.8,453.0,50.0,3.21,,512.0,12.0,120.0,80.0,0.0,2.0,Oppo
1,7000.0,207.0,77.2,162.8,7.3,6.8,453.0,50.0,3.25,,512.0,12.0,120.0,80.0,0.0,2.0,Oppo
2,6000.0,,,,,6.67,,50.0,2.2,,128.0,8.0,120.0,44.0,0.0,2.0,Vivo
3,5160.0,155.0,75.8,164.0,6.0,6.78,440.0,50.0,2.2,,256.0,8.0,144.0,45.0,0.0,2.0,Infinix
4,6000.0,201.0,76.6,167.2,7.9,6.74,260.0,50.0,1.8,,128.0,4.0,90.0,15.0,0.0,2.0,Realme


# Section 4 — Missing Value Handling

To prepare the dataset for clustering and other machine learning algorithms, we must ensure that it contains no missing values.

### Why handle missing values?
- Many ML algorithms (including KMeans) do not accept NaNs.
- Numerical features should not be dropped, as this would remove useful samples.
- Filling missing values maintains dataset size and respects the coursework requirement of having at least 1000 rows.

### Strategy:
- **Numeric columns** → fill with **median**  
  (medians are robust to outliers and preserve distribution shape)

- **Categorical columns (brand)** → fill with `"Unknown"`

After this step, the dataset will be complete and ready for scaling and clustering.


In [4]:
# Section 4 — Missing Value Handling

# Identify numeric and categorical columns
numeric_cols_final = df_final.select_dtypes(include=['float64','int64']).columns
categorical_cols_final = df_final.select_dtypes(include=['object']).columns

# Fill numeric NaNs with median
for col in numeric_cols_final:
    df_final[col].fillna(df_final[col].median(), inplace=True)

# Fill categorical NaNs with 'Unknown'
for col in categorical_cols_final:
    df_final[col].fillna("Unknown", inplace=True)

# Confirm no NaNs remain
print("Remaining NaN values per column:")
print(df_final.isna().sum().sort_values(ascending=False).head(10))

# Final shape confirmation
print("\nFinal dataset shape after missing-value handling:", df_final.shape)


Remaining NaN values per column:
battery_mAh         0
cpu_cores           0
sim_count           0
expandable_GB       0
charging_watts      0
refresh_Hz          0
ram                 0
internal_storage    0
cpu_speed_GHz       0
weight_g            0
dtype: int64

Final dataset shape after missing-value handling: (4144, 17)


# Section 5 — Feature Scaling & Encoding

Clustering algorithms such as KMeans rely on distance calculations.  
To prevent certain features from dominating the distance metrics simply due to scale differences (e.g., battery capacity vs. RAM vs. weight), we apply the following preprocessing:

### 1. Standard Scaling (numeric features)
All numeric features are scaled using **StandardScaler**, which transforms values to:
- mean = 0  
- standard deviation = 1  

This ensures hardware specifications contribute equally to clustering.

### 2. One-Hot Encoding (categorical feature)
The `brand` column is the only categorical feature kept in the final dataset.  
We use **OneHotEncoder** to convert it into numerical binary columns.

### 3. ColumnTransformer
We combine the numeric scaler and categorical encoder into a single unified transformation pipeline.

After transforming the dataset, we obtain a fully numerical, scaled matrix suitable for PCA and KMeans.


In [5]:
# Section 5 — Scaling & Encoding

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify numeric and categorical columns
numeric_cols = df_final.select_dtypes(include=['float64','int64']).columns.tolist()
categorical_cols = ['brand']  # only one categorical feature kept

# Define preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Fit + transform the dataset
X_prepared = preprocessor.fit_transform(df_final)

print("Transformed feature matrix shape:", X_prepared.shape)


Transformed feature matrix shape: (4144, 93)


# Section 6 — KMeans Clustering (3 Clusters) + PCA Visualization

Earlier attempts using 4 or 5 clusters resulted in mixed and unstable cluster boundaries,
because the smartphone dataset naturally contains **three meaningful performance groups**:

- **Budget**  
- **Midrange**  
- **Flagship**

To obtain clear and interpretable clusters that reflect actual specification tiers,
we apply **KMeans with 3 clusters**.

We then reduce the dataset to 2 principal components (PCA) to visualize the cluster
separation in 2D space.


In [20]:
# ========================================================
# SECTION 6 — KMeans (3 clusters) + PCA Visualization
# ========================================================

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import plotly.express as px

# --------------------------------------------------------
# 6.1 Apply KMeans with 3 clusters
# --------------------------------------------------------
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X_prepared)

df_final['cluster'] = cluster_labels

# --------------------------------------------------------
# 6.2 PCA transformation for 2D visualization
# --------------------------------------------------------
pca = PCA(n_components=2, random_state=42)
pca_results = pca.fit_transform(X_prepared)

df_final['pca1'] = pca_results[:, 0]
df_final['pca2'] = pca_results[:, 1]

# --------------------------------------------------------
# 6.3 Plot PCA clusters
# --------------------------------------------------------
fig = px.scatter(
    df_final,
    x='pca1',
    y='pca2',
    color='cluster',
    title='KMeans Clustering (3 Clusters) — PCA 2D Visualization',
    hover_name='brand',
    hover_data=['ram', 'cpu_speed_GHz', 'internal_storage', 'battery_mAh'],
    color_continuous_scale='Turbo'
)

fig.show()


# Section 7 — Cluster Interpretation & Performance Tier Assignment

Using the 3-cluster KMeans model, we now interpret the clusters based on their average
hardware specifications.

We compute the mean specifications in each cluster, then generate a simple “strength score”
to rank clusters from **weakest → strongest**.

Finally, we assign each cluster to one of the final smartphone tiers:

- **Budget**  
- **Midrange**  
- **Flagship**

This approach is fully unsupervised, uses no manual thresholds, and is driven purely by
cluster centroids and relative feature strength.


In [22]:
# ========================================================
# SECTION 7 — Interpret Clusters & Assign Final Tiers
# ========================================================

# --------------------------------------------------------
# 7.1 Compute mean specs per cluster
# --------------------------------------------------------
cluster_summary = df_final.groupby('cluster')[[
    'ram',
    'internal_storage',
    'cpu_speed_GHz',
    'battery_mAh',
    'rear_mp'
]].mean().round(2)

print("Cluster Summary (Mean Specification Values):")
display(cluster_summary)

# --------------------------------------------------------
# 7.2 Rank clusters from weakest → strongest
#     (based on a composite score of RAM, CPU, Storage, Battery)
# --------------------------------------------------------
strength_score = (
    cluster_summary['ram'] +
    cluster_summary['cpu_speed_GHz'] * 2 +
    cluster_summary['internal_storage'] / 64 +
    cluster_summary['battery_mAh'] / 2000
)

ranked_clusters = strength_score.sort_values().index.tolist()

print("\nClusters ranked from weakest → strongest:")
print(ranked_clusters)

# --------------------------------------------------------
# 7.3 Assign the final tiers using centroid ranking
# --------------------------------------------------------
tier_map = {
    ranked_clusters[0]: "Budget",
    ranked_clusters[1]: "Midrange",
    ranked_clusters[2]: "Flagship",
}

df_final['spec_tier'] = df_final['cluster'].map(tier_map)

# --------------------------------------------------------
# 7.4 Preview labeled dataset
# --------------------------------------------------------
df_final[['brand','ram','cpu_speed_GHz','internal_storage','spec_tier']].head(10)


Cluster Summary (Mean Specification Values):


Unnamed: 0_level_0,ram,internal_storage,cpu_speed_GHz,battery_mAh,rear_mp
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,11.08,296.01,3.02,5435.59,64.19
1,3.04,27.97,1.48,2357.03,8.48
2,5.46,112.03,2.17,4724.22,38.59



Clusters ranked from weakest → strongest:
[1, 2, 0]


Unnamed: 0,brand,ram,cpu_speed_GHz,internal_storage,spec_tier
0,Oppo,12.0,3.21,512.0,Flagship
1,Oppo,12.0,3.25,512.0,Flagship
2,Vivo,8.0,2.2,128.0,Midrange
3,Infinix,8.0,2.2,256.0,Flagship
4,Realme,4.0,1.8,128.0,Midrange
5,Motorola,8.0,2.4,256.0,Midrange
6,Samsung,6.0,2.4,128.0,Midrange
7,Oppo,12.0,3.21,256.0,Flagship
8,Oppo,12.0,3.25,256.0,Flagship
9,Samsung,6.0,2.4,128.0,Midrange


# Section 8 — Cluster Evaluation & Visualization

In this section, we evaluate the quality of the smartphone clusters produced by KMeans
and visualize how the three performance tiers (Budget, Midrange, Flagship) differ
across key specifications such as RAM, CPU speed, and storage.

We include:

- Silhouette Score  
- Visual comparisons between tiers  
- Exporting the final dataset  
- Interpretation of the results  


In [23]:
from sklearn.metrics import silhouette_score

# Compute silhouette score for 3-cluster KMeans output
sil_score = silhouette_score(X_prepared, df_final['cluster'])
print("Silhouette Score:", sil_score)


Silhouette Score: 0.22670461629054425


In [24]:
import plotly.express as px

# --- Average RAM per Tier ---
fig1 = px.bar(
    df_final.groupby('spec_tier')['ram'].mean().reset_index(),
    x='spec_tier',
    y='ram',
    title='Average RAM per Tier'
)
fig1.show()

# --- Average CPU Speed per Tier ---
fig2 = px.bar(
    df_final.groupby('spec_tier')['cpu_speed_GHz'].mean().reset_index(),
    x='spec_tier',
    y='cpu_speed_GHz',
    title='Average CPU Speed per Tier'
)
fig2.show()

# --- Boxplot: RAM distribution ---
fig3 = px.box(
    df_final,
    x='spec_tier',
    y='ram',
    title='RAM Distribution Across Tiers'
)
fig3.show()

# --- Scatter Plot: RAM vs CPU Speed ---
fig4 = px.scatter(
    df_final,
    x='cpu_speed_GHz',
    y='ram',
    color='spec_tier',
    title='RAM vs CPU Speed by Tier',
    hover_name='brand',
    hover_data=['internal_storage','battery_mAh']
)
fig4.show()


In [25]:
df_final.to_csv("smartphones_final_with_tiers.csv", index=False)
print("Exported: smartphones_final_with_tiers.csv")


Exported: smartphones_final_with_tiers.csv


### Interpretation

The clustering results reveal three meaningful smartphone performance tiers:

#### **Budget Tier**
- Lowest values for RAM, CPU speed, storage, battery, and camera resolution  
- Typically older or entry-level devices with limited performance capabilities  

#### **Midrange Tier**
- Moderate RAM (4–8 GB), mid-level CPU speeds, and 64–256 GB storage  
- Balanced devices offering good performance without premium hardware  

#### **Flagship Tier**
- Highest RAM (8–16 GB), fastest CPUs (3.0 GHz+), 256–512 GB storage  
- Large batteries and high-resolution camera sensors  
- Represents modern high-end smartphones  

### Conclusion
The unsupervised clustering approach successfully identifies three natural performance
segments in the dataset. Visualizations confirm that the tiers differ significantly across
key specifications, making the clustering both interpretable and practically meaningful.
