# PySpark ML Assignment on Clustering, Dimensionality Reduction & Imbalanced Data Handling

This notebook includes questions on:
- [Clustering](https://spark.apache.org/docs/latest/ml-clustering.html)
   -- [KMeans Clustering](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html)
- [Dimensionality Reduction (PCA)](https://spark.apache.org/docs/latest/ml-features.html#pca)
- Handling Imbalanced Data in PySpark

_Note: Depending on dataset availability or environment (e.g., SMOTE support), you might need to adapt paths or use pseudocode._

In [13]:
! pip install gdown pyspark numpy pandas scikit-learn



In [14]:
import gdown
file_id = "1v0TrkO0o4_UJbBlUiqpGrne7WnQSWIac"  # e.g., '1uNw9...'
url = f"https://drive.google.com/uc?id={file_id}"

gdown.download(url, "data_wk7.csv", quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1v0TrkO0o4_UJbBlUiqpGrne7WnQSWIac
To: /home/aymuos/Documents/Github/masters-practise-repo/TERM3/AI_at_Scale/ClassWork/data_wk7.csv
100%|██████████| 3.88k/3.88k [00:00<00:00, 5.06MB/s]


'data_wk7.csv'

## Q1: Load a sample dataset and create spark session

In [15]:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType , StringType
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder.appName("ImbalancedData_workings").getOrCreate()

schema = StructType([
    StructField("sepal_length", FloatType(), False),
    StructField("sepal_width", FloatType(), True),
    StructField("petal_length", FloatType(), True),
    StructField("petal_width", FloatType(), True),
    StructField("species", StringType(), True),])
# Load the dataset
df_spark = spark.read.csv("data_wk7.csv", header=True, schema=schema)


In [None]:
# Rename columns to snake_case (already in snake_case, but let's ensure consistency)
for col_name in df_spark.columns:
    df_spark = df_spark.withColumnRenamed(col_name, col_name.lower().replace(" ", "_"))

df_spark.show()

## Q2: Assemble features into a single vector

A feature transformer that merges multiple columns into a vector column --> VectorAssembler

In [16]:
from pyspark.ml.feature import VectorAssembler

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_features = assembler.transform(df_spark)
df_features.select("features", "species").show(5, truncate=False)

+----------------------------------------------------------------------------+-------+
|features                                                                    |species|
+----------------------------------------------------------------------------+-------+
|[5.099999904632568,3.5,1.399999976158142,0.20000000298023224]               |setosa |
|[4.900000095367432,3.0,1.399999976158142,0.20000000298023224]               |setosa |
|[4.699999809265137,3.200000047683716,1.2999999523162842,0.20000000298023224]|setosa |
|[4.599999904632568,3.0999999046325684,1.5,0.20000000298023224]              |setosa |
|[5.0,3.5999999046325684,1.399999976158142,0.20000000298023224]              |setosa |
+----------------------------------------------------------------------------+-------+
only showing top 5 rows


25/06/25 17:03:41 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: sepal length (cm), sepal width (cm), petal length (cm), petal width (cm), species
 Schema: sepal_length, sepal_width, petal_length, petal_width, species
Expected: sepal_length but found: sepal length (cm)
CSV file: file:///home/aymuos/Documents/Github/masters-practise-repo/TERM3/AI_at_Scale/ClassWork/data_wk7.csv


## Q3: Apply KMeans Clustering

## Q4: Evaluate KMeans Clustering

## Q5: Apply PCA for dimensionality reduction

## Q6: Visualize PCA-transformed data

## Q7: Create an imbalanced dataset

## Q8: Use SMOTE or resampling

## Q9: Use class weights in a classifier

## Q10: Evaluate classification on imbalanced dataset