### ML PYspark

The first thing to do is to create a .env file in the root of the directory. Add to the file the following two varibles 
ACCESS_KEY, ACCESS_SECRET. 
Check for more detailed explanation here: [dotenv]("https://pypi.org/project/python-dotenv/), he explains how the .env should look like. After that, the variables are add to the os.environ and can be access as a simple dict structure

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, when, input_file_name
from functools import reduce
import sys
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
import os
from dotenv import load_dotenv

In [2]:
load_dotenv()

## check
print(f"{os.environ['ACCESS_KEY']}")

AKIATL5DQEXAENZHWCKT


## Console Login

The following classes are to handle the spark on the AWS 

In [3]:
from src.s3handler import Sparker

In [4]:
## Initialize the class
spark = Sparker(os.environ['ACCESS_KEY'],os.environ['ACCESS_SECRET'])

## local session
spark._create_local_session()

:: loading settings :: url = jar:file:/Users/devseed/Documents/repos/FRACTAL_Big_Data/.venv/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/devseed/.ivy2.5.2/cache
The jars for the packages stored in: /Users/devseed/.ivy2.5.2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-10decdb4-7eb9-44fd-9d00-f82aeda11347;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.1 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.901 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
:: resolution report :: resolve 99ms :: artifacts dl 4ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.901 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.1 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	-----------------------------------------

## Read parquet

In [5]:
## Read the parquet and stored it 
df = spark.read_parquet("ubs-datasets",
                    "FRACTAL/data/train/TRAIN-0436_6399-002955400.parquet",
                    read_all=False)

# # Read the list of parquet files
# list_s3 = ["FRACTAL/data/train/TRAIN-1200_6136-008972557.parquet", "FRACTAL/data/train/TRAIN-0436_6399-002955400.parquet"]
# df = spark.read_parquet("ubs-datasets",
#                     list_s3,
#                     read_all=False)

Reading from: ['s3a://ubs-datasets/FRACTAL/data/train/TRAIN-0436_6399-002955400.parquet']


                                                                                

In [None]:
df.printSchema()

see that the schema here was infered by the spark and it is totally different from when I had to download the file

In [None]:
df.show(3)

In [None]:
print(f"Number of rows: {df.count()}")

In [None]:
classes = df.groupby('classification').count()
classes.show()

## Preprocessing steps

In [6]:
conditions = [col(c).isNull() for c in df.columns]

##combined condition returns True for any \
# row where at least one column is NULL
combined_condition = reduce(lambda a, b: a | b, conditions)

print(f"Number of cols with null values:{df.filter(combined_condition).count()}")

[Stage 1:>                                                          (0 + 1) / 1]

Number of cols with null values:0


                                                                                

In [7]:
# Split the array column into three separate columns
df = df.withColumn("x", col("xyz")[0]) \
       .withColumn("y", col("xyz")[1]) \
       .withColumn("z", col("xyz")[2])

## Feature Engineering

In [8]:
from src.s3handler import FeatureEngineering

In [9]:
featureEngineering = FeatureEngineering(df)
df = featureEngineering.apply_all()
# it can be used many feature engineering such as
# height_above_ground(self, grid_size=5.0)
# local_stats(self, grid_size=2.0)
# return_features(self)
# vegetation_index(self)
# water_detection(self)
# or applying all with apply_all(self)

In [10]:
df.show(3)

[Stage 4:>                                                          (0 + 1) / 1]

+--------------------+---------+------------+---------------+-----------------+----------------+--------------+---------+--------+--------+-------+-------------+--------+-------------+-------------------+-----------+-----+-----+-----+--------+--------------------+----------+-----------+------+--------------------+-------------+-------------------+-----------------+-------------------+------------+----------------+--------------+-------------------+------------------+------------------+
|                 xyz|Intensity|ReturnNumber|NumberOfReturns|ScanDirectionFlag|EdgeOfFlightLine|Classification|Synthetic|KeyPoint|Withheld|Overlap|ScanAngleRank|UserData|PointSourceId|            GpsTime|ScanChannel|  Red|Green| Blue|Infrared|                 wkb|         x|          y|     z| height_above_ground|local_density|        local_z_std|    local_z_range|          roughness|return_ratio|is_single_return|is_last_return|               ndvi|   green_red_ratio|              ndwi|
+-----------------

                                                                                

## Standardizing

In [None]:
from pyspark.ml.feature import StandardScaler

In [None]:
scaler = StandardScaler(inputCol="features",
                        outputCol="scaledFeatures",
                        withStd=True,
                        withMean=False)

## Choosing correct columns

In [None]:
## Select cols
feature_cols = ['x', 'y', 'z', 'Intensity', 'ReturnNumber', 'NumberOfReturns', 
                'ScanAngleRank', 'EdgeOfFlightLine', 'ScanDirectionFlag', 
                'Red', 'Green', 'Blue', 'Infrared']  


## Create an Vector Assembler
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

## scaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")


In [None]:
output = assembler.transform(df)

In [None]:
output.select("features").show(truncate=False)