<a href="https://colab.research.google.com/github/elsagomdef/deepLearning/blob/master/SecondAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BIG DATA INTELLIGENCE: METHODS AND TECHNOLOGIES.**
## *Contró, Pedro (NIA:100513487) & Defarges, Elsa (NIA: 100513002 )*
## *SECOND ASSIGNMENT: PYSPARK*


The primary objective of this assignment is to explore the impact of feature selection on the performance of a Linear Regression model in a PySpark environment. The focus is on assessing whether feature selection techniques can enhance results by eliminating irrelevant variables or, at the very least, achieve comparable results with fewer features.

The dataset ('wind_available_second.csv.gz') will be used, and various feature selection approaches will be trained on the training partition. The effectiveness of these approaches will then be evaluated on a separate test set and, the execution time of each feature selection technique, will be recorded as well as the mse of the models. With all these results, we will try to choose the best technique for this case.

Since we are going to work in Google Colab, the first step is to download the PySpark library and import all the other libraries needed for the development of the practice.

In [12]:
!pip install pyspark



In [13]:
import sys
import os
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, Imputer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import UnivariateFeatureSelector
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
import time
from pyspark.ml.feature import PCA
from pyspark.ml.feature import StandardScaler, VectorAssembler
from tabulate import tabulate

Next, a Spark session called "FeatureSelectionAssignment" is created and will be used to perform distributed operations on large datasets.

In [14]:
##########################################################################################################
# SPARK CONTEXT INITIALIZATION
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("FeatureSelectionAssignment") \
    .getOrCreate()
sc = spark.sparkContext
##########################################################################################################
# This is the spark context
print(spark)
print(sc)

<pyspark.sql.session.SparkSession object at 0x7cab73d9fd90>
<SparkContext master=local[*] appName=FeatureSelectionAssignment>


**Data import**

In [15]:
ava_sd=spark.read.csv(path='wind_available_second.csv.gz', header=True, inferSchema=True)

In [16]:
print(ava_sd.show(5))

+-------+----+-----+---+----+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+--

### Spliting the data in train and test

As in the previous practice, the database has values referring to time, so, in order to perform the partitions in train and test, the last year will be kept in the test to try to predict the future with the closest data.

In [17]:
last_year=ava_sd.agg({"year": "max"}).collect()[0][0]

train_data = ava_sd.filter(ava_sd["year"] < last_year)
test_data = ava_sd.filter(ava_sd["year"] == last_year)

num_train_data = train_data.count()
num_test_data = test_data.count()

print(f"Number of data in the training set:: {num_train_data}")
print(f"Number of data in the test set: {num_test_data}")

Number of data in the training set:: 3827
Number of data in the test set: 921


## ***Creation of the models with feature selection techniques***

### **UnivariateFeatureSelector and the fpr strategy**

The Univariate Feature Selector with the False Positive Rate (FPR) strategy is a distinctive **filter-based method** employed in the realm of feature selection. This approach is part of the broader category of univariate feature selection, focusing on evaluating and selecting individual features based on their statistical significance in relation to the target variable.

In its core, the Univariate Feature Selector employs statistical tests, often using the **p-value**, to assess the strength of the relationship between each feature and the target variable. The primary objective is to identify features that significantly contribute to the predictive power of the model while disregarding those with weaker associations.

Specifically, this strategy places a strong emphasis on controlling the False Positive Rate which refers to the rate of incorrectly identified significant features, and by **minimizing** this rate, the method ensures that the selected features positively influence the target variable rather than being artifacts of randomness or noise.

The steps that are going to be follow are:

1. Imputation of null values: In this case, the mean is going to be used as imputation method.
2. Assembler Configuration for Feature Vector: VectorAssembler is configured to take the imputed columns and create a new column named 'features'. This column will contain the assembled feature vector.
3. Configurate the Univariate Feature Selector with FPR method.
4. Build the linear regression model.
5. Create a pipeline with the different stages of the process.
6. Fit the model.
7. Make predictions and calculate evalaution metrics: the Mean Squared Error and the Execution time are going to be calculated.

In [18]:
cols_to_impute = train_data.columns[1:]
target = train_data.columns[0]

# 1. Imputation of null values
imputer = Imputer(inputCols=cols_to_impute, outputCols=cols_to_impute, strategy="mean")

# 2. Configuring the assembler to create the feature vector
assembler = VectorAssembler(inputCols=cols_to_impute, outputCol='features')

# 3. Applying the univariate feature selector method of fpr
feature_selector = UnivariateFeatureSelector(featuresCol='features', outputCol='selected_features', labelCol=target, selectionMode="fpr")
feature_selector.setFeatureType('continuous')
feature_selector.setLabelType('continuous')

# 4. Creating the lineal regression model
linear_regression = LinearRegression(featuresCol='selected_features', labelCol=target)

# 5. Setting the order of the stages in the pipeline
pipeline = Pipeline(stages=[imputer, assembler, feature_selector, linear_regression])

start_time = time.time()

# 6. Fitting the model
model= pipeline.fit(train_data)

end_time =time.time()

training_time_fpr= end_time - start_time

# 7. Making predictions
predictions = model.transform(test_data)

# Getting the features selected in ascending order
selected_features = model.stages[2].selectedFeatures
selected_features_sorted = sorted(selected_features)
num_selected_features = len(selected_features)

# Mean Squared Error (MSE)
evaluator = RegressionEvaluator(labelCol=target, predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)

print("\n\033[1mResults for the UnivariateFeatureSelector and the fpr strategy model: \033[0m")
print("--------------------------------------------------------------------------")
print("Selected Features:", selected_features_sorted)
print("Total Number of Selected Features with fpr:", num_selected_features)
print("Mean Squared Error (MSE):", round(mse, 3))
print("Execution time:", round(training_time_fpr, 3))
print(" ")
predictions.select(target, 'prediction').show(7)


[1mResults for the UnivariateFeatureSelector and the fpr strategy model: [0m
--------------------------------------------------------------------------
Selected Features: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 20

As can be seen, there are several outputs of the model, where we can see both the variables selected in ascending order, as well as the final number of variables used by the model (these will be discussed at the end of the practice). In addition, we have the MSE of the model and some predictions to be able to check the performance of the model.

### **UnivariateFeatureSelector and the fwe strategy**

The UnivariateFeatureSelector with Family Wise Error (FWE) strategy is another facet of **filter-based feature selection** designed to improve the accuracy and relevance of features in machine learning models.

In essence, UnivariateFeatureSelector uses statistical tests, often leveraging the p-value, to measure the strength of the relationship between each feature and the target variable. The main objective is to identify the features that contribute significantly to the predictive power of the model and to discard those with weaker associations.

The FWE strategy, stands out for addressing the **false alert error rate**. This rate reflects the probability of incorrectly identifying features as non-significant when they actually influence the target variable. By minimizing this error, the method ensures that the selected features authentically contribute to predictive accuracy, avoiding the inclusion of irrelevant variables that could arise from chance or noise.

In short, this is the **most conservative** of methods as it strikes a balance between statistical significance and false warnings, this approach systematically refines the feature space, ensuring that only the most substantiated and influential variables contribute to the predictive power of the models.

In [19]:
feature_selector_fwe = UnivariateFeatureSelector(featuresCol='features', outputCol='selected_features', labelCol=target, selectionMode="fwe")
feature_selector_fwe.setFeatureType('continuous')
feature_selector_fwe.setLabelType('continuous')

linear_regression_fwe = LinearRegression(featuresCol='selected_features', labelCol=target)

# The initial imputer and assembler will be used.
pipeline_fwe = Pipeline(stages=[imputer, assembler, feature_selector_fwe, linear_regression_fwe])

start_time = time.time()

model_fwe = pipeline_fwe.fit(train_data)

end_time =time.time()

training_time_fwe= end_time - start_time

## Predictions
predictions_fwe = model_fwe.transform(test_data)

selected_features_fwe = model_fwe.stages[2].selectedFeatures
selected_features_sorted_fwe = sorted(selected_features_fwe)
num_selected_features_fwe = len(selected_features_fwe)

evaluator_fwe = RegressionEvaluator(labelCol=target, predictionCol="prediction", metricName="mse")
mse_fwe = evaluator_fwe.evaluate(predictions_fwe)

print("\n\033[1mResults for the UnivariateFeatureSelector and the fwe strategy model: \033[0m")
print("--------------------------------------------------------------------------")
print("Selected Features with FWE:", selected_features_sorted_fwe)
print("Total Number of Selected Features with fwe:", num_selected_features_fwe)
print("Mean Squared Error (MSE) with FWE:", round(mse_fwe, 3))
print("Execution time:", round(training_time_fwe, 3))
print(" ")
predictions_fwe.select(target, 'prediction').show(7)


[1mResults for the UnivariateFeatureSelector and the fwe strategy model: [0m
--------------------------------------------------------------------------
Selected Features with FWE: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227

### **PCA with 3 components**

Principal Component Analysis (PCA) is a **dimensionality reduction technique** used to explore and visualize patterns in complex data. Its main objective is to transform a set of correlated variables into a new set of uncorrelated variables, called **principal components**. This transformation seeks to summarize the variability in the data while maintaining as much of the original variability as possible.

In [20]:

assembler = VectorAssembler(inputCols=cols_to_impute, outputCol='features')

# 3. Applying PCA with 3 components
pca1 = PCA(k=3, inputCol="features")

linear_regression_PCA = LinearRegression(featuresCol=pca1.getOutputCol(), labelCol=target)

pipeline = Pipeline(stages=[imputer, assembler, pca1, linear_regression_PCA])

start_time = time.time()

model_PCA= pipeline.fit(train_data)

end_time =time.time()

training_time_PCA= end_time - start_time

predictions_PCA = model_PCA.transform(test_data)

# Mean Squared Error (MSE)
evaluator = RegressionEvaluator(labelCol=target, predictionCol="prediction", metricName="mse")
mse_PCA = evaluator.evaluate(predictions_PCA)

print("\n\033[1mResults for the PCA with 3 components model: \033[0m")
print("--------------------------------------------")
print("Mean Squared Error (MSE) en PCA:", round(mse_PCA, 3))
print("Execution time:", round(training_time_PCA, 3))
print(" ")
predictions_PCA.select(target, 'prediction').show(7)


[1mResults for the PCA with 3 components model: [0m
--------------------------------------------
Mean Squared Error (MSE) en PCA: 382363.625
Execution time: 30.83
 
+-------+------------------+
| energy|        prediction|
+-------+------------------+
| 977.91|2595.3083458042456|
|1191.99| 2376.042267733102|
| 795.88|1970.5892903420906|
| 141.05| 1523.834644386074|
| 1124.2|1177.5345067516064|
| 916.83|1011.1861425020945|
| 441.86| 829.3159601277437|
+-------+------------------+
only showing top 7 rows



### **Model combining fpr and PCA features selection**

For the last model, the variables selected by a more conservative method such as FPR will be combined with the information contained in the 3 PCA components. This approach aims to take advantage of the valuable information that each method uniquely provides, with the objective of building a robust and efficient model, which is not only able to capture the crucial features for prediction, but also to understand the more intricate relationships between the variables.

In [21]:
feature_selector_fpr = UnivariateFeatureSelector(outputCol='selected_features_fpr', featuresCol='features', labelCol=target, selectionMode='fpr')
feature_selector_fpr.setFeatureType('continuous')
feature_selector_fpr.setLabelType('continuous')

pca = PCA(k=3, inputCol='features', outputCol='pca_features')

assembler_combined = VectorAssembler(inputCols=['selected_features_fpr', 'pca_features'], outputCol='combined_features')

linear_regression_combined = LinearRegression(featuresCol="combined_features", labelCol=target)

pipeline_combined = Pipeline(stages=[imputer, assembler, feature_selector_fpr, pca, assembler_combined, linear_regression_combined])

start_time_combined = time.time()

model_combined = pipeline_combined.fit(train_data)

end_time_combined = time.time()
training_time_comb =end_time_combined - start_time_combined

predictions_combined = model_combined.transform(test_data)

evaluator_combined = RegressionEvaluator(labelCol=target, predictionCol='prediction', metricName='mse')
mse_combined = evaluator_combined.evaluate(predictions_combined)

print("\n\033[1mResults for the combined model: \033[0m")
print("--------------------------------------------")
print(f"Mean Squared Error (MSE): {mse_combined}")
print(f"Execution time: {training_time_comb} seconds")
print(" ")
predictions_combined.select(target, "prediction").show(5)


[1mResults for the combined model: [0m
--------------------------------------------
Mean Squared Error (MSE): 279648.2197091595
Execution time: 39.56381678581238 seconds
 
+-------+------------------+
| energy|        prediction|
+-------+------------------+
| 977.91|1407.9700909331768|
|1191.99|1353.2528793283818|
| 795.88| 1081.099564697197|
| 141.05|1133.9571961749207|
| 1124.2| 837.1566994969726|
+-------+------------------+
only showing top 5 rows



## **Conclussions**

In [22]:
execution_times = {
    'Fpr': round(training_time_fpr, 3),
    'Fwe': round(training_time_fwe, 3),
    'PCA': round(training_time_PCA, 3),
    'Combined': round(training_time_comb, 3)
}

mse_fpr = round(mse, 2)
mse_fwe = round(mse_fwe, 2)
mse_PCA = round(mse_PCA, 2)
mse_combined = round(mse_combined, 2)

num_selec_fpr = num_selected_features
num_select_fwe = num_selected_features_fwe
comp = "-"
comb = "-"

results_df = pd.DataFrame(list(execution_times.items()), columns=['Feature Selection', 'Execution time'])

results_df['MSE'] = [mse_fpr, mse_fwe, mse_PCA, mse_combined]
results_df['Nº features'] = [num_selec_fpr, num_select_fwe, comp, comb]

print(tabulate(results_df, headers='keys', tablefmt='pretty', showindex=False))


+-------------------+----------------+-----------+-------------+
| Feature Selection | Execution time |    MSE    | Nº features |
+-------------------+----------------+-----------+-------------+
|        Fpr        |     42.199     | 279876.05 |     522     |
|        Fwe        |     24.24      | 281488.02 |     495     |
|        PCA        |     30.83      | 382363.62 |      -      |
|     Combined      |     39.564     | 279648.22 |      -      |
+-------------------+----------------+-----------+-------------+


Observing these results and taking into account the different feature selection techniques used, it could be said that:


*   **Univariate feature selection** methods like False Positive Rate (Fpr) and Family Wise Error (Fwe) operate by evaluating each feature independently based on statistical measures. Fpr tends to be more permissive, allowing a higher number of features, while Fwe is stricter, aiming to control the overall error rate. In the case of Fpr, it selected a greater number of features (522) compared to Fwe (495). However, the improvement in MSE was not proportional to the increase in the number of features. This could indicate that some of the additional features selected by Fpr do not contribute significantly to the predictive ability of the model and could be considered noise.

* **PCA**, a dimensionality reduction technique, transforms the original features into a set of uncorrelated principal components. In this case, PCA with only three components resulted in a higher MSE (382.363,62). The limited number of components might not effectively represent the variability in the data, leading to a loss of information.

* On the other hand, in the **combined approach**, it could be benefiting from the strengths of both methods: the ability of Fpr to identify unique features and the ability of PCA to capture more complex patterns and correlations between features. The significant reduction in MSE with the combined model suggests that the combination of features selected by both methods may be improving the predictive ability of the model compared to the individual use of Fpr or PCA.


