# Spark DataFrame - CRISP-DM

1	Situation Understanding

1.1	Identify the objectives of the situation.

The analysis that we wish to conduct is critical because the world's oceans are currently facing multiple challenges and, most importantly, the depletion of the world's oceans by 2048 (Worm et al., 2006) due to various reasons such as overfishing, habitat destruction, pollution, and climate change. 

These challenges pose significant threats to marine biodiversity and the sustainability of ocean resources. By conducting this analysis, we aim to assess the potential solution to overfishing and identify challenges and considerations that need to be addressed for implementing either aquaculture, marine protected areas, or many others as a solution. The success criteria for this analysis include:

1.	Aquaculture's potential as a sustainable solution to overfishing is assessed by providing evidence-based findings and recommendations.

2.	Evaluating the effectiveness of marine protected areas as a passive solution to sustainable fish stock health.’

3.	I am developing a prediction model with a confidence interval of more than 90% to establish the main factors of SDG14 ratings.

4.	It produces a comprehensive list of challenges and considerations that must be addressed for implementing aquaculture to solve overfishing.


2	Data Understanding

The data for this project was collected from publicly available sources, including the UN FAO, UN Statistics, the World Bank - World Development Indicator Database, and SeaAroundUs. It involved catch data (FAO, 2023), aquaculture production data (OECD, 2023), and other relevant environmental and socioeconomic variables.
Catch data refers to the recorded amount of fish or other marine organisms caught or harvested from the ocean. This data was often used to monitor the health of fish populations and inform fisheries management practices. The UN Statistics and World Bank - World Development Indicator Database (World Bank, 2023) provided additional data on factors such as gross domestic product (UN Stats, 2023), population, and environmental indicators. SeaAroundUs (2023) offered additional catch data and other metrics related to global fisheries.

In [None]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('basics').getOrCreate()

In [None]:
# Import necessary Python libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import isnan, when, count

In [None]:
# Load CSV files
file_names = [
    "Datasets/2022 Policy Strength.csv",
    "Datasets/2022 Ocean Science Funding.csv",
    "Datasets/2022 Marine Key Biodiversities Areas Percentage.csv",
    "Datasets/2018 Sustainable Fish Stock.csv",
    "Datasets/1969 - 2018 capture-fisheries-vs-aquaculture.csv"
]

In [None]:
dfs = []
for file_name in file_names:
    df = spark.read.csv(file_name, header=True, inferSchema=True)
    dfs.append(df)

In [None]:
# Explore Data
for df in dfs:
    df.show(10)

In [None]:
# Data visualization for each table
# 1. 2022 Policy Strength
df_0 = dfs[0]
df_0 = df_0.withColumn("Value", df_0["Value"].cast(IntegerType()))
df_0 = df_0.na.drop()
df_0_pd = df_0.toPandas()

In [None]:
# Plotting with Pandas (alternative to Matplotlib for this step)
plt.figure(figsize=(10, 6))
df_0_pd.plot(kind="bar", x="GeoAreaName", y="Value", rot=90)
plt.xlabel("Country")
plt.ylabel("Policy Strength")
plt.title("2022 Policy Strength")
plt.show()

In [None]:
# 2. 2022 Ocean Science Funding
df_1 = dfs[1]
df_1 = df_1.withColumn("GDP", df_1["GDP"].cast(IntegerType()))
df_1 = df_1.na.drop()

# Convert to Pandas DataFrame for plotting
df_1_pd = df_1.toPandas()

# Plot using Matplotlib
plt.figure(figsize=(10, 6))
plt.barh(df_1_pd["GeoAreaName"], df_1_pd["GDP"])
plt.xlabel("GDP Allocation")
plt.ylabel("Country")
plt.title("2022 Ocean Science Funding")
plt.tight_layout()
plt.show()

In [None]:
# 3. 2022 Marine Key Biodiversities Areas Percentage (scatter plot)
df_2 = dfs[2]
df_2 = df_2.withColumn("Percent", df_2["MPA Percent"].cast(IntegerType()))
df_2 = df_2.na.drop()
df_2_pd = df_2.toPandas()

In [None]:
# Plotting with Pandas
plt.figure(figsize=(10, 6))
plt.scatter(df_2_pd["GeoAreaName"], df_2_pd["Percent"])
plt.xlabel("Country")
plt.ylabel("Percentage")
plt.title("2022 Marine Key Biodiversities Areas Percentage")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
# 4. 2018 Sustainable Fish Stock
df_3 = dfs[3]
df_3 = df_3.withColumn("Percentage", df_3["Percentage"].cast(IntegerType()))
df_3 = df_3.na.drop()
df_3_pd = df_3.toPandas()

In [None]:
# Plotting with Pandas
plt.figure(figsize=(10, 6))
plt.scatter(df_3_pd["GeoAreaName"], df_3_pd["Percentage"])
plt.xlabel("Country")
plt.ylabel("Sustainability Rating")
plt.title("2018 Sustainable Fish Stock")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
# 5. 1969-2018 Capture Fisheries vs Aquaculture
df_4 = dfs[4]
df_4 = df_4.na.drop()
df_4 = df_4.withColumn("Year", df_4["Year"].cast(IntegerType()))
df_4 = df_4.withColumn("Capture", df_4["Capture"].cast(IntegerType()))
df_4 = df_4.withColumn("Aquaculture", df_4["Aquaculture"].cast(IntegerType()))
df_4 = df_4.na.drop()
df_4_pd = df_4.toPandas()

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(df_4_pd["Year"], df_4_pd["Capture"], color="red", label="Capture Fisheries")
plt.plot(df_4_pd["Year"], df_4_pd["Aquaculture"], color="blue", label="Aquaculture")
plt.xlabel("Year")
plt.ylabel("Tonnes")
plt.legend()
plt.title("Capture Fisheries vs Aquaculture (1969-2018)")
plt.tight_layout()
plt.show()

In [None]:
# Quality report for each dataframe
# Quality report for each DataFrame
# Quality report for each DataFrame
for i, df in enumerate(dfs):
    print(f"Table {i + 1}:")
    
    # Check for missing values
    print("Missing values:")
    missing_counts = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
    missing_counts.show()

    # Check for duplicate rows
    print("Duplicate rows:")
    print(df.count() - df.dropDuplicates().count())

    # Data types
    print("Data types:")
    df.printSchema()

    # Summary statistics
    print("Summary statistics:")
    df.describe().show()

    print()

# Data Preparation
# In this section, you can add data preprocessing and transformation steps as needed.
# For example, renaming columns, merging dataframes, handling missing values, and feature engineering.

# Model Building
# You can add machine learning model building and evaluation steps in this section.
# For example, building a classification or regression model to address the overfishing problem.

# Action Plan and Implementation
# Outline your plan for implementing the solution, monitoring, and continuous improvement.

# 07-DM
# Execute DM task (if applicable)

# 08-INT
# Summarize Results

# Add relevant tables or graphs

# 09-ACT
# Describe the Action Plan for Implementation, Observation, and Improvement

In [None]:
from pyspark.sql.functions import when

# 03-DP

# Print the current dataframe
dfs[3].select("GeoAreaName", "Percentage").show(5)

In [None]:
# Add SDG14_4_1 column
dfs[3] = dfs[3].withColumn("SDG14_4_1", when(dfs[3]["Percentage"] > 50, 1).otherwise(0))

# Print the updated dataframe
dfs[3].select("GeoAreaName", "Percentage", "SDG14_4_1").show(5)

In [None]:
# Data Cleaning
# Rename "Entity" column to "GeoAreaName" in table 5
dfs[4] = dfs[4].withColumnRenamed("Entity", "GeoAreaName")

# Merge the tables based on the "GeoAreaName" column
merged_df = dfs[0]
for i in range(1, len(dfs)):
    merged_df = merged_df.join(dfs[i], on=["GeoAreaName"], how="outer")

# Display the merged table
merged_df.show(5)

In [None]:
# Drop rows with missing values
merged_df = merged_df.dropna()

# Count the number of rows
num_rows_left = merged_df.count()

# Display the dataframe without missing values and the count
print("Number of rows left:", num_rows_left)

In [None]:
# 04-DT: Data Transformation

# Remove the "Code" column from the merged_df dataframe
merged_df = merged_df.drop("Code")

# Transform the "SDG14_4_1" column values to 1 if true, otherwise 0
from pyspark.sql.functions import expr
merged_df = merged_df.withColumn("SDG14_4_1", when(expr("SDG14_4_1 = true"), 1).otherwise(0))

# Display the modified merged table (first 5 rows)
merged_df.show(5)

In [None]:
# Filter out rows with None values in the "log_SDG14_4_1" column
filtered_df = merged_df.filter(merged_df["SDG14_4_1"].isNotNull())

# Plot a histogram of the log values
import matplotlib.pyplot as plt
log_values = filtered_df.select("SDG14_4_1").rdd.flatMap(lambda x: x).collect()
plt.hist(log_values, bins=[0, 0.5, 1], edgecolor='black')
plt.xlabel("(SDG14_4_1)")
plt.ylabel("Frequency")
plt.title("Distribution of (SDG14_4_1)")
plt.xticks([0,0.5, 1])
plt.show()

In [None]:
# Calculate counts of 0s and 1s in the "SDG14_4_1" column
counts = merged_df.groupBy("SDG14_4_1").count()
counts.show()

In [None]:
# Print the modified merged table (first 5 rows)
merged_df.show(5)

In [None]:
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col, when, expr
from pyspark.sql import DataFrame
import pandas as pd
import matplotlib.pyplot as plt

# Create a Spark session (if not already created)
spark = SparkSession.builder.appName('regression').getOrCreate()

# Assuming "merged_df" contains the merged DataFrame from the previous steps

# 1. Define the features (X) and target (y) for the model
feature_cols = ["GDP", "MPA Percent", "Aquaculture", "Value"]
target_col = "SDG14_4_1"

# 2. Create a Vector Assembler to assemble feature columns
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
merged_df = assembler.transform(merged_df)

# 3. Standardize the features using StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(merged_df)
merged_df = scaler_model.transform(merged_df)

# 4. Split the data into training and testing sets
train_data, test_data = merged_df.randomSplit([0.7, 0.3], seed=42)

# 5. Build a regression model using RandomForestRegressor
rf = RandomForestRegressor(featuresCol="scaled_features", labelCol=target_col)
rf_model = rf.fit(train_data)

# 6. Evaluate the model on the test data
test_predictions = rf_model.transform(test_data)
evaluator = RegressionEvaluator(labelCol=target_col, predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(test_predictions)

# 7. Print RMSE (Root Mean Squared Error)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# 8. Create a DataFrame to store results
results_df = test_predictions.select("GeoAreaName", target_col, "prediction")

# 9. Calculate confidence scores based on the predicted probabilities
results_df = results_df.withColumn("confidence_score", when(expr(f"abs({target_col} - prediction) <= {rmse}"), 1).otherwise(0))

# 10. Print the results DataFrame
results_df.show()

# 11. Calculate and plot the importance of predictors using the feature importances
feature_importances = pd.Series(rf_model.featureImportances.toArray(), index=feature_cols)
feature_importances.plot(kind='bar')
plt.xlabel('Predictors')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()

# 12. Create a summary table
summary_table = pd.DataFrame(columns=["Model Information", "Evaluation Metrics"])
summary_table.loc[0] = ["Random Forest Regressor", f"RMSE: {rmse}"]

# 13. Add model information and evaluation metrics to the summary table

# 14. Calculate the number of correct and wrong predictions
correct_predictions = results_df.filter(results_df.confidence_score == 1).count()
wrong_predictions = results_df.filter(results_df.confidence_score == 0).count()

# 15. Calculate the percentage of correct and wrong predictions
total_predictions = results_df.count()
percentage_correct = (correct_predictions / total_predictions) * 100
percentage_wrong = (wrong_predictions / total_predictions) * 100

# 16. Add total, correct, and wrong rows to the summary table
summary_table.loc[1] = ["Total Predictions", total_predictions]
summary_table.loc[2] = ["Correct Predictions", correct_predictions]
summary_table.loc[3] = ["Wrong Predictions", wrong_predictions]
summary_table.loc[4] = ["% Correct", percentage_correct]
summary_table.loc[5] = ["% Wrong", percentage_wrong]

# 17. Print the updated summary table
print(summary_table)

# 18. Create a DataFrame to store the predictor importances
predictor_importance_df = pd.DataFrame({
    "Predictor": feature_cols,
    "Importance": feature_importances.values
})

# 19. Print the predictor importance table
print(predictor_importance_df)

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# 1. Preprocess the data
pandas_df = merged_df.toPandas()

# 2. Define features and target
features = ["GDP", "MPA Percent", "Aquaculture", "Value"]
target = "SDG14_4_1"

X = pandas_df[features]
y = pandas_df[target]

# 3. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 4. Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 5. Build a Regression Neural Network using TensorFlow
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='linear')
])

model.compile(optimizer='adam', loss='mean_squared_error')

# 6. Train the model
model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=2)

# 7. Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# 8. Calculate predictor importance
importance = model.layers[0].get_weights()[0]  # Get weights of the first layer

# 9. Create a summary table
summary_table = pd.DataFrame(columns=["Metric", "Value"])
summary_table = summary_table.append({"Metric": "MSE", "Value": mse}, ignore_index=True)

# 10. Add model information and evaluation metrics to the summary table
# Add more model information or evaluation metrics as needed

# 11. Assess model performance
# No correct or wrong predictions, but you can calculate additional metrics like RMSE
rmse = np.sqrt(mse)

# 12. Add evaluation metrics to the summary table
summary_table = summary_table.append({"Metric": "RMSE", "Value": rmse}, ignore_index=True)

# 12. Plot Predictor Importance
importance_values = np.abs(importance).mean(axis=1)  # Calculate mean importance across neurons in the first layer

plt.figure(figsize=(10, 6))
plt.bar(features, importance_values)
plt.xlabel("Predictors")
plt.ylabel("Importance")
plt.title("Predictor Importance")
plt.show()

# 13. Create an additional summary table
additional_summary_table = pd.DataFrame(columns=["Model", "Metric", "Value"])
additional_summary_table = additional_summary_table.append({"Model": "Regression Neural Network", "Metric": "Number of Features", "Value": len(features)}, ignore_index=True)
additional_summary_table = additional_summary_table.append({"Model": "Regression Neural Network", "Metric": "Number of Neurons in the First Layer", "Value": 64}, ignore_index=True)
additional_summary_table = additional_summary_table.append({"Model": "Regression Neural Network", "Metric": "Number of Neurons in the Second Layer", "Value": 32}, ignore_index=True)
additional_summary_table = additional_summary_table.append({"Model": "Regression Neural Network", "Metric": "Number of Epochs", "Value": 100}, ignore_index=True)

# Print the updated summary table
print("Updated Summary Table for Regression Neural Network Model:")
print(summary_table)
print("\nAdditional Summary Table for Regression Neural Network Model:")
print(additional_summary_table)


In [None]:
# Import necessary libraries
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
import matplotlib.pyplot as plt
import pandas as pd

# 1. Define features and target
features = ["GDP", "MPA Percent", "Aquaculture", "Value"]
target = "SDG14_4_1"

# 2. Create a Vector Assembler to combine features into a single vector
vector_assembler = VectorAssembler(inputCols=features, outputCol="features_vector")
assembled_df = vector_assembler.transform(merged_df)  # Use the Spark DataFrame directly

# 3. Split the data
train_data, test_data = assembled_df.randomSplit([0.8, 0.2], seed=42)

# 4. Build a Decision Tree model
dt_model = DecisionTreeRegressor(featuresCol="features_vector", labelCol=target, seed=42)
dt_model = dt_model.fit(train_data)

# 5. Evaluate the model
evaluator = RegressionEvaluator(labelCol=target, predictionCol="prediction", metricName="rmse")
predictions = dt_model.transform(test_data)
rmse = evaluator.evaluate(predictions)

# 6. Create a summary table
summary_table = pd.DataFrame(columns=["Model", "Metric", "Value"])
summary_table = summary_table.append({"Model": "Decision Tree", "Metric": "Number of Features", "Value": len(features)}, ignore_index=True)
summary_table = summary_table.append({"Model": "Decision Tree", "Metric": "RMSE", "Value": rmse}, ignore_index=True)

# 7. Plot Predictor Importance
importances = dt_model.featureImportances.toArray()
plt.figure(figsize=(10, 6))
plt.bar(features, importances)
plt.xlabel("Predictors")
plt.ylabel("Importance")
plt.title("Predictor Importance")
plt.show()

# 8. Create the Predictor Importance Table
predictor_importance_table = pd.DataFrame({"Predictor": features, "Importance": importances})

# Print the updated summary table and predictor importance
print("Updated Summary Table for Decision Tree Model:")
print(summary_table)
print("\nDecision Tree Predictor Importance Table:")
print(predictor_importance_table)


Now that we're done with this tutorial, let's move on to Spark DataFrame Operations!