<a href="https://colab.research.google.com/github/angelMeela/ML-projects/blob/main/Untitled3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Design a mini-project for anomaly detection on the dataset "/content/equipment_anomaly_data.csv" using big data tools and machine learning, and prepare a report.

## Project design

### Subtask:
Define the scope of the mini-project, including the problem to be solved (anomaly detection), the dataset to be used, the big data tools to be integrated, and the ML model to be applied.


**Reasoning**:
Define the scope of the mini-project based on the instructions provided.



In [1]:
# 1. Problem to be solved:
problem = "Detecting anomalies in equipment sensor data to identify potential failures or unusual behavior."

# 2. Dataset to be used:
dataset_path = "/content/equipment_anomaly_data.csv"

# 3. Big data tool(s) to be integrated:
big_data_tools = ["PySpark for data loading and initial processing"]

# 4. Machine learning model to be applied:
ml_model = "Isolation Forest"
justification = "Isolation Forest is suitable for anomaly detection as it works well with high-dimensional data and is less sensitive to outliers than other methods. It's also efficient for large datasets."

# 5. Expected output:
expected_output = [
    "A trained anomaly detection model.",
    "A report summarizing the findings, including identified anomalies and insights."
]

print("Mini-Project Scope:")
print(f"Problem: {problem}")
print(f"Dataset: {dataset_path}")
print(f"Big Data Tools: {', '.join(big_data_tools)}")
print(f"ML Model: {ml_model}")
print(f"Justification for ML Model: {justification}")
print("Expected Output:")
for item in expected_output:
    print(f"- {item}")

Mini-Project Scope:
Problem: Detecting anomalies in equipment sensor data to identify potential failures or unusual behavior.
Dataset: /content/equipment_anomaly_data.csv
Big Data Tools: PySpark for data loading and initial processing
ML Model: Isolation Forest
Justification for ML Model: Isolation Forest is suitable for anomaly detection as it works well with high-dimensional data and is less sensitive to outliers than other methods. It's also efficient for large datasets.
Expected Output:
- A trained anomaly detection model.
- A report summarizing the findings, including identified anomalies and insights.


## Data loading and exploration

### Subtask:
Load the `equipment_anomaly_data.csv` dataset using a big data tool (e.g., PySpark). Perform initial data exploration to understand the data structure, identify potential issues (missing values, outliers), and gain insights into the data.


**Reasoning**:
Import the necessary libraries for PySpark and create a SparkSession.



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("EquipmentAnomalyDetection").getOrCreate()

**Reasoning**:
Load the data into a Spark DataFrame, display its schema, show the first few rows, and print the number of rows.



In [3]:
df = spark.read.csv("/content/equipment_anomaly_data.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)
print(f"Number of rows: {df.count()}")

root
 |-- temperature: double (nullable = true)
 |-- pressure: double (nullable = true)
 |-- vibration: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- equipment: string (nullable = true)
 |-- location: string (nullable = true)
 |-- faulty: double (nullable = true)

+-----------------+-----------------+-------------------+------------------+----------+-------------+------+
|      temperature|         pressure|          vibration|          humidity| equipment|     location|faulty|
+-----------------+-----------------+-------------------+------------------+----------+-------------+------+
|58.18018003931781|25.02927765103301| 0.6065162172245139|45.694907104076414|   Turbine|      Atlanta|   0.0|
|75.74071220894001|22.95401759548667| 2.3380947537510077|41.867406792614915|Compressor|      Chicago|   0.0|
|71.35859424081657|27.27683031893662| 1.3891983049086754| 58.95440890948324|   Turbine|San Francisco|   0.0|
|71.61698526704753|32.24292130393475| 1.7706896863176191|

**Reasoning**:
Calculate and display summary statistics for numerical columns and check for missing values in each column.



In [4]:
df.describe().show()

from pyspark.sql.functions import col, sum

df.select(*(sum(col(c).isNull().cast("integer")).alias(c) for c in df.columns)).show()

+-------+------------------+------------------+--------------------+------------------+----------+-------------+-------------------+
|summary|       temperature|          pressure|           vibration|          humidity| equipment|     location|             faulty|
+-------+------------------+------------------+--------------------+------------------+----------+-------------+-------------------+
|  count|              7672|              7672|                7672|              7672|      7672|         7672|               7672|
|   mean| 70.92247815645658|35.738048287949084|  1.6118091973524176|50.016573581920284|      NULL|         NULL|0.09997393117831074|
| stddev| 16.20005890567922|10.381593069378281|  0.7285601038822835|11.841478762776903|      NULL|         NULL|0.29998478968659514|
|    min|10.269384763629008| 3.620797987053262|-0.42818766462029845|10.215076711940103|Compressor|      Atlanta|                0.0|
|    max| 149.6904197890206| 79.88773351925018|   4.990537369353191| 

## Data preprocessing

### Subtask:
Clean and preprocess the data for ML model training. This may involve handling missing values, scaling numerical features, encoding categorical features, and splitting the data into training and testing sets.


**Reasoning**:
Drop the 'faulty' column, identify numerical and categorical columns, create a feature list, and one-hot encode the categorical columns.



In [5]:
# Drop the 'faulty' column
df_features = df.drop("faulty")

# Identify numerical and categorical columns
numerical_cols = [col for col, dtype in df_features.dtypes if dtype == 'double']
categorical_cols = [col for col, dtype in df_features.dtypes if dtype == 'string']

# Create a list of features
feature_cols = numerical_cols + categorical_cols

# One-hot encode categorical columns
from pyspark.ml.feature import OneHotEncoder, StringIndexer

indexers = [StringIndexer(inputCol=col, outputCol=col + "_index").fit(df_features) for col in categorical_cols]
encoders = [OneHotEncoder(inputCol=col + "_index", outputCol=col + "_encoded") for col in categorical_cols]

# Chain indexers and encoders into a pipeline
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=indexers + encoders)
df_encoded = pipeline.fit(df_features).transform(df_features)

# Update feature_cols to include encoded categorical features and numerical features
encoded_categorical_cols = [col + "_encoded" for col in categorical_cols]
feature_cols_processed = numerical_cols + encoded_categorical_cols

df_encoded.printSchema()
df_encoded.show(5)

root
 |-- temperature: double (nullable = true)
 |-- pressure: double (nullable = true)
 |-- vibration: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- equipment: string (nullable = true)
 |-- location: string (nullable = true)
 |-- equipment_index: double (nullable = false)
 |-- location_index: double (nullable = false)
 |-- equipment_encoded: vector (nullable = true)
 |-- location_encoded: vector (nullable = true)

+-----------------+-----------------+-------------------+------------------+----------+-------------+---------------+--------------+-----------------+----------------+
|      temperature|         pressure|          vibration|          humidity| equipment|     location|equipment_index|location_index|equipment_encoded|location_encoded|
+-----------------+-----------------+-------------------+------------------+----------+-------------+---------------+--------------+-----------------+----------------+
|58.18018003931781|25.02927765103301| 0.6065162172245

**Reasoning**:
Assemble the processed features into a single vector column, split the data into training and testing sets, and cache the resulting dataframes.



In [6]:
# Assemble features into a single vector column
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=feature_cols_processed, outputCol="features")
df_assembled = assembler.transform(df_encoded)

# Split data into training and testing sets
train_df, test_df = df_assembled.randomSplit([0.8, 0.2], seed=42)

# Cache the training and testing dataframes
train_df.cache()
test_df.cache()

print("Training DataFrame count:", train_df.count())
print("Testing DataFrame count:", test_df.count())

Training DataFrame count: 6208
Testing DataFrame count: 1464


## Big data tool integration

### Subtask:
Integrate a big data tool (e.g., PySpark) into the data processing pipeline to handle the dataset efficiently. This could involve using Spark DataFrames for transformations and feature engineering.


## Ml model selection and training

### Subtask:
Select an appropriate ML model for anomaly detection (e.g., Isolation Forest, One-Class SVM). Train the selected model on the preprocessed training data using a big data tool if applicable (e.g., Spark MLlib).


**Reasoning**:
Select, instantiate, and train the Isolation Forest model using scikit-learn on the training data by converting the Spark DataFrame to a Pandas DataFrame.



In [7]:
from sklearn.ensemble import IsolationForest
import pandas as pd

# Convert the 'features' column of train_df to a Pandas DataFrame for scikit-learn
# Extract the vectors and convert to a list of lists
train_features_list = train_df.select("features").rdd.map(lambda row: row.features.toArray().tolist()).collect()
train_features_pd = pd.DataFrame(train_features_list)

# Instantiate the Isolation Forest model
# Setting contamination to 'auto' lets the algorithm decide based on the data,
# or we can set it based on prior knowledge (e.g., 0.1 as approximately 10% faulty)
isolation_forest_model = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model to the training data
isolation_forest_model.fit(train_features_pd)

print("Isolation Forest model trained successfully.")

Isolation Forest model trained successfully.


## Model evaluation

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics for anomaly detection (e.g., precision, recall, F1-score, ROC AUC).


**Reasoning**:
Extract the 'features' column from the `test_df` Spark DataFrame, convert it to a Pandas DataFrame, predict anomaly scores and labels using the trained Isolation Forest model, and extract the true labels from the original dataframe to prepare for evaluation.



In [8]:
# 1. Extract 'features' column from test_df and convert to Pandas DataFrame
test_features_list = test_df.select("features").rdd.map(lambda row: row.features.toArray().tolist()).collect()
test_features_pd = pd.DataFrame(test_features_list)

# 2. Predict anomaly scores
test_anomaly_scores = isolation_forest_model.decision_function(test_features_pd)

# 3. Predict anomaly labels
test_predicted_labels = isolation_forest_model.predict(test_features_pd)

# 4. Extract true labels from the original df and align with test_df
# To align, we can join the original df with the test_df.
# We need to add a unique identifier to both dataframes before splitting to ensure correct joining.
# Since we didn't add an ID before splitting, we'll assume the order is maintained for this example.
# In a real-world scenario with Spark, it's safer to add a monotonic_id before splitting.
# Given the context, we will proceed assuming the order is consistent after randomSplit.

# Convert test_df to Pandas to align with test_features_pd
test_df_pd = test_df.toPandas()

# Extract the 'faulty' column from the original df for the rows corresponding to test_df
# This assumes the order of rows in test_df corresponds to the order in test_df_pd
test_true_labels = test_df_pd['faulty']

# Adjust true labels: 1 (faulty) should be the positive class (anomaly), 0 (not faulty) is negative.
# Adjust predicted labels: -1 (anomaly) should be the positive class, 1 (normal) is negative.
# We can convert Isolation Forest's -1 to 1 and 1 to 0 to match the true labels (1 for anomaly, 0 for normal).
test_predicted_labels_adjusted = [1 if label == -1 else 0 for label in test_predicted_labels]

# Convert to numpy arrays for sklearn metrics
test_true_labels_np = test_true_labels.to_numpy()
test_predicted_labels_adjusted_np = np.array(test_predicted_labels_adjusted)

print("Data prepared for evaluation.")

KeyError: 'faulty'

**Reasoning**:
The error indicates that the 'faulty' column is not present in `test_df_pd`. This is because the 'faulty' column was dropped from `df_features` before splitting the data. To get the true labels for the test set, we need to extract the 'faulty' column from the *original* `df` Spark DataFrame and then align it with the `test_df` after the split. The safest way to do this is to join the original `df` with `test_df` using a common identifier or, if the order was preserved during the split (which is not guaranteed with `randomSplit`), extract the corresponding rows. Since adding an ID before splitting wasn't done, we will attempt to extract the 'faulty' column from the original `df` for the rows that are present in `test_df`. We can achieve this by converting both `df` and `test_df` to Pandas DataFrames and then filtering or merging. However, converting the entire original Spark DataFrame to Pandas might be memory-intensive. A better approach is to extract the original 'faulty' column and add it back to the `df_assembled` Spark DataFrame before splitting, or join `test_df` with a subset of the original `df` containing only the 'faulty' column and the columns used for splitting (if any) or row identifiers. Given the available dataframes, the simplest (though not the most robust) approach is to convert the original `df` to Pandas, convert `test_df` to Pandas, and then merge or filter to get the 'faulty' column corresponding to the test set. Let's assume for now that merging on the feature columns will work for this dataset to retrieve the correct 'faulty' labels for the test set.

Let's refine the approach: Instead of relying on a perfect merge based on features, which is risky, let's reconsider how `test_df` was created. It's a random split of `df_assembled`. `df_assembled` contains all columns from `df_encoded` plus the 'features' vector. `df_encoded` contains columns from `df_features` plus the indexed and encoded categorical columns. `df_features` was created by dropping 'faulty' from the original `df`. This confirms that `test_df` does *not* contain the original 'faulty' column.

To get the true labels for `test_df`, we need to get the 'faulty' column from the original `df` for the same rows that ended up in `test_df`. The most reliable way to do this post-split without a pre-existing ID is to join `test_df` (which has the original 'equipment' and 'location' columns along with the processed features) back to the original `df` using the original columns as a join key. However, joining on multiple columns can be slow.

A simpler (but potentially less robust depending on the dataset's structure and whether duplicates exist based on the join keys) approach is to convert both the original `df` and `test_df` to Pandas and then merge.

Let's try converting the relevant columns of the original `df` to Pandas and merging with `test_df_pd`. We'll merge on the original columns ('temperature', 'pressure', 'vibration', 'humidity', 'equipment', 'location') that are present in both.

Let's try again, focusing on correctly extracting the true labels. We have `test_df`, which is a Spark DataFrame. We need to get the 'faulty' column for the same rows from the original `df`. We can convert `test_df` to Pandas (`test_df_pd`) and then join it with a Pandas version of the original `df` that *includes* the 'faulty' column.

Let's try converting the original `df` to Pandas and then merging with `test_df_pd`.



In [9]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# 1. Extract 'features' column from test_df and convert to Pandas DataFrame
test_features_list = test_df.select("features").rdd.map(lambda row: row.features.toArray().tolist()).collect()
test_features_pd = pd.DataFrame(test_features_list)

# 2. Predict anomaly scores
test_anomaly_scores = isolation_forest_model.decision_function(test_features_pd)

# 3. Predict anomaly labels
test_predicted_labels = isolation_forest_model.predict(test_features_pd)

# 4. Extract true labels from the original df and align with test_df
# Convert original df to Pandas DataFrame
df_pd = df.toPandas()

# Convert test_df to Pandas DataFrame to facilitate merging
test_df_pd = test_df.toPandas()

# Merge test_df_pd with the original df_pd to get the 'faulty' column.
# We will merge on the original feature columns that are present in both.
merge_cols = ['temperature', 'pressure', 'vibration', 'humidity', 'equipment', 'location']
test_df_with_faulty = pd.merge(test_df_pd, df_pd[merge_cols + ['faulty']], on=merge_cols, how='left')

# Extract the 'faulty' column as the true labels
test_true_labels = test_df_with_faulty['faulty']

# Adjust true labels: 1 (faulty) should be the positive class (anomaly), 0 (not faulty) is negative.
# Adjust predicted labels: -1 (anomaly) should be the positive class, 1 (normal) is negative.
# Convert Isolation Forest's -1 to 1 and 1 to 0 to match the true labels (1 for anomaly, 0 for normal).
test_predicted_labels_adjusted = [1 if label == -1 else 0 for label in test_predicted_labels]

# Convert to numpy arrays for sklearn metrics
test_true_labels_np = test_true_labels.to_numpy()
test_predicted_labels_adjusted_np = np.array(test_predicted_labels_adjusted)

# 5. Calculate evaluation metrics
# Precision, Recall, F1-score require binary classification labels
precision = precision_score(test_true_labels_np, test_predicted_labels_adjusted_np)
recall = recall_score(test_true_labels_np, test_predicted_labels_adjusted_np)
f1 = f1_score(test_true_labels_np, test_predicted_labels_adjusted_np)

# ROC AUC requires predicted scores and true labels
# The decision_function gives a score where lower values are more anomalous.
# For ROC AUC, we need a score where higher values indicate the positive class (anomaly).
# So, we can use the negative of the decision function scores.
# Also, ROC AUC requires the true labels to be 0 and 1, which we already have in test_true_labels_np.
roc_auc = roc_auc_score(test_true_labels_np, -test_anomaly_scores)

# Calculate confusion matrix
conf_matrix = confusion_matrix(test_true_labels_np, test_predicted_labels_adjusted_np)


# 6. Print the calculated evaluation metrics
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC AUC: {roc_auc:.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Precision: 0.8345
Recall: 0.8467
F1-score: 0.8406
ROC AUC: 0.9732
Confusion Matrix:
[[1304   23]
 [  21  116]]


## Anomaly detection and analysis

### Subtask:
Use the trained model to predict anomalies in the dataset. Analyze the detected anomalies to understand their characteristics and potential causes.


**Reasoning**:
Predict anomaly scores and labels for the entire dataset, add them to the original DataFrame, and filter for detected anomalies.



In [10]:
# 1. Predict anomaly scores for the entire dataset (using df_assembled)
# Convert df_assembled to Pandas DataFrame for scikit-learn prediction
df_assembled_pd = df_assembled.select("features").rdd.map(lambda row: row.features.toArray().tolist()).collect()
df_assembled_pd = pd.DataFrame(df_assembled_pd)

# Predict anomaly scores
all_anomaly_scores = isolation_forest_model.decision_function(df_assembled_pd)

# 2. Predict anomaly labels for the entire dataset
all_predicted_labels = isolation_forest_model.predict(df_assembled_pd)

# Adjust predicted labels to match the original 'faulty' column format (1 for anomaly, 0 for normal)
all_predicted_labels_adjusted = [1 if label == -1 else 0 for label in all_predicted_labels]

# 3. Add the anomaly scores and predicted labels as new columns to the original Spark DataFrame df
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.sql.functions import col, udf

# Create Pandas Series for scores and labels
scores_series = pd.Series(all_anomaly_scores)
labels_series = pd.Series(all_predicted_labels_adjusted)

# Convert Spark DataFrame to Pandas to add the new columns
df_pd_with_predictions = df.toPandas()

# Add the scores and labels as new columns
df_pd_with_predictions['anomaly_score'] = scores_series
df_pd_with_predictions['predicted_anomaly'] = labels_series

# Convert the Pandas DataFrame back to Spark DataFrame
df_with_predictions = spark.createDataFrame(df_pd_with_predictions)

# 4. Filter the DataFrame to show only the detected anomalies (where the predicted label indicates an anomaly)
anomalies_df = df_with_predictions.filter(col("predicted_anomaly") == 1)

# 5. Display the first few rows of the DataFrame containing only the detected anomalies and their scores
anomalies_df.show(5)

+------------------+------------------+-------------------+------------------+----------+-------------+------+--------------------+-----------------+
|       temperature|          pressure|          vibration|          humidity| equipment|     location|faulty|       anomaly_score|predicted_anomaly|
+------------------+------------------+-------------------+------------------+----------+-------------+------+--------------------+-----------------+
|  49.5109594501023| 76.42672868018283|  4.809297752752174|20.269692247772895|Compressor|     New York|   1.0|-0.12466146703124403|                1|
|148.92336873207364|22.214121367591716|0.10981888071321333|39.944953294698905|   Turbine|San Francisco|   1.0|-0.03363523837066429|                1|
|141.26030984408388| 54.23905839448483| 2.7256907331768128| 86.39046563060697|      Pump|      Atlanta|   1.0|-0.08414630224300623|                1|
| 83.73658545162958| 53.44757268774192| 1.6430079038173124|  25.5250485765034|   Turbine|     New Yo

**Reasoning**:
Analyze the characteristics of the detected anomalies by examining the feature distributions and grouping by categorical features.



In [11]:
# 6. Analyze the characteristics of the detected anomalies
# Convert anomalies_df to Pandas for easier analysis and visualization
anomalies_pd = anomalies_df.toPandas()

# Display descriptive statistics for numerical features of anomalies
print("Descriptive statistics for numerical features in detected anomalies:")
display(anomalies_pd[['temperature', 'pressure', 'vibration', 'humidity', 'anomaly_score']].describe())

# 7. Consider grouping or summarizing the anomalies by categorical features
print("\nAnomaly counts by equipment type:")
display(anomalies_pd['equipment'].value_counts())

print("\nAnomaly counts by location:")
display(anomalies_pd['location'].value_counts())

# 8. Briefly summarize the key characteristics observed in the detected anomalies.
# This step will be done as a concluding remark after the analysis.

Descriptive statistics for numerical features in detected anomalies:


Unnamed: 0,temperature,pressure,vibration,humidity,anomaly_score
count,760.0,760.0,760.0,760.0,760.0
mean,79.935586,40.955373,2.465996,50.985919,-0.055307
std,39.994741,21.518243,1.474946,23.335907,0.038514
min,10.269385,3.620798,-0.167446,10.215077,-0.16517
25%,44.642111,22.195602,1.129491,29.386059,-0.084687
50%,83.043784,39.718277,2.479675,53.618424,-0.051516
75%,114.692447,58.771977,3.768034,70.616874,-0.020608
max,149.69042,79.887734,4.990537,89.984718,-2.6e-05



Anomaly counts by equipment type:


Unnamed: 0_level_0,count
equipment,Unnamed: 1_level_1
Compressor,321
Turbine,231
Pump,208



Anomaly counts by location:


Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
Chicago,174
Atlanta,162
New York,161
Houston,161
San Francisco,102


## Report preparation

### Subtask:
Prepare a comprehensive report summarizing the project, including the problem definition, data analysis, methodology, model details, results, and conclusions.


**Reasoning**:
Generate the comprehensive report as a series of print statements, summarizing the project based on the previous steps and available variables.



In [12]:
print("## Equipment Anomaly Detection Project Report\n")

print("### 1. Introduction")
print("This project aimed to detect anomalies in equipment sensor data to identify potential failures or unusual behavior. Early detection of anomalies can prevent costly breakdowns and improve maintenance efficiency.\n")

print("### 2. Data")
print("The dataset used for this project is located at '/content/equipment_anomaly_data.csv'.")
print("It contains sensor readings and categorical information for various pieces of equipment.")
print(f"The dataset has {df.count()} rows and the following columns:")
df.printSchema()
print("\nInitial data exploration showed no missing values and provided summary statistics for numerical features.\n")

print("### 3. Methodology")
print("The approach involved several steps to prepare the data for anomaly detection using the Isolation Forest model:")
print("- **Data Preprocessing:** The 'faulty' column was initially separated as it represents the true anomaly labels for evaluation.")
print("  Categorical features ('equipment' and 'location') were one-hot encoded using Spark's StringIndexer and OneHotEncoder.")
print("  Numerical and encoded categorical features were assembled into a single feature vector using VectorAssembler.")
print("- **Train-Test Split:** The data was split into an 80/20 training and testing set.")
print("- **Model Selection:** Isolation Forest was chosen as the anomaly detection model due to its effectiveness in isolating anomalies and its efficiency with large datasets.")
print("- **Model Training:** The Isolation Forest model was trained on the preprocessed training data.\n")

print("### 4. Model Training and Evaluation")
print("An Isolation Forest model with n_estimators=100 and contamination=0.1 (estimated proportion of anomalies) was trained.")
print("The model's performance was evaluated on the test set using standard anomaly detection metrics:")
print(f"- **Precision:** {precision:.4f}")
print(f"- **Recall:** {recall:.4f}")
print(f"- **F1-score:** {f1:.4f}")
print(f"- **ROC AUC:** {roc_auc:.4f}")
print("The Confusion Matrix for the test set is:")
print(conf_matrix)
print("Interpretation of metrics:")
print(f"- Precision of {precision:.4f} indicates that {precision*100:.2f}% of the instances predicted as anomalies were actually faulty.")
print(f"- Recall of {recall:.4f} indicates that the model correctly identified {recall*100:.2f}% of all actual faulty instances.")
print(f"- The F1-score of {f1:.4f} is the harmonic mean of precision and recall, providing a balanced measure of the model's accuracy.")
print(f"- An ROC AUC of {roc_auc:.4f} indicates excellent discrimination ability of the model in distinguishing between normal and anomalous instances.\n")

print("### 5. Anomaly Detection and Analysis")
print("The trained Isolation Forest model was used to predict anomaly scores and labels for the entire dataset.")
print(f"A total of {anomalies_df.count()} anomalies were detected based on the model's predictions.")
print("Analysis of the detected anomalies revealed the following characteristics:")
print("\nDescriptive statistics for numerical features in detected anomalies:")
display(anomalies_pd[['temperature', 'pressure', 'vibration', 'humidity', 'anomaly_score']].describe())
print("\nAnomaly counts by equipment type:")
display(anomalies_pd['equipment'].value_counts())
print("\nAnomaly counts by location:")
display(anomalies_pd['location'].value_counts())
print("\nThe analysis shows that anomalies are distributed across all equipment types and locations, with Compressors and locations like Chicago, Atlanta, and New York showing higher counts of detected anomalies.\n")

print("### 6. Conclusion")
print("This project successfully implemented an anomaly detection system for equipment sensor data using PySpark for data processing and scikit-learn's Isolation Forest model.")
print("The model demonstrated strong performance on the test set, as indicated by high precision, recall, F1-score, and ROC AUC.")
print("The anomaly analysis provided insights into the distribution of anomalies across different equipment types and locations.")
print("The detected anomalies can be further investigated to understand the underlying causes of unusual equipment behavior.")
print("Potential next steps include:")
print("- Investigating the specific characteristics of anomalies for each equipment type and location.")
print("- Implementing real-time anomaly detection on streaming sensor data.")
print("- Exploring other anomaly detection algorithms and comparing their performance.")
print("- Incorporating domain expertise to refine the anomaly detection threshold and interpret the results.")
print("\nReport generation complete.")

## Equipment Anomaly Detection Project Report

### 1. Introduction
This project aimed to detect anomalies in equipment sensor data to identify potential failures or unusual behavior. Early detection of anomalies can prevent costly breakdowns and improve maintenance efficiency.

### 2. Data
The dataset used for this project is located at '/content/equipment_anomaly_data.csv'.
It contains sensor readings and categorical information for various pieces of equipment.
The dataset has 7672 rows and the following columns:
root
 |-- temperature: double (nullable = true)
 |-- pressure: double (nullable = true)
 |-- vibration: double (nullable = true)
 |-- humidity: double (nullable = true)
 |-- equipment: string (nullable = true)
 |-- location: string (nullable = true)
 |-- faulty: double (nullable = true)


Initial data exploration showed no missing values and provided summary statistics for numerical features.

### 3. Methodology
The approach involved several steps to prepare the data for anom

Unnamed: 0,temperature,pressure,vibration,humidity,anomaly_score
count,760.0,760.0,760.0,760.0,760.0
mean,79.935586,40.955373,2.465996,50.985919,-0.055307
std,39.994741,21.518243,1.474946,23.335907,0.038514
min,10.269385,3.620798,-0.167446,10.215077,-0.16517
25%,44.642111,22.195602,1.129491,29.386059,-0.084687
50%,83.043784,39.718277,2.479675,53.618424,-0.051516
75%,114.692447,58.771977,3.768034,70.616874,-0.020608
max,149.69042,79.887734,4.990537,89.984718,-2.6e-05



Anomaly counts by equipment type:


Unnamed: 0_level_0,count
equipment,Unnamed: 1_level_1
Compressor,321
Turbine,231
Pump,208



Anomaly counts by location:


Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
Chicago,174
Atlanta,162
New York,161
Houston,161
San Francisco,102



The analysis shows that anomalies are distributed across all equipment types and locations, with Compressors and locations like Chicago, Atlanta, and New York showing higher counts of detected anomalies.

### 6. Conclusion
This project successfully implemented an anomaly detection system for equipment sensor data using PySpark for data processing and scikit-learn's Isolation Forest model.
The model demonstrated strong performance on the test set, as indicated by high precision, recall, F1-score, and ROC AUC.
The anomaly analysis provided insights into the distribution of anomalies across different equipment types and locations.
The detected anomalies can be further investigated to understand the underlying causes of unusual equipment behavior.
Potential next steps include:
- Investigating the specific characteristics of anomalies for each equipment type and location.
- Implementing real-time anomaly detection on streaming sensor data.
- Exploring other anomaly detection algorithms and

## Summary:

### Data Analysis Key Findings

*   The dataset contains sensor readings, equipment type, location, and a 'faulty' label, with no missing values.
*   Approximately 10% of the records in the original dataset are labeled as faulty.
*   The Isolation Forest model trained on the data achieved strong performance on the test set with a Precision of 0.8345, Recall of 0.8467, F1-score of 0.8406, and an ROC AUC of 0.9732.
*   The model identified 760 anomalies in the entire dataset.
*   Anomalies were distributed across all equipment types, with Compressors showing the highest count (321).
*   Anomalies were also distributed across all locations, with Chicago having the highest count (174).

### Insights or Next Steps

*   Investigate the specific sensor readings and combinations of features that contribute to high anomaly scores for different equipment types and locations to understand the root causes of unusual behavior.
*   Explore implementing this anomaly detection system in a real-time streaming environment to enable proactive maintenance and minimize equipment downtime.
