# Vessel Deficiency Severity Prediction

### Authors:

## Sections

---

### **Part 1: Logic for Deriving a Consensus Severity**

In this section, we will explore the dataset and develop a suitable logic for deriving the consensus severity for each deficiency. This involves understanding the text data and finding ways to derive a unified severity rating from multiple annotations provided by Subject Matter Experts (SMEs).

#### **Objective of Part 1**:
- Understand the deficiency text, which includes descriptions, root causes, corrective actions, and preventive measures.
- Derive a consensus severity (High, Medium, Low) for each deficiency based on the ratings given by at least three annotators.

#### **Key Tasks**:
1. **Explore the Dataset**: Investigate the structure and content of the provided data, focusing on deficiency descriptions and severity annotations.
2. **Preprocess the Data**: Clean and prepare the data for analysis, including any necessary text preprocessing.

---

### **Part 2: Model Development and Severity Prediction**

In this section, we will take the processed data from Part 1 and build a machine learning model to predict the severity of deficiencies.

#### **Key Tasks**:
1. **Train the Model**: We first train a fine-tunned distilbert and use that to obtain numberical values for the def_text info, then we trained a gradient boosting model22222222 with the dataset.
2. **Evaluate Performance**: Measure the model’s accuracy and performance using cross-validation or other techniques.
3. **Generate Predictions**: Use the trained model to predict severity labels for the test dataset and evaluate the model's predictions.

---


# Part 1: Logic for Deriving a Consensus Severity


In [1]:
# Importing necessary libraries

import pandas as pd               # Data manipulation
import numpy as np                # Numerical operations
import matplotlib.pyplot as plt    # Plotting
import seaborn as sns             # Data visualization
import re                          # Regular expressions for text processing
from sklearn.model_selection import train_test_split  # Splitting data
from sklearn.preprocessing import LabelEncoder         # Label encoding for target variable
from collections import Counter, defaultdict

In [2]:


def compute_reputation_scores(data):
    """
    Compute reputation scores for all inspectors based on majority agreement.
    Inspectors get +1 for each deficiency where their severity matches any majority severity.
    """
    reputation_scores = defaultdict(int)
    
    # Group by unique deficiency cases
    grouped = data.groupby(['VesselId', 'PscInspectionId', 'deficiency_code'])
    
    for _, group in grouped:
        # Count occurrences of each severity
        severity_counts = Counter(group['annotation_severity'])
        max_count = max(severity_counts.values())
        
        # Get all severities that have the maximum count
        majority_severities = {severity for severity, count in severity_counts.items() 
                             if count == max_count}
        
        # Update reputation scores for inspectors who chose any majority severity
        for _, row in group.iterrows():
            if row['annotation_severity'] in majority_severities:
                reputation_scores[row['username']] += 1
    
    return reputation_scores

def derive_consensus_severity(group, reputation_scores):
    """
    Derive consensus severity based on majority and reputation scores for tie-breaking.
    """
    # Count occurrences of each severity
    severity_counts = Counter(group['annotation_severity'])
    max_count = max(severity_counts.values())
    
    # Get severities with maximum count
    majority_severities = {severity for severity, count in severity_counts.items() 
                         if count == max_count}
    
    # If there's only one majority severity, return it
    if len(majority_severities) == 1:
        return majority_severities.pop()
    
    # For ties, find the inspector with highest reputation among tied severities
    highest_rep_score = -1
    consensus_severity = None
    
    # Filter to only look at inspectors who gave majority severities
    tied_inspectors = group[group['annotation_severity'].isin(majority_severities)]
    
    for _, row in tied_inspectors.iterrows():
        rep_score = reputation_scores[row['username']]
        if rep_score > highest_rep_score:
            highest_rep_score = rep_score
            consensus_severity = row['annotation_severity']
        # If same reputation score, take the higher severity as tie-breaker
        elif rep_score == highest_rep_score and row['annotation_severity'] > consensus_severity:
            consensus_severity = row['annotation_severity']
    
    return consensus_severity

def process_severity_data(input_file, output_file):
    """
    Main function to process severity data and generate consensus outputs.
    """
    # Read and clean data
    data = pd.read_csv(input_file).dropna(subset=['annotation_severity'])
    
    # First compute reputation scores for all inspectors
    reputation_scores = compute_reputation_scores(data)
    
    # Now derive consensus severities using the pre-computed reputation scores
    consensus_severities = []
    
    grouped = data.groupby(['VesselId', 'PscInspectionId', 'deficiency_code'])
    
    for _, group in grouped:
        consensus_severity = derive_consensus_severity(group, reputation_scores)
        
        consensus_severities.append({
            'VesselId': group['VesselId'].iloc[0],
            'PscInspectionId': group['PscInspectionId'].iloc[0],
            'deficiency_code': group['deficiency_code'].iloc[0],
            'consensus_severity': consensus_severity,
            'inspectors': ', '.join(f"{user} (ID: {annot_id}, Severity: {severity})"
                                  for user, annot_id, severity in 
                                  zip(group['username'], 
                                      group['annotation_id'], 
                                      group['annotation_severity'])),
            'inspection_date': group['InspectionDate'].iloc[0],
            'VesselGroup': group['VesselGroup'].iloc[0],
            'age': group['age'].iloc[0],            
             'def_text': group['def_text'].iloc[0]
        })
    
    # Create output dataframes
    consensus_df = pd.DataFrame(consensus_severities)
    reputation_df = pd.DataFrame(list(reputation_scores.items()),
                               columns=['username', 'reputation_score'])\
                               .sort_values(by=['reputation_score', 'username'],
                                          ascending=[False, True])
    
    # Sort consensus dataframe
    consensus_df = consensus_df.sort_values(
        by=['VesselId', 'PscInspectionId', 'inspection_date'],
        ascending=True
    )
    
    # Save to files
    consensus_df.to_csv(output_file, index=False)
    reputation_df.to_csv(output_file.replace('.csv', '_reputation.csv'), index=False)
    
    print(f"Consensus severity saved to {output_file}")
    print(f"Reputation scores saved to {output_file.replace('.csv', '_reputation.csv')}")

input_file = 'datasets/psc_severity_train.csv'
output_file = 'datasets/consensus_severity_output.csv'
process_severity_data(input_file, output_file)

Consensus severity saved to datasets/consensus_severity_output.csv
Reputation scores saved to datasets/consensus_severity_output_reputation.csv


In [3]:
# Import the consensus severity
df = pd.read_csv(output_file)
df.head(3)

Unnamed: 0,VesselId,PscInspectionId,deficiency_code,consensus_severity,inspectors,inspection_date,VesselGroup,age,def_text
0,66646,1695287,2113,High,"guru (ID: 42950805, Severity: High), sunil (ID...",2023-03-28,Dry Bulk,44.569473,PscInspectionId: 1695287\n\nDeficiency/Finding...
1,77681,1671010,3105,High,"guru (ID: 46013695, Severity: High), sunil (ID...",2022-12-21,Dry Bulk,40.884326,PscInspectionId: 1671010\n\nDeficiency/Finding...
2,77681,1671010,18327,Medium,"mihail (ID: 42535118, Severity: Medium), raul ...",2022-12-21,Dry Bulk,40.884326,PscInspectionId: 1671010\n\nDeficiency/Finding...


    The below function preprocesses the deficiency text data for machine learning.
    
    Parameters:
    df (pandas.DataFrame): Input dataframe containing the deficiency texts
    text_column (str): Name of the column containing the deficiency text
    
    Returns:
    tuple: (processed_df, vectorizer)
        - processed_df: DataFrame with additional processed text columns
        - vectorizer: Fitted TfidfVectorizer for future use

# Part 2: Model training and prediction

## a) distilbert model

### Preparing the data for training 
cleaning + tokenising


In [4]:

import re

def extract_deficiency_fields(df):
    """
    Extracts fields from the 'def_text' column of a DataFrame.

    Args:
        df: The pandas DataFrame containing the 'def_text' column.

    Returns:
        A DataFrame with the extracted fields.
    """

    def extract_field(text, field_name):
        match = re.search(rf"{field_name}:\s*(.*?)(?=\n[A-Z]|$)", text, re.DOTALL)
        return match.group(1).strip() if match else None

    df['Deficiency/Finding'] = df['def_text'].apply(lambda x: extract_field(x, "Deficiency/Finding"))
    df['Description Overview'] = df['def_text'].apply(lambda x: extract_field(x, "Description Overview"))
    df['Immediate Causes'] = df['def_text'].apply(lambda x: extract_field(x, "Immediate Causes"))
    df['Root Cause Analysis'] = df['def_text'].apply(lambda x: extract_field(x, "Root Cause Analysis"))
    df['Corrective Action'] = df['def_text'].apply(lambda x: extract_field(x, "Corrective Action"))
    df['Preventive Action'] = df['def_text'].apply(lambda x: extract_field(x, "Preventive Action"))


    # Detainable Deficiency (handle differently as it's categorical)
    df['Detainable Deficiency'] = df['def_text'].str.extract(r"Detainable Deficiency:\s*(.*)")

    return df


df_separate_fields = extract_deficiency_fields(df)

df_separate_fields['combined_text'] = (df_separate_fields['Deficiency/Finding'].fillna('') + ' ' +
                               df_separate_fields['Description Overview'].fillna('') + ' ' +
                               df_separate_fields['Immediate Causes'].fillna('') + ' ' +
                               df_separate_fields['Root Cause Analysis'].fillna('') + ' ' +
                               df_separate_fields['Corrective Action'].fillna('') + ' ' +
                               df_separate_fields['Preventive Action'].fillna('')).str.strip()

df_combined_text = df_separate_fields.drop(columns=['Deficiency/Finding', 'Description Overview', 'Immediate Causes', 'Root Cause Analysis', 'Corrective Action', 'Preventive Action', 'def_text'])
df_combined_text.head(2)

Unnamed: 0,VesselId,PscInspectionId,deficiency_code,consensus_severity,inspectors,inspection_date,VesselGroup,age,Detainable Deficiency,combined_text
0,66646,1695287,2113,High,"guru (ID: 42950805, Severity: High), sunil (ID...",2023-03-28,Dry Bulk,44.569473,No,Hull - cracking Hull - cracking Crack found on...
1,77681,1671010,3105,High,"guru (ID: 46013695, Severity: High), sunil (ID...",2022-12-21,Dry Bulk,40.884326,No,Lack of maintenance on weathertight hatches/co...


In [5]:


def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII characters (optional, but often helpful)
    text = text.lower() # convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # Remove special characters and punctuation
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    return text.strip()

df_combined_text['cleaned_text'] = df_combined_text['combined_text'].apply(clean_text)
df_clean_text = df_combined_text.drop(columns=['combined_text'])




df_clean_text = df_clean_text.dropna(subset=['cleaned_text', 'consensus_severity'])
df_clean_text.head(1)

Unnamed: 0,VesselId,PscInspectionId,deficiency_code,consensus_severity,inspectors,inspection_date,VesselGroup,age,Detainable Deficiency,cleaned_text
0,66646,1695287,2113,High,"guru (ID: 42950805, Severity: High), sunil (ID...",2023-03-28,Dry Bulk,44.569473,No,hull cracking hull cracking crack found on mai...


### Tokenising the text


In [6]:
!pip install --upgrade transformers datasets evaluate huggingface_hub torch

import numpy as np
import evaluate
from datasets import load_dataset # Not used
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import classification_report # Used instead of the "evaluate" library
import torch; 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 1. Load DistilBERT Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# 2. Label mapping (same as before)
label_mapping = {"Not a deficiency": 0, "Low": 1, "Medium": 2, "High": 3}

# 3. Print unique labels (same as before)
unique_labels = df_clean_text['consensus_severity'].unique()
print(unique_labels)

# 4. Number of labels
num_labels = len(unique_labels)

# 5. Update labels in the DataFrame
df_clean_text['consensus_severity'] = df_clean_text['consensus_severity'].map(label_mapping)

# 6. Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df_clean_text[['cleaned_text', 'consensus_severity']])
dataset = dataset.rename_column("cleaned_text", "text")
dataset = dataset.rename_column("consensus_severity", "label")

# 7. Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 8. Set the format of the dataset
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# 9. Split the dataset into train and test sets
train_test_split = tokenized_datasets.train_test_split(test_size=0.2, seed=42)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

# 10. Optionally select a smaller subset for faster debugging
small_train_dataset = train_dataset.shuffle(seed=42).select(range(1000)) 
small_eval_dataset = test_dataset.shuffle(seed=42).select(range(200))   


['High' 'Medium' 'Low' 'Not a deficiency']


Map:   0%|          | 0/4403 [00:00<?, ? examples/s]

Fine tuning the model takes a long time, hence we will only run it once

In [8]:
import os
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate

# Define the dataset name for saving the model
dataset_name = "large_dataset"
saved_model_path = f'model_{dataset_name}'

# Check if the model already exists
if not os.path.exists(saved_model_path):
    print(f"Model not found at '{saved_model_path}'. Starting training...")
    
    # Load DistilBERT model for sequence classification
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=num_labels)
    
    # Load accuracy metric
    metric = evaluate.load("accuracy")
    
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        disable_tqdm=False,
        logging_dir='./logs',
        logging_steps=500,
        fp16=True
    )
    
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        compute_metrics=compute_metrics,
    )
    
    # Train the model
    trainer.train()
    
    # Save the trained model
    trainer.save_model(saved_model_path)
    
    # Evaluate the model
    results = trainer.evaluate()
    print("Training complete. Evaluation results:", results)
else:
    print(f"Model already exists at '{saved_model_path}'. Skipping training.")


Model already exists at 'model_large_dataset'. Skipping training.


Load the model

In [9]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the saved model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained('model_large_dataset')

# Put the model in evaluation mode
model.eval()

# Function to generate embeddings
def generate_embeddings(text):
    inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        # Use the last hidden state or logits as features
        return outputs.logits.numpy().flatten()



In [10]:
from tqdm import tqdm
import os
import pickle

# Check if embeddings file already exists
embeddings_file = 'embeddings.pkl'

if os.path.exists(embeddings_file):
    # If embeddings already exist, load from file
    print("Loading precomputed embeddings...")
    with open(embeddings_file, 'rb') as f:
        df_for_training = pickle.load(f)
else:
    # Enable the tqdm progress bar for Pandas apply
    tqdm.pandas()

    # Generate embeddings for each cleaned_text with progress
    df_clean_text['embeddings'] = df_clean_text['cleaned_text'].progress_apply(lambda x: generate_embeddings(x))

    # Flatten the embeddings into separate columns (assuming embeddings are vectors)
    embedding_dim = len(df_clean_text['embeddings'].iloc[0])  # This is the size of the embedding vector
    embedding_df = pd.DataFrame(df_clean_text['embeddings'].to_list(), columns=[f'emb_{i}' for i in range(embedding_dim)])

    # Concatenate the flattened embeddings with the original DataFrame
    df_for_training = pd.concat([df_clean_text.reset_index(drop=True), embedding_df], axis=1)

    # Drop the 'cleaned_text' and 'embeddings' columns (optional, depending on what you need)
    df_for_training = df_for_training.drop(columns=['cleaned_text', 'embeddings'])

    # Save the resulting dataframe with embeddings
    with open(embeddings_file, 'wb') as f:
        pickle.dump(df_for_training, f)

    # Optionally, save as CSV
    # df_for_training.to_csv('df_for_training_with_embeddings.csv', index=False)

df_for_training.head()


Loading precomputed embeddings...


Unnamed: 0,VesselId,PscInspectionId,deficiency_code,consensus_severity,inspectors,inspection_date,VesselGroup,age,Detainable Deficiency,emb_0,emb_1,emb_2,emb_3
0,66646,1695287,2113,3,"guru (ID: 42950805, Severity: High), sunil (ID...",2023-03-28,Dry Bulk,44.569473,No,-2.655375,0.39065,1.109482,0.725045
1,77681,1671010,3105,3,"guru (ID: 46013695, Severity: High), sunil (ID...",2022-12-21,Dry Bulk,40.884326,No,-2.697097,0.428598,1.209944,0.722074
2,77681,1671010,18327,2,"mihail (ID: 42535118, Severity: Medium), raul ...",2022-12-21,Dry Bulk,40.884326,No,-2.414241,0.018993,1.259212,0.793717
3,83036,1688644,3104,3,"ranjit (ID: 46089058, Severity: High), man (ID...",2023-03-02,Dry Bulk,38.915811,No,-2.652054,0.793822,0.94159,0.58217
4,83036,1688644,15150,2,"guru (ID: 42436473, Severity: High), sunil (ID...",2023-03-02,Dry Bulk,38.915811,No,-2.680373,1.541046,0.847053,0.075584


In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Load the data
# data = pd.read_csv('datasets/consensus_severity_output.csv')
data = df_for_training

# Check for missing values in 'consensus_severity'
print(f"Missing values in consensus_severity: {data['consensus_severity'].isnull().sum()}")

# Check if there are NaN values in the 'severity_encoded' column after mapping
print(f"Missing values after encoding: {data['consensus_severity'].isnull().sum()}")

# Features and target
X = data[['age', 'VesselGroup', 'deficiency_code']]
y = data['consensus_severity']

# Combine features and target for further processing
data_combined = pd.concat([X, y], axis=1)

# Drop rows where the target column (y) has NaN
data_combined = data_combined.dropna(subset=['consensus_severity'])

# Print to check how many rows remain after dropping NaNs
print(f"Data shape after dropping NaNs: {data_combined.shape}")

# Split back into features (X) and target (y)
X = data_combined.drop('consensus_severity', axis=1)
y = data_combined['consensus_severity']


X = pd.get_dummies(X, columns=['VesselGroup', 'deficiency_code'], drop_first=True)

# Print to check the shape after encoding categorical variables
print(f"Shape after encoding categorical variables: {X.shape}")

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Print the shape of training and testing sets
print(f"Training set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Test set shape: X_test={X_test.shape}, y_test={y_test.shape}")

# Model setup
model2 = XGBClassifier(
    n_estimators=100,      # Number of trees
    learning_rate=0.1,     # Step size shrinkage
    max_depth=6,           # Max depth of trees
    subsample=0.8,         # Fraction of samples used for training
    colsample_bytree=0.8,  # Fraction of features used per tree
    random_state=42
)

model2.fit(X_train, y_train)

# Predict class
y_pred = model2.predict(X_test)

# Predict probabilities (optional, for metrics like AUC-ROC)
y_pred_proba = model2.predict_proba(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Not a deficiency', 'Low', 'Medium', 'High']))


Missing values in consensus_severity: 0
Missing values after encoding: 0
Data shape after dropping NaNs: (4403, 4)
Shape after encoding categorical variables: (4403, 445)
Training set shape: X_train=(3962, 445), y_train=(3962,)
Test set shape: X_test=(441, 445), y_test=(441,)
Accuracy: 0.4308390022675737
Classification Report:
                  precision    recall  f1-score   support

Not a deficiency       0.00      0.00      0.00         2
             Low       0.43      0.73      0.54       181
          Medium       0.39      0.28      0.32       163
            High       0.65      0.14      0.23        95

        accuracy                           0.43       441
       macro avg       0.37      0.29      0.27       441
    weighted avg       0.46      0.43      0.39       441



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Prediction for test dataset

In [12]:
test = pd.read_csv('datasets/psc_severity_test.csv')



test_separate_fields = extract_deficiency_fields(test)

test_separate_fields['combined_text'] = (test_separate_fields['Deficiency/Finding'].fillna('') + ' ' +
                               test_separate_fields['Description Overview'].fillna('') + ' ' +
                               test_separate_fields['Immediate Causes'].fillna('') + ' ' +
                               test_separate_fields['Root Cause Analysis'].fillna('') + ' ' +
                               test_separate_fields['Corrective Action'].fillna('') + ' ' +
                               test_separate_fields['Preventive Action'].fillna('')).str.strip()

test_combined_text = test_separate_fields.drop(columns=['Deficiency/Finding', 'Description Overview', 'Immediate Causes', 'Root Cause Analysis', 'Corrective Action', 'Preventive Action', 'def_text'])
test_combined_text.head(2)

test_combined_text['cleaned_text'] = test_combined_text['combined_text'].apply(clean_text)
test_clean_text = test_combined_text.drop(columns=['combined_text'])


test_clean_text = test_clean_text.dropna(subset=['cleaned_text'])
test_clean_text.head(1)


from tqdm import tqdm
import os
import pickle

# Check if embeddings file already exists
embeddings_file = 'embeddings_test.pkl'

if os.path.exists(embeddings_file):
    # If embeddings already exist, load from file
    print("Loading precomputed embeddings...")
    with open(embeddings_file, 'rb') as f:
        test_for_training = pickle.load(f)
else:
    # Enable the tqdm progress bar for Pandas apply
    tqdm.pandas()

    # Generate embeddings for each cleaned_text with progress
    test_clean_text['embeddings'] = test_clean_text['cleaned_text'].progress_apply(lambda x: generate_embeddings(x))

    # Flatten the embeddings into separate columns (assuming embeddings are vectors)
    embedding_dim = len(test_clean_text['embeddings'].iloc[0])  # This is the size of the embedding vector
    embedding = pd.DataFrame(test_clean_text['embeddings'].to_list(), columns=[f'emb_{i}' for i in range(embedding_dim)])

    # Concatenate the flattened embeddings with the original DataFrame
    test_for_training = pd.concat([test_clean_text.reset_index(drop=True), embedding], axis=1)

    # Drop the 'cleaned_text' and 'embeddings' columns (optional, depending on what you need)
    test_for_training = test_for_training.drop(columns=['cleaned_text', 'embeddings'])

    # Save the resulting dataframe with embeddings
    with open(embeddings_file, 'wb') as f:
        pickle.dump(test_for_training, f)


test_for_training.head()





100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1101/1101 [00:52<00:00, 20.85it/s]


Unnamed: 0,PscInspectionId,deficiency_code,InspectionDate,VesselId,PscAuthorityId,PortId,VesselGroup,age,Detainable Deficiency,emb_0,emb_1,emb_2,emb_3
0,1802364,14402,2024-04-05,293691,9,936,Dry Bulk,9.593429,No,-2.428447,-0.004113,1.280959,0.773754
1,1736765,10199,2023-08-17,272075,9,5237,Dry Bulk,25.21013,No,-1.809258,1.883787,0.035624,-0.578579
2,1787907,18204,2024-02-15,302667,1,953,Dry Bulk,5.793292,No,-2.777685,1.192926,0.842268,0.342324
3,1691176,14108,2023-03-13,288591,7,1439,Oil,12.44627,No,-2.57331,0.450111,1.206649,0.73564
4,1712454,5109,2023-05-26,290457,2,1366,Dry Bulk,11.731691,No,-2.717867,0.619949,1.291194,0.561986


In [14]:
import pandas as pd


print(f"Test data shape: {test_for_training.shape}")

X_test_data = test_for_training[['age', 'VesselGroup', 'deficiency_code', 'emb_0', 'emb_1', 'emb_2', 'emb_3']]

X_test_data = pd.get_dummies(X_test_data, columns=['VesselGroup', 'deficiency_code'], drop_first=True)

X_test_data = X_test_data.reindex(columns=X_train.columns, fill_value=0)

y_pred_test = model2.predict(X_test_data)

predictions_df = pd.DataFrame({
    'Predicted_Severity': y_pred_test
})

severity_mapping_inv = {0: 'Not a deficiency', 1: 'Low', 2: 'Medium', 3: 'High'}
predictions_df['Predicted_Severity'] = predictions_df['Predicted_Severity'].map(severity_mapping_inv)

# Save the predictions to a CSV file
predictions_df.to_csv('MaritimeHackathon2025_SeverityPredictions_SQUARE.csv', index=False)

# Print the resulting DataFrame
print(predictions_df.head())


Test data shape: (1101, 13)
  Predicted_Severity
0                Low
1             Medium
2               High
3                Low
4                Low
