# Predicting Whether an NRC Incident will require a 5800 Incident Report

## Motivation
The National Response Center (NRC) receives calls whenever there is an oil spill, chemical spill, or HAZMAT (hazardous material) release. Depending on the characteristics of the incident that caused the release, a DOT PHMSA 5800 incident report may need to be completed. A 5800 HAZMAT incident is a more speciic and more dangerous type of incident.

The purpose of this exercise is to create a model that can predict whether or not an NRC incident will require a 5800 report to be generated. 

By knowing whether or not a 5800 report will need to be created, the data science team at PHMSA can reduce the latency of generating incident details to management by as much as 30 days.

## Exploratory Data Analysis

In [1]:
import pandas as pd

df = pd.read_csv("combined_data.csv", encoding="utf-8", engine="python")
df.head()

Unnamed: 0,target,description
0,1,THE CALLER IS REPORTING A DERAILMENT OF 13 CAR...
1,1,CALLER IS REPORTING A TRACTOR TRAILER CARRYING...
2,1,CALLER IS REPORTING THAT A TANKER TRUCK SWERVE...
3,1,CALLER REPORTED A 55 GALLON METAL DRUM SPILLED...
4,1,THE CALLER STATES AS THE TANK CAR (NOT PART OF...


We'll see if we can use the Event Description to predict whether or not an event will become a 5800 reportable incident.

For this, we can use Logistics Regression or a CNN.

## Train and Test Logistic Regression

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score

In [3]:
X = df['description']  # Features (event descriptions)
y = df['target']    # Target (event types)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# Create a TF-IDF vectorizer to convert text to numerical features
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Initialize and train a Logistic Regression model
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(X_train_tfidf, y_train)

thresholds = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

y_true = y_test  # Actual labels
y_pred_prob = logistic_regression.predict_proba(X_test_tfidf)[:, 1]  # Predicted probabilities

thresholds = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

for threshold in thresholds:
    # Classify events as positive if the predicted probability is above the threshold
    y_pred_modified = [1 if prob >= threshold else 0 for prob in y_pred_prob]

    # Calculate precision and recall
    precision = precision_score(y_true, y_pred_modified)
    recall = recall_score(y_true, y_pred_modified)

    # Calculate confusion matrix
    confusion = confusion_matrix(y_true, y_pred_modified)

    print(f"Threshold: {threshold}")
    print("Confusion Matrix:")
    print(confusion)
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print()

Threshold: 0.01
Confusion Matrix:
[[1422 2259]
 [   0  293]]
Precision: 0.11
Recall: 1.00

Threshold: 0.05
Confusion Matrix:
[[2988  693]
 [   6  287]]
Precision: 0.29
Recall: 0.98

Threshold: 0.1
Confusion Matrix:
[[3321  360]
 [  20  273]]
Precision: 0.43
Recall: 0.93

Threshold: 0.2
Confusion Matrix:
[[3522  159]
 [  46  247]]
Precision: 0.61
Recall: 0.84

Threshold: 0.3
Confusion Matrix:
[[3590   91]
 [  75  218]]
Precision: 0.71
Recall: 0.74

Threshold: 0.4
Confusion Matrix:
[[3623   58]
 [ 108  185]]
Precision: 0.76
Recall: 0.63

Threshold: 0.5
Confusion Matrix:
[[3646   35]
 [ 135  158]]
Precision: 0.82
Recall: 0.54

Threshold: 0.6
Confusion Matrix:
[[3661   20]
 [ 167  126]]
Precision: 0.86
Recall: 0.43



Threshold: applied to classify events as positive or negative.

- Recall: minimizes how often the model predicts "NOT A 5800" but is wrong.
- Precision: minimizes how often the model predicts "THIS IS A 5800" but is wrong.

We would rather be extremely sure that an event will be a 5800 if it is labeled as such. Therefore, we want a high recall. 

A threshold of 0.1 gives us a good balance of high recall with decent precision. 

Based on the results above, we consider a threshold of 0.3 to yield the best results. 

## Train and Test Random Forest Classifier

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score

# Train the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train_tfidf, y_train)

# Get the predicted probabilities for class 1 (positive class)
y_pred_prob = random_forest.predict_proba(X_test_tfidf)[:, 1]

threshold = 0.3
y_pred_custom = (y_pred_prob > threshold).astype(int)

# Calculate the confusion matrix
confusion = confusion_matrix(y_test, y_pred_custom)

# Calculate precision, recall, and F1 score
precision = precision_score(y_test, y_pred_custom)
recall = recall_score(y_test, y_pred_custom)
f1 = f1_score(y_test, y_pred_custom)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_custom)

print("Confusion Matrix:\n", confusion)
print("Precision:", round(precision, 2))
print("Recall:", round(recall, 2))
print("F1 Score:", round(f1, 2))
print("Accuracy:", round(accuracy, 2))

Confusion Matrix:
 [[3568  113]
 [  80  213]]
Precision: 0.65
Recall: 0.73
F1 Score: 0.69
Accuracy: 0.95


Unfortunately, the Random Forest Classifer does not appear to perform much better (if any better) than the Logistic Regression algorithm.

Next, we will train a CNN deep learning model to attempt to obtain better results.

## Train and Test CNN

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [6]:
# Tokenize your text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to a fixed length
max_seq_length = 100  # Adjust as needed
X_train_pad = pad_sequences(X_train_seq, maxlen=max_seq_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_seq_length)

# Create a CNN model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=max_seq_length))
model.add(Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train_pad, y_train, epochs=3, batch_size=64, validation_data=(X_test_pad, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test_pad, y_test)
print(f'Test Accuracy: {accuracy}')

Epoch 1/3
Epoch 2/3
Epoch 3/3
Test Accuracy: 0.9609964489936829


In [7]:
from sklearn.metrics import confusion_matrix

y_true = y_test  # Actual labels
y_pred = model.predict(X_test_pad)  # Predicted labels (modify as needed)

# Convert predicted probabilities to binary class labels (0 or 1)
y_pred_binary = [1 if prob > 0.3 else 0 for prob in y_pred]

# Calculate the confusion matrix
confusion = confusion_matrix(y_true, y_pred_binary)

print("Confusion Matrix:")
print(confusion)

Confusion Matrix:
[[3583   98]
 [  64  229]]


In [8]:
from sklearn.metrics import precision_score, recall_score

y_true = y_test  # Actual labels
y_pred_binary = [1 if prob > 0.3 else 0 for prob in model.predict(X_test_pad)]  # Predicted binary labels

precision = precision_score(y_true, y_pred_binary)
recall = recall_score(y_true, y_pred_binary)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Precision: 0.70
Recall: 0.78


Since these models perform similarly, we'll use the simpler logistic regression model for our final test.

# Model Head-to-Head Test

## Logistic Regression Test

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# List of 10 new data points (text descriptions) that you want to predict
# The first 5 are non-reportable, the last 5 are reportable.
new_data_points = [
    "CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.",
    "CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.",
    "CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.",
    "CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.",
    "CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.",
    "THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUID) DUE TO POTENTIAL LOAD SHIFT *(DAMAGE TO THE TOTE DURING TRANSPORT). THE IMPACTS ARE INSIDE THE TRAILER, ROADWAY, AND INTO A STORM DRAIN.",
    "CALLER IS REPORTING THAT A TANKER TRUCK ROLLED OVER AND DISCHARGED APPROXIMATELY 16 BARRELS OF CRUDE OIL ONTO THE GROUND. NO FIRES, INJURIES, OR FATALITIES REPORTED.",
    "CALLER IS REPORTING A RELEASE OF SODIUM HYDROXIDE SOLUTION ONTO THE BALLAST FROM TANK CAR GAMX5483 DUE TO EQUIPMENT FAILURE.",
    "CALLER IS REPORTING A RELEASE OF PHOSPHOROUS ACID (UN 2834) ONTO THE PAVEMENT AT A RAILYARD FROM A SHIPPING CONTAINER FOR UNKNOWN REASONS.",
    "THE CALLER IS REPORTING A DISCHARGE OF DYED DIESEL FUEL NO. 2 (300-400 GALLONS) ONTO SOIL. THE CALLER STATED THAT WHILE TRANSFERRING FUEL FROM A TANKER TRUCK TO A STORAGE TANK A VALVE WAS IMPROPERLY LEFT OPEN."
]

# Initialize a list to store the predictions and probabilities for each data point
predictions = []
probabilities = []
custom_threshold = 0.3

# Preprocess and make predictions for each data point
for data_point in new_data_points:
    # Preprocess the data point using the same TF-IDF vectorizer
    data_point_tfidf = tfidf_vectorizer.transform([data_point])

    # Use the Logistic Regression model to predict the probability
    predicted_probability = logistic_regression.predict_proba(data_point_tfidf)[:, 1]

    # Classify based on the custom threshold
    predicted_target = 1 if predicted_probability >= custom_threshold else 0

    # Append the results to the respective lists
    predictions.append(predicted_target)
    probabilities.append(predicted_probability)

# Print the predictions and probabilities for each data point
for i, data_point in enumerate(new_data_points):
    print(f"Data Point {i + 1}: {data_point}")
    print(f"Predicted Target: {predictions[i]}")
    print(f"Predicted Probability: {probabilities[i][0]:.2f}")
    print()


Data Point 1: CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.
Predicted Target: 0
Predicted Probability: 0.01

Data Point 2: CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.
Predicted Target: 0
Predicted Probability: 0.07

Data Point 3: CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.
Predicted Target: 0
Predicted Probability: 0.06

Data Point 4: CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 5: CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.
Predicted Target: 1
Predicted Probability: 0.78

Data Point 6: THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUI

## Random Forest Classifier Test

In [13]:
# List of 10 new data points (text descriptions) that you want to predict
# The first 5 are non-reportable, the last 5 are reportable.
new_data_points = [
    "CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.",
    "CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.",
    "CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.",
    "CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.",
    "CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.",
    "THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUID) DUE TO POTENTIAL LOAD SHIFT *(DAMAGE TO THE TOTE DURING TRANSPORT). THE IMPACTS ARE INSIDE THE TRAILER, ROADWAY, AND INTO A STORM DRAIN.",
    "CALLER IS REPORTING THAT A TANKER TRUCK ROLLED OVER AND DISCHARGED APPROXIMATELY 16 BARRELS OF CRUDE OIL ONTO THE GROUND. NO FIRES, INJURIES, OR FATALITIES REPORTED.",
    "CALLER IS REPORTING A RELEASE OF SODIUM HYDROXIDE SOLUTION ONTO THE BALLAST FROM TANK CAR GAMX5483 DUE TO EQUIPMENT FAILURE.",
    "CALLER IS REPORTING A RELEASE OF PHOSPHOROUS ACID (UN 2834) ONTO THE PAVEMENT AT A RAILYARD FROM A SHIPPING CONTAINER FOR UNKNOWN REASONS.",
    "THE CALLER IS REPORTING A DISCHARGE OF DYED DIESEL FUEL NO. 2 (300-400 GALLONS) ONTO SOIL. THE CALLER STATED THAT WHILE TRANSFERRING FUEL FROM A TANKER TRUCK TO A STORAGE TANK A VALVE WAS IMPROPERLY LEFT OPEN."
]

# Initialize a list to store the predictions and probabilities for each data point
predictions = []
probabilities = []
custom_threshold = 0.3

# Preprocess and make predictions for each data point
for data_point in new_data_points:
    # Preprocess the data point using the same TF-IDF vectorizer
    data_point_tfidf = tfidf_vectorizer.transform([data_point])

    # Use the Logistic Regression model to predict the probability
    predicted_probability = random_forest.predict_proba(data_point_tfidf)[:, 1]

    # Classify based on the custom threshold
    predicted_target = 1 if predicted_probability >= custom_threshold else 0

    # Append the results to the respective lists
    predictions.append(predicted_target)
    probabilities.append(predicted_probability)

# Print the predictions and probabilities for each data point
for i, data_point in enumerate(new_data_points):
    print(f"Data Point {i + 1}: {data_point}")
    print(f"Predicted Target: {predictions[i]}")
    print(f"Predicted Probability: {probabilities[i][0]:.2f}")
    print()

Data Point 1: CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.
Predicted Target: 0
Predicted Probability: 0.03

Data Point 2: CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 3: CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.
Predicted Target: 0
Predicted Probability: 0.03

Data Point 4: CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 5: CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.
Predicted Target: 1
Predicted Probability: 0.59

Data Point 6: THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUI

## CNN Test

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# List of 10 new data points (text descriptions) that you want to predict
new_data_points = [
    "CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.",
    "CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.",
    "CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.",
    "CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.",
    "CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.",
    "THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUID) DUE TO POTENTIAL LOAD SHIFT *(DAMAGE TO THE TOTE DURING TRANSPORT). THE IMPACTS ARE INSIDE THE TRAILER, ROADWAY, AND INTO A STORM DRAIN.",
    "CALLER IS REPORTING THAT A TANKER TRUCK ROLLED OVER AND DISCHARGED APPROXIMATELY 16 BARRELS OF CRUDE OIL ONTO THE GROUND. NO FIRES, INJURIES, OR FATALITIES REPORTED.",
    "CALLER IS REPORTING A RELEASE OF SODIUM HYDROXIDE SOLUTION ONTO THE BALLAST FROM TANK CAR GAMX5483 DUE TO EQUIPMENT FAILURE.",
    "CALLER IS REPORTING A RELEASE OF PHOSPHOROUS ACID (UN 2834) ONTO THE PAVEMENT AT A RAILYARD FROM A SHIPPING CONTAINER FOR UNKNOWN REASONS.",
    "THE CALLER IS REPORTING A DISCHARGE OF DYED DIESEL FUEL NO. 2 (300-400 GALLONS) ONTO SOIL. THE CALLER STATED THAT WHILE TRANSFERRING FUEL FROM A TANKER TRUCK TO A STORAGE TANK A VALVE WAS IMPROPERLY LEFT OPEN."
]

# Tokenize and pad sequences for the new data points
new_data_sequences = tokenizer.texts_to_sequences(new_data_points)
new_data_pad = pad_sequences(new_data_sequences, maxlen=max_seq_length)

# Make predictions for each data point using the CNN model
predictions = model.predict(new_data_pad)
predicted_targets = [1 if prob >= 0.3 else 0 for prob in predictions]

# Print the predictions and probabilities for each data point
for i, data_point in enumerate(new_data_points):
    print(f"Data Point {i + 1}: {data_point}")
    print(f"Predicted Target: {predicted_targets[i]}")
    print(f"Predicted Probability: {predictions[i][0]:.2f}")
    print()


Data Point 1: CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 2: CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.
Predicted Target: 0
Predicted Probability: 0.01

Data Point 3: CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.
Predicted Target: 0
Predicted Probability: 0.02

Data Point 4: CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 5: CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.
Predicted Target: 1
Predicted Probability: 0.88

Data Point 6: THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUI

# Concluding Thoughts


Both Logistic Regression with TF-IDF and CNN (Convolutional Neural Network) are valid approaches for classifying text data. Here's a comparison of the two methods:

## Logistic Regression with TF-IDF:

### Pros:
- **Simplicity**: Logistic Regression is a simple and interpretable model.
- **Efficiency**: It offers faster training and inference, especially for small to medium-sized datasets.
- **Suitability**: Works well for linearly separable or straightforward text classification tasks.

### Cons:
- **Limited Complexity**: It may not capture complex relationships in the data.
- **Semantic Constraints**: Struggles on tasks requiring semantic understanding or contextual analysis.
- **Ineffectiveness**: Less effective for sentiment analysis and text generation tasks.

## CNN Text Classification Model:

### Pros:
- **Local Patterns**: CNN can capture local patterns and relationships in text using convolutional layers.
- **Hierarchical Features**: Effective at capturing hierarchical features in text data.
- **Contextual Understanding**: Suitable for tasks that require understanding sentence context and structure.

### Cons:
- **Data Requirement**: CNN models require larger datasets for training due to more parameters.
- **Training Time**: Longer training times, especially for large models.
- **Overfitting Risk**: May overfit if not regularized properly.

## Next Steps:
Consider augmenting your CNN model by adding an LSTM layer to further enhance its ability to handle sequential text data.not regularized properly.

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense, GlobalMaxPooling1D, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenize your text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to a fixed length
max_seq_length = 100  # Adjust as needed
X_train_pad = pad_sequences(X_train_seq, maxlen=max_seq_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_seq_length)

# Create a model with CNN and LSTM layers
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=max_seq_length))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(LSTM(64, return_sequences=True))  # LSTM layer with 64 units
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train_pad, y_train, epochs=3, batch_size=64, validation_data=(X_test_pad, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test_pad, y_test)
print(f'Test Accuracy: {accuracy}')


Epoch 1/3
Epoch 2/3
Epoch 3/3
Test Accuracy: 0.9607448577880859


In [12]:
# List of 10 new data points (text descriptions) that you want to predict
new_data_points = [
    "CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.",
    "CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.",
    "CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.",
    "CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.",
    "CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.",
    "THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUID) DUE TO POTENTIAL LOAD SHIFT *(DAMAGE TO THE TOTE DURING TRANSPORT). THE IMPACTS ARE INSIDE THE TRAILER, ROADWAY, AND INTO A STORM DRAIN.",
    "CALLER IS REPORTING THAT A TANKER TRUCK ROLLED OVER AND DISCHARGED APPROXIMATELY 16 BARRELS OF CRUDE OIL ONTO THE GROUND. NO FIRES, INJURIES, OR FATALITIES REPORTED.",
    "CALLER IS REPORTING A RELEASE OF SODIUM HYDROXIDE SOLUTION ONTO THE BALLAST FROM TANK CAR GAMX5483 DUE TO EQUIPMENT FAILURE.",
    "CALLER IS REPORTING A RELEASE OF PHOSPHOROUS ACID (UN 2834) ONTO THE PAVEMENT AT A RAILYARD FROM A SHIPPING CONTAINER FOR UNKNOWN REASONS.",
    "THE CALLER IS REPORTING A DISCHARGE OF DYED DIESEL FUEL NO. 2 (300-400 GALLONS) ONTO SOIL. THE CALLER STATED THAT WHILE TRANSFERRING FUEL FROM A TANKER TRUCK TO A STORAGE TANK A VALVE WAS IMPROPERLY LEFT OPEN."
]

# Tokenize and pad sequences for the new data points
new_data_sequences = tokenizer.texts_to_sequences(new_data_points)
new_data_pad = pad_sequences(new_data_sequences, maxlen=max_seq_length)

# Make predictions for each data point using the LSTM model
predictions = model.predict(new_data_pad)
predicted_targets = [1 if prob >= 0.3 else 0 for prob in predictions]

# Print the predictions and probabilities for each data point
for i, data_point in enumerate(new_data_points):
    print(f"Data Point {i + 1}: {data_point}")
    print(f"Predicted Target: {predicted_targets[i]}")
    print(f"Predicted Probability: {predictions[i][0]:.2f}")
    print()

Data Point 1: CALLER IS REPORTING A RELEASE OF DIESEL FUEL FROM A PUMP AT A GAS STATION.
Predicted Target: 0
Predicted Probability: 0.01

Data Point 2: CALLER IS REPORTING THAT THERE IS A PARTIALLY SUBMERGED DRUM THAT IS NEAR THE BEACH LINE. THE DRUM IS DARK IN COLOR AND CAN ONLY BE SEEN AT LOW TIDE.
Predicted Target: 0
Predicted Probability: 0.01

Data Point 3: CALLER IS REPORTING THAT ONE BULLET WAS FOUND IN A PASSENGERS CARRY ON BAG AT THE GANG WAY SECURITY CHECK.
Predicted Target: 0
Predicted Probability: 0.04

Data Point 4: CALLER STATED THAT THEY DISCOVERED A VESSEL THAT IS TAKING ON WATER AND IS HALF OUT OF THE WATER, RESTING ON THE BOTTOM. THERE WAS NO REPORT OF POLLUTION AT THE TIME OF THE REPORT.
Predicted Target: 0
Predicted Probability: 0.00

Data Point 5: CALLER IS REPORTING THAT SODIUM HYDROXIDE IS LEAKING FROM A TANKER TRUCK DUE TO UNKNOWN CAUSES.
Predicted Target: 1
Predicted Probability: 0.92

Data Point 6: THE CALLER IS REPORTING A RELEASE OF UN1993 (COMBUSTIBLE LIQUI

We can see that on average, this model performs better than the CNN model without LSTM.