### PSUEDO STEPS

1. Data Import/Collection and Preparation
    - Load and review the obtained APS Failure at Scania Trucks dataset from the UCI Machine Learning Repository.
    - Perform initial data exploration:
        - Analyze feature distributions
        - Check for missing values
        - Identify potential outliers
    - Handle missing values as described in your methodology:
        - Discard features with >70% missing values
        - Drop rows for features with <5% missing values
        - Apply median imputation for features with 5-15% missing values
        - Use Multiple Imputation by Chained Equations (MICE) for features with 15-70% missing values
    - Conduct feature engineering:
        - Create binary features to indicate originally missing values
        - Separate histogram bin features from numerical features
2. Exploratory Data Analysis (EDA)
    - Visualize feature distributions and correlations
    - Analyze class imbalance (59,000 negative vs. 1,000 positive cases)
    - Identify potential key indicators of APS failures through statistical analysis:
        - Correlation analysis
        - Univariate studies
        - Feature importance rankings using techniques like Recursive Feature Elimination (RFE) or Sklearn.SelectFromModel
3. Model Development and Optimization
    - Split the data into training (60,000 samples) and testing (16,000 samples) sets
    - Implement k-fold cross-validation on the training set
    - Train and evaluate multiple machine learning models:
        - Logistic Regression (baseline)
        - Random Forest
        - Support Vector Machines (SVM)
        - Gradient Boosted Decision Trees
        - Neural Networks (sklearn.MLPClassifier)
    - Optimize hyperparameters for each model using techniques like grid search or random search
    - Address class imbalance using methods such as:
        - Oversampling (e.g., SMOTE)
        - Undersampling
        - Class weighting
4. Model Evaluation and Comparison
    - Evaluate models using performance metrics:
        - Accuracy
        - Precision
        - Recall
        - Macro-F1 Score
    - Compare model performance and select the best-performing model(s)
    - Analyze feature importance for the selected model(s) to identify key indicators of APS failures
5. Model Interpretation and Insights
    - Interpret the results of the best-performing model(s)
    - Identify the most significant features contributing to APS failures
    - Develop actionable insights for improving maintenance strategies
6. Cost-Benefit Analysis
    - Estimate the potential cost savings from implementing the predictive maintenance model
    - Compare the costs of false positives (unnecessary checks) vs. false negatives (missed failures)
    - Analyze the impact of the predictive model on overall maintenance costs
7. Model Deployment and Validation
    - Implement the selected model(s) on the test set
    - Evaluate the model's performance on unseen data
    - Develop a strategy for model deployment in real-world scenarios
    - Create a plan for continuous model monitoring and updating
8. Documentation and Reporting
    - Document the entire process, including data preprocessing steps, model development, and evaluation results
    - Prepare visualizations and summary statistics to support findings
    - Write a comprehensive report addressing the research questions and objectives
    - Develop recommendations for implementing predictive maintenance strategies in Scania trucks

## 1. Data Import and Preparation

### 1.1 Load and review the obtained APS Failure at Scania Trucks dataset

### 1.2 Initial Data Exploration

#### 1.2.1 Analyze features distribution

#### 1.2.2 Check for missing values

#### Key observations:

1. High Missing Value Percentages (>70%): Several features (br_000, bq_000, bp_000, bo_000, ab_000, cr_000, bn_000) have extremely high missing value rates, ranging from 73.3% to 82.1%
The highest missing rate is 82.1% for feature br_000


2. Medium Missing Value Percentages (30-70%): Features bm_000, bl_000, and bk_000 show moderate to high missing values
    - bm_000 has about 65.9% missing values
    - bl_000 has 45.5% missing values
    - bk_000 has 38.4% missing values


3. Low Missing Value Percentages (~24-23%): A large cluster of features (ad_000, cg_000, ch_000, cf_000, co_000, cx_000, cz_000, cy_000, dc_000, db_000) have consistent missing value rates around 23-24.8%

**This pattern suggests:**

* The data collection process might have systematic issues for certain sensors/measurements
* Features with very high missing values (>70%) might need to be dropped or require advanced imputation techniques
* The consistent ~24% missing rate across multiple features might indicate a systematic data collection issue or a specific operational condition where these measurements weren't recorded
* For machine learning purposes, handling these missing values will be crucial as they could significantly impact model performance

For this APS failure data, these missing values might represent sensor failures or conditions where measurements couldn't be taken during vehicle operation. 

### 1.3 Handle Missing Values

#### 1.3.1 Remove features with > 70% missing values

#### 1.3.2 Drop rows for features with <5% missing values

#### 1.3.3 Apply median imputation for features with 5-15% missing values

#### 1.3.4 Use Multiple Imputation by Chained Equations (MICE) for features with 15-70% missing values

### 1.4 Perform Feature Engineering

#### 1.4.1 Separate histogram bin features from numerical features

#### 1.4.2 Create binary features to indicate originally missing values

#### 1.4.3 Split Training Dataset to training and target sets

#### 1.4.3 Separate histogram bin features from numerical features

## 2. Exploratory Data Analysis (EDA)

### 2.1 Visualize feature distributions and correlations

### 2.2 Analyze class imbalance (59,000 negative vs. 1,000 positive cases)

### 2.3 Identify potential key indicators of APS failures through statistical analysis:

#### 2.3.1 Correlation analysis

## 3. Model Development and Optimization
    

### 3.1 Split the data into training (60,000 samples) and testing (16,000 samples) sets

### 3.2 Implement k-fold cross-validation on the training set

### 3.3 Train and evaluate multiple machine learning models:

#### 3.3.1 Logistic Regression (baseline)

#### 3.3.2 Random Forest

#### 3.3.3 Support Vector Machines (SVM)

#### 3.3.4 Gradient Boosted Decision Trees

#### 3.3.5 Neural Networks (sklearn.MLPClassifier)

### 3.4 Optimize hyperparameters for each model using techniques like grid search or random search

### 3.5 Address class imbalance using methods such as:

#### 3.5.1 Oversampling (e.g., SMOTE)

#### 3.5.2 Undersampling

#### 3.5.3 Class weighting

## 4. Model Evaluation and Comparison

### 4.1 Evaluate models using performance metrics:

#### 4.1.1 Accuracy

#### 4.1.2 Precision

#### 4.1.3 Recall

#### 4.1.4 Macro-F1 Score

#### 4.1.5 Compare model performance and select the best-performing model(s)

## 5. Model Interpretation and Insights


### 5.1 Interpret the results of the best-performing model(s)


### 5.2 Identify the most significant features contributing to APS failures

### 5.3 Develop actionable insights for improving maintenance strategies

### 5.4     - Develop actionable insights for improving maintenance strategies

## 6. Cost-Benefit Analysis

### 6.1. Estimate the potential cost savings from implementing the predictive maintenance model

### 6.2. Compare the costs of false positives (unnecessary checks) vs. false negatives (missed failures)

### 6.3. Analyze the impact of the predictive model on overall maintenance costs

## 7. Model Deployment and Validation

### 7.1. Implement the selected model(s) on the test set

### 7.2. Evaluate the model's performance on unseen data

### 7.3. Develop a strategy for model deployment in real-world scenarios

### 7.4. Create a plan for continuous model monitoring and updating

## 8. Documentation and Reporting

### 8.1. Document the entire process, including data preprocessing steps, model development, and evaluation results

### 8.2. Prepare visualizations and summary statistics to support findings

### 8.3. Write a comprehensive report addressing the research questions and objectives

### 8.4. Develop recommendations for implementing predictive maintenance strategies in Scania trucks

In [None]:

### Analyze feature importance for the selected model(s) to identify key indicators of APS failures

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42),
    'LightGBM': LGBMClassifier(random_state=42),
    'AdaBoost': AdaBoostClassifier(random_state=42),
    'BaggingClassifier': BaggingClassifier(random_state=42),
}

estimators = [(name, model) for name, model in models.items()]

models['VotingClassifier'] = VotingClassifier(estimators=estimators, voting='soft')

In [None]:
# Train and evaluate models
print("\nTraining and evaluating models...")
metrics = ['accuracy_score', 
           'precision_score', 
           'recall_score',
           'f1_score']

accuracy_df = pd.DataFrame(index = metrics)

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_resampled, y_train_resampled)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    f1 = f1_score(y_test, y_pred, average='macro')

    # Convert to dataframe
    accuracy_df[f'{name}'] = pd.Series([accuracy, precision, recall, f1])
    
    print(f"{name} Results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")

In [None]:
print("\nPerforming hyperparameter tuning...")

# Define parameter grids for each model
param_grids = {
    'Logistic Regression': {'C': [0.001, 0.01, 0.1, 1, 10, 100]},
    'Random Forest': {'n_estimators': [100, 200, 300], 'max_depth': [None, 10, 20, 30]},
    'Decision Tree': {'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]},
    'SVM': {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']},
    'GradientBoosting': {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1, 0.2]},
    'XGBoost': {'n_estimators': [100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1]},
    'LightGBM': {'n_estimators': [100, 200], 'max_depth': [-1, 5, 10], 'learning_rate': [0.01, 0.1]},
    'AdaBoost': {'n_estimators': [50, 100], 'learning_rate': [0.01, 0.1, 1]},
    'BaggingClassifier': {'n_estimators': [10, 20, 30], 'max_samples': [0.5, 0.7, 1.0]},
}

In [None]:
# Perform GridSearchCV for each model
best_models = {}
for name, model in models.items():
    if name != 'VotingClassifier':  # Skip VotingClassifier for individual tuning
        print(f"\nTuning {name}...")
        grid_search = GridSearchCV(model, param_grids[name], cv=3, scoring='f1_macro', n_jobs=-1)
        grid_search.fit(X_train_resampled, y_train_resampled)
        
        best_models[name] = grid_search.best_estimator_
        
        print(f"Best parameters: {grid_search.best_params_}")
        print(f"Best Macro-F1 Score: {grid_search.best_score_:.4f}")

: 

In [None]:
# Create a new VotingClassifier with the best models
best_estimators = [(name, model) for name, model in best_models.items()]
best_voting_classifier = VotingClassifier(estimators=best_estimators, voting='soft')

In [None]:
# Train and evaluate the best VotingClassifier
print("\nTraining and evaluating the best VotingClassifier...")
best_voting_classifier.fit(X_train_resampled, y_train_resampled)
y_pred = best_voting_classifier.predict(X_test)
macro_f1 = f1_score(y_test, y_pred, average='macro')
print(f"Best VotingClassifier Macro-F1 Score: {macro_f1:.4f}")

# 6. Feature Importance Analysis

In [None]:
print("\nAnalyzing feature importance...")
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_resampled, y_train_resampled)

feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rfc.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(20))
plt.title('Top 20 Most Important Features')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.close()

print("\nAnalysis complete. Check the generated plots for visualizations.")

## References:

Shetty, R. (2021). Predicting a Failure in Scania’s Air Pressure System. [online] Medium. Available at: https://towardsdatascience.com/predicting-a-failure-in-scanias-air-pressure-system-aps-c260bcc4d038 [Accessed 4 Jan. 2025].