# Data Loading & Initial Cleaning - CMAPSS Dataset

In [1]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


### Dataset Overview
- C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) simulates realistic sensor data from large commercial turbofan engines using a high-fidelity thermodynamic model.

- The dataset contains multivariate time-series data from multiple engines, each under various operational conditions and fault scenarios.

- It is divided into four subsets (FD001, FD002, FD003, FD004), each representing different settings:

- Varying the number of operational conditions (1 or 6) and fault modes (1 or 2).

- Each subset includes both training and testing trajectories, where each trajectory represents one engine run until failure (run-to-failure).

### Data Structure
- Columns: Each record has 26 sensor measurements plus metadata such as engine ID, operational settings (3 variables), and cycle number.

- Goal: Predict RUL for engines in the test set, using only partial run-to-failure sensor data provided for them.

- Noise and Variability: The dataset incorporates realistic elements such as sensor noise, manufacturing variance, and operational differences to mimic real-world degradation scenarios

In [2]:
## 1. Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
## 2. Load the Dataset
# Download or place the CMAPSS file in your working directory (e.g., 'train_FD001.txt')
# Adjust the path as needed
column_names = [
    "engine_id", "cycle",
    "op_setting_1", "op_setting_2", "op_setting_3"
] + [f"sensor_{i}" for i in range(1,22)]

df = pd.read_csv('/content/drive/MyDrive/Infosys/data/train_FD001.txt',
    sep='\s+',
    header=None,
    names=column_names
)

print("Shape of the DataFrame:", df.shape)
df.head()

  sep='\s+',


Shape of the DataFrame: (20631, 26)


Unnamed: 0,engine_id,cycle,op_setting_1,op_setting_2,op_setting_3,sensor_1,sensor_2,sensor_3,sensor_4,sensor_5,...,sensor_12,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


In [5]:
## 3. Data Profiling: Check Types and Stats
df.info()
df.describe().transpose()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20631 entries, 0 to 20630
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   engine_id     20631 non-null  int64  
 1   cycle         20631 non-null  int64  
 2   op_setting_1  20631 non-null  float64
 3   op_setting_2  20631 non-null  float64
 4   op_setting_3  20631 non-null  float64
 5   sensor_1      20631 non-null  float64
 6   sensor_2      20631 non-null  float64
 7   sensor_3      20631 non-null  float64
 8   sensor_4      20631 non-null  float64
 9   sensor_5      20631 non-null  float64
 10  sensor_6      20631 non-null  float64
 11  sensor_7      20631 non-null  float64
 12  sensor_8      20631 non-null  float64
 13  sensor_9      20631 non-null  float64
 14  sensor_10     20631 non-null  float64
 15  sensor_11     20631 non-null  float64
 16  sensor_12     20631 non-null  float64
 17  sensor_13     20631 non-null  float64
 18  sensor_14     20631 non-nu

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
engine_id,20631.0,51.506568,29.22763,1.0,26.0,52.0,77.0,100.0
cycle,20631.0,108.807862,68.88099,1.0,52.0,104.0,156.0,362.0
op_setting_1,20631.0,-9e-06,0.002187313,-0.0087,-0.0015,0.0,0.0015,0.0087
op_setting_2,20631.0,2e-06,0.0002930621,-0.0006,-0.0002,0.0,0.0003,0.0006
op_setting_3,20631.0,100.0,0.0,100.0,100.0,100.0,100.0,100.0
sensor_1,20631.0,518.67,6.537152e-11,518.67,518.67,518.67,518.67,518.67
sensor_2,20631.0,642.680934,0.5000533,641.21,642.325,642.64,643.0,644.53
sensor_3,20631.0,1590.523119,6.13115,1571.04,1586.26,1590.1,1594.38,1616.91
sensor_4,20631.0,1408.933782,9.000605,1382.25,1402.36,1408.04,1414.555,1441.49
sensor_5,20631.0,14.62,3.3947e-12,14.62,14.62,14.62,14.62,14.62


In [6]:
## 4. Check for Missing Values
print("Missing values per column:")
print(df.isnull().sum())


Missing values per column:
engine_id       0
cycle           0
op_setting_1    0
op_setting_2    0
op_setting_3    0
sensor_1        0
sensor_2        0
sensor_3        0
sensor_4        0
sensor_5        0
sensor_6        0
sensor_7        0
sensor_8        0
sensor_9        0
sensor_10       0
sensor_11       0
sensor_12       0
sensor_13       0
sensor_14       0
sensor_15       0
sensor_16       0
sensor_17       0
sensor_18       0
sensor_19       0
sensor_20       0
sensor_21       0
dtype: int64


In [7]:
## 5. Handle Missing or Anomalous Values
# If any missing values: (none expected for CMAPSS, but good practice)
if df.isnull().any().any():
    df = df.fillna(method='ffill').fillna(method='bfill')
    print("Missing values after filling:", df.isnull().sum().sum())
else:
    print("No missing values detected.")


No missing values detected.


In [8]:
## 6. Save or Export Cleaned Data (Optional)
df.to_csv('cmapss_cleaned_train_FD001.csv', index=False)


In [9]:
## 7. Summary Report
print(f"The cleaned dataset has {df.shape} rows and {df.shape} columns.")
print("Data loading and initial cleaning complete.")


The cleaned dataset has (20631, 26) rows and (20631, 26) columns.
Data loading and initial cleaning complete.
