# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [6]:
"""
Colab-ready script: Predictive maintenance (CMAPSS-like) using uploaded archive.zip
- Detects /mnt/data/archive.zip, extracts files into ./data/
- Auto-identifies train/test/RUL (truth) files using common name patterns (or substring match)
- Preprocesses, creates sliding windows, trains an LSTM, evaluates, saves model+scaler
"""
# === CELL 0: (optional) install libs in Colab ===
# Uncomment if you need to install
# !pip install --upgrade --quiet tensorflow pandas scikit-learn matplotlib joblib

# === CELL 1: imports ===
import os
import sys
import math
import zipfile
import shutil
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Masking
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from tensorflow.keras import backend as K
import joblib

print('Python', sys.version)

# === CELL 2: prepare data dir and extract archive.zip ===
DATA_DIR = Path('data')
DATA_DIR.mkdir(exist_ok=True)

ARCHIVE_PATH = Path('/content/archive.zip')  # user-upload location you mentioned

if ARCHIVE_PATH.exists():
    print('Found archive at', ARCHIVE_PATH)
    with zipfile.ZipFile(ARCHIVE_PATH, 'r') as z:
        print('Archive contents:')
        for name in z.namelist():
            print(' -', name)
        print('Extracting to ./data/ ...')
        z.extractall(DATA_DIR)
else:
    print('No /mnt/data/archive.zip found. If you uploaded files elsewhere, move them into this notebook or upload archive.zip to /mnt/data and rerun.')

print('\nFiles under ./data/:')
for f in sorted(DATA_DIR.rglob('*')):
    print('-', f.relative_to(Path.cwd()))

# === CELL 3: locate train/test/RUL files (flexible matching) ===
possible_train = ['train_FD001.txt','train_FD002.txt','train_FD003.txt','train_FD004.txt','train.txt','train.csv']
possible_test  = ['test_FD001.txt','test_FD002.txt','test_FD003.txt','test_FD004.txt','test.txt','test.csv']
possible_rul   = ['RUL_FD001.txt','RUL_FD002.txt','RUL_FD003.txt','RUL_FD004.txt','RUL.txt','RUL.csv','truth.txt','truth.csv','RUL_FD.txt']

train_path = None
test_path = None
rul_path = None

# exact matches first
for f in DATA_DIR.rglob('*'):
    name = f.name
    if name in possible_train and train_path is None:
        train_path = f
    if name in possible_test and test_path is None:
        test_path = f
    if name in possible_rul and rul_path is None:
        rul_path = f

# fallback: substring matches
if not train_path:
    for f in DATA_DIR.rglob('*'):
        if 'train' in f.name.lower():
            train_path = f
            break
if not test_path:
    for f in DATA_DIR.rglob('*'):
        if 'test' in f.name.lower():
            test_path = f
            break
if not rul_path:
    for f in DATA_DIR.rglob('*'):
        if 'rul' in f.name.lower() or 'truth' in f.name.lower():
            rul_path = f
            break

print('\nDetected files:')
print('train:', train_path)
print('test :', test_path)
print('rul  :', rul_path)

if not (train_path and test_path and rul_path):
    raise FileNotFoundError("Could not find train/test/RUL files automatically. Check ./data contents and re-run. Files found above.")

# copy to consistent names
shutil.copy(train_path, DATA_DIR / 'train.txt')
shutil.copy(test_path,  DATA_DIR / 'test.txt')
shutil.copy(rul_path,   DATA_DIR / 'RUL.txt')
train_path = DATA_DIR / 'train.txt'
test_path  = DATA_DIR / 'test.txt'
rul_path   = DATA_DIR / 'RUL.txt'
print('Files copied to data/train.txt, data/test.txt, data/RUL.txt')

# === CELL 4: load CMAPSS-like data (flexible delimiter) ===
col_names = ['unit','cycle'] + [f'op_{i}' for i in range(1,4)] + [f'sensor_{i}' for i in range(1,22)]

def load_file_try(path):
    # try whitespace separated first, then comma
    try:
        return pd.read_csv(path, sep=r'\s+', header=None, names=col_names)
    except Exception:
        return pd.read_csv(path, header=None, names=col_names)

train_df = load_file_try(train_path)
test_df  = load_file_try(test_path)
# RUL/truth usually single column
try:
    rul_df = pd.read_csv(rul_path, sep=r'\s+', header=None, names=['RUL'])
except Exception:
    rul_df = pd.read_csv(rul_path, header=None, names=['RUL'])

print('Shapes -> train:', train_df.shape, 'test:', test_df.shape, 'RUL:', rul_df.shape)

# === CELL 5: compute RUL for train/test ===
def add_rul_train(df):
    d = df.copy()
    d['RUL'] = d.groupby('unit')['cycle'].transform('max') - d['cycle']
    return d

train_df = add_rul_train(train_df)

# For test: RUL file gives remaining life at end-of-test; compute per-row RUL
last_cycle = test_df.groupby('unit')['cycle'].max().reset_index()
last_cycle.columns = ['unit','last_cycle']

if len(rul_df) != len(last_cycle):
    print('Warning: RUL length != number of unique units in test. Will attempt to align by order.')

# take the first len(last_cycle) rows from rul_df
last_cycle['future_RUL'] = rul_df['RUL'].values[:len(last_cycle)]
last_cycle_dict = last_cycle.set_index('unit')['last_cycle'].to_dict()
future_rul_dict = last_cycle.set_index('unit')['future_RUL'].to_dict()

test_df = test_df.copy()
test_df['RUL'] = test_df['unit'].map(last_cycle_dict) + test_df['unit'].map(future_rul_dict) - test_df['cycle']

print('Added RUL columns to train and test.')

# === CELL 6: quick EDA (optional) ===
print('\nTrain sample:')
display(train_df.head())

plt.figure(figsize=(6,3))
plt.hist(train_df['RUL'], bins=40)
plt.title('Train RUL distribution')
plt.xlabel('RUL')
plt.show()

# === CELL 7: feature selection and scaling ===
all_features = [c for c in train_df.columns if c.startswith('sensor_') or c.startswith('op_')]
var = train_df[all_features].var()
keep_cols = var[var > 0.0].index.tolist()
FEATURE_COLUMNS = keep_cols
LABEL_COLUMN = 'RUL'

print('Selected features:', FEATURE_COLUMNS)

scaler = StandardScaler()
scaler.fit(train_df[FEATURE_COLUMNS])
train_df_scaled = train_df.copy()
train_df_scaled[FEATURE_COLUMNS] = scaler.transform(train_df[FEATURE_COLUMNS])
test_df_scaled  = test_df.copy()
test_df_scaled[FEATURE_COLUMNS]  = scaler.transform(test_df[FEATURE_COLUMNS])

# === CELL 8: create sequences (padding for short engines) ===
def create_sequences(df, seq_len=30, features=FEATURE_COLUMNS, label_col='RUL'):
    seqs = []
    labels = []
    for uid in sorted(df['unit'].unique()):
        unit_data = df[df['unit']==uid].sort_values('cycle')
        X = unit_data[features].values
        y = unit_data[label_col].values
        if len(X) < seq_len:
            pad_len = seq_len - len(X)
            pad = np.repeat(X[[0], :], pad_len, axis=0)
            seqs.append(np.vstack([pad, X]))
            labels.append(y[-1])
        else:
            for start in range(0, len(X) - seq_len + 1):
                end = start + seq_len
                seqs.append(X[start:end])
                labels.append(y[end-1])
    return np.array(seqs), np.array(labels)

SEQ_LEN = 30
X_all, y_all = create_sequences(train_df_scaled, seq_len=SEQ_LEN)
print('Total train sequences:', X_all.shape)

# === CELL 9: train/val split by unit (no leakage) ===
unit_ids = sorted(train_df['unit'].unique())
np.random.seed(42)
np.random.shuffle(unit_ids)
train_units = unit_ids[:int(0.8*len(unit_ids))]
val_units   = unit_ids[int(0.8*len(unit_ids)):]

def sequences_for_units(df_scaled, units, seq_len=SEQ_LEN, features=FEATURE_COLUMNS):
    rows = df_scaled[df_scaled['unit'].isin(units)]
    return create_sequences(rows, seq_len=seq_len, features=features)

X_train, y_train = sequences_for_units(train_df_scaled, train_units)
X_val,   y_val   = sequences_for_units(train_df_scaled, val_units)
X_test,  y_test  = create_sequences(test_df_scaled, seq_len=SEQ_LEN)

print('Shapes:')
print('X_train', X_train.shape, 'y_train', y_train.shape)
print('X_val  ', X_val.shape,   'y_val  ', y_val.shape)
print('X_test ', X_test.shape,  'y_test ', y_test.shape)

# === CELL 10: build LSTM model ===
K.clear_session()
model = Sequential([
    Masking(mask_value=0., input_shape=(SEQ_LEN, len(FEATURE_COLUMNS))),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mse', metrics=["RootMeanSquaredError"])
model.summary()

# callbacks and training
checkpoint_path = 'best_model.h5'
callbacks = [
    ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, verbose=1),
    EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True, verbose=1),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=1)
]

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=60,
    batch_size=64,
    callbacks=callbacks,
    verbose=2
)

# === CELL 11: evaluate, plot, save ===
model.load_weights(checkpoint_path)

train_pred = model.predict(X_train).ravel()
val_pred   = model.predict(X_val).ravel()
test_pred  = model.predict(X_test).ravel()

def rmse(y_true, y_pred):
    return math.sqrt(mean_squared_error(y_true, y_pred))

print('Train RMSE:', rmse(y_train, train_pred))
print('Val RMSE:', rmse(y_val, val_pred))
print('Test RMSE:', rmse(y_test, test_pred))

plt.figure(figsize=(6,4))
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.yscale('log')
plt.legend()
plt.title('Training loss')
plt.show()

plt.figure(figsize=(5,5))
plt.scatter(y_test, test_pred, alpha=0.4)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--')
plt.xlabel('Actual RUL')
plt.ylabel('Predicted RUL')
plt.title('Test: Actual vs Predicted RUL')
plt.show()

model.save('lstm_cmapss_model')
joblib.dump(scaler, 'feature_scaler.pkl')
print('Saved model -> lstm_cmapss_model and scaler -> feature_scaler.pkl')

# === CELL 12: quick inference example ===
if len(X_test) > 0:
    idx = np.random.randint(0, len(X_test))
    sample_seq = X_test[idx:idx+1]
    true_rul = y_test[idx]
    pred_rul = float(model.predict(sample_seq).ravel()[0])
    print(f'Sample predicted RUL: {pred_rul:.2f} ; Actual RUL: {true_rul:.2f}')
else:
    print('No test sequences created; check test file parsing / seq len.')


Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Found archive at /content/archive.zip
Archive contents:
 - PM_test.csv
 - PM_train.csv
 - PM_truth.csv
Extracting to ./data/ ...

Files under ./data/:


ValueError: 'data/PM_test.csv' is not in the subpath of '/content'

In [7]:
# Full Predictive Maintenance Project Code (Adjusted for '/content/archive.zip')

"""
Predictive maintenance using LSTM on aviation engine sensor data.
Updated to use the uploaded file located at: '/content/archive.zip'
This version loads PM_train.csv, PM_test.csv, PM_truth.csv directly after extraction.
"""

# ==========================
# CELL 1 — Imports & Setup
# ==========================
import os
import zipfile
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices('GPU'))

# ==========================
# CELL 2 — Unzip uploaded archive
# ==========================
UPLOAD_ZIP = "/content/archive.zip"  # fixed path
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"Unzipping {UPLOAD_ZIP} ...")
with zipfile.ZipFile(UPLOAD_ZIP, "r") as zip_ref:
    zip_ref.extractall(DATA_DIR)

print("Extracted files:")
for f in DATA_DIR.iterdir():
    print(" -", f)

# ==========================
# CELL 3 — Load CSV files
# ==========================
train_file = DATA_DIR / "PM_train.csv"
test_file = DATA_DIR / "PM_test.csv"
truth_file = DATA_DIR / "PM_truth.csv"

train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
truth_df = pd.read_csv(truth_file)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("Truth shape:", truth_df.shape)

# ==========================
# CELL 4 — Create RUL for training
# ==========================
rul_train = train_df.groupby('unit_number')['time_in_cycles'].max().reset_index()
rul_train.columns = ['unit_number', 'max_cycle']

train_df = train_df.merge(rul_train, on='unit_number', how='left')
train_df['RUL'] = train_df['max_cycle'] - train_df['time_in_cycles']
train_df.drop('max_cycle', axis=1, inplace=True)

# ==========================
# CELL 5 — Normalization
# ==========================
FEATURES = train_df.columns.drop(['unit_number', 'time_in_cycles', 'RUL'])

scaler = MinMaxScaler()
train_df[FEATURES] = scaler.fit_transform(train_df[FEATURES])
test_df[FEATURES] = scaler.transform(test_df[FEATURES])

# ==========================
# CELL 6 — Sequence creation
# ==========================
SEQ_LEN = 30

def create_sequences(df, seq_len, target_col):
    sequences = []
    targets = []
    for engine_id in df['unit_number'].unique():
        edata = df[df['unit_number']==engine_id].reset_index(drop=True)
        matrix = edata[FEATURES].values

        for i in range(len(edata) - seq_len):
            sequences.append(matrix[i:i+seq_len])
            targets.append(edata.loc[i+seq_len, target_col])

    return np.array(sequences), np.array(targets)

X_train, y_train = create_sequences(train_df, SEQ_LEN, "RUL")

# ==========================
# CELL 7 — Build LSTM model
# ==========================
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, len(FEATURES))),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])

model.compile(optimizer="adam", loss="mse")
model.summary()

# ==========================
# CELL 8 — Train model
# ==========================
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=20,
    batch_size=64,
    verbose=1
)

# ==========================
# CELL 9 — Prepare test sequences
# ==========================
def last_sequence_per_engine(df, seq_len):
    lst = []
    for eid in df['unit_number'].unique():
        sub = df[df['unit_number']==eid].reset_index(drop=True)
        mat = sub[FEATURES].values
        if len(mat) >= seq_len:
            lst.append(mat[-seq_len:])
    return np.array(lst)

X_test = last_sequence_per_engine(test_df, SEQ_LEN)

# ==========================
# CELL 10 — Predict RUL
# ==========================
pred_rul = model.predict(X_test).flatten()
print("Predicted RUL sample:", pred_rul[:10])
print("Actual RUL sample:", truth_df['RUL'].values[:10])

# ==========================
# CELL 11 — Evaluation
# ==========================
rmse = np.sqrt(mean_squared_error(truth_df['RUL'], pred_rul))
print("RMSE:", rmse)

# ==========================
# CELL 12 — Save model
# ==========================
model.save("predictive_maintenance_lstm.h5")
print("Model saved: predictive_maintenance_lstm.h5")


TensorFlow version: 2.19.0
GPU available: []
Unzipping /content/archive.zip ...
Extracted files:
 - data/PM_train.csv
 - data/PM_truth.csv
 - data/PM_test.csv
Train shape: (20631, 26)
Test shape: (13096, 26)
Truth shape: (100, 2)


KeyError: 'unit_number'

In [8]:
import pandas as pd

import pandas as pd
df = pd.read_csv("/content/data/PM_train.csv")
print(df.columns)
print(df.head())


Index(['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
       's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
       's15', 's16', 's17', 's18', 's19', 's20', 's21'],
      dtype='object')
   id  cycle  setting1  setting2  setting3      s1      s2       s3       s4  \
0   1      1   -0.0007   -0.0004     100.0  518.67  641.82  1589.70  1400.60   
1   1      2    0.0019   -0.0003     100.0  518.67  642.15  1591.82  1403.14   
2   1      3   -0.0043    0.0003     100.0  518.67  642.35  1587.99  1404.20   
3   1      4    0.0007    0.0000     100.0  518.67  642.35  1582.79  1401.87   
4   1      5   -0.0019   -0.0002     100.0  518.67  642.37  1582.85  1406.22   

      s5  ...     s12      s13      s14     s15   s16  s17   s18    s19  \
0  14.62  ...  521.66  2388.02  8138.62  8.4195  0.03  392  2388  100.0   
1  14.62  ...  522.28  2388.07  8131.49  8.4318  0.03  392  2388  100.0   
2  14.62  ...  522.42  2388.03  8133.23  8.4178  0.03  390  2

In [9]:
df_test = pd.read_csv("/content/data/PM_test.csv")
print(df_test.columns)
print(df_test.head())

df_truth = pd.read_csv("/content/data/PM_truth.csv")
print(df_truth.head())


Index(['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
       's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
       's15', 's16', 's17', 's18', 's19', 's20', 's21'],
      dtype='object')
   id  cycle  setting1  setting2  setting3      s1      s2       s3       s4  \
0   1      1    0.0023    0.0003     100.0  518.67  643.02  1585.29  1398.21   
1   1      2   -0.0027   -0.0003     100.0  518.67  641.71  1588.45  1395.42   
2   1      3    0.0003    0.0001     100.0  518.67  642.46  1586.94  1401.34   
3   1      4    0.0042    0.0000     100.0  518.67  642.44  1584.12  1406.42   
4   1      5    0.0014    0.0000     100.0  518.67  642.51  1587.19  1401.92   

      s5  ...     s12      s13      s14     s15   s16  s17   s18    s19  \
0  14.62  ...  521.72  2388.03  8125.55  8.4052  0.03  392  2388  100.0   
1  14.62  ...  522.16  2388.06  8139.62  8.3803  0.03  393  2388  100.0   
2  14.62  ...  521.97  2388.03  8130.10  8.4441  0.03  393  2

In [15]:
# Full Predictive Maintenance Project Code (Adjusted for '/content/archive.zip')

"""
Predictive maintenance using LSTM on aviation engine sensor data.
Updated to use the uploaded file located at: '/content/archive.zip'
This version loads PM_train.csv, PM_test.csv, PM_truth.csv directly after extraction.
"""

# ==========================
# CELL 1 — Imports & Setup
# ==========================
import pandas as pd

import pandas as pd
df = pd.read_csv("/content/data/PM_train.csv")
print(df.columns)
print(df.head())
df_test = pd.read_csv("/content/data/PM_test.csv")
print(df_test.columns)
print(df_test.head())

df_truth = pd.read_csv("/content/data/PM_truth.csv")
print(df_truth.head())
import os
import zipfile
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices('GPU'))

# ==========================
# CELL 2 — Unzip uploaded archive
# ==========================
UPLOAD_ZIP = "/content/archive.zip"  # fixed path
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"Unzipping {UPLOAD_ZIP} ...")
with zipfile.ZipFile(UPLOAD_ZIP, "r") as zip_ref:
    zip_ref.extractall(DATA_DIR)

print("Extracted files:")
for f in DATA_DIR.iterdir():
    print(" -", f)

# ==========================
# CELL 3 — Load CSV files
# ==========================
train_file = DATA_DIR / "PM_train.csv"
test_file = DATA_DIR / "PM_test.csv"
truth_file = DATA_DIR / "PM_truth.csv"

train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
truth_df = pd.read_csv(truth_file)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("Truth shape:", truth_df.shape)

# ==========================
# CELL 4 — Create RUL for training
# ==========================
rul_train = train_df.groupby('id')['cycle'].max().reset_index()
rul_train.columns = ['id', 'max_cycle']

train_df = train_df.merge(rul_train, on='unit_number', how='left')
train_df['RUL'] = train_df['max_cycle'] - train_df['cycles']
train_df.drop('max_cycle', axis=1, inplace=True)

# ==========================
# CELL 5 — Normalization
# ==========================
FEATURES = train_df.columns.drop(['id', 'cycles', 'RUL'])

scaler = MinMaxScaler()
train_df[FEATURES] = scaler.fit_transform(train_df[FEATURES])
test_df[FEATURES] = scaler.transform(test_df[FEATURES])

# ==========================
# CELL 6 — Sequence creation
# ==========================
SEQ_LEN = 30

def create_sequences(df, seq_len, target_col):
    sequences = []
    targets = []
    for engine_id in df['id'].unique():
        edata = df[df['id']==engine_id].reset_index(drop=True)
        matrix = edata[FEATURES].values

        for i in range(len(edata) - seq_len):
            sequences.append(matrix[i:i+seq_len])
            targets.append(edata.loc[i+seq_len, target_col])

    return np.array(sequences), np.array(targets)

X_train, y_train = create_sequences(train_df, SEQ_LEN, "RUL")

# ==========================
# CELL 7 — Build LSTM model
# ==========================
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, len(FEATURES))),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])

model.compile(optimizer="adam", loss="mse")
model.summary()

# ==========================
# CELL 8 — Train model
# ==========================
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=20,
    batch_size=64,
    verbose=1
)

# ==========================
# CELL 9 — Prepare test sequences
# ==========================
def last_sequence_per_engine(df, seq_len):
    lst = []
    for eid in df['id'].unique():
        sub = df[df['id']==eid].reset_index(drop=True)
        mat = sub[FEATURES].values
        if len(mat) >= seq_len:
            lst.append(mat[-seq_len:])
    return np.array(lst)

X_test = last_sequence_per_engine(test_df, SEQ_LEN)

# ==========================
# CELL 10 — Predict RUL
# ==========================
pred_rul = model.predict(X_test).flatten()
print("Predicted RUL sample:", pred_rul[:10])
print("Actual RUL sample:", truth_df['RUL'].values[:10])

# ==========================
# CELL 11 — Evaluation
# ==========================
rmse = np.sqrt(mean_squared_error(truth_df['RUL'], pred_rul))
print("RMSE:", rmse)

# ==========================
# CELL 12 — Save model
# ==========================
model.save("predictive_maintenance_lstm.h5")
print("Model saved: predictive_maintenance_lstm.h5")


Index(['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
       's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
       's15', 's16', 's17', 's18', 's19', 's20', 's21'],
      dtype='object')
   id  cycle  setting1  setting2  setting3      s1      s2       s3       s4  \
0   1      1   -0.0007   -0.0004     100.0  518.67  641.82  1589.70  1400.60   
1   1      2    0.0019   -0.0003     100.0  518.67  642.15  1591.82  1403.14   
2   1      3   -0.0043    0.0003     100.0  518.67  642.35  1587.99  1404.20   
3   1      4    0.0007    0.0000     100.0  518.67  642.35  1582.79  1401.87   
4   1      5   -0.0019   -0.0002     100.0  518.67  642.37  1582.85  1406.22   

      s5  ...     s12      s13      s14     s15   s16  s17   s18    s19  \
0  14.62  ...  521.66  2388.02  8138.62  8.4195  0.03  392  2388  100.0   
1  14.62  ...  522.28  2388.07  8131.49  8.4318  0.03  392  2388  100.0   
2  14.62  ...  522.42  2388.03  8133.23  8.4178  0.03  390  2

KeyError: 'unit_number'

In [16]:
"""
Predictive maintenance using LSTM on aviation engine sensor data.
Assumes uploaded archive at '/content/archive.zip' and CSVs inside 'data/' after extraction:
 - PM_train.csv
 - PM_test.csv
 - PM_truth.csv
"""

# --------------------------
# CELL 1 — Imports & Setup
# --------------------------
import zipfile
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices('GPU'))

# --------------------------
# CELL 2 — Unzip uploaded archive
# --------------------------
UPLOAD_ZIP = "/content/archive.zip"  # fixed path
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"Unzipping {UPLOAD_ZIP} ...")
with zipfile.ZipFile(UPLOAD_ZIP, "r") as zip_ref:
    zip_ref.extractall(DATA_DIR)

print("Extracted files:")
for f in sorted(DATA_DIR.iterdir()):
    print(" -", f.name)

# --------------------------
# CELL 3 — Load CSV files
# --------------------------
train_file = DATA_DIR / "PM_train.csv"
test_file = DATA_DIR / "PM_test.csv"
truth_file = DATA_DIR / "PM_truth.csv"

train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
truth_df = pd.read_csv(truth_file)

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("Truth shape:", truth_df.shape)
print("Train columns:", train_df.columns.tolist())

# --------------------------
# CELL 4 — Ensure consistent column names
# --------------------------
# Common dataset variants use 'id' and 'cycle' — adapt if your CSVs have other names.
# If your files use different names (e.g. 'unit_number' or 'cycles'), change them here.
# We'll standardize to 'id' and 'cycle'
col_map = {}
if 'unit_number' in train_df.columns:
    col_map['unit_number'] = 'id'
if 'cycles' in train_df.columns:
    col_map['cycles'] = 'cycle'

# Apply to all dataframes
if col_map:
    train_df = train_df.rename(columns=col_map)
    test_df = test_df.rename(columns=col_map)

# Confirm required columns exist
required = {'id', 'cycle'}
missing = required - set(train_df.columns)
if missing:
    raise ValueError(f"Missing required columns in train_df: {missing}")

# --------------------------
# CELL 5 — Create RUL for training
# --------------------------
# Compute max cycle per engine and then RUL = max_cycle - cycle
rul_train = train_df.groupby('id')['cycle'].max().reset_index()
rul_train.columns = ['id', 'max_cycle']

train_df = train_df.merge(rul_train, on='id', how='left')
train_df['RUL'] = train_df['max_cycle'] - train_df['cycle']
train_df = train_df.drop(columns=['max_cycle'])

# --------------------------
# CELL 6 — Feature selection & Normalization
# --------------------------
# Remove identifier and target columns from features
FEATURES = [c for c in train_df.columns if c not in ('id', 'cycle', 'RUL')]

print("Using features:", FEATURES)

scaler = MinMaxScaler()
train_df[FEATURES] = scaler.fit_transform(train_df[FEATURES])
# For test set, we must apply the same scaler. If test has same sensor columns, transform directly.
test_df[FEATURES] = scaler.transform(test_df[FEATURES])

# --------------------------
# CELL 7 — Sequence creation
# --------------------------
SEQ_LEN = 30

def create_sequences(df, seq_len, target_col, feature_cols):
    sequences = []
    targets = []
    # iterate sorted ids to have deterministic order
    for engine_id in sorted(df['id'].unique()):
        edata = df[df['id'] == engine_id].sort_values('cycle').reset_index(drop=True)
        matrix = edata[feature_cols].values
        # create sliding windows; target is the RUL at the timestep after the sequence
        for i in range(len(edata) - seq_len):
            sequences.append(matrix[i:i+seq_len])
            targets.append(edata.loc[i+seq_len, target_col])
    return np.array(sequences), np.array(targets)

X_train, y_train = create_sequences(train_df, SEQ_LEN, "RUL", FEATURES)
print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)

# --------------------------
# CELL 8 — Build LSTM model
# --------------------------
model = Sequential([
    LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, len(FEATURES))),
    Dropout(0.2),
    LSTM(32),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1)
])

model.compile(optimizer="adam", loss="mse")
model.summary()

# --------------------------
# CELL 9 — Train model
# --------------------------
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=20,
    batch_size=64,
    verbose=1
)

# --------------------------
# CELL 10 — Prepare test sequences (last sequence per engine)
# --------------------------
def last_sequence_per_engine(df, seq_len, feature_cols):
    lst = []
    ids = []
    for eid in sorted(df['id'].unique()):  # sorted to align with truth_df order (assumption)
        sub = df[df['id'] == eid].sort_values('cycle').reset_index(drop=True)
        mat = sub[feature_cols].values
        if len(mat) >= seq_len:
            lst.append(mat[-seq_len:])
            ids.append(eid)
        else:
            # If not enough cycles to form a full seq, you could pad or skip; we skip here.
            print(f"Engine {eid} has {len(mat)} cycles (<{seq_len}) — skipping.")
    return np.array(lst), ids

X_test, test_ids = last_sequence_per_engine(test_df, SEQ_LEN, FEATURES)
print("X_test shape:", X_test.shape, "number of test engines used:", len(test_ids))

# --------------------------
# CELL 11 — Predict RUL
# --------------------------
pred_rul = model.predict(X_test).flatten()
print("Predicted RUL sample:", pred_rul[:10])

# --------------------------
# CELL 12 — Align predictions with truth and evaluate
# --------------------------
# PM_truth usually contains RULs per engine in the same order as engine id 1..N.
# We'll create a mapping from engine id to truth RUL.
# First, ensure truth_df has one column with RUL (commonly single column).
truth_df = truth_df.reset_index(drop=True)
# If truth_df has more than one column, take the first numeric column as RUL
if truth_df.shape[1] > 1:
    truth_col = truth_df.columns[0]
else:
    truth_col = truth_df.columns[0]

# Make dict mapping id -> truth RUL. If truth_df doesn't have ids, assume ordering matches sorted ids.
if 'id' in truth_df.columns:
    truth_map = dict(zip(truth_df['id'], truth_df[truth_col]))
else:
    # assume truth rows correspond to sorted engine ids in test set
    sorted_test_ids = sorted(test_df['id'].unique())
    truth_map = dict(zip(sorted_test_ids, truth_df[truth_col].values))

# Build ground-truth array aligned with test_ids used
truth_aligned = np.array([truth_map.get(eid, np.nan) for eid in test_ids])

# Filter out any NaN (in case of missing)
valid_mask = ~np.isnan(truth_aligned)
if valid_mask.sum() == 0:
    raise ValueError("No valid truth values found for the test engines used for prediction.")

pred_rul_aligned = pred_rul[valid_mask]
truth_aligned = truth_aligned[valid_mask]

print("Actual RUL sample:", truth_aligned[:10])

rmse = np.sqrt(mean_squared_error(truth_aligned, pred_rul_aligned))
print("RMSE:", rmse)

# --------------------------
# CELL 13 — Save model
# --------------------------
model.save("predictive_maintenance_lstm.h5")
print("Model saved: predictive_maintenance_lstm.h5")


TensorFlow version: 2.19.0
GPU available: []
Unzipping /content/archive.zip ...
Extracted files:
 - PM_test.csv
 - PM_train.csv
 - PM_truth.csv
Train shape: (20631, 26)
Test shape: (13096, 26)
Truth shape: (100, 2)
Train columns: ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
Using features: ['setting1', 'setting2', 'setting3', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
X_train shape: (17631, 30, 24) y_train shape: (17631,)


  super().__init__(**kwargs)


Epoch 1/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 28ms/step - loss: 9189.4131 - val_loss: 6678.9951
Epoch 2/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 29ms/step - loss: 3759.9985 - val_loss: 5442.7827
Epoch 3/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 31ms/step - loss: 3410.0107 - val_loss: 5380.9341
Epoch 4/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 68ms/step - loss: 3535.7532 - val_loss: 5439.9990
Epoch 5/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 30ms/step - loss: 3393.4722 - val_loss: 3577.2395
Epoch 6/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - loss: 1553.3059 - val_loss: 2378.3474
Epoch 7/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 30ms/step - loss: 1077.8826 - val_loss: 1794.4165
Epoch 8/20
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - loss: 1010.6374 - va



Predicted RUL sample: [130.39886  156.76396   57.90264   91.904854 123.12611  131.69188
 111.208595 114.26902  154.96301   89.896   ]
Actual RUL sample: [ 1  2  3  4  5  6  7  8  9 10]
RMSE: 73.79513978745484
Model saved: predictive_maintenance_lstm.h5
