# ðŸŽ¬ StreamSage Module 1: The Binge Predictor (LSTM)

**Goal**: Train a Recurrent Neural Network (LSTM) to predict if a user will continue watching movies (binge) or stop.

**Concepts**:
- **Sequence Modeling**: Treating user history as a timeline.
- **LSTM (Long Short-Term Memory)**: A neural network that remembers long-term patterns.
- **Binary Classification**: Predicting Yes (1) or No (0).

**Steps**:
1. Load MovieLens 100K Dataset.
2. Preprocess data into user sessions.
3. Build an LSTM model with Keras.
4. Train the model.
5. Save and download `binge_model.h5`.

In [None]:
# 1. Setup & Imports
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Masking, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

print(f"TensorFlow Version: {tf.__version__}")

## 2. Load Data
We will use the **MovieLens 100K** dataset, a standard benchmark for recommender systems.

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip ml-100k.zip

# Load ratings: user_id, movie_id, rating, timestamp
columns = ['user_id', 'movie_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)

# Sort by user and time to create sequences
df = df.sort_values(by=['user_id', 'timestamp'])
print("Data loaded successfully!")
df.head()

## 3. Preprocessing: Creating Sessions

We need to turn a list of ratings into "sessions".
- We'll group by `user_id`.
- We'll calculate the time difference between watches to define "binging" (watching in quick succession).
- **Feature Engineering**: We'll use `rating`, `timestamp` (normalized), and `time_diff` as features.

In [None]:
# Calculate time difference between consecutive watches for each user
df['prev_timestamp'] = df.groupby('user_id')['timestamp'].shift(1)
df['time_diff'] = df['timestamp'] - df['prev_timestamp']
df['time_diff'] = df['time_diff'].fillna(0) # First movie has 0 diff

# Normalize features (Neural Networks like numbers between 0 and 1)
scaler = MinMaxScaler()
df[['rating_norm', 'time_diff_norm']] = scaler.fit_transform(df[['rating', 'time_diff']])

# Create Sequences
# Input (X): Sequence of last N movies [rating, time_gap]
# Output (y): Will they watch another movie within 2 hours? (1 = Yes, 0 = No)

SEQUENCE_LENGTH = 5 # Look at last 5 movies
BINGE_THRESHOLD = 7200 # 2 hours in seconds

sequences = []
targets = []

user_groups = df.groupby('user_id')

for user_id, group in user_groups:
    # Convert group to array
    data = group[['rating_norm', 'time_diff_norm']].values
    timestamps = group['timestamp'].values
    
    # Create sliding windows
    for i in range(len(data) - SEQUENCE_LENGTH):
        # Input: 5 movies
        seq = data[i:i+SEQUENCE_LENGTH]
        sequences.append(seq)
        
        # Target: Is the NEXT movie watched within 2 hours?
        next_time_diff = timestamps[i+SEQUENCE_LENGTH] - timestamps[i+SEQUENCE_LENGTH-1]
        target = 1 if next_time_diff < BINGE_THRESHOLD else 0
        targets.append(target)

X = np.array(sequences)
y = np.array(targets)

print(f"Created {len(X)} sequences.")
print(f"X shape: {X.shape} (Samples, TimeSteps, Features)")
print(f"y shape: {y.shape}")

## 4. Build LSTM Model

We use a **Sequential** model with:
1. **LSTM Layer**: The "memory" layer. `units=64` means it has 64 internal memory cells.
2. **Dropout**: Prevents overfitting (randomly forgets things during training).
3. **Dense Layer**: The final output. `sigmoid` activation gives a probability between 0 and 1.

In [None]:
model = Sequential([
    # Input shape: (5 steps, 2 features)
    LSTM(64, input_shape=(SEQUENCE_LENGTH, 2), return_sequences=False),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid') # Binary classification output
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

## 5. Train the Model
We split the data into Training (80%) and Testing (20%) sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=32,
    validation_data=(X_test, y_test)
)

## 6. Evaluate & Visualize

In [None]:
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

## 7. Save & Download
We save the model in the `.h5` format (Keras standard).

In [None]:
model.save('binge_model.h5')
print("Model saved as binge_model.h5")

# Code to download file from Colab
try:
    from google.colab import files
    files.download('binge_model.h5')
except ImportError:
    print("Not running in Colab, check local directory for 'binge_model.h5'")