# Deep Learning Packages in Python: A Financial Perspective

## 1. Lecture Overview & Learning Objectives 

Welcome to this lecture on Deep Learning (DL) packages in Python! In the realm of finance, deep learning is rapidly transforming how we approach complex problems, from predicting market movements and managing risk to understanding sentiment and optimizing trading strategies. Python, with its rich ecosystem of libraries, has become the de facto language for implementing these advanced models.

### Why is this relevant for Finance Students?

*   **Harnessing Complexity:** Financial markets are inherently non-linear and complex. DL models, with their ability to learn intricate patterns, are well-suited to capture these dynamics where traditional models might fall short.
*   **Big Data Analytics:** The financial world generates vast amounts of data (tick data, news feeds, social media, macroeconomic indicators). DL provides powerful tools to process and extract value from this scale of data.
*   **Competitive Edge:** Proficiency in these tools is becoming a critical skill for quantitative analysts, portfolio managers, and risk managers. Understanding which tool to use for which task can provide a significant competitive advantage.

### Learning Objectives:
By the end of this lecture, you should be able to:
1.  Identify the major deep learning packages in Python: **TensorFlow/Keras, PyTorch, and JAX**.
2.  Understand the core philosophy, strengths, and weaknesses of each package.
3.  Implement basic deep learning models using TensorFlow/Keras and PyTorch.
4.  Appreciate JAX's unique capabilities for high-performance numerical computing and research.
5.  Make informed decisions on choosing the appropriate DL framework for specific financial applications.

## 2. Introduction to Deep Learning in Python 

Deep Learning is a subset of machine learning inspired by the structure and function of the human brain's neural networks. It involves training artificial neural networks with many layers (hence "deep") on vast amounts of data to learn complex representations and patterns.

### Key Concepts:
*   **Neural Networks:** Composed of interconnected nodes (neurons) organized in layers.
*   **Layers:** Input, Hidden (multiple), Output.
*   **Activation Functions:** Introduce non-linearity (e.g., ReLU, Sigmoid, Tanh).
*   **Loss Function:** Measures how well the model is performing (e.g., MSE for regression, Cross-Entropy for classification).
*   **Optimizer:** Adjusts model weights to minimize the loss (e.g., Adam, SGD).
*   **Backpropagation:** Algorithm to efficiently calculate gradients and update weights.

### Major Deep Learning Packages in Python:

Today, we will focus on the three giants in the Python deep learning ecosystem:

1.  **TensorFlow / Keras:** Google's powerful and comprehensive ecosystem. Keras is its user-friendly high-level API.
2.  **PyTorch:** Developed by Facebook AI Research (FAIR), known for its flexibility and Pythonic interface.
3.  **JAX:** Google's newer library for high-performance numerical computing and machine learning research, often used for custom models and advanced optimizations.

Each of these frameworks has its strengths and ideal use cases. Understanding them will empower you to choose the right tool for your financial modeling needs.

## Fundamental Concepts:

### 1. Neural Networks

**What it is:**
A **Neural Network** (NN) is a computational model inspired by the structure and function of the human brain. It's composed of numerous interconnected nodes, often called "neurons" or "perceptrons," organized into layers. Each connection between neurons has an associated "weight," representing the strength or importance of that connection.

| Feature | Human Brain | Artificial Neural Network |
| :--- | :--- | :--- |
| **Basic Unit** | Neuron | Node (Perceptron) |
| **Signal** | Electrical Impulses | Floating-Point Numbers |
| **Connections** | Synapses | Weights |
| **Learning** | Adjusting Synaptic Strength (Plasticity) | Adjusting Weights (Optimization) |
| **Structure** | Neural Networks & Regions | Layers (Input, Hidden, Output) |
| **Speed** | Relatively Slow (milliseconds) but massively parallel | Extremely Fast (nanoseconds) but depends on hardware |
| **Energy** | Highly efficient (~20 watts) | Very high power consumption (kilowatts) |




**Why it's important:**
Neural networks are designed to recognize patterns and make predictions. Their ability to learn complex, non-linear relationships makes them exceptionally powerful for tasks like fraud detection, algorithmic trading, credit scoring, and market sentiment analysis where traditional linear models might fall short.

---
### 2. Layers

Neural networks are structured into distinct layers, each serving a specific purpose in processing the information flow:


![Artificial Neural Network Layers](https://upload.wikimedia.org/wikipedia/commons/e/e4/Artificial_neural_network.svg "Diagram illustrating the input, hidden, and output layers of a neural network")
*Image Source: Svjo / CC BY-SA 3.0*

*   **Input Layer:**
    *   **What it is:** This is the first layer of the network. Each neuron in the input layer corresponds to a feature in the input data. For example, if you're predicting stock prices, the input layer might represent features like opening price, closing price, volume, and various technical indicators.
    *   **Why it's important:** It's the entry point for your data into the network. It receives the raw information the model will learn from.

*   **Hidden Layers:**
    *   **What it is:** These are the layers between the input and output layers. In "deep" learning, there are typically multiple hidden layers. Each neuron in a hidden layer processes information from the previous layer, applies a transformation, and passes the result to the next layer. These layers are "hidden" because their outputs are not directly visible as raw input or final output.
    *   **Why it's important:** Hidden layers are where the network learns to extract increasingly abstract and complex features from the data. The more hidden layers (and neurons within them), the more complex patterns the network can potentially learn, which is crucial for handling highly non-linear relationships in financial time series or high-dimensional datasets.

*   **Output Layer:**
    *   **What it is:** This is the final layer of the network. The number of neurons in the output layer depends on the type of problem being solved. For a regression task (e.g., predicting a continuous stock price), it might have one neuron. For a classification task (e.g., predicting if a stock will go up or down, or classifying credit risk into categories), it would have one neuron per class.
    *   **Why it's important:** It provides the network's final prediction or decision.

---
### 3. Activation Functions

**What it is:**
An **Activation Function** is applied to the output of each neuron in the hidden and output layers. Its primary role is to introduce non-linearity into the network. Without activation functions, a neural network, no matter how many layers it has, would essentially behave like a single linear regression model. Common activation functions include:



*   **ReLU (Rectified Linear Unit):** `f(x) = max(0, x)`. Outputs the input directly if it's positive, otherwise outputs zero. It's computationally efficient and widely used.
*   **Sigmoid:** `f(x) = 1 / (1 + e^(-x))`. Squashes values between 0 and 1, often used in binary classification output layers to represent probabilities.
*   **Tanh (Hyperbolic Tangent):** `f(x) = (e^x - e^(-x)) / (e^x + e^(-x))`. Squashes values between -1 and 1, often preferred over Sigmoid in hidden layers because its output is zero-centered.

**Why it's important:**
Non-linearity is crucial because most real-world data, especially in finance, is non-linear. Activation functions allow neural networks to learn complex, non-linear relationships between inputs and outputs, enabling them to model intricate market dynamics that linear models cannot capture.

---
### 4. Loss Function

**What it is:**
A **Loss Function** (also known as a Cost Function or Objective Function) quantifies the discrepancy between the network's predicted output and the actual target output for a given input. A lower loss value indicates a better-performing model. Examples include:

![Cost Function Diagram](https://upload.wikimedia.org/wikipedia/commons/5/5b/Gradient_descent_method.png "A 2D representation of a cost function with a global minimum")
*Image Source:wikimedia.org*

*   **Mean Squared Error (MSE):** Commonly used for regression tasks (e.g., predicting exact stock prices). It calculates the average of the squared differences between predictions and actual values.
*   **Cross-Entropy:** Commonly used for classification tasks (e.g., predicting if a stock will rise or fall, or which credit risk category a borrower belongs to). It measures the dissimilarity between predicted probability distributions and true distributions.

**Why it's important:**
The loss function is the critical metric that guides the learning process. The goal of training a neural network is to minimize this loss, effectively making the model's predictions as close as possible to the true values.

---
## 5. Optimizer

**What it is:**
An **Optimizer** is an algorithm or method used to adjust the weights and biases of the neural network in such a way that the loss function is minimized. It uses the gradients calculated during backpropagation to determine how to update the network's parameters. Popular optimizers include:


*   **Adam (Adaptive Moment Estimation):** A very popular and generally effective optimizer that combines concepts from other optimizers like RMSprop and AdaGrad, adapting learning rates for each parameter.
*   **SGD (Stochastic Gradient Descent):** A fundamental optimizer that updates weights iteratively based on the gradient of the loss function with respect to the weights for a small batch of data.

**Why it's important:**
The optimizer dictates "how" the network learns. By intelligently adjusting weights, it navigates the complex landscape of the loss function to find the optimal set of parameters that yield the best performance. Without an effective optimizer, the network might fail to learn or converge slowly.

---
### 6. Backpropagation

**What it is:**
**Backpropagation** is a fundamental algorithm for training neural networks. It works by efficiently calculating the gradients of the loss function with respect to all the weights and biases in the network. This calculation starts from the output layer and propagates backward through the network, hence the name "backpropagation."


**Why it's important:**
Backpropagation is the engine of learning in deep neural networks. It provides the optimizer with the necessary information (the gradients) to understand how each weight and bias contributes to the overall error. This allows the optimizer to make informed adjustments to the network's parameters, enabling the network to learn from its mistakes and improve its predictive accuracy over time.

##  Hiking Analogy

Let's use an analogy to make it crystal clear.

Imagine you are a hiker standing on a huge, foggy mountain range. Your goal is to find the absolute lowest point in the valley (this is the **minimum error**).

*   **The Mountain Range:** This is your "loss landscape." Every point on the landscape represents a specific set of weights for your neural network, and the altitude at that point is the error (how "wrong" the network is with those weights).
*   **Your Goal:** Get to the bottom of the deepest valley.

Now, let's define the roles of Backpropagation, SGD, and Adam.

---

#### 1. Backpropagation: The Compass and Slope Calculator

Backpropagation is not an optimizer. It's the **algorithm that tells you the direction and steepness of the slope where you are currently standing.**

*   **What it does:** It calculates the **gradient** of the loss function with respect to every single weight in the network.
*   **In our analogy:** It's like having a special tool that you can use at your current spot to figure out exactly which way is "downhill" and how steep the slope is in every direction. It doesn't move you; it just gives you the critical information: "To go down, you must take 3 steps east, 1 step north, and 10 steps down."

**Backpropagation is the engine that computes the gradients. It's the "how-to" for finding the slope.**

---

#### 2. SGD (Stochastic Gradient Descent): The Basic Hiking Strategy

SGD is the most basic **optimization algorithm**. It's a simple, direct strategy for using the slope information to actually move.

*   **What it does:** It takes the gradient calculated by Backpropagation and takes a step in the opposite direction of the steepest slope. The size of this step is controlled by a parameter called the **learning rate**.
*   **In our analogy:** SGD is the hiker's rule: "Calculate the slope (using Backpropagation), then take a fixed-size step directly downhill." If the learning rate is high, you take a big leap. If it's low, you take a tiny, careful shuffle.

**SGD is the "how-to" for moving, based on the slope.**

**Limitations of SGD:** This basic strategy has problems. The terrain (loss landscape) is not a simple smooth bowl. It has:
*   **Local Minima:** Small valleys that aren't the true lowest point. SGD can get stuck here.
*   **Saddle Points:** Areas that are flat in one direction but steep in another. SGD can get stuck here too.
*   **Noisy Terrain:** The path can be very erratic and zig-zag down a valley, making the journey slow.

---

#### 3. Adam (Adaptive Moment Estimation): The Advanced, Smart Hiking Strategy

Adam is a more sophisticated **optimization algorithm**. It uses the same slope information from Backpropagation but decides *how* to move in a much smarter way.

Adam keeps track of two things as it hikes:
1.  **Momentum (the first moment):** It averages the past few steps to maintain a direction. If you've been moving consistently east, it will keep you moving east, helping you power through small bumps and ravines (local minima).
    *   **Analogy:** It's like a bowling ball rolling downhill. Its momentum helps it push past small obstacles.

2.  **Adaptive Learning Rates (the second moment):** It keeps track of how large the gradients have been for each parameter (weight) individually. For weights that have large, noisy gradients, it takes smaller, more careful steps. For weights that have small, consistent gradients, it takes larger, more confident steps.
    *   **Analogy:** Imagine you're hiking on a narrow, steep ridge (high gradient) for one parameter, and a wide, gentle slope (low gradient) for another. Adam lets you shuffle carefully along the ridge while taking big strides on the gentle slope, all at the same time.

**Adam is the "how-to" for moving more efficiently and intelligently, based on the slope.**

---

#### Putting It All Together

Here is the step-by-step process during one training iteration:

1.  **Forward Pass:** The network makes a prediction.
2.  **Calculate Loss:** You measure how wrong that prediction was (your "altitude").
3.  **Backpropagation:** You use the loss to calculate the gradient for every weight in the network. **(This is the compass telling you the slope).**
4.  **Optimizer (SGD or Adam) Takes Over:**
    *   **If using SGD:** The optimizer looks at the gradient and says, "Okay, let's take a step of size X directly opposite to this direction."
    *   **If using Adam:** The optimizer looks at the gradient, checks its history of momentum and past gradient sizes, and says, "Based on my momentum and the terrain here, I'll take this *specifically sized* step in this *general* downhill direction."
5.  **Update Weights:** The optimizer updates the weights. The hiker has moved to a new position.
6.  **Repeat:** This entire cycle is repeated thousands or millions of times until the hiker reaches the bottom of the valley.

## 3. TensorFlow & Keras: The Production Powerhouses 

### What is TensorFlow?
TensorFlow is an open-source machine learning framework developed by Google. It's designed for numerical computation using data flow graphs, where nodes represent mathematical operations and edge represent multi-dimensional data arrays (tensors). It's highly scalable and designed for production environments, capable of running on various platforms from mobile to large-scale distributed systems.

### What is Keras?
Keras is a high-level API for building and training deep learning models. It runs on top of TensorFlow (and used to support others like Theano, CNTK). Keras was designed for fast experimentation, allowing users to go from idea to result with the fewest possible steps. It has become the standard high-level API for TensorFlow, making deep learning much more accessible.

### Why use TensorFlow/Keras in Finance?

*   **Robustness & Scalability:** Ideal for deploying complex models in real-time trading systems or large-scale risk simulations.
*   **Comprehensive Ecosystem:** Tools like TensorBoard for visualization, TensorFlow Serving for model deployment, and TensorFlow Lite for edge devices.
*   **Large Community & Resources:** Extensive documentation, tutorials, and a massive user base mean plenty of support and pre-trained models.
*   **Ease of Use (Keras):** Quickly prototype and iterate on models for tasks like market prediction, sentiment analysis from financial news, or credit scoring.

### Pros and Cons of TensorFlow/Keras

| Aspect          | Pros                                                                                             | Cons                                                                                              |
| :-------------- | :----------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ |
| **Ease of Use** | Keras API makes rapid prototyping very easy, clear and intuitive.                                | Lower-level TensorFlow can be complex; debugging can be less straightforward than PyTorch.       |
| **Flexibility** | Supports both high-level (Keras) and low-level control. Functional API for complex architectures. | Can feel overly opinionated or restrictive for highly experimental research compared to PyTorch. |
| **Performance** | Excellent performance on GPUs/TPUs. Static computation graph allows for strong optimizations.     | Initial overhead due to graph compilation.                                                        |
| **Deployment**  | Industry-leading tools for production deployment (TF Serving, TF Lite, TF.js).                   |                                                                                                   |
| **Community**   | Massive community, extensive documentation, and widespread industry adoption.                    |                                                                                                   |
| **Finance Use** | Market prediction, algorithmic trading, fraud detection, credit risk modeling, time series analysis. |                                                                                                   |

### How to use TensorFlow/Keras: A Simple Example (Stock Price Prediction)

Let's build a simple Multi-Layer Perceptron (MLP) to predict the next day's stock price movement (up/down) based on previous day's features. We'll use a dummy dataset for illustration.

In [None]:
#pip install tensorflow

In [2]:
import tensorflow as tf

print("TensorFlow Version:", tf.__version__)

# 1. List all available physical GPU devices
gpus = tf.config.list_physical_devices('GPU')

if gpus:
    try:
        # 2. Configure GPU memory growth
        # This prevents TensorFlow from pre-allocating all memory on the GPU.
        # It will only allocate as much GPU memory as needed, and it grows dynamically.
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        
        # You can also set a specific GPU to use if you have multiple
        # For example, to use the first GPU:
        # tf.config.set_visible_devices(gpus[0], 'GPU')

        # Print confirmation
        print(f"GPUs available: {len(gpus)}. Memory growth is enabled for all GPUs.")
        print("Details:", [gpu.name for gpu in gpus])

    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(f"Error configuring GPU: {e}")
        print("TensorFlow will likely fall back to CPU or default GPU settings.")
else:
    print("No GPU devices found. TensorFlow will run on CPU.")



# If you want to explicitly place operations on CPU or a specific GPU, you can use:
# with tf.device('/CPU:0'):
#     # Operations here will run on CPU
#     cpu_tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]])
#     print("CPU Tensor:", cpu_tensor)

# if gpus:
#     with tf.device('/GPU:0'): # For the first GPU
#         # Operations here will run on GPU:0
#         gpu_tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]])
#         print("GPU Tensor:", gpu_tensor)

TensorFlow Version: 2.20.0
No GPU devices found. TensorFlow will run on CPU.


### How Dropout Works

Dropout is a regularization technique used in deep learning to prevent overfitting in neural networks. Overfitting occurs when a model performs exceptionally well on the training data but struggles to generalize to new, unseen data.


1.  **During Training:**
    *   In each training iteration, dropout randomly deactivates (or "drops out") a fraction of neurons in a layer, along with their connections.
    *   This means that for a given training step, a temporary "thinned" network is created, as the dropped neurons do not contribute to the forward pass or backpropagation.
    *   The "dropout probability" (often a hyperparameter between 0.2 and 0.5, with 0.2 being a good baseline) determines the chance of a neuron being dropped. This random deactivation is performed independently for each training example and each layer.
    *   By randomly disabling neurons, the network is prevented from becoming overly reliant on specific neurons or co-adaptations between neurons. This forces the remaining neurons to learn more robust and generalized features.

2.  **During Inference (Testing):**
    *   In contrast to training, all neurons are active during testing or inference.
    *   To account for the different network structure during training (where some neurons were dropped), the outgoing weights of the neurons are scaled down by the dropout rate (the probability *p* with which a unit was retained during training). This ensures that the expected output of a neuron remains the same as during training.

### Benefits of Dropout

*   **Prevents Overfitting:** This is the primary benefit. By randomly disabling neurons, the network cannot rely too heavily on specific connections, making it less likely to memorize the training data and improving its ability to generalize to new data.
*   **Ensemble Effect:** Dropout can be viewed as training an ensemble of many smaller, "thinned" neural networks in parallel during each iteration. This ensemble effect leads to a more robust model and better generalization.
*   **Enhanced Data Representation:** It introduces noise during training, which can help the network learn more effective data representations.
*   **Computationally Efficient:** Dropout is relatively simple to implement and adds minimal computational overhead compared to its benefits in reducing overfitting.
*   **Works well with large networks:** It is particularly effective in deep neural networks with many layers where overfitting is a common challenge.

### Drawbacks of Dropout

*   **Longer Training Times:** Due to its stochastic nature and the effective training of multiple sub-networks, dropout can increase the number of epochs required for the model to converge, leading to longer overall training durations.
*   **Hyperparameter Tuning:** The dropout rate is a hyperparameter that requires careful tuning for optimal performance.
*   **Optimization Complexity:** The exact reasons why dropout works are sometimes considered unclear, which can make optimization challenging.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 1. Data Generation (Dummy Financial Data)
# Let's simulate some financial features and a target variable (stock movement: 0 = down/flat, 1 = up)
np.random.seed(42)
num_samples = 1000
num_features = 10

X = np.random.rand(num_samples, num_features) * 100  # e.g., technical indicators, volume, news sentiment scores
# Create a 'noisy' relationship for the target variable
y = (X[:, 0] * 0.5 + X[:, 1] * 0.3 - X[:, 2] * 0.2 + np.random.randn(num_samples) * 5 > 50).astype(int)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Target distribution: {np.bincount(y)}")

# 2. Data Preprocessing
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# 3. Model Definition (Keras Sequential API)
model = keras.Sequential([
    keras.layers.Input(shape=(num_features,)),  # Input layer
    keras.layers.Dense(64, activation='relu'),   # Hidden layer 1
    keras.layers.Dropout(0.3),                   # Dropout for regularization
    keras.layers.Dense(32, activation='relu'),   # Hidden layer 2
    keras.layers.Dense(1, activation='sigmoid')  # Output layer (sigmoid for binary classification)
])

# 4. Model Compilation
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

# 5. Model Training
print("\nTraining the Keras model...")
history = model.fit(
    X_train, y_train,
    epochs=20,          # Number of passes over the training data
    batch_size=32,      # Number of samples per gradient update
    validation_split=0.1, # Use 10% of training data for validation during training
    verbose=0           # Suppress output for cleaner presentation
)

print("Training complete.")
# To see training progress:
# pd.DataFrame(history.history).plot(figsize=(10, 7))
# import matplotlib.pyplot as plt; plt.show()

# 6. Model Evaluation
print("\nEvaluating the model...")
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

y_pred_proba = model.predict(X_test, verbose=0)
y_pred = (y_pred_proba > 0.5).astype(int)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

X shape: (1000, 10)
y shape: (1000,)
Target distribution: [865 135]
X_train shape: (800, 10), y_train shape: (800,)
X_test shape: (200, 10), y_test shape: (200,)



Training the Keras model...
Training complete.

Evaluating the model...
Test Loss: 0.1394
Test Accuracy: 0.9150

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95       173
           1       0.67      0.74      0.70        27

    accuracy                           0.92       200
   macro avg       0.81      0.84      0.83       200
weighted avg       0.92      0.92      0.92       200



### How to use TensorFlow/Keras: Another Example (LTSM sentiment)

In [9]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

print("TensorFlow Version:", tf.__version__)

# --- Configuration for GPU (if available) ---
# This part is optional but good practice if you have a GPU
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Set memory growth to True to avoid allocating all GPU memory at once
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"GPUs available: {len(gpus)}. Memory growth is enabled.")
        print("Details:", [gpu.name for gpu in gpus])
    except RuntimeError as e:
        print(f"Error configuring GPU: {e}")
else:
    print("No GPU devices found. TensorFlow will run on CPU.")
print("-" * 50)


# --- 1. Load and Preprocess Data ---

# Parameters for data loading and preprocessing
vocab_size = 10000  # Only consider the top `vocab_size` words
maxlen = 250        # Max length of a review (sentences longer than this will be truncated)
embedding_dim = 128 # Dimension of the word embeddings

print(f"Loading IMDB dataset (top {vocab_size} words)...")
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
print(f"Train samples: {len(x_train)}, Test samples: {len(x_test)}")

# Pad sequences to ensure uniform length for all reviews
# 'post' padding adds zeros at the end
# 'pre' padding adds zeros at the beginning
print(f"Padding sequences to maxlen={maxlen}...")
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print("Sequences padded.")

# Retrieve the word index mapping (word to integer)
word_index = imdb.get_word_index()
# Keras's IMDB dataset reserves 0, 1, 2 for padding, start-of-sequence, and unknown.
# We need to adjust the index for actual words when mapping back.
word_to_id = {key: (value + 3) for key, value in word_index.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2  # Unknown word
word_to_id["<UNUSED>"] = 3 # Unused (often maps to actual words if not adjusted)

# Create a reverse mapping (id to word)
id_to_word = {value: key for key, value in word_to_id.items()}

# Example: Convert a preprocessed review back to words (for verification)
# print("\nExample preprocessed review (first 10 words):")
# print([id_to_word.get(i, "?") for i in x_train[0][:10]])
# print("Corresponding label:", y_train[0])
print("-" * 50)

# --- 2. Build the LSTM Model ---
print("Building the LSTM model...")
model = Sequential([
    # Embedding layer: Converts integer-encoded words into dense vectors
    Embedding(vocab_size, embedding_dim, input_length=maxlen),
    
    # LSTM layer: Processes sequences. 'units' is the dimensionality of the output space.
    # It learns long-term dependencies in the sequence.
    LSTM(embedding_dim), # Using the same dimension as embedding for simplicity
    
    # Dropout layer: Helps prevent overfitting by randomly setting a fraction of input units to 0.
    Dropout(0.5), # 50% of neurons will be dropped during training
    
    # Dense output layer: For binary classification (positive/negative sentiment)
    # 1 unit and 'sigmoid' activation for probabilities between 0 and 1.
    Dense(1, activation='sigmoid')
])

model.summary()
print("-" * 50)

# --- 3. Compile and Train the Model ---
print("Compiling the model...")
model.compile(optimizer='adam',
              loss='binary_crossentropy', # Appropriate for binary classification
              metrics=['accuracy'])

print("Training the model (this may take a few minutes)...")
history = model.fit(x_train, y_train,
                    epochs=5,           # Number of times to iterate over the entire dataset
                    batch_size=64,      # Number of samples per gradient update
                    validation_split=0.2 # Use 20% of training data for validation
                   )
print("Model training finished.")
print("-" * 50)

# --- 4. Evaluate the Model ---
print("Evaluating the model on the test set...")
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
print("-" * 50)

# --- 5. Predict Sentiment for Raw Sentences ---
print("Demonstrating sentiment prediction for raw sentences...")

def preprocess_text(text, word_to_id_map, maxlen_val, vocab_size_val):
    """
    Tokenizes, converts words to IDs, and pads a raw text sentence.
    """
    words = text.lower().split()
    # Convert words to IDs, defaulting to <UNK> if not in vocab
    # Also, ensure we don't exceed the vocab_size by mapping out-of-vocab to <UNK>
    encoded_review = [
        word_to_id_map.get(word, word_to_id_map["<UNK>"])
        for word in words
        if word_to_id_map.get(word, word_to_id_map["<UNK>"]) < vocab_size_val
    ]
    # Add the start-of-sequence token
    encoded_review = [word_to_id_map["<START>"]] + encoded_review
    
    # Pad the sequence
    padded_review = pad_sequences([encoded_review], maxlen=maxlen_val)
    return padded_review

def predict_sentiment(model, raw_sentence, word_to_id_map, maxlen_val, vocab_size_val):
    """
    Predicts the sentiment of a raw sentence using the trained model.
    """
    preprocessed_input = preprocess_text(raw_sentence, word_to_id_map, maxlen_val, vocab_size_val)
    prediction = model.predict(preprocessed_input)[0][0]
    
    sentiment = "Positive" if prediction >= 0.5 else "Negative"
    
    print(f"Sentence: \"{raw_sentence}\"")
    print(f"Predicted Probability (Positive): {prediction:.4f}")
    print(f"Predicted Sentiment: {sentiment}")
    print("-" * 20)

# Sample sentences
sample_sentence1 = "This movie was absolutely fantastic! I loved every single moment of it. Highly recommend."
sample_sentence2 = "Terrible film, a complete waste of time and money. I wouldn't watch it again even for free."
sample_sentence3 = "It was okay, not great, not bad. Just an average movie experience."
sample_sentence4 = "The acting was superb, but the plot was a bit weak and confusing."
sample_sentence5 = "An engaging story with brilliant performances, a must-see for everyone."

predict_sentiment(model, sample_sentence1, word_to_id, maxlen, vocab_size)
predict_sentiment(model, sample_sentence2, word_to_id, maxlen, vocab_size)
predict_sentiment(model, sample_sentence3, word_to_id, maxlen, vocab_size)
predict_sentiment(model, sample_sentence4, word_to_id, maxlen, vocab_size)
predict_sentiment(model, sample_sentence5, word_to_id, maxlen, vocab_size)

TensorFlow Version: 2.20.0
No GPU devices found. TensorFlow will run on CPU.
--------------------------------------------------
Loading IMDB dataset (top 10000 words)...
Train samples: 25000, Test samples: 25000
Padding sequences to maxlen=250...
Sequences padded.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
--------------------------------------------------
Building the LSTM model...


--------------------------------------------------
Compiling the model...
Training the model (this may take a few minutes)...
Epoch 1/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 124ms/step - accuracy: 0.7537 - loss: 0.4877 - val_accuracy: 0.8376 - val_loss: 0.4420
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 125ms/step - accuracy: 0.8876 - loss: 0.2897 - val_accuracy: 0.8734 - val_loss: 0.3194
Epoch 3/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 123ms/step - accuracy: 0.9216 - loss: 0.2123 - val_accuracy: 0.8648 - val_loss: 0.3607
Epoch 4/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 125ms/step - accuracy: 0.9486 - loss: 0.1445 - val_accuracy: 0.8672 - val_loss: 0.3926
Epoch 5/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 125ms/step - accuracy: 0.9648 - loss: 0.1018 - val_accuracy: 0.8522 - val_loss: 0.4353
Model training finished.
-----------------------------

### Epoch
In deep learning, an **epoch** refers to one complete pass through the entire training dataset.

Here's a breakdown of what that means:

1.  **Full Dataset Pass:** During one epoch, every single training example from your dataset is fed forward through the neural network, and the network's weights are updated via backpropagation based on the error it makes on those examples.

2.  **Learning Cycles:** Neural networks typically require multiple epochs to learn the underlying patterns in the data effectively. A single pass might not be enough for the model to sufficiently adjust its internal parameters (weights and biases) to achieve good performance.

3.  **Relationship with Batch Size and Iterations:**
    *   **Batch Size:** The training dataset is usually too large to process all at once. Instead, it's divided into smaller subsets called "batches" (or "mini-batches"). The model processes one batch at a time, calculates the loss, and updates its weights.
    *   **Iteration/Step:** An "iteration" (or "step") refers to processing a single batch.
    *   **Number of Iterations per Epoch:** If you have `N` total training samples and a `batch_size` of `B`, then one epoch will consist of `N / B` iterations (or steps).

    **Formula:** `Number of Iterations per Epoch = Total Training Samples / Batch Size`

**Example:**
*   If you have a training dataset of 10,000 images.
*   And your `batch_size` is 100.
*   Then one epoch will involve `10,000 / 100 = 100` iterations.
*   During each iteration, 100 images are processed, and the model's weights are updated. After all 100 iterations are completed, one epoch is finished, meaning the model has seen all 10,000 images once.

**Why Multiple Epochs are Necessary:**
*   **Gradual Learning:** The model learns gradually. Each epoch allows the model to refine its understanding of the data, adjusting its parameters incrementally to minimize the loss function.
*   **Convergence:** Training for multiple epochs allows the model's performance (e.g., accuracy) to converge to a better solution.
*   **Underfitting vs. Overfitting:**
    *   **Too few epochs** can lead to **underfitting**, where the model hasn't learned enough from the data and performs poorly on both training and test sets.
    *   **Too many epochs** can lead to **overfitting**, where the model learns the training data too well, memorizing noise and specific examples, and thus performs poorly on new, unseen data. Techniques like dropout, early stopping, and regularization help mitigate overfitting even with many epochs.

In summary, an epoch is a fundamental unit in the training process of a neural network, representing a full cycle of seeing and learning from the entire training dataset.

### Discussion: TensorFlow/Keras in Finance

*   **Sentiment Analysis:** Using LSTMs or Transformers in Keras for analyzing financial news, earnings call transcripts, or social media for market sentiment.
*   **Algorithmic Trading:** Developing reinforcement learning agents with TensorFlow (e.g., using `tf-agents`) to learn optimal trading strategies.
*   **Credit Risk Modeling:** Building deep neural networks for credit default prediction, processing structured and unstructured data (e.g., loan applications, customer behavior data).
*   **Fraud Detection:** Identifying anomalous transactions or patterns using autoencoders or other unsupervised/semi-supervised DL models.

Keras simplifies the process, allowing finance professionals to focus on feature engineering and model architecture rather than low-level implementation details. TensorFlow then provides the robust backend for deployment.

## 4. PyTorch: The Research Favorite (45-50 minutes)

### What is PyTorch?
PyTorch is an open-source machine learning library primarily developed by Facebook's AI Research lab (FAIR). It's renowned for its flexibility, dynamic computation graph, and Pythonic interface, which makes it a favorite among researchers and developers who need fine-grained control over their models and training processes.

### Why use PyTorch in Finance?

*   **Flexibility & Debugging:** The dynamic graph allows for easier debugging (like regular Python code) and more experimental model architectures, which is crucial when exploring novel financial hypotheses.
*   **Pythonic Design:** Integrates seamlessly with the rest of the Python data science stack (NumPy, Pandas, Scikit-learn), making it very intuitive for Python developers.
*   **Research & Innovation:** Many cutting-edge research papers in finance (e.g., in areas like quantitative finance, NLP for market analysis) are often implemented first in PyTorch.
*   **Growing Ecosystem:** While historically more research-focused, PyTorch's ecosystem for deployment (TorchScript, PyTorch Mobile, PyTorch Lightning for training abstraction) has matured significantly.

### Pros and Cons of PyTorch

| Aspect          | Pros                                                                                             | Cons                                                                                              |
| :-------------- | :----------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ |
| **Ease of Use** | Highly Pythonic, intuitive API. Dynamic graph makes debugging and control straightforward.       | Requires more manual control over the training loop compared to Keras's `model.fit()`.           |
| **Flexibility** | Extremely flexible, ideal for custom layers, loss functions, and research-oriented architectures. | With great power comes great responsibility; more boilerplate code for standard tasks.           |
| **Performance** | Excellent performance on GPUs. JIT compilation (TorchScript) for optimization.                   | May require more careful optimization for deployment compared to TensorFlow's static graph.      |
| **Deployment**  | `TorchScript` enables production deployment, but the ecosystem is slightly less mature than TF.  |                                                                                                   |
| **Community**   | Strong and rapidly growing community, especially in academia and research.                       | Historically smaller than TensorFlow, but catching up quickly.                                   |
| **Finance Use** | Developing novel trading strategies, complex risk models, advanced NLP for financial text, generative models for synthetic data. |                                                                                                   |

### How to use PyTorch: A Simple Example (Stock Price Prediction)

Let's re-implement the same stock price prediction problem using PyTorch.

In [None]:
#pip install torch

A PyTorch tensor is the fundamental data structure in PyTorch, serving as the primary way to store and manipulate numerical data. It is essentially a multi-dimensional array, similar to NumPy arrays, but with several key advantages that make it suitable for deep learning:

Here's a breakdown of what a PyTorch tensor is and its characteristics:

1.  **Multi-dimensional Array:** At its core, a PyTorch tensor is an array that can have zero, one, or more dimensions (e.g., a scalar, a vector, a matrix, or higher-order tensors). This allows it to represent various types of data, from single numbers to images, audio, or entire batches of data.

2.  **GPU Acceleration:** One of the most significant advantages of PyTorch tensors is their ability to leverage Graphics Processing Units (GPUs) for computations. This allows for massive parallelization and significantly speeds up the training and inference of deep learning models, which often involve large-scale matrix operations.

3.  **Automatic Differentiation (Autograd):** PyTorch tensors are at the heart of PyTorch's automatic differentiation engine, called Autograd. When you perform operations on tensors, PyTorch automatically builds a computational graph. This graph tracks all the operations, allowing PyTorch to automatically compute gradients (derivatives) of the output with respect to the input tensors. This feature is crucial for training neural networks using backpropagation.

4.  **Data Types:** Tensors can hold data of various types, such as floating-point numbers (e.g., `torch.float32`, `torch.float64`), integers (e.g., `torch.int32`, `torch.int64`), and booleans. The data type affects the precision and memory usage of the tensor.

5.  **Device Agnostic:** Tensors can reside on either the CPU or the GPU. You can easily move tensors between devices using methods like `.to('cuda')` or `.to('cpu')`.

6.  **Immutable Shape (but mutable data):** Once a tensor is created, its shape (number of dimensions and size of each dimension) is generally fixed. However, the actual numerical values stored within the tensor are mutable and can be changed.

7.  **Interoperability with NumPy:** PyTorch tensors can be easily converted to and from NumPy arrays, facilitating integration with the broader Python scientific computing ecosystem.

**In summary:** A PyTorch tensor is the core data container for all computations in PyTorch, providing a flexible, GPU-accelerated, and automatically differentiable multi-dimensional array structure essential for building and training deep learning models.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 1. Data Generation (Dummy Financial Data) - Same as before
np.random.seed(42)
num_samples = 1000
num_features = 10

X_np = np.random.rand(num_samples, num_features) * 100
y_np = (X_np[:, 0] * 0.5 + X_np[:, 1] * 0.3 - X_np[:, 2] * 0.2 + np.random.randn(num_samples) * 5 > 50).astype(int)

# 2. Data Preprocessing
scaler = StandardScaler()
X_scaled_np = scaler.fit_transform(X_np)

X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(X_scaled_np, y_np, test_size=0.2, random_state=42, stratify=y_np)

# Convert NumPy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_np, dtype=torch.float32).unsqueeze(1) # Add a dimension for binary output
X_test_tensor = torch.tensor(X_test_np, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test_np, dtype=torch.float32).unsqueeze(1)

print(f"X_train_tensor shape: {X_train_tensor.shape}, y_train_tensor shape: {y_train_tensor.shape}")

# Create DataLoader for batching
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# 3. Model Definition (PyTorch nn.Module)
class MLP(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.relu1 = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden_size2, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.sigmoid(out)
        return out

input_size = num_features
hidden_size1 = 64
hidden_size2 = 32
output_size = 1

model_pt = MLP(input_size, hidden_size1, hidden_size2, output_size)

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_pt.to(device)

print(model_pt)

# 4. Loss Function and Optimizer
criterion = nn.BCELoss() # Binary Cross-Entropy Loss
optimizer = optim.Adam(model_pt.parameters(), lr=0.001)

# 5. Model Training Loop (Manual)
epochs = 20
print("\nTraining the PyTorch model...")

for epoch in range(epochs):
    model_pt.train() # Set model to training mode
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad() # Zero the parameter gradients
        outputs = model_pt(inputs)
        loss = criterion(outputs, labels)
        loss.backward()       # Backpropagation
        optimizer.step()      # Update weights
        running_loss += loss.item() * inputs.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    if (epoch + 1) % 5 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}')

print("Training complete.")

# 6. Model Evaluation
print("\nEvaluating the model...")
model_pt.eval() # Set model to evaluation mode
all_preds = []
all_labels = []

with torch.no_grad(): # Disable gradient calculations
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model_pt(inputs)
        predicted = (outputs > 0.5).squeeze().cpu().numpy()
        all_preds.extend(predicted)
        all_labels.extend(labels.squeeze().cpu().numpy())

accuracy = accuracy_score(all_labels, all_preds)
print(f"Test Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(all_labels, all_preds))

X_train_tensor shape: torch.Size([800, 10]), y_train_tensor shape: torch.Size([800, 1])
MLP(
  (fc1): Linear(in_features=10, out_features=64, bias=True)
  (relu1): ReLU()
  (dropout): Dropout(p=0.3, inplace=False)
  (fc2): Linear(in_features=64, out_features=32, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=32, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

Training the PyTorch model...
Epoch [5/20], Loss: 0.2587
Epoch [10/20], Loss: 0.1537
Epoch [15/20], Loss: 0.1265
Epoch [20/20], Loss: 0.1103
Training complete.

Evaluating the model...
Test Accuracy: 0.9450

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.98      0.97       173
         1.0       0.83      0.74      0.78        27

    accuracy                           0.94       200
   macro avg       0.90      0.86      0.88       200
weighted avg       0.94      0.94      0.94       200



In [3]:
!pip install torch

Collecting torch
  Downloading torch-2.9.1-cp313-cp313-win_amd64.whl.metadata (30 kB)
Downloading torch-2.9.1-cp313-cp313-win_amd64.whl (110.9 MB)
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   ---------------------------------------- 0.5/110.9 MB 2.9 MB/s eta 0:00:39
   ---------------------------------------- 1.3/110.9 MB 3.3 MB/s eta 0:00:34
    --------------------------------------- 1.8/110.9 MB 3.2 MB/s eta 0:00:34
    --------------------------------------- 2.6/110.9 MB 3.3 MB/s eta 0:00:33
    --------------------------------------- 2.6/110.9 MB 3.3 MB/s eta 0:00:33
   - -------------------------------------- 3.1/110.9 MB 2.7 MB/s eta 0:00:41
   - -------------------------------------- 3.9/110.9 MB 2.7 MB/s eta 0:00:40
   - -------------------------------------- 4.5/110.9 MB 2.7 MB/s eta 0:00:39
   - -------------------------------------- 5.0/110.9 MB 2.8 MB/s eta 0:00:39
   -- ------------------------------------- 5.8/110.9 MB 2.8 MB/s eta 0:00:38
  

##  BERT and FinBERT
BERT and FinBERT are both powerful language models, but they serve different purposes due to their training data and specialization.

### BERT (Bidirectional Encoder Representations from Transformers)

**What is BERT?**
BERT is an open-source machine learning framework for Natural Language Processing (NLP) developed by Google in 2018. It revolutionized the field of NLP by introducing a novel approach to pre-training language representations.

**Key Characteristics of BERT:**
*   **Transformer Architecture:** BERT is built upon the Transformer architecture, specifically using only the encoder part. This architecture employs a self-attention mechanism that allows it to consider the entire context of a word in a sentence simultaneously, rather than processing text sequentially (left-to-right or right-to-left). This "bidirectional" understanding is a key differentiator.
*   **Pre-training:** BERT undergoes pre-training on massive amounts of unlabeled text data, such as Wikipedia and Google's BooksCorpus (over 3 billion words). This unsupervised pre-training allows it to learn deep contextual representations of words and complex language patterns.
*   **Two Training Strategies:** During pre-training, BERT uses two main tasks:
    *   **Masked Language Modeling (MLM):** Randomly masks a percentage of words in a sentence and then tries to predict the masked words based on their context.
    *   **Next Sentence Prediction (NSP):** Predicts whether two sentences follow each other in the original text.
*   **Fine-tuning:** After pre-training, BERT can be fine-tuned with smaller, labeled datasets for specific NLP tasks like sentiment analysis, question answering, named entity recognition, and text classification, achieving state-of-the-art performance.

### FinBERT (Financial Bidirectional Encoder Representations from Transformers)

**What is FinBERT?**
FinBERT is a specialized variant of the BERT model, specifically tailored for financial sentiment analysis and other NLP tasks within the financial domain. It is built by taking a pre-trained general BERT model and further training (fine-tuning) it on a large corpus of financial text.

**Key Characteristics and Why it's Important:**
*   **Domain-Specific Specialization:** While general BERT models are powerful, financial language has unique terminology, jargon, and context that can make general models less accurate for sentiment analysis. FinBERT addresses this by focusing on finance.
*   **Training Data:** FinBERT is further trained on extensive financial corpora, such as the Reuters TRC2 dataset and the Financial PhraseBank. This additional training allows it to better understand and interpret financial jargon and the nuances of financial sentiment.
*   **Improved Accuracy in Finance:** For tasks like classifying financial text as positive, negative, or neutral, FinBERT significantly outperforms general BERT models because it has learned domain-specific contextual information. For example, it can understand that "The company's profits are up 10%" is positive, while "The stock price is down 5%" is negative.
*   **Use Cases:** FinBERT is highly valuable for analyzing the sentiment of financial news articles, earnings reports, market updates, regulatory alerts, and social media posts related to financial markets, assisting in investment strategies and risk management.
*   **Output:** FinBERT typically provides softmax outputs for three labels: positive, negative, or neutral, indicating the probability of each sentiment.

In essence, FinBERT takes the strong language understanding capabilities of BERT and refines them with domain-specific knowledge, making it a powerful tool for financial text analysis.

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np

# --- 1. Load FinBERT Model and Tokenizer ---
# We'll use the 'ProsusAI/finbert' model from Hugging Face
print("Loading FinBERT tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

# Ensure the model is in evaluation mode (important for consistent predictions)
model.eval()

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"FinBERT model loaded and moved to: {device}")
print("-" * 50)

# --- 2. Prepare Sample Financial Sentences ---
# These are raw input sentences, just like you'd get from online data.
sample_sentences = [
    "The company announced strong quarterly earnings, exceeding analyst expectations.", # Positive
    "Despite record sales, the stock price plummeted due to market uncertainty.",     # Negative/Mixed
    "Analysts maintain a 'neutral' rating on the company's shares.",                  # Neutral
    "Inflation concerns are growing as commodity prices continue to rise.",           # Negative
    "A new strategic partnership is expected to boost revenue significantly.",        # Positive
    "The board approved a new share buyback program, delighting investors.",          # Positive
    "Company X filed for bankruptcy today.",                                          # Negative
    "Market shows no significant movement today.",                                    # Neutral
]

# --- 3. Process and Predict Sentiment for Each Sentence ---
print("Analyzing sentiment for sample sentences:")

# Define the sentiment labels that FinBERT typically outputs
labels = ['positive', 'negative', 'neutral']

results = []
for sentence in sample_sentences:
    # Tokenize the sentence
    # `return_tensors="pt"` ensures PyTorch tensors are returned
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Move inputs to the same device as the model
    inputs = {key: val.to(device) for key, val in inputs.items()}

    # Perform inference (forward pass)
    with torch.no_grad(): # Disable gradient calculation for inference to save memory and speed up
        outputs = model(**inputs)

    # Get the logits (raw prediction scores)
    logits = outputs.logits

    # Apply softmax to convert logits to probabilities
    probabilities = torch.softmax(logits, dim=1).cpu().numpy()[0] # Move to CPU and convert to numpy

    # Get the predicted sentiment (index of the highest probability)
    predicted_class_id = np.argmax(probabilities)
    predicted_sentiment = labels[predicted_class_id]
    
    # Store results
    results.append({
        "sentence": sentence,
        "sentiment": predicted_sentiment,
        "probabilities": {label: prob for label, prob in zip(labels, probabilities)}
    })

# --- 4. Display Results ---
print("-" * 50)
for res in results:
    print(f"Sentence: \"{res['sentence']}\"")
    print(f"  Predicted Sentiment: {res['sentiment'].upper()}")
    print(f"  Probabilities: P(Positive)={res['probabilities']['positive']:.4f}, "
          f"P(Negative)={res['probabilities']['negative']:.4f}, "
          f"P(Neutral)={res['probabilities']['neutral']:.4f}")
    print("-" * 20)

print("FinBERT sentiment analysis complete.")

Loading FinBERT tokenizer and model...


tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


FinBERT model loaded and moved to: cpu
--------------------------------------------------
Analyzing sentiment for sample sentences:
--------------------------------------------------
Sentence: "The company announced strong quarterly earnings, exceeding analyst expectations."
  Predicted Sentiment: POSITIVE
  Probabilities: P(Positive)=0.9531, P(Negative)=0.0234, P(Neutral)=0.0235
--------------------
Sentence: "Despite record sales, the stock price plummeted due to market uncertainty."
  Predicted Sentiment: NEGATIVE
  Probabilities: P(Positive)=0.0087, P(Negative)=0.9722, P(Neutral)=0.0191
--------------------
Sentence: "Analysts maintain a 'neutral' rating on the company's shares."
  Predicted Sentiment: NEUTRAL
  Probabilities: P(Positive)=0.0349, P(Negative)=0.4024, P(Neutral)=0.5627
--------------------
Sentence: "Inflation concerns are growing as commodity prices continue to rise."
  Predicted Sentiment: POSITIVE
  Probabilities: P(Positive)=0.8338, P(Negative)=0.1179, P(Neutral)

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

### Discussion: PyTorch in Finance

*   **Quantitative Research:** Developing new models for option pricing (e.g., using neural networks to approximate Black-Scholes or calibrate implied volatility surfaces).
*   **Generative Models:** Using GANs (Generative Adversarial Networks) or VAEs (Variational Autoencoders) to generate synthetic financial data for backtesting or privacy-preserving data sharing.
*   **Reinforcement Learning for Trading:** Building custom RL environments and agents with PyTorch for complex trading simulations, leveraging its flexibility for complex state and action spaces.
*   **Natural Language Processing:** State-of-the-art NLP models (e.g., Transformers from Hugging Face, often built on PyTorch) for processing earnings calls, analyst reports, and news for alpha generation.

PyTorch offers granular control, which is highly valued when you need to deviate from standard architectures or delve deep into debugging, making it a powerful tool for financial innovation and research.

## 5. JAX: The High-Performance Numerics and Research Tool (20-25 minutes)

### What is JAX?
JAX is a high-performance numerical computing library from Google that combines three key features:
1.  **Automatic Differentiation:** Capable of differentiating native Python and NumPy functions.
2.  **JIT Compilation (XLA):** Compiles Python and NumPy code into optimized, high-performance routines for CPUs, GPUs, and TPUs using XLA (Accelerated Linear Algebra).
3.  **Composability:** Its transformations (`grad`, `jit`, `vmap`, `pmap`) can be arbitrarily composed.

JAX's paradigm is functional programming – functions are pure, and state is handled explicitly, which leads to highly reproducible and performant code. It's not a full-fledged deep learning framework like TensorFlow or PyTorch, but rather a powerful toolkit upon which DL frameworks like Flax and Haiku are built.

### Why use JAX in Finance?

*   **High-Performance Numerical Computing:** Crucial for computationally intensive tasks in quantitative finance, such as Monte Carlo simulations, complex optimization problems, or large-scale calibration of financial models.
*   **Automatic Differentiation of Arbitrary Functions:** Extremely useful for calculating sensitivities (Greeks in options pricing), risk measures, or gradients for custom optimization algorithms without manual derivation.
*   **Functional Programming:** Encourages writing more robust, testable, and parallelizable code, which is important for financial models where correctness is paramount.
*   **Research & Custom Models:** For researchers developing novel financial models or optimization techniques that go beyond standard neural network architectures.

### Pros and Cons of JAX

| Aspect          | Pros                                                                                             | Cons                                                                                              |
| :-------------- | :----------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ |
| **Ease of Use** | Simple API for core transformations (`grad`, `jit`). Feels like NumPy.                          | Learning curve for functional programming paradigm, explicit state management. Less 'plug-and-play' for standard DL. |
| **Flexibility** | Extremely flexible for custom numerical algorithms, AD, and high-performance scientific computing. | Not a full-fledged DL framework; requires building more from scratch or using frameworks like Flax/Haiku. |
| **Performance** | Best-in-class performance on accelerators (GPUs, TPUs) due to XLA compilation.                 | Initial compilation time for JITted functions.                                                    |
| **Deployment**  | Primarily a research/development tool; deployment typically involves integrating compiled JAX functions into other systems. |
| **Community**   | Smaller but highly active community, primarily in research and high-performance computing.      | Less industry adoption than TF/PyTorch for general DL tasks.                                      |
| **Finance Use** | Option pricing (Greeks), risk management (sensitivities), complex optimization, Monte Carlo simulations, Bayesian inference. |                                                                                                   |

### How to use JAX: A Simple Example (Option Pricing Greeks)

Let's demonstrate JAX's automatic differentiation by calculating the Black-Scholes option price and its Greek (Delta) with minimal effort.

In [None]:
pip install jax

In [8]:
import jax
import jax.numpy as jnp
from jax import grad, jit, vmap
from jax.scipy.stats import norm

# 1. Define the Black-Scholes Call Option Price function
# S: Stock price
# K: Strike price
# T: Time to expiration (years)
# r: Risk-free rate
# sigma: Volatility

@jit # JIT compile for performance
def black_scholes_call(S, K, T, r, sigma):
    d1 = (jnp.log(S / K) + (r + 0.5 * sigma**2) * T) / (sigma * jnp.sqrt(T))
    d2 = d1 - sigma * jnp.sqrt(T)
    call_price = S * norm.cdf(d1) - K * jnp.exp(-r * T) * norm.cdf(d2)
    return call_price

# 2. Define parameters
S_val = 100.0  # Stock price
K_val = 105.0  # Strike price
T_val = 1.0    # Time to expiration (1 year)
r_val = 0.05   # Risk-free rate (5%)
sigma_val = 0.20 # Volatility (20%)

# 3. Calculate the call option price
call_price = black_scholes_call(S_val, K_val, T_val, r_val, sigma_val)
print(f"Black-Scholes Call Price: {call_price:.4f}")

# 4. Calculate Delta (gradient with respect to Stock Price S)
# `grad` returns a function that computes the gradient of `black_scholes_call`
# with respect to its first argument (S by default, index 0).
delta_func = grad(black_scholes_call)
delta_val = delta_func(S_val, K_val, T_val, r_val, sigma_val)

print(f"Black-Scholes Delta (d(Call)/dS): {delta_val:.4f}")

# 5. Calculate Rho (gradient with respect to Risk-free rate r)
# To compute the gradient with respect to a different argument, specify `argnums`.
# r is the 4th argument (index 3)
rho_func = grad(black_scholes_call, argnums=3)
rho_val = rho_func(S_val, K_val, T_val, r_val, sigma_val)

print(f"Black-Scholes Rho (d(Call)/dr): {rho_val:.4f}")

# Example of vmap (vectorization) - processing multiple options efficiently
S_multiple = jnp.array([90.0, 100.0, 110.0])
K_multiple = jnp.array([100.0, 100.0, 100.0])
T_multiple = jnp.array([0.5, 1.0, 1.5])

# We want to map over S, K, T, keeping r and sigma constant
vmap_bs_call = vmap(black_scholes_call, in_axes=(0, 0, 0, None, None))
prices_multiple = vmap_bs_call(S_multiple, K_multiple, T_multiple, r_val, sigma_val)
print(f"\nPrices for multiple options:\n{prices_multiple}")

# And we can vmap the delta calculation too!
vmap_delta_func = vmap(grad(black_scholes_call), in_axes=(0, 0, 0, None, None))
deltas_multiple = vmap_delta_func(S_multiple, K_multiple, T_multiple, r_val, sigma_val)
print(f"Deltas for multiple options:\n{deltas_multiple}")

Black-Scholes Call Price: 8.0214
Black-Scholes Delta (d(Call)/dS): 0.5422
Black-Scholes Rho (d(Call)/dr): 46.2015

Prices for multiple options:
[ 2.3494284 10.450576  20.774443 ]
Deltas for multiple options:
[0.30940998 0.6368306  0.79325354]


### Discussion: JAX in Finance

*   **Risk Management:** Calculating accurate and fast sensitivities (Greeks for options, Duration/Convexity for bonds) for large portfolios.
*   **Calibration of Models:** Optimizing parameters of complex financial models (e.g., stochastic volatility models) by using JAX for efficient gradient-based optimization.
*   **Quantitative Trading Strategies:** Implementing complex custom indicators or optimization routines that require fast computations and derivatives.
*   **Bayesian Inference:** For advanced statistical modeling in finance, JAX is a powerful backend for probabilistic programming libraries that need automatic differentiation (e.g., NumPyro).

JAX empowers quants and researchers to build highly efficient and custom numerical algorithms, addressing the core mathematical challenges in finance with unparalleled performance.

## 6. Comparison and Conclusion (10-15 minutes)

Let's summarize the key characteristics of these powerful deep learning packages:

| Feature             | TensorFlow/Keras                                   | PyTorch                                            | JAX                                                |
| :------------------ | :------------------------------------------------- | :------------------------------------------------- | :------------------------------------------------- |
| **Primary Focus**   | Production, Scalability, General-purpose DL        | Research, Flexibility, Pythonic Interface          | High-performance Numerics, AD, Research, Functional |
| **Graph Type**      | Static (compiled)                                  | Dynamic (define-by-run)                            | Functional transformations, JIT compilation         |
| **Ease of Use**     | Keras: Very high; TF low-level: moderate-low       | Moderate-high (Pythonic, but more manual loop)     | Moderate (NumPy-like, but functional paradigm)      |
| **Debugging**       | Can be challenging (static graph errors)           | Excellent (standard Python debugger)               | Good (functional, but JIT can obscure)             |
| **Deployment**      | Excellent (TF Serving, TF Lite, TF.js)             | Good (TorchScript, PyTorch Mobile)                 | Less direct (functions can be exported)            |
| **Community/Ecosystem** | Massive, mature, industry-standard               | Large, rapidly growing, strong in academia         | Smaller, highly technical, growing                 |
| **Best For**        | Large-scale deployments, robust applications, rapid prototyping (Keras).       | Cutting-edge research, custom architectures, complex modeling, flexible experimentation. | Highly optimized numerical tasks, complex optimization, advanced quantitative models.       |

### When to choose which framework in Finance:

*   **TensorFlow/Keras:**
    *   You need to deploy models into production systems with high reliability and scalability (e.g., real-time fraud detection, automated trading).
    *   You prefer a high-level API for rapid prototyping and have standard deep learning tasks (e.g., classifying market sentiment, predicting stock movements with standard LSTMs).
    *   Your team is already familiar with the TensorFlow ecosystem.

*   **PyTorch:**
    *   You are performing cutting-edge financial research, experimenting with novel model architectures, or require deep customization (e.g., new generative models for financial data, complex reinforcement learning agents).
    *   You value a more Pythonic interface and easier debugging capabilities.
    *   You're building complex NLP models for financial text analysis, often leveraging the Hugging Face ecosystem.

*   **JAX:**
    *   You need extreme performance for numerical computations, such as Monte Carlo simulations, complex optimization, or high-dimensional calibration problems.
    *   You require automatic differentiation for custom functions, especially for calculating sensitivities (Greeks) or gradients of bespoke financial models.
    *   You are comfortable with a functional programming style and building components from a lower level, or working with libraries built on JAX (e.g., Flax, Haiku, NumPyro).

### Final Thoughts

The deep learning landscape is dynamic, with frameworks constantly evolving and learning from each other. TensorFlow has embraced Keras as its high-level API and incorporated dynamic execution. PyTorch has improved its production deployment capabilities with TorchScript. JAX, while distinct, is influencing the design of other libraries.

As Master of Finance students, your goal isn't necessarily to master all of them, but to understand their strengths and weaknesses so you can pick the *right tool for the job*. The ability to articulate *why* you chose a particular framework for a specific financial problem demonstrates a deeper understanding of both quantitative methods and practical implementation.

I encourage you to experiment with these libraries. Start with Keras for ease, then explore PyTorch for flexibility, and consider JAX when you face computationally intensive numerical challenges or need precise gradient control.

Thank you!



### Fully Connected Network (FCN) - (also known as Multi-Layer Perceptron / MLP)

A **Fully Connected Network** is a type of artificial neural network where neurons are organized into layers, and every neuron in one layer is connected to every neuron in the subsequent layer. There are no connections between neurons within the same layer or backward connections.

**Key Characteristics:**
*   **Architecture:** Consists of an input layer, one or more hidden layers (dense layers), and an output layer.
*   **Data Processing:** Each input is treated independently. Data flows in one direction, from the input layer through the hidden layers to the output layer (feedforward).
*   **Input Type:** Best suited for **structured, independent data** where the order of features does not matter, or features are processed simultaneously.
*   **Memory:** Has **no inherent memory** of past inputs. Each prediction is based solely on the current input.
*   **Parameter Sharing:** Weights are generally *not* shared across different parts of the input or across different instances of the network's operation, except within the same layer (e.g., all connections from `L-1` to `L` share the same weight matrix).
*   **Output:** Typically outputs a fixed-size vector or a single value for classification or regression.

**Pros:**
*   **Simplicity:** Conceptually straightforward and easier to implement for basic tasks.
*   **Versatility:** Can learn highly complex, non-linear relationships within the data.
*   **Efficiency:** For fixed-size, independent inputs, FCNs can be very efficient to train and deploy.
*   **Universal Approximator:** With enough hidden layers and neurons, they can approximate any continuous function.

**Cons:**
*   **Inefficient for Sequential Data:** Cannot naturally handle sequences or capture temporal dependencies. Treating sequence elements independently loses crucial information.
*   **Fixed Input Size:** Typically requires a fixed-size input vector, which can be problematic for variable-length sequences.
*   **Parameter Overload:** Can have a very large number of parameters if the input dimension is high, leading to overfitting with limited data.

**Common Use Cases (in Finance):**
*   **Credit Scoring:** Predicting loan default based on a fixed set of financial and demographic features.
*   **Fraud Detection:** Identifying fraudulent transactions based on a snapshot of transaction details.
*   **Equity Research (Static):** Classifying stocks based on fundamental metrics and technical indicators at a specific point in time.
*   **Cross-Sectional Alpha Factors:** Predicting stock returns based on current company characteristics relative to others.

---

### Recurrent Neural Network (RNN)

A **Recurrent Neural Network** is a type of neural network designed to recognize patterns in sequential data. Unlike FCNs, RNNs have internal memory, allowing them to use information from previous inputs in the sequence to influence the processing of the current input.

**Key Characteristics:**
*   **Architecture:** Features connections that loop back on themselves or connect to the next time step, creating a "memory" of past information.
*   **Data Processing:** Processes data sequentially, one element at a time, while maintaining an internal "hidden state" that captures information from previous steps in the sequence.
*   **Input Type:** Specifically designed for **sequential, time-dependent data** where the order of elements is crucial.
*   **Memory:** Possesses an **internal state (memory)** that is updated at each step, allowing it to remember information over arbitrary lengths of sequences.
*   **Parameter Sharing:** Critically, the same set of weights is used across all time steps (or positions) in the sequence, which is highly efficient for learning temporal patterns.
*   **Output:** Can produce an output at each time step, or a single output after processing the entire sequence. Can handle variable-length input and output sequences.

**Pros:**
*   **Excellent for Sequential Data:** Naturally handles data where order and context are important.
*   **Variable-Length Inputs/Outputs:** Can process and generate sequences of varying lengths.
*   **Parameter Efficiency:** Shares parameters across time steps, reducing the total number of parameters compared to an FCN trying to model sequences of fixed length.

**Cons:**
*   **Vanishing/Exploding Gradients:** Traditional RNNs struggle with long-term dependencies due to vanishing or exploding gradients during training. (This is largely addressed by advanced RNN architectures like LSTMs and GRUs).
*   **Slower Training:** Sequential processing can be slower than parallel processing in FCNs, especially for very long sequences.
*   **Complexity:** Can be more complex to design and debug compared to simpler FCNs.

**Common Use Cases (in Finance):**
*   **Time Series Prediction:** Forecasting stock prices, exchange rates, or economic indicators.
*   **Algorithmic Trading:** Developing strategies that adapt to market dynamics over time.
*   **Sentiment Analysis of News Feeds:** Understanding the evolving sentiment from continuous streams of financial news.
*   **Credit Default Prediction (Dynamic):** Predicting default considering a borrower's payment history over time.
*   **Bond Rating Trends:** Analyzing changes in a company's financial health over quarters/years to predict rating adjustments.

---

### Comparison and Contrast

| Feature            | Fully Connected Network (FCN)                               | Recurrent Neural Network (RNN)                                     |
| :----------------- | :---------------------------------------------------------- | :----------------------------------------------------------------- |
| **Primary Use**    | Independent, structured, tabular data                       | Sequential, time-dependent data (text, time series, audio)         |
| **Architecture**   | Feedforward; no loops                                       | Contains loops; connects to previous states                        |
| **Memory**         | No inherent memory of past inputs                           | Maintains an internal hidden state (memory) over time              |
| **Input/Output**   | Fixed-size input/output                                     | Can handle variable-length sequences                               |
| **Parameter Sharing** | Weights are unique per connection (within a layer)      | Same weights are shared across different time steps                |
| **Order Sensitivity** | Order of features typically doesn't matter (if input is flattened) | Highly sensitive to the order of elements in a sequence            |
| **Training Issues** | Standard backpropagation                                    | Vanishing/exploding gradients (in vanilla RNNs); addressed by LSTMs/GRUs |
| **Complexity**     | Relatively simpler (for basic tasks)                        | More complex, especially for long sequences (LSTMs/GRUs needed)    |
| **Financial Use**  | Credit scoring, fraud detection (snapshot), cross-sectional alpha | Stock price forecasting, sentiment analysis of news, dynamic risk modeling |

**In Summary:**

*   **FCNs** excel at processing **static, independent data points** where all relevant information is present in a single input vector. They are powerful for learning complex non-linear relationships but lack a sense of continuity or temporal order.
*   **RNNs** are specifically designed for **sequential data**, enabling them to capture temporal dependencies and context over time. They "remember" past information, making them indispensable for tasks like time series analysis and natural language processing in finance.

The choice between an FCN and an RNN (or one of its variants like LSTM/GRU) primarily depends on the **nature of your data** and whether **temporal dependencies are crucial** for the task at hand. Often, more complex financial applications might combine both, using RNNs to extract features from time series data, and then feeding those features into FCNs for final prediction or classification.