In [5]:

# Transformer models have revolutionized the field of natural language processing (NLP) and are the backbone of many state-of-the-art models 
# like BERT, GPT, and T5.

# The transformer architecture consists of an encoder and a decoder, each built from layers of self-attention and feed-forward neural networks.

# The Encoder consists of:

# Input Embeddings: The input text is converted into embeddings, which are vectors representing the words.

# Positional Encoding: Since transformers do not have a sequence-awareness like RNNs, positional encodings are added to input embeddings 
# to give the model information about the order of words.

# Multi-Head Self-Attention: Each word attends to every other word in the sentence, allowing the model to capture context from all positions. 
# Multiple attention heads allow the model to focus on different parts of the sentence simultaneously.

# Feed-Forward Neural Network: After attention, the data is passed through a feed-forward neural network.

# Layer Normalization and Residual Connections: These help in stabilizing and speeding up the training process.

# The Decoder is similar but slightly different:
# Similar to the encoder but with an additional attention layer that helps the decoder focus on relevant parts of the input sentence.
        
# These models are pre-trained, which is very convenient because it saves us a ton of time, as we don't have to train and tune the models ourselves.

# Methods:
# Masked Language Modeling (MLM): Used in BERT. Randomly masks some of the words in the input and the model is trained to predict them.
# Next Sentence Prediction (NSP): Also used in BERT. The model predicts whether a given sentence follows another sentence in the original text.
# Causal Language Modeling (CLM): Used in GPT. Predicts the next word in a sequence, training the model to generate coherent text.

# Fine-tuning
# After pre-training, the model can be fine-tuned on specific tasks like question answering, text classification, or translation using 
# task-specific data.

# So, how does this actually work?
# Self-Attention: The core idea is to compute a weighted representation of the input sequence where each word pays attention to all other words. 
# This is achieved using:
#  Query (Q): Represents the word for which attention is being calculated.
# Key (K): Represents the words being attended to.
# Value (V): The values corresponding to the keys, which contribute to the output representation.
# The attention score is calculated using the dot product of Q and K, scaled, and passed through a softmax to get the attention weights.

# Multi-Head Attention
# Multiple attention mechanisms (heads) are used in parallel to capture different aspects of the relationships between words.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute    

In [1]:

# I did a Google search for this: 'federal reserve pdf'
# I Downloaded the first 5 PDF files that I saw and put them in a folder on my desktop.

# The script below uses a question-answering model initialized from the Hugging Face transformers library, specifically the 
# bert-large-uncased-whole-word-masking-finetuned-squad model.

import os
import pdfplumber
from tqdm import tqdm
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering

import warnings
warnings.filterwarnings('ignore')

# Function to extract text from a single PDF using pdfplumber
def extract_text_from_pdf(pdf_path):
    text = ''
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text
    return text

# Function to process all PDFs with a progress bar
def get_all_texts(folder_path):
    pdf_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pdf')]
    texts = []
    for pdf_file in tqdm(pdf_files, desc="Processing PDFs"):
        texts.append(extract_text_from_pdf(pdf_file))
    return ' '.join(texts)

# Initialize the question-answering pipeline
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = AutoModelForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

def ask_question(context, question):
    return qa_pipeline({'context': context, 'question': question})

# Specify the folder containing the files
folder_path = 'C:\\Users\\ryan_\\Desktop\\All_Docs\\'
context = get_all_texts(folder_path)







Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Processing PDFs: 100%|███████████████████████████████████████████████████████████████████| 5/5 [00:42<00:00,  8.47s/it]


In [2]:

# Ask a question
question = "What are these documents about?"
answer = ask_question(context, question)
print(answer['answer'])


Federal Reserve Board


In [3]:

# Ask a question
question = "What is the main purpose of the Federal Reserve Bank?"
answer = ask_question(context, question)
print(answer['answer'])


supervising and examining state member banks


In [4]:

# This script below also uses a question-answering model initialized from the transformers library but with a different model: 
# distilbert-base-uncased-distilled-squad.

import os
import fitz  # PyMuPDF
from tqdm import tqdm
from transformers import pipeline

# Function to extract text from a single PDF using PyMuPDF
def extract_text_from_pdf(pdf_path):
    text = ''
    try:
        doc = fitz.open(pdf_path)
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            text += page.get_text()
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
    return text

# Function to process all PDFs with a progress bar
def get_all_texts(folder_path):
    pdf_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pdf')]
    texts = []
    for pdf_file in tqdm(pdf_files, desc="Processing PDFs"):
        texts.append(extract_text_from_pdf(pdf_file))
    return ' '.join(texts)

# Initialize the question-answering pipeline
qa_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')

def ask_question(context, question):
    result = qa_pipeline({'context': context, 'question': question})
    return result['answer']

# Specify the folder containing the files
folder_path = 'C:\\Users\\ryan_\\Desktop\\All_Docs\\'
context = get_all_texts(folder_path)


Downloading config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Processing PDFs: 100%|███████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  8.87it/s]


In [10]:

# Ask a question
question = "In what year was the Federal Reserve system created?"
answer = ask_question(context, question)
print("Answer:", answer)


Answer: 1913


In [11]:

# Ask a question
question = "Who are the memebers of the Federal Reserve system?"
answer = ask_question(context, question)
print("Answer:", answer)


Answer: state-chartered banks


In [None]:

# Popular Transformer Models include the following
# BERT (Bidirectional Encoder Representations from Transformers)
# Training: Trained using MLM and NSP.
# Usage: Good for tasks requiring understanding of context from both directions, such as question answering and text classification.

# GPT (Generative Pre-trained Transformer)
# Training: Uses CLM, training to predict the next word in a sequence.
# Usage: Excellent for text generation tasks.

# T5 (Text-to-Text Transfer Transformer)
# Training: Converts all tasks into a text-to-text format, where both input and output are text strings.
# Usage: Versatile, used for translation, summarization, and more.

# Applications of Transformer Models
# Question Answering: Understanding context and extracting relevant information.
# Text Generation: Generating coherent and contextually relevant text.
# Translation: Translating text from one language to another.
# Summarization: Condensing long texts into shorter summaries.
# Sentiment Analysis: Determining the sentiment expressed in a text.

# Understanding their architecture and training helps in leveraging these models effectively for various NLP tasks. By using pre-trained 
# models like BERT and GPT, one can achieve high performance on complex language tasks with relatively little fine-tuning on specific datasets.


In [None]:

# END!!!
