<a href="https://colab.research.google.com/github/drpetros11111/Transformers/blob/main/04_window_method_in_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Window Method in PyTorch

In the previous section we built a method for calculating the average sentiment for long pieces of text by breaking the text up into *windows* and calculating the sentiment for each window individually.

Our approach in the last section was a quick-and-dirty solution. Here, we will work on improving this process and implementing it solely using PyTorch functions to improve efficiency.

The first thing we will do is import modules and initialize our model and tokenizer.

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

We will be using the same text example as we did previously.

In [None]:
txt = """
I would like to get your all  thoughts on the bond yield increase this week.  I am not worried about the market downturn but the sudden increase in yields. On 2/16 the 10 year bonds yields increased by almost  9 percent and on 2/19 the yield increased by almost 5 percent.

Key Points from the CNBC Article:

* **The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.**
* **Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic.**
* **However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.**

The recent rise in bond yields and U.S. inflation expectations has some investors wary that a repeat of the 2013 “taper tantrum” could be on the horizon.

The benchmark U.S. 10-year Treasury note climbed above 1.3% for the first time since February 2020 earlier this week, while the 30-year bond also hit its highest level for a year. Yields move inversely to bond prices.

Yields tend to rise in lockstep with inflation expectations, which have reached their highest levels in a decade in the U.S., powered by increased prospects of a large fiscal stimulus package, progress on vaccine rollouts and pent-up consumer demand.

The “taper tantrum” in 2013 was a sudden spike in Treasury yields due to market panic after the Federal Reserve announced that it would begin tapering its quantitative easing program.

Major central banks around the world have cut interest rates to historic lows and launched unprecedented quantities of asset purchases in a bid to shore up the economy throughout the pandemic. The Fed and others have maintained supportive tones in recent policy meetings, vowing to keep financial conditions loose as the global economy looks to emerge from the Covid-19 pandemic.

However, the recent rise in yields suggests that some investors are starting to anticipate a tightening of policy sooner than anticipated to accommodate a potential rise in inflation.

With central bank support removed, bonds usually fall in price which sends yields higher. This can also spill over into stock markets as higher interest rates means more debt servicing for firms, causing traders to reassess the investing environment.

“The supportive stance from policymakers will likely remain in place until the vaccines have paved a way to some return to normality,” said Shane Balkham, chief investment officer at Beaufort Investment, in a research note this week.

“However, there will be a risk of another ‘taper tantrum’ similar to the one we witnessed in 2013, and this is our main focus for 2021,” Balkham projected, should policymakers begin to unwind this stimulus.

Long-term bond yields in Japan and Europe followed U.S. Treasurys higher toward the end of the week as bondholders shifted their portfolios.

“The fear is that these assets are priced to perfection when the ECB and Fed might eventually taper,” said Sebastien Galy, senior macro strategist at Nordea Asset Management, in a research note entitled “Little taper tantrum.”

“The odds of tapering are helped in the United States by better retail sales after four months of disappointment and the expectation of large issuance from the $1.9 trillion fiscal package.”

Galy suggested the Fed would likely extend the duration on its asset purchases, moderating the upward momentum in inflation.

“Equity markets have reacted negatively to higher yield as it offers an alternative to the dividend yield and a higher discount to long-term cash flows, making them focus more on medium-term growth such as cyclicals” he said. Cyclicals are stocks whose performance tends to align with economic cycles.

Galy expects this process to be more marked in the second half of the year when economic growth picks up, increasing the potential for tapering.

## Tapering in the U.S., but not Europe

Allianz CEO Oliver Bäte told CNBC on Friday that there was a geographical divergence in how the German insurer is thinking about the prospect of interest rate hikes.

“One is Europe, where we continue to have financial repression, where the ECB continues to buy up to the max in order to minimize spreads between the north and the south — the strong balance sheets and the weak ones — and at some point somebody will have to pay the price for that, but in the short term I don’t see any spike in interest rates,” Bäte said, adding that the situation is different stateside.

“Because of the massive programs that have happened, the stimulus that is happening, the dollar being the world’s reserve currency, there is clearly a trend to stoke inflation and it is going to come. Again, I don’t know when and how, but the interest rates have been steepening and they should be steepening further.”

## Rising yields a ‘normal feature’

However, not all analysts are convinced that the rise in bond yields is material for markets. In a note Friday, Barclays Head of European Equity Strategy Emmanuel Cau suggested that rising bond yields were overdue, as they had been lagging the improving macroeconomic outlook for the second half of 2021, and said they were a “normal feature” of economic recovery.

“With the key drivers of inflation pointing up, the prospect of even more fiscal stimulus in the U.S. and pent up demand propelled by high excess savings, it seems right for bond yields to catch-up with other more advanced reflation trades,” Cau said, adding that central banks remain “firmly on hold” given the balance of risks.

He argued that the steepening yield curve is “typical at the early stages of the cycle,” and that so long as vaccine rollouts are successful, growth continues to tick upward and central banks remain cautious, reflationary moves across asset classes look “justified” and equities should be able to withstand higher rates.

“Of course, after the strong move of the last few weeks, equities could mark a pause as many sectors that have rallied with yields look overbought, like commodities and banks,” Cau said.

“But at this stage, we think rising yields are more a confirmation of the equity bull market than a threat, so dips should continue to be bought.”
"""

This time, because we are using PyTorch, we will specify `return_tensors='pt'` when encoding our input text.

In [None]:
tokens = tokenizer.encode_plus(txt, add_special_tokens=False,
                               return_tensors='pt')

print(len(tokens['input_ids'][0]))
tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (1345 > 512). Running this sequence through the model will result in indexing errors


1345


{'input_ids': tensor([[1045, 2052, 2066,  ..., 4149, 1012, 1524]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

# Tokenizes a given text using a pre-trained financial tokenizer


---


##Import necessary libraries:

    from transformers import BertTokenizer

This imports the BertTokenizer class from the transformers library.

-------------------------
##Initialize the tokenizer:

    tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')

This line initializes a tokenizer specifically designed for financial text, using the pre-trained model 'ProsusAI/finbert'.

------------------------
##Tokenize the text:

    tokens = tokenizer.encode_plus(txt, add_special_tokens=False, return_tensors='pt')

This line tokenizes the input text txt using the initialized tokenizer.

    add_special_tokens=False
    
Prevents the automatic addition of special tokens like [CLS] and [SEP].

In the provided code, the text is being processed in a specific way for sentiment analysis using a windowing method.

The code aims to split the text into smaller chunks (windows) of a specific size (512 tokens).

###Here's why add_special_tokens=False is important:

###Manual Control:

By setting add_special_tokens=False, the code avoids the tokenizer automatically adding special tokens like [CLS] and [SEP] at the beginning and end of the entire text.

This gives the code more control over where these tokens are placed.

###Windowing Approach:

The code later manually inserts the [CLS] and [SEP] tokens at the beginning and end of each individual chunk (window) of text.

This ensures each chunk is treated as a separate sequence for sentiment analysis.

If the special tokens were added at the beginning and end of the entire text, the windowing approach wouldn't work correctly.

###Padding and Chunk Size:

The code carefully calculates the required padding to ensure each chunk is exactly 512 tokens long, including the manually added [CLS] and [SEP] tokens.

 This consistent chunk size is important for efficient processing by the model. If special tokens were automatically added, the chunk sizes would vary, potentially exceeding the desired length.

###In summary,

setting add_special_tokens=False provides more control over token placement, which is crucial for the windowing approach used in this code to process long text for sentiment analysis with specific chunk sizes.

By manually adding the special tokens to each chunk, the code ensures that the chunks are formatted correctly for the model and padded to the desired length.

    return_tensors='pt'

Ensures the output is in PyTorch tensor format.

-------------------------------
##Print the length of the tokenized input:

    print(len(tokens['input_ids'][0]))

This line prints the length of the tokenized input, accessed via tokens['input_ids'][0], which represents the numerical IDs of the tokens.

-------------------------
##Display the tokenized output:

    tokens

This line displays the full tokenized output, including input IDs and attention mask.

-----------------------------
##In essence,

this code snippet tokenizes a given text using a pre-trained financial tokenizer and displays the length and content of the tokenized output as PyTorch tensors.

Now we have a set of tensors where each tensor contains **1345** tokens. We will use a similiar approach to what we used before where we will pull out a length of **510** tokens (or less), add the CLS and SEP tokens, then add PAD tokens when needed. To create these tensors of length **510** we  need to use the `torch.split` method.

In [None]:
a = torch.arange(10)
a

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Creates a 1-D tensor
This line creates a 1-D tensor named a using the

    torch.arange function. torch.arange(10)

generates a sequence of numbers from 0 to 9 (exclusive) and creates a tensor from it.

--------
##Display the tensor:

    a

This line displays the content of the tensor a.

To see the output, run the code. You will see a tensor containing the numbers 0 to 9.

----------------
##In essence,

this code snippet creates a simple 1-D tensor containing a sequence of numbers and displays it. This is a fundamental operation in PyTorch for creating and working with tensors.

In [None]:
torch.split(a, 4)

(tensor([0, 1, 2, 3]), tensor([4, 5, 6, 7]), tensor([8, 9]))

Now we apply `split` to our *input IDs* and *attention mask* tensors. Note that we must access the first element of each tensor because they are shaped like a list within a list (you can see this by comparing the number of square brackets between tensor `a` above, and the tensors shown when outputting `tokens` above.

In [None]:
input_id_chunks = tokens['input_ids'][0].split(510)
mask_chunks = tokens['attention_mask'][0].split(510)

# Splitting long text into chunks


---


     tokens['input_ids'][0]:

This part accesses the input IDs from the tokens dictionary.

tokens likely holds the output of a tokenization process, where text is converted into numerical representations.


      ['input_ids']

retrieves the specific key containing the input IDs.

      [0]

selects the first element of the input IDs, which is likely a tensor or a list representing the tokenized text.

--------------------------
    .split(510):

This applies the split method to the selected input IDs.

split(510) divides the input IDs into chunks of 510 tokens each.

If the total number of tokens is not perfectly divisible by 510, the last chunk will have fewer tokens.

    input_id_chunks = ...:

This assigns the resulting chunks to the variable input_id_chunks.

In essence, this line splits the tokenized input IDs into smaller chunks of 510 tokens each and stores them in input_id_chunks.

-----------------------------
Line 2: mask_chunks = tokens

    ['attention_mask'][0].split(510)

This line follows the same logic as the first, but instead of input IDs, it works with the attention mask.

     tokens['attention_mask'][0]:

This selects the attention mask from the tokens dictionary and takes its first element.

    .split(510):

Similar to before, this splits the attention mask into chunks of 510 elements.

    mask_chunks = ...:

The resulting chunks are stored in the variable mask_chunks.

This line splits the attention mask into chunks of 510 elements each, aligning with the input ID chunks, and assigns them to mask_chunks.

-------------------------------
#Reasoning:

This splitting into chunks is often used when dealing with long sequences in natural language processing.

It allows you to process the text in smaller, more manageable pieces, which can be important for computational efficiency or when working with models that have limitations on input sequence length.

The attention mask is used to indicate which tokens should be attended to by the model, so splitting it alongside the input IDs ensures that the attention mechanism works correctly on each chunk.

The line torch.split(a, 4) serves as an introductory example of the split functionality. The actual application of this concept occurs later in the code, utilizing the .split(510) method on the input IDs and attention mask tensors.

# Add Start & Finish tokens in the splits through tensor concatenation
To add our CLS (**101**) and SEP (**102**) tokens, we can use the `torch.cat` method. This method takes a *list* of tensors and con**cat**enates them. Let's try it on our example tensor `a` first:

In [None]:
a = torch.cat(
    [torch.Tensor([101]), a, torch.Tensor([102])]
)

a

tensor([101.,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9., 102.])

# Padding the Chunks
It's that easy! We're almost there now, but we still need to add padding to our tensors to push them upto a length of *512*, which should only be required for the final chunk.

To do this we will build an if-statement that checks if the tensor length requires padding, and if so add the correct amount of padding which will be something like `required_len = 512 - tensor_len`.

Again, let's test it on tensor `a` first:

In [None]:
padding_len = 20 - a.shape[0]

padding_len

8

# Padding Calculated


---


Line 1:

    padding_len = 20 - a.shape[0]

    a.shape[0]:

This retrieves the size of the first dimension of the tensor a.

In this case, since a is a 1-dimensional tensor, a.shape[0] represents the total number of elements in the tensor.

    20 - a.shape[0]:

This subtracts the size of the tensor a from 20. The result represents the number of padding elements that would be needed to make the tensor have a length of 20.

    padding_len = ...:

This assigns the calculated padding length to the variable padding_len.


Line 2:

    padding_len

This line simply displays the value of padding_len.

If the tensor a has fewer than 20 elements, padding_len will be a positive number indicating the amount of padding needed.

If the tensor a already has 20 or more elements, padding_len will be 0 or a negative number, indicating that no padding is required.

In essence, these lines calculate the amount of padding needed to bring the length of the tensor a up to 20 and store this value in padding_len.

----------------------------
##Example:

If a has 15 elements,
    a.shape[0]

would be 15.

Then, padding_len would be calculated as 20 - 15 = 5.

This means 5 padding elements would be needed to make a have a length of 20.

This is just an illustrative example.

In [None]:
if padding_len > 0:
    a = torch.cat(
        [a, torch.Tensor([0] * padding_len)]
    )

a

tensor([101.,   0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9., 102.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])

Now let's use the same logic with our `tokens` tensors.

In [None]:
# define target chunksize
chunksize = 512

# split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)
input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))

# loop through each chunk
for i in range(len(input_id_chunks)):
    # add CLS and SEP tokens to input IDs
    input_id_chunks[i] = torch.cat([
        torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
    ])
    # add attention tokens to attention mask
    mask_chunks[i] = torch.cat([
        torch.tensor([1]), mask_chunks[i], torch.tensor([1])
    ])
    # get required padding length
    pad_len = chunksize - input_id_chunks[i].shape[0]
    # check if tensor length satisfies required chunk size
    if pad_len > 0:
        # if padding length is more than 0, we must add padding
        input_id_chunks[i] = torch.cat([
            input_id_chunks[i], torch.Tensor([0] * pad_len)
        ])
        mask_chunks[i] = torch.cat([
            mask_chunks[i], torch.Tensor([0] * pad_len)
        ])

# check length of each tensor
for chunk in input_id_chunks:
    print(len(chunk))
# print final chunk so we can see 101, 102, and 0 (PAD) tokens are all correctly placed
chunk

512
512
512


tensor([  101.,  2153.,  1010.,  1045.,  2123.,  1521.,  1056.,  2113.,  2043.,
         1998.,  2129.,  1010.,  2021.,  1996.,  3037.,  6165.,  2031.,  2042.,
         9561.,  7406.,  1998.,  2027.,  2323.,  2022.,  9561.,  7406.,  2582.,
         1012.,  1524.,  1001.,  1001.,  4803., 16189.,  1037.,  1520.,  3671.,
         3444.,  1521.,  2174.,  1010.,  2025.,  2035., 18288.,  2024.,  6427.,
         2008.,  1996.,  4125.,  1999.,  5416., 16189.,  2003.,  3430.,  2005.,
         6089.,  1012.,  1999.,  1037.,  3602.,  5958.,  1010., 23724.,  2015.,
         2132.,  1997.,  2647., 10067.,  5656., 14459.,  6187.,  2226.,  4081.,
         2008.,  4803.,  5416., 16189.,  2020.,  2058., 20041.,  1010.,  2004.,
         2027.,  2018.,  2042.,  2474., 12588.,  1996.,  9229., 26632., 23035.,
        17680.,  2005.,  1996.,  2117.,  2431.,  1997., 25682.,  1010.,  1998.,
         2056.,  2027.,  2020.,  1037.,  1523.,  3671.,  3444.,  1524.,  1997.,
         3171.,  7233.,  1012.,  1523., 

# Long text Preprocessing steps

----------------------
##1. Defining Chunk Size:

    chunksize = 512

This line simply sets the desired size for each chunk of text to 512 tokens.

------------------
##2. Splitting into Chunks:

Split into chunks of 510 tokens, we also convert to list (default is tuple which is immutable)

    input_id_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))

###tokens['input_ids'][0]

and tokens['attention_mask'][0] access the input IDs and attention masks, respectively, from the tokens dictionary.

###.split(chunksize - 2)

splits these into chunks of chunksize - 2 (which is 510) tokens each.

We subtract 2 to make space for the special CLS and SEP tokens that will be added later.

###list(...)

converts the result (which is a tuple by default) into a list, as lists are mutable (can be modified).

------------------------
##3. Processing Each Chunk:


    for i in range(len(input_id_chunks)):
        # ... (code inside the loop)

This loop iterates through each chunk in input_id_chunks (and correspondingly in mask_chunks).

-------------------

##4. Adding Special Tokens:


    input_id_chunks[i] = torch.cat([
           torch.tensor([101]), input_id_chunks[i], torch.tensor([102])
         ])
    mask_chunks[i] = torch.cat([
        torch.tensor([1]), mask_chunks[i], torch.tensor([1])
    ])

    torch.cat(...)

concatenates (joins) tensors together.

For each chunk, it adds the CLS token (101) at the beginning and the SEP token (102) at the end of the input IDs.

It also adds corresponding attention mask values (1) for these special tokens.

-------------------
##5. Padding:


    pad_len = chunksize -  
       input_id_chunks
        [i].shape[0]
    if pad_len > 0:
        input_id_chunks[i] = torch.cat([
            input_id_chunks[i], torch.Tensor([0] * pad_len)
        ])
        mask_chunks[i] = torch.cat([
            mask_chunks[i], torch.Tensor([0] * pad_len)
        ])

    pad_len

calculates the required padding length to make the chunk reach the target size (chunksize).

If pad_len is greater than 0 (meaning padding is needed), it adds padding tokens (0) to both the input IDs and attention masks to reach the desired length of 512.

-------------------
##6. Verification:


    for chunk in input_id_chunks:
       print(len(chunk))
       chunk

This loop prints the length of each chunk to verify they are all 512 tokens long after padding.

The final chunk is printed to show the structure, including the special tokens and padding.

-----------------------
#In summary:

This code takes a long text, splits it into smaller chunks of 510 tokens, adds special tokens (CLS and SEP) to each chunk, pads the chunks to reach a uniform length of 512 tokens, and then verifies the lengths.


This is a common preprocessing step for feeding text data into transformer models like BERT.

It all looks good! Now the final step of placing our tensors back into the dictionary style format we had before.

In [None]:
input_ids = torch.stack(input_id_chunks)
attention_mask = torch.stack(mask_chunks)

input_dict = {
    'input_ids': input_ids.long(),
    'attention_mask': attention_mask.int()
}
input_dict

{'input_ids': tensor([[  101,  1045,  2052,  ...,  1012,  1523,   102],
         [  101,  1996, 16408,  ...,  2272,  1012,   102],
         [  101,  2153,  1010,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)}

We can now process all chunks and calculate probabilities using softmax in parallel like so:

In [None]:
outputs = model(**input_dict)
probs = torch.nn.functional.softmax(outputs[0], dim=-1)
probs = probs.mean(dim=0)
probs

tensor([0.4144, 0.4940, 0.0916], grad_fn=<MeanBackward1>)