# Data Loading and splitting

#### This notebook demonstrates how to load and split data using the LangChain library.

In [1]:
# import libraries
from langchain.document_loaders import UnstructuredURLLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

### UnstructuredURLLoader

In [2]:
loader = UnstructuredURLLoader(
    urls = [
        "https://www.reuters.com/business/finance/caixabanks-q3-net-profit-rises-70-same-period-2022-2023-10-27/",
        "https://www.reuters.com/business/finance/danske-bank-q3-profit-exceeds-expectations-2023-10-27/",
        "https://www.cnbc.com/2023/10/26/amazons-profit-margin-nears-record-high-after-ceo-jassys-cost-cuts.html"
    ]
)

In [3]:
# load data and check length (3 urls was passed)
data = loader.load()
len(data)

3

In [4]:
data[0].page_content

'Finance\n\nCaixabank sees higher lending income in 2023 after beating forecasts\n\nBy\n\nJesús Aguado\n\nOctober 27, 2023\n\n7:26 AM UTC\n\nUpdated  ago\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nA man uses a Caixabank ATM in Barcelona, Spain, October 3, 2022. REUTERS/Nacho Doce/File photo  Acquire Licensing Rights\n\nSummary\n\nCompanies\n\nQ3 net profit beats market forecasts\n\nNII 2023 growth guidance lifted to 10 bln vs 9.25 bln\n\nTargets stable NII performance in 2024\n\nMADRID, Oct 27 (Reuters) - Caixabank (CABK.MC) reported third-quarter net profit on Friday which beat forecasts, helped by higher lending income, which the Spanish bank said would rise more than 50% in 2023 compared to 2022.\n\nSpanish lenders are mainly retail lenders and are benefiting from rising interest rates, with higher returns on their loans, driven mainly by floating rate credit, while keeping deposit costs under control.\n\nIn Caixabank\'s case, yields on loans rose 48 basis points in the quarter to 4.23% whi

In [5]:
# source of the data
data[0].metadata

{'source': 'https://www.reuters.com/business/finance/caixabanks-q3-net-profit-rises-70-same-period-2022-2023-10-27/'}

## Text Splitters

> To effectively handle input text within the token limits of Language Models (LLMs), it is crucial to break down the text into smaller chunks to avoid exceeding the token limit. In addition to manual splitting, there are two more efficient methods for text segmentation provided by LangChain. The first method involves *character splitting*, which allows you to segment the data based on a specified character. The second method is *recursive splitting*, which provides the flexibility to segment the given text using multiple characters. This approach can be particularly valuable when dealing with extensive texts, as a single delimiter may not produce the desired results.

- ### CharacterTextSplitter


In [6]:
# create CharacterTextSplitter
splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size=200,
    chunk_overlap=0
)

In [7]:
chunks = splitter.split_text(data[0].page_content)
len(chunks)

Created a chunk of size 224, which is longer than the specified 200
Created a chunk of size 206, which is longer than the specified 200
Created a chunk of size 231, which is longer than the specified 200
Created a chunk of size 215, which is longer than the specified 200
Created a chunk of size 251, which is longer than the specified 200


20

In [8]:
for chunk in chunks:
    print(len(chunk))

134
175
92
224
206
231
93
215
251
68
167
82
129
182
153
196
196
122
114
195


- ### RecursiveTextSplitter

In [9]:
r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n", "\n", " "],  # List of separators based on requirement (defaults to ["\n\n", "\n", " "])
    chunk_size = 200,  # size of each chunk created
    chunk_overlap  = 0,  # size of  overlap between chunks in order to maintain the context
)

In [10]:
chunks = r_splitter.split_text(data[0].page_content)
len(chunks)

25

In [11]:
for chunk in chunks:
    print(len(chunk))

140
178
93
198
25
197
8
197
33
93
196
18
198
52
68
167
82
129
182
153
196
148
174
114
196
