<a href="https://colab.research.google.com/github/arikaran007/LLM_Langchain/blob/main/Text_Splitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
text = """ The maximum speed on the world-class highways coming up in India will remain at the existing 120kmph in the interest of safety, even though many of them have been tested for speeds as high as 160kmph.
           The Union road ministry has decided to retain the speed limit of 120kmph and 100kmph for expressways and highways, respectively, two people aware of the development said. However, the ministry will await an apex court verdict on this matter before making a notification.
           In April 2018, the government set speed limits of 120kmph for expressways, 100kmph for national highways and 60kmph for urban roads. However, the Madras High Court quashed the order in 2021, stating these speeds were too high. The matter is now in the Supreme Court.
           The Ministry of Road Transport and Highways (MoRTH) has internally decided to keep speed limits on highways at the levels decided in 2018 to give more time for highway driving to get mature and disciplined," one of the two people cited above said on the condition of anonymity.

“The proposed speed limits are still among the best globally and are considered sufficient for ensuring logistics efficiency, as it would double the average speed of truck movement on highways from the present 40kmph to 80kmph," he added.

A query sent to a road ministry spokesperson remained unanswered till press time.

The Motor Vehicles Act empowers MoRTH to set speed limits across the country; however, the subject falls in the concurrent list, meaning states also get to decide on the matter"""

In [4]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=0
)

In [8]:
chunks = splitter.split_text(text)
len(chunks)



7

In [13]:
for chunk in chunks:
  print(len(chunk))

200
270
266
277
238
81
176


As you can see, all though we gave 200 as a chunk size since the split was based on \n, it ended up creating chunks that are bigger than size 200.

Another class from Langchain can be used to recursively split the text based on a list of separators. This class is RecursiveTextSplitter. Let's see how it works

# RecursiveTextSpliiter

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ["\n\n","\n"," "],
    chunk_size=200,
    chunk_overlap=0
)

In [33]:
chunks = r_splitter.split_text(text)

for chunk in chunks:
    print(len(chunk))

191
8
183
86
184
81
182
94
197
40
81
176


In [23]:
first_split = text.split("\n\n")[0]
first_split


' The maximum speed on the world-class highways coming up in India will remain at the existing 120kmph in the interest of safety, even though many of them have been tested for speeds as high as 160kmph.\n           The Union road ministry has decided to retain the speed limit of 120kmph and 100kmph for expressways and highways, respectively, two people aware of the development said. However, the ministry will await an apex court verdict on this matter before making a notification.\n           In April 2018, the government set speed limits of 120kmph for expressways, 100kmph for national highways and 60kmph for urban roads. However, the Madras High Court quashed the order in 2021, stating these speeds were too high. The matter is now in the Supreme Court.\n           The Ministry of Road Transport and Highways (MoRTH) has internally decided to keep speed limits on highways at the levels decided in 2018 to give more time for highway driving to get mature and disciplined," one of the two 

In [24]:
len(first_split)


1050

In [25]:
second_split = first_split.split("\n")
second_split

[' The maximum speed on the world-class highways coming up in India will remain at the existing 120kmph in the interest of safety, even though many of them have been tested for speeds as high as 160kmph.',
 '           The Union road ministry has decided to retain the speed limit of 120kmph and 100kmph for expressways and highways, respectively, two people aware of the development said. However, the ministry will await an apex court verdict on this matter before making a notification.',
 '           In April 2018, the government set speed limits of 120kmph for expressways, 100kmph for national highways and 60kmph for urban roads. However, the Madras High Court quashed the order in 2021, stating these speeds were too high. The matter is now in the Supreme Court.',
 '           The Ministry of Road Transport and Highways (MoRTH) has internally decided to keep speed limits on highways at the levels decided in 2018 to give more time for highway driving to get mature and disciplined," one o

In [26]:
len(second_split)

4

In [27]:
for split in second_split:
    print(len(split))

201
281
277
288


In [3]:
pip install langchain

Collecting langchain
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.0-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.38 (from langchain)
  Downloading langsmith-0.0.40-py3-none-any.whl (39 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.20.1-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting jsonpointer>=1.9 (from jsonpatch<2.0,>=1.33