# Text Splitter 이해
- SageMaker stdio, Data Science 3.0 Image, ml.t3.medium 에서 테스트 되었습니다.

---


Reference:
- [Using langchain for Question Answering on Own Data](https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed)
- [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)

- 이 텍스트 분할기는 일반 텍스트에 권장되는 것입니다. 문자 목록으로 매개변수화 됩니다. 청크가 충분히 작아질 때까지 순서대로 분할하려고 시도합니다. 기본 목록은 ["\n\n", "\n", " ", ""]입니다. 이는 모든 단락(그리고 문장, 단어)을 가능한 한 오랫동안 함께 유지하려고 노력하는 효과가 있습니다. 일반적으로 이러한 단락은 의미상 가장 강력한 관련 텍스트 조각인 것처럼 보이기 때문입니다.

In [2]:
%load_ext autoreload
%autoreload 2

import sys, os
module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import print_ww



# 1. 간단한 RecursiveCharacterTextSplitter 및 CharacterTextSplitter 이해

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

전체 문자가 26개로서 아래는 26개 알파벳 청크 1개 로 분리 한다.

## RecursiveCharacterTextSplitter 이해

In [4]:
# Recursive text Splitter
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)
# Output - ['abcdefghijklmnopqrstuvwxyz']


['abcdefghijklmnopqrstuvwxyz']

26개 이상이기에 4개의 문자를 중복 시키고, 두개로 분리한다

In [5]:
# Character Text Splitter
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)
# Output - ['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']


['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

스페이스(' ') 를 포함해서 최대 26개 글자씩을 분리한다. 

In [6]:
# Recursive text Splitter
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)
# output - ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']


['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## CharacterTextSplitter 이해

Spearator 가 없는 CharacterTextSplitter 는 26 의 chunk size 를 무시하고 전체 (51) 을 1개의 청크로 사용한다.

In [7]:
# Character Text Splitter
print("length of text3: ", len(text3))
c_splitter.split_text(text3)
# output - ['a b c d e f g h i j k l m n o p q r s t u v w x y z']


length of text3:  51


['a b c d e f g h i j k l m n o p q r s t u v w x y z']

Spearator 를 스페이스로 주게 되면 이를 분리자로해서 26개 chunk size 이내로 청크를 분리한다.

In [8]:
# Character Text Splitter with separator defined
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)
# Output - ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

# 2. 실제 문장 이해하기

In [9]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space. \nThis is the last sentence"""

print_ww("some_text: \n\n", some_text)


print("\nlen: ", len(some_text)) #  -> 496


some_text:

 When writing documents, writers will use document structure to group content. This can convey to
the reader, which idea's are related. For example, closely related ideas are in sentances. Similar
ideas are in paragraphs. Paragraphs form a document.

 Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are
the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,
have a space.and words are separated by space.
This is the last sentence

len:  522


## RecursiveCharacterTextSplitter
- 전체 글자수가 522 인데, chunk_size 가 450 이어서, 450 글자 안에서 "\n\n", "\n", " ", "" 의 순서대로 분리자를 선택하여 분절 한다. 그래서 Paragraphs 로 시작하는 곳에서 두번째 청크가 시작이 됨

In [10]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

docs = r_splitter.split_text(some_text)
print("len(docs): ", len(docs)) 
print_ww("docs: \n", docs)



len(docs):  2
docs:
 ["When writing documents, writers will use document structure to group content. This can convey to
the reader, which idea's are related. For example, closely related ideas are in sentances. Similar
ideas are in paragraphs. Paragraphs form a document.", 'Paragraphs are often delimited with a
carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in
this string. Sentences have a period at the end, but also, have a space.and words are separated by
space. \nThis is the last sentence']


## CharacterTextSplitter
- 첫번째 청크가 450 개의 글자에 가까운 분리자 스페이스에서 분절 함. 그래서 두번째 청크는 have 에서 시작함.

In [11]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
docs = c_splitter.split_text(some_text)
print("len(docs): ", len(docs)) 
print("docs: \n", docs)

len(docs):  2
docs: 
 ['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,', 'have a space.and words are separated by space. \nThis is the last sentence']


# 3. 실제 긴 테스트 문서 (State of Union) 로 이해하기

In [12]:
# This is a long document we can split up.
with open('data/state_of_the_union.txt') as f:
    state_of_the_union = f.read()

## RecursiveCharacterTextSplitter 
- chunk_size == 1000 으로 분절. 1000 이내로 자르기 위해서 "\n\n", "\n", ".", " ", "" 순서 기준으로 먼저 오는 것을 분절하고, 다음 청크로 분리 함.

In [13]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 0,
    separators=["\n\n", "\n", ".", " ", ""],
    length_function = len,
    is_separator_regex = False,
)
texts = text_splitter.create_documents([state_of_the_union])

# show_context_used(texts)
from utils.rag import show_context_used

show_context_used(texts, limit=5)

-----------------------------------------------
1. Chunk: 970
-----------------------------------------------
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and
the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.
And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he
could make it bend to his menacing ways. But he badly miscalculated.He thought he could roll into
Ukraine and the world would roll over. Instead he met a wall of strength he never imagined.He met
the Ukrainian people.From President Zelenskyy to every Ukrainian, their fearlessness, their courage,
their determination, inspires the world.Let each of us here tonight in this Chamber send an
unmistakable signal to Ukraine and to the world.Please rise if you are able and sho

## CharacterTextSplitter
- chunk_size = 500 과 분리자를 스페이스로 해서 분리 함.

In [14]:
text_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
    separator = ' '
)

texts = text_splitter.create_documents([state_of_the_union])

show_context_used(texts)

-----------------------------------------------
1. Chunk: 498
-----------------------------------------------
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and
the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.
And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he
could make it bend to his menacing ways. But he badly miscalculated.He thought
-----------------------------------------------
2. Chunk: 495
-----------------------------------------------
he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never
imagined.He met the Ukrainian people.From President Zelenskyy to every Ukrainian, their
fearlessness, their courage, their determination, inspires the world.Let each of us here tonig