# 2b - Web Scraping CSV to docs_text delta table
* Notebook by Adam Lang
* Notebook was adopted from the Databricks webinar in June 2024 that streamed on the Databricks YouTube channel.
  * This is the alternative 2nd notebook and the first step after creating the tables within the unity catalog, but ONLY USE THIS NOTEBOOK if you are uploading a CSV file that you precreated such as from web scraping or other data source, not from a PDF file as the previous notebook. 

* Date: 4/30/2025

## 1. Install Dependencies
* Install then restart kernel.

In [0]:
%pip install langchain
dbutils.library.restartPython()

## 2. Extract, Split and Chunk Text

In [0]:
import os 
from langchain.text_splitter import RecursiveCharacterTextsplitter

## 1. Read in csv text file with scraped web text
df = spark.read.text("<insert csv file path here.csv>")

## 2. Collect all text into single string
text_col = " ".join([row.value for row in df.collect()])

## 3. Setup Text splitter
## init text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, # character length
                                               chunk_overlap=200,  # Overlap
                                               length_function=len,  # Character length with len()
                                               separators=["\n\n", "\n", ".", "?", "!", " ", ""] # adding ChromaDB separators
)

## create chunks from text column
chunks = text_splitter.split_text(text_col)


## 3. Create Pandas UDF to chunk text for insert to delta table

In [0]:
from pyspark.sql.functions import pandas_udf 
from pyspark.sql.types import ArrayType, StringType
import pandas as pd 

# setup chunk function
@pandas_udf("array<string>")
def get_chunks(dummy) -> pd.Series:
    return pd.Series([chunks])

# Register the UDF dataframe
spark.udf.register("get_chunks", get_chunks)")

## 4. Load/Insert chunked data into docs_text delta table

In [0]:
%sql

insert into workspace.llm_rag_demos.docs_text (text)
select explode(get_chunks('dummy')) as text;