In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY')
os.environ['ACTIVELOOP_TOKEN'] = os.environ.get('ACTIVELOOP_TOKEN')

## TextLoader
Import the LangChain and necessary loaders from  langchain.document_loaders.

You can use the encoding argument to change the encoding type. (For example:  encoding="ISO-8859-1")

In [4]:
from langchain.document_loaders import TextLoader

loader = TextLoader('my_file.txt')
documents = loader.load()


## PyPDFLoader (PDF)
The LangChain library provides two methods for loading and processing PDF files: PyPDFLoader and PDFMinerLoader. We mainly focus on the former, which is used to load PDF files into an array of documents, where each document contains the page content and metadata with the page number. First, install the package using Python Package Manager (PIP).

In [8]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/Informe-Turismos-Enero-2024.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='Matriculaciones de Automóviles de turismo\nEnero 2024\n2024 2023 %24/23 2024 2023 %24/23\n68.685 64.038 7,3% 68.685 64.038 7,3%\nTop 10 Automóviles de turismo\nEnero-Enero\n1º TOYOTA 7.615 TOYOTA 7.615 DACIA SANDERO 2.233 DACIA SANDERO 2.233\n2º SEAT 5.073 SEAT 5.073 TOYOTA COROLLA 2.143 TOYOTA COROLLA 2.143\n3º KIA 4.594 KIA 4.594 MG ZS 1.626 MG ZS 1.626\n4º HYUNDAI 4.121 HYUNDAI 4.121 SEAT LEON 1.535 SEAT LEON 1.535\n5º PEUGEOT 3.811 PEUGEOT 3.811 TOYOTA YARIS CROSS 1.522 TOYOTA YARIS CROSS 1.522\n6º DACIA 3.780 DACIA 3.780 SEAT ARONA 1.501 SEAT ARONA 1.501\n7º BMW 3.593 BMW 3.593 HYUNDAI TUCSON 1.494 HYUNDAI TUCSON 1.494\n8º VOLKSWAGEN 3.405 VOLKSWAGEN 3.405 PEUGEOT 2008 1.465 PEUGEOT 2008 1.465\n9º MERCEDES 3.308 MERCEDES 3.308 SEAT IBIZA 1.371 SEAT IBIZA 1.371\n10º RENAULT 2.937 RENAULT 2.937 TOYOTA RAV 4 1.241 TOYOTA RAV 4 1.241\nAutomóviles de turismo: Detalle por carburante (Cuota)\nEne.      Feb. Mar.     Abr. May.     Jun. Jul.       Ago. Sep.     Oct.      Nov.

In [16]:
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator

In [17]:
model_name = 'gpt-3.5-turbo-instruct'
temperature = 0.0
model = OpenAI(model_name=model_name, temperature=temperature)

In [34]:
from datetime import datetime
# Define your desired data structure.
class Turismo(BaseModel):
    anio_actual: int = Field(description="Año actual")
    valor_actual: int = Field(description="Número de ventas del año actual")
    anio_pasado: int = Field(description="Año anterior al actual")
    valor_pasado: int = Field(description="Número de ventas del año anterior")

    # You can add custom validation logic easily with Pydantic.
    @validator('anio_actual')
    def anio_is_actual(cls, field):
        if field != datetime.now().year:
            raise ValueError("No es el año actual")
        return field
    
    """
    # You can add custom validation logic easily with Pydantic.
    @validator('valor_actual')
    def anio_is_actual(cls, field):
        if field < 1000:
            raise ValueError("El valor es muy pequeño. Revisar formato")
        return field
    """
    
# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Turismo)
    
template = """
        Te daré una página de documento acerca de ventas de turismos(automóviles). 
        Necesito que extraigas de dicho texto el número total de ventas del mes en cuestion para el año actual y el pasado.
        
        texto: {texto}
        \n{format_instructions}"""

template_bk = """
        Te daré una página de documento acerca de ventas de turismos(automóviles). 
        Necesito que extraigas de dicho texto el número total de ventas del mes en cuestion para el año actual y el pasado.
        
        Formato: El número suele venir como un entero formateado los miles con un punto.
        Ejemplo: 58.365 sería 58365
        texto: {texto}
        \n{format_instructions}"""

prompt = PromptTemplate(
    template=template,
    input_variables=["texto"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

_input = prompt.format_prompt(texto=pages[0])

output = model.invoke(_input.to_string())
parser.parse(output)

Turismo(anio_actual=2024, valor_actual=68, anio_pasado=2023, valor_pasado=64)

Probando el fixing

In [32]:
from langchain.output_parsers import OutputFixingParser

try:
    ans = parser.parse(output)
except Exception as e:
    print('intentando corregir salida..llamamos con el formato correcto nuevamente')
    o_parse = OutputFixingParser.from_llm(parser=parser, llm=model)
    ans = o_parse.parse(output)

ans

intentando corregir salida..llamamos con el formato correcto nuevamente


Turismo(anio_actual=2024, valor_actual=68685, anio_pasado=2023, valor_pasado=64038)

Using PyPDFLoader offers advantages such as simple, straightforward usage and easy access to page content and metadata, like page numbers, in a structured format. However, it has disadvantages, including limited text extraction capabilities compared to PDFMinerLoader.

## SeleniumURLLoader (URL)
The SeleniumURLLoader module offers a robust yet user-friendly approach for loading HTML documents from a list of URLs requiring JavaScript rendering. Here is a guide and example for using this class which starts by installing the package using the Python Package Manager (PIP)

In [11]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Eddy.Tovar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Eddy.Tovar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


page_content='OPENASSISTANT TAKES ON CHATGPT!\n\nBuscar\n\nInformación\n\nCompras\n\nVer más tarde\n\nCompartir\n\nCopiar enlace\n\nActivar el sonido\n\n2x\n\nSi la reproducción no empieza en breve, prueba a reiniciar el dispositivo.\n\nSiguiente\n\nEn directoPróximamente\n\nVer ahora\n\nMachine Learning Street Talk\n\nSuscribirme\n\nSuscrito\n\nNo has iniciado sesión\n\nLos vídeos que veas podrían aparecer en el historial de reproducciones de la TV e influir en las recomendaciones. Puedes evitarlo si cancelas e inicias sesión en YouTube desde tu ordenador.\n\nCompartir\n\nSe ha producido un error al recuperar la información de uso compartido. Vuelve a intentarlo más tarde.\n\n2:19\n\n2:19 / 59:51\n\nVer vídeo completo\n\n•\n\nDesliza hacia abajo para ver más detalles\n\nNaN / NaN\n\nNaN / NaN\n\nBuscar' metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s', 'title': 'OPENASSISTANT TAKES ON CHATGPT! - YouTube', 'description': 'Patreon: https://www.patreon.com/mlstDis

The SeleniumURLLoader class includes the following attributes:

URLs (List[str]): List of URLs to load.
continue_on_failure (bool, default=True): Continues loading other URLs on failure if True.
browser (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox'.
executable_path (Optional[str], default=None): Browser executable path.
headless (bool, default=True): Browser runs in headless mode if True.
Customize these attributes during SeleniumURLLoader instance initialization, such as using Firefox instead of Chrome by setting the browser to "firefox":

## Google Drive loader
The LangChain Google Drive Loader efficiently imports data from Google Drive by using the GoogleDriveLoader class. It can fetch data from a list of Google Docs document IDs or a single folder ID.

Prepare necessary credentials and tokens:

By default, the GoogleDriveLoader searches for the credentials.json file in ~/.credentials/credentials.json. Use the credentials_file keyword argument to modify this path.
The token.json file follows the same principle and will be created automatically upon the loader's first use.
To set up the credentials_file, follow these steps:

Create a new Google Cloud Platform project or use an existing one by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.
Retrieve the folder or document ID from the URL:

Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
Document: https://docs.google.com/document/d/{document_id}/edit
Import the GoogleDriveLoader class:

In [12]:
from langchain.document_loaders import GoogleDriveLoader

In [13]:
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)

In [14]:
docs = loader.load()

ImportError: You must run `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib` to use the Google Drive loader.