## Data Loaders
* Load all kinds of data and then ask the LLM questions about it.
* Connect with data sources and load private documents.
* LangChain built-in data loaders.
* Labeled as "integrations".
Most of them require to install the corresponding libraries.
LangChain documentation on Document Loaders
See the documentation page here.
See the list of built-in document loaders here.
Setup
After you download the code from the github repository in your computer
In terminal:

### jupyter lab
Go to the folder of notebooks and open the right notebook.

To see the code in Virtual Studio Code or your editor of choice.
open Virtual Studio Code or your editor of choice.
open the project-folder
open the 001-data-loaders.py file
## Create your .env file
In the github repo we have included a file named .env.example
Rename that file to .env file and here is where you will add your confidential api keys. Remember to include:
OPENAI_API_KEY=your_openai_api_key
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_PROJECT=your_project_name

## Connect with the .env file located in the same directory of this notebook
If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [2]:
#pip install python-dotenv

In [3]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [4]:
#pip install langchain

In [5]:
#pip install langchain-openai

## Connect with an LLM
If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [7]:
#pip install langchain-openai

### Simple data loading

#### Loading a .txt file

In [9]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [11]:
#pip install langchain-community

In [16]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/be-good.txt")

loaded_data = loader.load()

In [18]:
#loaded_data

#### Loading a CSV file

In [20]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader('./data/Street_Tree_List.csv')

loaded_data = loader.load()

In [22]:
#loaded_data

#### Loading an .html file

In [25]:
#pip install bs4

In [29]:
#pip install unstructured

In [31]:
pip show unstructured

Name: unstructured
Version: 0.16.13
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: devops@unstructuredai.io
License: Apache-2.0
Location: C:\Users\mt\Desktop\llm_course\myenv\Lib\site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, html5lib, langdetect, lxml, ndjson, nltk, numpy, psutil, python-iso639, python-magic, python-oxmsg, rapidfuzz, requests, tqdm, typing-extensions, unstructured-client, wrapt
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [37]:
# import nltk
# nltk.download('all')

In [36]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('./data/100-startups.html')

loaded_data = loader.load()

In [39]:
#loaded_data

#### Loading a .pdf file

In [41]:
#pip install pypdf

In [43]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./data/5pages.pdf')

loaded_data = loader.load_and_split()

In [44]:
loaded_data[0].page_content

'Page 1 of 4 \nPDF Files \nScan – Create – Reduce File Size  \n \n \nIt is recommended that you purchase an Adobe Acrobat product that \nallows you to read, create and manipulate PDF documents.  Go to \nhttp://www.adobe.com/products/acrobat/matrix.html to compare \nAdobe products and features –Adobe Acrobat Standard is sufficient. \n \n \nScanning Documents \n \nYou should only have to scan documents that are not electronic, and \nwhen you are unable to create a PDF using PDFMaker or the Print \nCommand from the application you are using.   \n \nSignature Pages \nIf you have a document such as a CV that requires a signature on a \npage only print the page that requires the signature –printing the \nentire document and scanning it is not necessary or desired.  Once you \nsign and scan the signature page you can combine it with the original \ndocument using the Create PDF From Multiple Files feature. \n \nScanner Settings \nBefore scanning documents remember to make certain that the \nfo

#### Loading a Wikipedia page and asking questions about it

In [46]:
#pip install wikipedia

In [51]:
from langchain_community.document_loaders import WikipediaLoader

name = "Sezen Aksu"

loader = WikipediaLoader(query=name, load_max_docs=1)

loaded_data = loader.load()[0].page_content

In [60]:
from langchain_core.prompts import ChatPromptTemplate

chat_template = ChatPromptTemplate.from_messages(
    [
        ("human", "Answer this {question}, here is some extra {context}"),
    ]
)

messages = chat_template.format_messages(
    question="Who is  Sezen Aksu",
    context=loaded_data
)

In [61]:
response = chatModel.invoke(messages)

In [62]:
response 

AIMessage(content=' Nur Yengi\'s debut album, Sevgiliye, in 1995.\nDuring the 80\'s, Aksu continued to release successful albums, such as Firuze, which featured the hit song "Firuze". She also collaborated with numerous artists, such as Goran Bregović on the album Adı Bende Saklı, which featured a mix of Turkish and Roma music.\nThroughout the 80\'s, Aksu\'s influence on Turkish music continued to grow, solidifying her reputation as a pioneering and influential figure in Turkish pop music. She continued to push boundaries and experiment with different styles and sounds, showcasing her versatility as a singer and songwriter.\n\nOverall, Sezen Aksu is a highly respected and acclaimed Turkish singer, songwriter, and producer, known for her powerful and emotive vocals, as well as her ability to adapt and innovate within the music industry. She has left a lasting impact on Turkish music and has garnered a large and dedicated fanbase both in Turkey and internationally.', additional_kwargs={'

In [63]:
response.content

' Nur Yengi\'s debut album, Sevgiliye, in 1995.\nDuring the 80\'s, Aksu continued to release successful albums, such as Firuze, which featured the hit song "Firuze". She also collaborated with numerous artists, such as Goran Bregović on the album Adı Bende Saklı, which featured a mix of Turkish and Roma music.\nThroughout the 80\'s, Aksu\'s influence on Turkish music continued to grow, solidifying her reputation as a pioneering and influential figure in Turkish pop music. She continued to push boundaries and experiment with different styles and sounds, showcasing her versatility as a singer and songwriter.\n\nOverall, Sezen Aksu is a highly respected and acclaimed Turkish singer, songwriter, and producer, known for her powerful and emotive vocals, as well as her ability to adapt and innovate within the music industry. She has left a lasting impact on Turkish music and has garnered a large and dedicated fanbase both in Turkey and internationally.'