## Streamlined Data Ingestion: Text, PyPDF,  Selenium URL Loaders, and Google Drive Sync

### Introduction

The TextLoader handles plain text files, while the PyPDFLoader specializes in PDF files, offering easy access to content and metadata. SeleniumURLLoader is designed for loading HTML documents from URLs that require JavaScript rendering. Lastly, the Google Drive Loader provides seamless integration with Google Drive, allowing for the import of data from Google Docs or folders.

### TextLoader
Import the LangChain and necessary loaders from  langchain.document_loaders. Remember to install the required packages with the following command: `pip install langchain==0.0.208 deeplake==3.9.27 openai==0.27.8 tiktoken.`

In [None]:
%pip install langchain==0.0.208 deeplake==3.9.27 openai==0.27.8 tiktoken

In [2]:
from langchain.document_loaders import TextLoader

loader = TextLoader('sample1.txt')
documents = loader.load()

### PyPDFLoader (PDF)
The LangChain library provides two methods for loading and processing PDF files: `PyPDFLoader` and `PDFMinerLoader`. We mainly focus on the former, which is used to load PDF files into an array of documents, where each document contains the page content and metadata with the page number. First, install the package using Python Package Manager (PIP).


```
!pip install -q pypdf
```



In [4]:
!pip install -q pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m225.3/298.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("sample2.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='PDF Form Example\nThis is an example of a user fillable PDF form. Normally PDF is used as a final publishing format. \nHowever PDF has an option to be used as an entry form that can be edited and saved by the user.\nThe fields of this form have been selected to demonstrate as many as possible of the common \nentry fields.\nThis document and PDF form have been created with OpenOffice (version 3.4.0).\nTo fill out the form, make sure the PDF file is not read-only. If the file is read-only save it first to a \nfolder or computer desktop. Close this file and open the saved file.\nPlease fill out the following fields. Important fields are marked yellow.\nGiven Name:\nFamily Name:\nAddress 1:   House nr:\nAddress 2:\nPostcode: City:  \nCountry:\nGender:\nHeight (cm):\nDriving License:\nI speak and understand (tick all that apply): \n      \nFavourite colour:\nImportant: Save the completed PDF form (use menu File - Save).\nDeutsch English Français Esperanto Latin' metadata={'sou

Using PyPDFLoader offers advantages such as simple, straightforward usage and easy access to page content and metadata, like page numbers, in a structured format. However, it has disadvantages, including limited text extraction capabilities compared to PDFMinerLoader.

### SeleniumURLLoader (URL)
The `SeleniumURLLoader` module offers a robust yet user-friendly approach for loading HTML documents from a list of URLs requiring JavaScript rendering. Here is a guide and example for using this class which starts by installing the package using the Python Package Manager (PIP). The codes has been tested for unstructured and selenium libraries with 0.7.7 and 4.10.0, respectively. However, feel free to install the latest versions.

In [7]:
%pip install -q unstructured selenium

Instantiate the `SeleniumURLLoader` class by providing a list of URLs to load, for example:

In [9]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

page_content="OPENASSISTANT TAKES ON CHATGPT!\n\n2x\n\nIf playback doesn't begin shortly, try restarting your device.\n\nUp next\n\nLive\n\nUpcoming\n\nPlay Now\n\nYou're signed out\n\nVideos you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.\n\nMachine Learning Street Talk\n\nSubscribe\n\nSubscribed\n\nShare\n\nAn error occurred while retrieving sharing information. Please try again later.\n\n2:19\n\n2:19 / 59:51•Watch full video" metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s'}


The SeleniumURLLoader class includes the following attributes:

- URLs (List[str]): List of URLs to load.
continue_on_failure (bool, default=True): Continues loading other URLs on failure if True.
- browser (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox'.
executable_path (Optional[str], default=None): Browser executable path.
- headless (bool, default=True): Browser runs in headless mode if True.

Customize these attributes during SeleniumURLLoader instance initialization, such as using Firefox instead of Chrome by setting the browser to "firefox":
```
loader = SeleniumURLLoader(urls=urls, browser="firefox")
```
Upon invoking the load() method, a list of Document instances containing the loaded content is returned. Each Document instance includes a page_content attribute with the extracted text from the HTML and a metadata attribute containing the source URL.

Bear in mind that SeleniumURLLoader may be slower than other loaders since it initializes a browser instance for each URL. Nevertheless, it is advantageous for loading pages necessitating JavaScript rendering.

### Google Drive loader
The LangChain Google Drive Loader efficiently imports data from Google Drive by using the `GoogleDriveLoader` class. It can fetch data from a list of Google Docs document IDs or a single folder ID.

Prepare necessary credentials and tokens:

- By default, the GoogleDriveLoader searches for the credentials.json file in ~/.credentials/credentials.json. Use the credentials_file keyword argument to modify this path.
- The token.json file follows the same principle and will be created automatically upon the loader's first use.

**To set up the `credentials_file`, follow these steps:**

1. Create a new Google Cloud Platform project or use an existing one by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
2. Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
3. Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
4. Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
5. After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.

Retrieve the folder or document ID from the URL:

- Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
- Document: https://docs.google.com/document/d/{document_id}/edit

Import the GoogleDriveLoader class:
```
from langchain.document_loaders import GoogleDriveLoaderCopy
```
Instantiate GoogleDriveLoader:

```
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False  # Optional: Fetch files from subfolders recursively. Defaults to False.
)
```
Load the documents:

```
docs = loader.load()Copy
```
Note that currently, only Google Docs are supported.