# Data Loaders in LlamaIndex
* Notebook by Adam Lang
* Date: 3/11/2024

## What are Data Loaders?
* Read and load various data sources and types.
* Facilitates conversion of different data types into a **document** format that is readable and usable by llamaindex.
* LlamaIndex supports 100+ Data Loaders within the Llama Hub.

In [1]:
# install library
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.10.18-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.5-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.8-py3-none-any.whl (25 kB)
Collecting llama-index-core<0.11.0,>=0.10.18 (from llama-index)
  Downloading llama_index_core-0.10.18.post1-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl (6.0 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.3-py3-none-any.whl (6.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloa

# 1. Load a PDF file
* download the data

In [2]:
!mkdir data
!wget 'https://raw.githubusercontent.com/aravindpai/Speech-Recognition/1882379d3152c8cd830d74e677be4dd161d024ea/transformers.pdf' -O 'data/transformers.pdf'

--2024-03-11 20:21:07--  https://raw.githubusercontent.com/aravindpai/Speech-Recognition/1882379d3152c8cd830d74e677be4dd161d024ea/transformers.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/octet-stream]
Saving to: ‘data/transformers.pdf’


2024-03-11 20:21:08 (33.1 MB/s) - ‘data/transformers.pdf’ saved [2215244/2215244]



* Using PDFReader in LlamaIndex
    * documentation from LlamaIndex: https://docs.llamaindex.ai/en/stable/understanding/loading/loading.html

In [4]:
from pathlib import Path
from llama_index.core import download_loader

In [5]:
# PDFreader from llamaindex
PDFReader = download_loader("PDFReader")

  PDFReader = download_loader("PDFReader")


In [6]:
loader = PDFReader()

In [7]:
documents = loader.load_data(file=Path('./data/transformers.pdf'))

In [8]:
# check len of documents
len(documents)

15

In [9]:
# print first line of text
documents[0].text

'Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and con

summary:
* Above we can see the process to load a PDF file in LlamaIndex.

# 2. Loading CSV files

In [10]:
!wget https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv -O 'data/transactions.csv'

--2024-03-11 20:29:23--  https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv
Resolving datahack-prod.s3.amazonaws.com (datahack-prod.s3.amazonaws.com)... 16.12.40.67, 52.219.62.84, 52.219.160.47, ...
Connecting to datahack-prod.s3.amazonaws.com (datahack-prod.s3.amazonaws.com)|16.12.40.67|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 869537 (849K) [text/csv]
Saving to: ‘data/transactions.csv’


2024-03-11 20:29:26 (718 KB/s) - ‘data/transactions.csv’ saved [869537/869537]



In [16]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=['/content/data/transactions.csv'])


In [23]:
# instatiate document loader in documents variable
documents = reader.load_data()

In [26]:
# print documents first line
print(documents[0])

Doc ID: 7728be25-1376-4692-8d5f-eaab7c8be713
Text: FDA15, 9.3, Low Fat, 0.016047301, Dairy, 249.8092, OUT049, 1999,
Medium, Tier 1, Supermarket Type1, 3735.138 DRC01, 5.92, Regular,
0.019278216, Soft Drinks, 48.2692, OUT018, 2009, Medium, Tier 3,
Supermarket Type2, 443.4228 FDN15, 17.5, Low Fat, 0.016760075, Meat,
141.618, OUT049, 1999, Medium, Tier 1, Supermarket Type1, 2097.27
FDX07, 19.2, Reg...


# 3. Loading a Web Page

In [29]:
from llama_index.core import download_loader
SimpleWebPageReader = download_loader("SimpleWebPageReader")

  SimpleWebPageReader = download_loader("SimpleWebPageReader")


In [30]:
loader = SimpleWebPageReader()

In [33]:
documents = loader.load_data(urls=['https://huggingface.co/blog/moe'])

In [34]:
documents[0].text

'<!doctype html>\n<html class="">\n\t<head>\n\t\t<meta charset="utf-8" />\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=no" />\n\t\t<meta name="description" content="We’re on a journey to advance and democratize artificial intelligence through open source and open science." />\n\t\t<meta property="fb:app_id" content="1321688464574422" />\n\t\t<meta name="twitter:card" content="summary_large_image" />\n\t\t<meta name="twitter:site" content="@huggingface" />\n\t\t<meta property="og:title" content="Mixture of Experts Explained" />\n\t\t<meta property="og:type" content="website" />\n\t\t<meta property="og:url" content="https://huggingface.co/blog/moe" />\n\t\t<meta property="og:image" content="https://huggingface.co/blog/assets/moe/thumbnail.png" />\n\n\t\t<link rel="stylesheet" href="/front/build/kube-78db83e/style.css" />\n\n\t\t<link rel="preconnect" href="https://fonts.gstatic.com" />\n\t\t<link\n\t\t\thref="https://fonts.googleapis.com/css2?f

In [39]:
from llama_index.core import Document
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [40]:
# write the HTML string to a file
with open('data/blog.html', "w") as file:
  file.write(document.text)

# 4. Reading from Directory
* Using SimpleDirectoryReader you can directly load all files present in the directory or specify the multiple file names you want to read.

In [42]:
from llama_index.core import SimpleDirectoryReader

In [43]:
documents = SimpleDirectoryReader('./data/').load_data()

Load with file names:

In [47]:
documents = SimpleDirectoryReader(input_files=['/content/data/transformers.pdf',
                                               '/content/data/transactions.csv']).load_data()

In [48]:
# read documents
documents[0].text

'Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and con