# Data Loaders

#### Contents:

1. Read PDF Files
2. Read CSV Files
3. Load Webpage
4. Reading from Directory

### Install required libraries

pip install llama-index-readers-file llama-index-readers-web unstructured

## 1. Loading PDF files

Next, we'll download a PDF file from a given URL and save it into our `data` directory. Here, we are using the `wget` command to download the file. The URL points to a PDF file hosted on GitHub, and we save it as `transformers.pdf` in our `data` directory.

In [2]:
!mkdir data
!wget "https://arxiv.org/pdf/1706.03762" -O 'data/transformers.pdf'

--2024-06-11 12:31:22--  https://arxiv.org/pdf/1706.03762
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘data/transformers.pdf’


2024-06-11 12:31:23 (4.07 MB/s) - ‘data/transformers.pdf’ saved [2215244/2215244]



## Using PDFReader

In [1]:
from pathlib import Path
from llama_index.readers.file import PDFReader

Next, we create an instance of `PDFReader`.

In [3]:
loader = PDFReader()

We then use the `load_data` method to load the content of our PDF file. The `file` parameter specifies the path to our PDF file. The `load_data` method reads the PDF and returns a list of documents, where each document represents a portion of the PDF content.

In [4]:
documents = loader.load_data(file=Path('./data/transformers.pdf'))


To check how many documents we have loaded, we can use the `len` function.

In [5]:
len(documents)

15

Finally, we can access the text of the first document in our list and display it. This gives us a peek into the content of the PDF.

In [6]:
print(documents[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experime

In [7]:
documents[0].to_dict().keys()

dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'text', 'mimetype', 'start_char_idx', 'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 'class_name'])

In [8]:
documents[0].id_

'0ea80bd8-f224-4d76-82e7-2134fc263edf'

In [9]:
documents[0].metadata

{'page_label': '1', 'file_name': 'transformers.pdf'}

## 2. Loading CSV files

In [7]:
!wget https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv -O 'data/transactions.csv'

--2024-06-11 12:32:18--  https://datahack-prod.s3.amazonaws.com/train_file/train_v9rqX0R.csv
Resolving datahack-prod.s3.amazonaws.com (datahack-prod.s3.amazonaws.com)... 52.219.158.83, 52.219.158.199, 52.219.156.91, ...
Connecting to datahack-prod.s3.amazonaws.com (datahack-prod.s3.amazonaws.com)|52.219.158.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 869537 (849K) [text/csv]
Saving to: ‘data/transactions.csv’


2024-06-11 12:32:20 (718 KB/s) - ‘data/transactions.csv’ saved [869537/869537]



In [12]:
from llama_index.readers.file import CSVReader

Next, we create an instance of CSVReader.

In [13]:
loader = CSVReader()

We then use the load_data method to load the content of our CSV file. The file parameter specifies the path to our CSV file. This method reads the CSV file and returns a list of documents, each representing a row or a set of rows from the CSV.

In [15]:
documents = loader.load_data(file=Path('./data/transactions.csv'))

To access the content of the first document, we simply reference it by its index and display the text.

In [17]:
len(documents)

1

In [16]:
documents[0].text

'Item_Identifier, Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type, Item_MRP, Outlet_Identifier, Outlet_Establishment_Year, Outlet_Size, Outlet_Location_Type, Outlet_Type, Item_Outlet_Sales\nFDA15, 9.3, Low Fat, 0.016047301, Dairy, 249.8092, OUT049, 1999, Medium, Tier 1, Supermarket Type1, 3735.138\nDRC01, 5.92, Regular, 0.019278216, Soft Drinks, 48.2692, OUT018, 2009, Medium, Tier 3, Supermarket Type2, 443.4228\nFDN15, 17.5, Low Fat, 0.016760075, Meat, 141.618, OUT049, 1999, Medium, Tier 1, Supermarket Type1, 2097.27\nFDX07, 19.2, Regular, 0, Fruits and Vegetables, 182.095, OUT010, 1998, , Tier 3, Grocery Store, 732.38\nNCD19, 8.93, Low Fat, 0, Household, 53.8614, OUT013, 1987, High, Tier 3, Supermarket Type1, 994.7052\nFDP36, 10.395, Regular, 0, Baking Goods, 51.4008, OUT018, 2009, Medium, Tier 3, Supermarket Type2, 556.6088\nFDO10, 13.65, Regular, 0.012741089, Snack Foods, 57.6588, OUT013, 1987, High, Tier 3, Supermarket Type1, 343.5528\nFDP10, , Low Fat, 0.127469857, Snack

## 3. Loading Web Page

We start by importing the UnstructuredURLLoader from llama_index.readers.web. This class helps us load and parse content from web pages.

In [2]:
from llama_index.readers.web import UnstructuredURLLoader

We create an instance of UnstructuredURLLoader and pass a list of URLs we want to load.

In [3]:
loader = UnstructuredURLLoader(urls=['https://huggingface.co/blog/moe'])

Using the load_data method, we load the content of the specified URL.

In [4]:
documents = loader.load_data()

In [5]:
len(documents)

1

We can access and display the text of the first document similarly.

In [6]:
print(documents[0].text)

Back to Articles

Mixture of Experts Explained

Published December 11, 2023

Update on GitHub

Upvote

128

osanseviero Omar Sanseviero

lewtun Lewis Tunstall

philschmid Philipp Schmid

smangrul Sourab Mangrulkar

ybelkada Younes Belkada

pcuenq Pedro Cuenca

With the release of Mixtral 8x7B (announcement, model card), a class of transformer has become the hottest topic in the open AI community: Mixture of Experts, or MoEs for short. In this blog post, we take a look at the building blocks of MoEs, how they’re trained, and the tradeoffs to consider when serving them for inference.

Let’s dive in!

Table of Contents

What is a Mixture of Experts?

A Brief History of MoEs

What is Sparsity?

Load Balancing tokens for MoEs

MoEs and Transformers

Switch Transformers

Stabilizing training with router Z-loss

What does an expert learn?

How does scaling the number of experts impact pretraining?

Fine-tuning MoEs

When to use sparse MoEs vs dense models?

Making MoEs go brrr

Expert Paralle

To combine the text from multiple documents into a single document, we use the `Document` class from `llama_index.core`.

In [7]:
from llama_index.core import Document

Finally, we write the combined text to an HTML file.

In [8]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [9]:
# Write the HTML string to the file
with open('data/blog.html', "w") as file:
    file.write(document.text)

## 4. Reading from Directory

Using SimpleDirectoryReader: You can directly load all the files present in the directory or specify the multiple file names that you want to read.

In [27]:
from llama_index.core import SimpleDirectoryReader

We create an instance of `SimpleDirectoryReader` and specify the directory to read from.

In [28]:
documents = SimpleDirectoryReader('./data/').load_data()

We can also specify individual files within the directory to read.

In [29]:
documents = SimpleDirectoryReader(input_files=['./data/transformers.pdf',
                                               './data/transactions.csv']).load_data()

To access and display the text of a specific document, we reference it by its index.

In [34]:
documents[15].text

'FDA15, 9.3, Low Fat, 0.016047301, Dairy, 249.8092, OUT049, 1999, Medium, Tier 1, Supermarket Type1, 3735.138\nDRC01, 5.92, Regular, 0.019278216, Soft Drinks, 48.2692, OUT018, 2009, Medium, Tier 3, Supermarket Type2, 443.4228\nFDN15, 17.5, Low Fat, 0.016760075, Meat, 141.618, OUT049, 1999, Medium, Tier 1, Supermarket Type1, 2097.27\nFDX07, 19.2, Regular, 0.0, Fruits and Vegetables, 182.095, OUT010, 1998, nan, Tier 3, Grocery Store, 732.38\nNCD19, 8.93, Low Fat, 0.0, Household, 53.8614, OUT013, 1987, High, Tier 3, Supermarket Type1, 994.7052\nFDP36, 10.395, Regular, 0.0, Baking Goods, 51.4008, OUT018, 2009, Medium, Tier 3, Supermarket Type2, 556.6088\nFDO10, 13.65, Regular, 0.012741089, Snack Foods, 57.6588, OUT013, 1987, High, Tier 3, Supermarket Type1, 343.5528\nFDP10, nan, Low Fat, 0.127469857, Snack Foods, 107.7622, OUT027, 1985, Medium, Tier 3, Supermarket Type3, 4022.7636\nFDH17, 16.2, Regular, 0.016687114, Frozen Foods, 96.9726, OUT045, 2002, nan, Tier 2, Supermarket Type1, 1076.

These examples demonstrate the flexibility of LlamaIndex in handling various data sources, making it a powerful tool for data processing and analysis.

### You can find various data loaders [here](https://llamahub.ai/).