# Data Ingestion Techniques using LangChain

In [4]:
# Loading text
from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech.txt')
loader.load()

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

In [None]:
# Loading PDF file
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('attention.pdf')
loader.load()

[Document(metadata={'source': 'attention.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Tran

In [12]:
# Loading webpage
from langchain_community.document_loaders import WebBaseLoader
from bs4 import SoupStrainer

loader = WebBaseLoader(web_path='https://python.langchain.com/api_reference/community/index.html', bs_kwargs=dict(parse_only=SoupStrainer(class_=("bd-article",))))
loader.load()

[Document(metadata={'source': 'https://python.langchain.com/api_reference/community/index.html'}, page_content='\n\nlangchain-community: 0.3.13#\nMain entrypoint into package.\n\n\n\nadapters#\nClasses\n\n\nadapters.openai.Chat()\nChat.\n\nadapters.openai.ChatCompletion()\nChat completion.\n\nadapters.openai.ChatCompletionChunk\nChat completion chunk.\n\nadapters.openai.ChatCompletions\nChat completions.\n\nadapters.openai.Choice\nChoice.\n\nadapters.openai.ChoiceChunk\nChoice chunk.\n\nadapters.openai.Completions()\nCompletions.\n\nadapters.openai.IndexableBaseModel\nAllows a BaseModel to return its fields by string variable indexing.\n\n\n\n\nFunctions\n\n\nadapters.openai.aenumerate(iterable[,\xa0start])\nAsync version of enumerate function.\n\nadapters.openai.convert_dict_to_message(_dict)\nConvert a dictionary to a LangChain message.\n\nadapters.openai.convert_message_to_dict(message)\nConvert a LangChain message to a dictionary.\n\nadapters.openai.convert_messages_for_finetuning(

In [15]:
# Loading Arxiv paper
from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(query="1706.03762", load_max_docs=2)
loader.load()

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [18]:
# Loading Wikipedia article
from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query="TOR", load_max_docs=2)
loader.load()

[Document(metadata={'title': 'Tor (network)', 'summary': "Tor is a free overlay network for enabling anonymous communication. Built on free and open-source software and more than seven thousand volunteer-operated relays worldwide, users can have their Internet traffic routed via a random path through the network.\nUsing Tor makes it more difficult to trace a user's Internet activity by preventing any single point on the Internet (other than the user's device) from being able to view both where traffic originated from and where it is ultimately going to at the same time. This conceals a user's location and usage from anyone performing network surveillance or traffic analysis from any such point, protecting the user's freedom and ability to communicate confidentially.", 'source': 'https://en.wikipedia.org/wiki/Tor_(network)'}, page_content='Tor is a free overlay network for enabling anonymous communication. Built on free and open-source software and more than seven thousand volunteer-ope