# `TikaDocumentConverter`


In [1]:
%load_ext autoreload
%autoreload 2

- Run Apache Tika server
  ```bash
  docker run -d -p 127.0.0.1:9998:9998 apache/tika:latest
  ```

## On its own


In [4]:
from haystack.components.converters import TikaDocumentConverter
from pathlib import Path

In [5]:
converter = TikaDocumentConverter()
res = converter.run(sources=[Path("./sample.pdf")])

In [6]:
res

{'documents': [Document(id=85e4d1e1926ac6bad093031bc9c1d56c3e5331ffaaa44eed39e994bc064cfb9f, content: '
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  ThuyVT2
  
  
  1
  
  ThuyVT2
  
  
  
  2
  
  VKS
  VKS �VNGCloud Kubernetes Service) is...', meta: {'file_path': 'sample.pdf'})]}

## In Pipeline

In [8]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import TikaDocumentConverter
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

In [9]:
document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", TikaDocumentConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

file_paths = ["./sample.pdf"]
pipeline.run({"converter": {"sources": file_paths}})

{'writer': {'documents_written': 760}}

In [10]:
for doc in document_store.filter_documents():
    print(doc.content[:100])

ThuyVT2 1 ThuyVT22 VKS
VKS �VNGCloud Kubernetes Service) is a managed service on VNGCloud that help
 Kubernetes, an open-source platform developed by Google, is widely used to manage and deploy contai
29) to ensure you can take advantage of the most advanced features. • Kubernetes Networking: VKS int
 • Integration of Native Blockstore �Container Storage Interface - CSI�� VKS allows you to manage Bl
5 How VKS works? Below are the current concepts being provided to you by VKS� When you create a Pub
 At this time, these VMs �Nodes� cannot join the K8S cluster directly. In order for these VMs to joi
 Besides, using Private Node Group will help you control applications in the cluster more securely, 
..Private Cluster is the ideal choice for services that require strict access control, ensuring comp
 Comparison between using Public Cluster
and Private Cluster7 other services in VNG Cloud. other se
 Scalability �AutoScaling) Easily scalable through Auto Scaling feature . Easily scalable t