In [1]:
import json
import os
from doctran import Doctran, ExtractProperty

ModuleNotFoundError: No module named 'openai'

In [2]:
from dotenv import load_dotenv

load_dotenv(".env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_MODEL = "gpt-4"
OPENAI_TOKEN_LIMIT = 8000

### Load Sample Content

In [3]:
content = ""
with open('sample.txt', 'r') as file:
    content = file.read()
print(content)

[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Emp

In [4]:
doctran = Doctran(openai_api_key=OPENAI_API_KEY, openai_model=OPENAI_MODEL, openai_token_limit=OPENAI_TOKEN_LIMIT)
document = doctran.parse(content=content)

### Extract properties
Uses OpenAI function calling to extract JSON data from any document. This is extremely flexible and can be used to classify, rewrite, or extract properties from unstructured text.

In [6]:
properties = [
        ExtractProperty(
            name="millenial_or_boomer", 
            description="A prediction of whether this document was written by a millenial or boomer",
            type="string",
            enum=["millenial", "boomer"],
            required=True
        ),
        ExtractProperty(
            name="as_gen_z", 
            description="The document summarized and rewritten as if it were authored by a Gen Z person",
            type="string",
            required=True
        ),
        ExtractProperty(
            name="contact_info", 
            description="A list of each person mentioned and their contact information",
            type="array",
            items={
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "The name of the person"
                    },
                    "contact_info": {
                        "type": "object",
                        "properties": {
                            "phone": {
                                "type": "string",
                                "description": "The phone number of the person"
                            },
                            "email": {
                                "type": "string",
                                "description": "The email address of the person"
                            }
                        }
                    }
                }
            },
            required=True
        )
]
transformed_document = document.extract(properties=properties).execute()
print(json.dumps(transformed_document.extracted_properties, indent=2))

{
  "millenial_or_boomer": "boomer",
  "as_gen_z": "Hey team, \n\nHope you're all good! Just a quick update on what's been happening. We've been working hard to keep our customers' data safe and sound, big shoutout to John Doe (john.doe@example.com) for his work on this. Remember to keep following our data protection rules, and if you see anything sketchy, let our security team know at security@example.com. \n\nWe've got some new faces in the team who are killing it! Jane Smith has been amazing in customer service, and don't forget, it's almost time to sign up for our employee benefits program. If you need help with this, hit up Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com). \n\nOur marketing team has been on fire, especially Sarah Thompson (phone: 415-555-1234) who's been bossing our social media. Also, don't forget about our product launch on July 15th - it's gonna be lit! \n\nOur R&D team has been grinding away on some cool projects, with David Rodriguez 

### Redact Sensitive Information
Uses a spaCy model to remove names, emails, phone numbers and other sensitive information from a document. Runs locally to avoid sending sensitive data to third party APIs.

In [6]:
transformed_document = document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"], interactive=False).execute()
print(transformed_document.transformed_content)

[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend <PERSON> (email: <EMAIL_ADDRESS>) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at <EMAIL_ADDRESS>.

HR Updates and Employee Bene

### Summarize Context
Summarize the information in a document.

In [15]:
transformed_document = document.summarize(token_limit=100).execute()
print(transformed_document.transformed_content)

The document discusses updates on security measures, HR, marketing initiatives, and R&D projects. It commends John Doe for enhancing network security, welcomes new team members, and recognizes Jane Smith for her customer service. It also mentions the open enrollment period for employee benefits, thanks Sarah Thompson for her social media efforts, and announces a product launch event on July 15th. David Rodriguez is acknowledged for his contributions to R&D. The document emphasizes the importance of confidentiality.


### Refine Context
Remove all information from a document unless it's related to a specific set of topics.

In [16]:
transformed_document = document.refine(topics=['marketing', 'company events']).execute()
print(transformed_document.transformed_content)

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Please treat the information in this document with utmost confidentiality and ensure that it i

### Translate Language
Translate text into another language.

In [17]:
transformed_document = document.translate(language="spanish").execute()
print(transformed_document.transformed_content)

[Generado con ChatGPT]

Documento confidencial - Solo para uso interno

Fecha: 1 de julio de 2023

Asunto: Actualizaciones y discusiones sobre varios temas

Estimado equipo,

Espero que este correo electrónico les encuentre bien. En este documento, me gustaría proporcionarles algunas actualizaciones importantes y discutir varios temas que requieren nuestra atención. Por favor, traten la información contenida aquí como altamente confidencial.

Medidas de seguridad y privacidad
Como parte de nuestro compromiso continuo para garantizar la seguridad y privacidad de los datos de nuestros clientes, hemos implementado medidas robustas en todos nuestros sistemas. Nos gustaría elogiar a John Doe (correo electrónico: john.doe@example.com) del departamento de TI por su diligente trabajo en mejorar nuestra seguridad de red. En adelante, recordamos amablemente a todos que se adhieran estrictamente a nuestras políticas y directrices de protección de datos. Además, si se encuentran con posibles riesg

### Interrogate
Convert information in a document into question and answer format. End user queries often take the form of a question, so converting information into questions and creating indexes from these questions often yields better results when using a vector database for context retrieval.

In [7]:
transformed_document = document.interrogate().execute()
print(json.dumps(transformed_document.extracted_properties, indent=2))

{
  "questions_and_answers": [
    {
      "question": "What is the purpose of this document?",
      "answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
    },
    {
      "question": "Who is commended for enhancing the company's network security?",
      "answer": "John Doe from the IT department is commended for enhancing the company's network security."
    },
    {
      "question": "Who should be contacted in case of potential security risks or incidents?",
      "answer": "Potential security risks or incidents should be reported to the dedicated team at security@example.com."
    },
    {
      "question": "Who has been recognized for her outstanding performance in customer service?",
      "answer": "Jane Smith has been recognized for her outstanding performance in customer service."
    },
    {
      "question": "Who is the HR representative to contact for questions or assistance regarding the 

### Process Template
Take some text with templatized placeholders {like this} and replace the placeholders with values that correspond to the instructions in those placeholders. Useful for generating emails or variations on some text while modifying only the content in the {placeholders}. Can use any regex to detect placeholders, but the most common one is {} which can be matched with Regex expression `\{([^}]*)\}`.

In [6]:
template_string = """My name is {common american name}. Today is {first day of the work week}. On this day, I like to get to work at {some unreasonably early time in the morning}. The first thing I do at work is {some arbitrary task}."""
template = doctran.parse(content=template_string)

transformed_document = template.process_template(template_regex="\{([^}]*)\}").execute()
print(transformed_document.transformed_content + "\n")
print(json.dumps(transformed_document.extracted_properties, indent=2))

My name is John. Today is Monday. On this day, I like to get to work at 5:00 AM. The first thing I do at work is check my emails.

{
  "replacements": [
    {
      "index": 0,
      "placeholder": "common american name",
      "replaced_value": "John"
    },
    {
      "index": 1,
      "placeholder": "first day of the work week",
      "replaced_value": "Monday"
    },
    {
      "index": 2,
      "placeholder": "some unreasonably early time in the morning",
      "replaced_value": "5:00 AM"
    },
    {
      "index": 3,
      "placeholder": "some arbitrary task",
      "replaced_value": "check my emails"
    }
  ]
}


## Chaining transformations
You can chain transformations together and execute them in a single step:

In [12]:
transformed_document = (document
                              .redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"])
                              .summarize(token_limit=100)
                              .translate(language="french")
                              .execute()
                              )
print(transformed_document.transformed_content)

Le document est une communication interne confidentielle discutant des mises à jour sur les mesures de sécurité et de confidentialité, des mises à jour des RH et des avantages des employés, des initiatives et campagnes de marketing, et des projets de recherche et développement. Il met en évidence les contributions de membres spécifiques de l'équipe dans chaque domaine, rappelle l'adhésion aux politiques de protection des données, la prochaine période d'inscription aux avantages, un événement de lancement de produit et une séance de remue-méninges R&D. Le document souligne l'importance de maintenir la confidentialité des informations partagées.


### Cookbook
* **Redact** information with a local spaCy model before sending it to OpenAI to **summarize**
* **Refine** information to focus on specific topics, then **interrogate** the document. Index your documents based on embeddings generated from the questions