### 引入DOCTRAN

In [1]:
import json
import os
from doctran import Doctran, ExtractProperty

### 注意在ENV文中配置KEY和代理地址

In [2]:
from dotenv import load_dotenv

load_dotenv(".env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
#使用硅基流动的模型，记得要修改ENV配置文件
#OPENAI_MODEL = "deepseek-ai/DeepSeek-V2.5"
OPENAI_MODEL = "gpt-4"
OPENAI_TOKEN_LIMIT = 16000

### 加载文本

In [3]:
content = ""
with open('sample.txt', 'r') as file:
    content = file.read()
print(content)

[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.

HR Updates and Emp

### 实例化
***
注意必须传入这些参数

In [4]:
doctran = Doctran(openai_api_key=OPENAI_API_KEY, openai_model=OPENAI_MODEL, openai_base_url=OPENAI_BASE_URL, openai_token_limit=OPENAI_TOKEN_LIMIT)
document = doctran.parse(content=content)

#### 提取属性
使用 OpenAI 函数调用从任何文档中提取 JSON 数据。这非常灵活，可用于对非结构化文本进行分类、重写或提取属性。

In [5]:
properties = [
    ExtractProperty(
        name="millenial_or_boomer", 
        description="A prediction of whether this document was written by a millenial or boomer",
        type="string",
        enum=["millenial", "boomer"],
        required=True
    ),
    ExtractProperty(
        name="as_gen_z", 
        description="The document summarized and rewritten as if it were authored by a Gen Z person",
        type="string",
        required=True
    ),
    ExtractProperty(
        name="contact_info", 
        description="A list of each person mentioned and their contact information",
        type="array",
        items={
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string"},
                "phone": {"type": "string"}
            }
        },
        required=True
    )
]
transformed_document = document.extract(properties=properties).execute()
print(json.dumps(transformed_document.extracted_properties, indent=2))

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
{
  "millenial_or_boomer": "boomer",
  "as_gen_z": "Yo, what's up team? Hope y'all good. Got some updates and stuff we need to talk about, so let's get into it. Keep this on the DL, yeah? \n\nFirst off, security. We beefed up our systems to keep our customers' data safe. Big shoutout to John Doe from IT (john.doe@example.com) for killing it with the network security. Remember to follow the data protection rules strictly. If you spot any security risks, hit up security@example.com ASAP. \n\nNext, HR stuff. We got some new homies who've been crushing it in their departments. Big props to Jane Smith for her mad skills in customer service. Also, open enrollment for benefits is coming up, so hit up Michael Johnson (418-492-3850, michael.johnson@example.com) if you need help. \n\nMarketing's been lit with new strategies to boost our brand and customer engagement. Sarah Thompson (415-555-1234) killed it with our social media, gro

### 编辑敏感信息
使用 spaCy 模型从文档中删除姓名、电子邮件、电话号码和其他敏感信息。在本地运行，以避免将敏感数据发送到第三方 API。

In [6]:
transformed_document = document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"], interactive=False).execute()
print(transformed_document.transformed_content)

[Generated with ChatGPT]

Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures
As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend <PERSON> (email: <EMAIL_ADDRESS>) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at <EMAIL_ADDRESS>.

HR Updates and Employee Bene

### 总结上下文
汇总文档中的信息。

In [7]:
transformed_document = document.summarize(token_limit=100).execute()
print(transformed_document.transformed_content)

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
The confidential document provides updates on security measures, HR updates including employee recognition and benefits, marketing initiatives, and R&D projects. It emphasizes the importance of confidentiality and encourages participation in upcoming events and brainstorming sessions.


### 优化上下文
从文档中删除所有信息，除非它与一组特定的主题相关。

In [8]:
transformed_document = document.refine(topics=['marketing', 'company events']).execute()
print(transformed_document.transformed_content)

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
Confidential Document - For Internal Use Only

Date: July 1, 2023

Subject: Updates and Discussions on Various Topics

Dear Team,

I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Marketing Initiatives and Campaigns
Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.

Please t

### 翻译语言
将文本翻译成另一种语言。

In [9]:
transformed_document = document.translate(language="spanish").execute()
print(transformed_document.transformed_content)

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
Documento Confidencial - Solo para uso interno

Fecha: 1 de julio de 2023

Asunto: Actualizaciones y Discusiones sobre Diversos Temas

Estimado Equipo,

Espero que este correo electrónico les encuentre bien. En este documento, me gustaría proporcionarles algunas actualizaciones importantes y discutir diversos temas que requieren nuestra atención. Por favor, consideren la información contenida en este documento como altamente confidencial.

Medidas de Seguridad y Privacidad
Como parte de nuestro compromiso continuo para garantizar la seguridad y privacidad de los datos de nuestros clientes, hemos implementado medidas robustas en todos nuestros sistemas. Nos gustaría felicitar a John Doe (correo electrónico: john.doe@example.com) del departamento de TI por su diligente trabajo en la mejora de nuestra seguridad de red. En adelante, les rogamos que se adhieran estrictamente a nuestras políticas y directrices de protección de d

### 询问
将文档中的信息转换为问答格式。最终用户查询通常以问题的形式出现，因此将信息转换为问题并从这些问题创建索引，在使用向量数据库进行上下文检索时通常会产生更好的结果。

In [10]:
transformed_document = document.interrogate().execute()
print(json.dumps(transformed_document.extracted_properties, indent=2))

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
尝试修复 JSON 格式问题...
修复后的 JSON: {"questions_and_answers":[{"question":"Whatisthesubjectofthedocumentdated July1,2023,addressedtotheteamby Jason Fan,Cofounder&CEOof Psychic,andhowshouldtheinformationbetreatedbyrecipientsofthedocument?","answer":"Thesubjectofthedocumentisupdatesanddiscussionsonvarioustopicsthatrequiretheteam"sattention.Theinformationinthedocumentshouldbetreatedwithutmostconfidentialityandnotsharedwithunauthorizedindividuals."}
JSON 解析仍然失败: Expecting ',' delimiter: line 1 column 294 (char 293)
使用默认结构...
{
  "questions_and_answers": [
    {
      "question": "\u89e3\u6790\u5931\u8d25\uff0c\u8bf7\u68c0\u67e5\u6a21\u578b\u8f93\u51fa",
      "answer": "\u89e3\u6790\u5931\u8d25\uff0c\u8bf7\u68c0\u67e5\u6a21\u578b\u8f93\u51fa"
    }
  ]
}


### 处理模板
使用带有模板化占位符{像这样}的文本，并用与占位符中的指令相对应的值替换这些占位符。适用于生成电子邮件或某些文本的变体，只需修改{占位符}中的内容。可以使用任何正则表达式来检测占位符，但最常见的是{}，可以用正则表达式`\{([^}]*)\}`来匹配。

In [11]:
template_string = """My name is {common american name}. Today is {first day of the work week}. On this day, I like to get to work at {some unreasonably early time in the morning}. The first thing I do at work is {some arbitrary task}."""
template = doctran.parse(content=template_string)

transformed_document = template.process_template(template_regex="\{([^}]*)\}").execute()
print(transformed_document.transformed_content + "\n")
print(json.dumps(transformed_document.extracted_properties, indent=2))

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base


  transformed_document = template.process_template(template_regex="\{([^}]*)\}").execute()


尝试修复 JSON 格式问题...
修复后的 JSON: {"replacements":[{"index":0,"placeholder":"{commonamericanname}","replaced_value":"Michael"}
JSON 解析仍然失败: Expecting ',' delimiter: line 1 column 93 (char 92)
使用默认结构...
My name is {common american name}. Today is {first day of the work week}. On this day, I like to get to work at {some unreasonably early time in the morning}. The first thing I do at work is {some arbitrary task}.

{
  "replacements": []
}


## 链式转换
你可以将多个转换链接在一起并在单个步骤中执行：

In [None]:
transformed_document = (document
                              .redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"])
                              .summarize(token_limit=100)
                              .translate(language="french")
                              .execute()
                              )
print(transformed_document.transformed_content)

警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
警告：无法为模型 deepseek-ai/DeepSeek-V2.5 找到对应的分词器，使用默认分词器 cl100k_base
