In [5]:
!pip3 install unstructured

Successfully installed XlsxWriter-3.0.9 argilla-1.5.1 backoff-2.2.1 click-8.1.3 commonmark-0.9.1 deprecated-1.2.13 et-xmlfile-1.1.0 h11-0.14.0 httpcore-0.16.3 httpx-0.23.3 joblib-1.2.0 markdown-3.4.3 monotonic-1.6 msg_parser-1.2.0 nltk-3.8.1 numpy-1.23.5 olefile-0.46 openpyxl-3.1.2 pillow-9.5.0 pypandoc-1.11 python-docx-0.8.11 python-magic-0.4.27 python-pptx-0.6.21 rfc3986-1.5.0 rich-13.0.1 unstructured-0.5.11 wrapt-1.14.1


In [14]:
import os

In [28]:
from langchain import OpenAI
from langchain.document_loaders import TextLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter

In [166]:
os.environ['OPENAI_API_KEY'] = ''

In [25]:
loader = TextLoader('../data/raw/twitter-tos.txt')

In [26]:
data = loader.load()

### `data` contains the entire Twitter Terms of Service text

In [None]:
data

### ToS text is 19731 characters long

In [19]:
!wc -c ../data/raw/twitter-tos.txt

   19731 ../data/raw/twitter-tos.txt


### we split the long text into chunks of text size 2000

In [88]:
chunked_data = CharacterTextSplitter(chunk_size=2000, chunk_overlap=200).create_documents([data[0].page_content])

Created a chunk of size 2775, which is longer than the specified 2000


### we get 8 chunks

In [89]:
len(chunked_data)

11

In [90]:
chunked_data[0]

Document(page_content='1. Who May Use the Services\nYou may use the Services only if you agree to form a binding contract with Twitter and are not a person barred from receiving services under the laws of the applicable jurisdiction. In any case, you must be at least 13 years old, or in the case of Periscope 16 years old, to use the Services. If you are accepting these Terms and using the Services on behalf of a company, organization, government, or other legal entity, you represent and warrant that you are authorized to do so and have the authority to bind such entity to these Terms, in which case the words “you” and “your” as used in these Terms shall refer to such entity.\n\n \n2. Privacy\nOur Privacy Policy (https://www.twitter.com/privacy) describes how we handle the information you provide to us when you use our Services. You understand that through your use of the Services you consent to the collection and use (as set forth in the Privacy Policy) of this information, including t

In [91]:
small_chunk = chunked_data[0]

In [92]:
small_chunk

Document(page_content='1. Who May Use the Services\nYou may use the Services only if you agree to form a binding contract with Twitter and are not a person barred from receiving services under the laws of the applicable jurisdiction. In any case, you must be at least 13 years old, or in the case of Periscope 16 years old, to use the Services. If you are accepting these Terms and using the Services on behalf of a company, organization, government, or other legal entity, you represent and warrant that you are authorized to do so and have the authority to bind such entity to these Terms, in which case the words “you” and “your” as used in these Terms shall refer to such entity.\n\n \n2. Privacy\nOur Privacy Policy (https://www.twitter.com/privacy) describes how we handle the information you provide to us when you use our Services. You understand that through your use of the Services you consent to the collection and use (as set forth in the Privacy Policy) of this information, including t

In [162]:
llm = OpenAI(temperature=0)

In [93]:

summary_chain = load_summarize_chain(llm, chain_type='stuff')

In [94]:
summary_chain.run([small_chunk])

'\n\nTwitter services are available to those aged 13 or older (16 for Periscope). If using the services on behalf of a company, organization, government, or other legal entity, the user must have the authority to bind such entity to the Terms. The Privacy Policy outlines how Twitter handles user information. Users are responsible for their use of the Services and any Content they provide, and must ensure compliance with applicable laws, rules, and regulations.'

### `refine` summarization 

In this case, summarization is run on the first chunk and is passed as context to the next chunk.

In [95]:
summary_chain = load_summarize_chain(llm, chain_type='refine')

In [96]:
summary_chain.run(chunked_data)

'\n\nTwitter services are available to those aged 13 or older (16 for Periscope). Users are responsible for their use of the Services and any Content they provide, and must ensure compliance with applicable laws, rules, and regulations. Twitter reserves the right to remove Content that violates the User Agreement and users can report copyright infringement. Users retain their rights to any Content they submit, post or display on or through the Services and grant Twitter a worldwide, non-exclusive, royalty-free license to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content. In consideration for Twitter granting access to and use of the Services, users agree that Twitter and its third-party providers and partners may place advertising on the Services or in connection with the display of Content or information from the Services. Users must not misuse the Services, for example, by interfering with them or accessing them using a method other 

### `map_reduce` summarization 

In this case, summarization is run on all chunks and then a combiner is used to combine all summaries

In [97]:
summary_chain = load_summarize_chain(llm, chain_type='map_reduce')

In [117]:
summary_chain.run(chunked_data)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).


'\n\nThe Twitter Terms of Service outlines the responsibilities of users when using the Services, including the responsibility to not post any material subject to copyright or other proprietary rights without the necessary permission or legal entitlement. It also outlines the rights and responsibilities of users when using the Services, such as the right to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute their content in any and all media or distribution methods. Additionally, it outlines the terms of use for advertising features of the Services, and the terms of ending a legal agreement with Twitter. Lastly, it states that the Services are provided to the user on an “AS IS” and “AS AVAILABLE” basis, and that the Twitter Entities are not liable for any indirect, incidental, special, consequential or punitive damages. \n\nImportant Sections: \n- Users are responsible for any Content they post and that Twitter does not endorse, support, represent o

### using a custom prompt

In [99]:
from langchain.prompts import PromptTemplate

In [100]:
prompt_template = """Write a detailed summary of the following:


{text}


Summary:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])

In [148]:
combine_prompt_template = """Write a detailed analysis and extract the effective date from the following:


{text}


Analysis: """
COMBINE_PROMPT = PromptTemplate(template=combine_prompt_template, input_variables=["text"])

In [149]:
summary_chain = load_summarize_chain(llm, chain_type='map_reduce', map_prompt=PROMPT, combine_prompt=COMBINE_PROMPT)

In [150]:
output = summary_chain.run(chunked_data)

In [151]:
output

'\n\nThe Twitter Terms of Service outlines the responsibilities of users when using the Services, including the responsibility to not misuse the Services, to not reverse engineer, decompile, or disassemble the software, and to not access, tamper with, or use non-public areas of the Services. It also outlines the rights of users, such as the right to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Services or Content on the Services. Additionally, it outlines the terms of ending a legal agreement with Twitter, such as the right to deactivate an account at any time and the right to file an appeal if an account is terminated in error. Lastly, it outlines the limitations of liability of the Twitter Entities and the governing law of the Terms.\n\nEffective Date: The effective date of the Twitter Terms of Service is when the user continues to access or use the Services after revisions become effective.

## Adding Guardrails

In [None]:
!pip3 install guardrails-ai

In [197]:
from langchain.output_parsers import GuardrailsOutputParser
from rich import print
import guardrails as gr

### create a rail spec

In [236]:
rail_spec = """
<rail version="0.1">

<output>
    <list name="winners">
        <object>
            <string name="person_name" description="Winner Name"/>
            <date name="award_year" description="Year"/>

        </object>
    </list>
</output>

<prompt>

Give the name and year of nobel peace prize winners

@complete_json_suffix_v2
</prompt>
</rail>
"""

In [237]:
output_parser = GuardrailsOutputParser.from_rail_string(rail_spec)

In [238]:
print(output_parser.guard.base_prompt)


In [239]:
prompt = PromptTemplate(
    template=output_parser.guard.base_prompt,
    input_variables=output_parser.guard.prompt.variable_names,
)

In [240]:
print(prompt)

In [241]:
llm = OpenAI(temperature=0)

In [242]:
output

'\n\nThe Twitter Terms of Service outlines the responsibilities of users when using the Services, including the responsibility to not misuse the Services, to not reverse engineer, decompile, or disassemble the software, and to not access, tamper with, or use non-public areas of the Services. It also outlines the rights of users, such as the right to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Services or Content on the Services. Additionally, it outlines the terms of ending a legal agreement with Twitter, such as the right to deactivate an account at any time and the right to file an appeal if an account is terminated in error. Lastly, it outlines the limitations of liability of the Twitter Entities and the governing law of the Terms.\n\nEffective Date: The effective date of the Twitter Terms of Service is when the user continues to access or use the Services after revisions become effective.

In [243]:
prompt.format_prompt().to_string()

'\n\nGive the name and year of nobel peace prize winners\n\n\nGiven below is XML that describes the information to extract from this document and the tags to extract it into.\n\n<output>\n    <list name="winners">\n        <object>\n            <string name="person_name" description="Winner Name"/>\n            <date name="award_year" description="Year"/>\n        </object>\n    </list>\n</output>\n\n\nONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML\'s tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise.\n\nHere are examples of simple (XML, JSON) pairs that show the expected behavior:\n- `<string name=\'foo\' format=\'two-words lower-case\' />` => `{\'foo\': \'example one\'}`\n- `<list name=\'bar\'><string format=\'up

In [244]:
llm_output = llm(prompt.format_prompt().to_string(), )

In [245]:
llm_output

'\n{"winners": [{"person_name": "Nelson Mandela", "award_year": 1993}, {"person_name": "Malala Yousafzai", "award_year": 2014}]}'

In [246]:
import openai

In [247]:
guard = gr.Guard.from_rail_string(rail_spec)

In [248]:
guard

Guard(RAIL=Rail(input_schema=InputSchema({}), output_schema=OutputSchema({'winners': List({'item': Object({'person_name': String({}), 'award_year': Date({})})})}), prompt=Prompt(

Give the name and year of nobel peace prize winn...), script=Script(variables={}, language='python'), version='0.1'))

In [249]:
raw_llm_response, validated_response = guard(
    llm
)

TypeError: strptime() argument 1 must be str, not int

### create a rail spec

In [292]:
rail_spec = """
<rail version="0.1">

<output>
        <object name='driver_details'>
            <string name="fname" description="First Name"/>
            <string name="lname" description="Last Name"/>
            <string name="address" description="Address"/>
            <string name="issue_date" description="ISS"/>
            <string name="exp_date" description="Exp Date"/>
            <string name="gender" description="Gender" format="valid-choices: choices=['M', 'F', 'O']"/>
        </object>

</output>

<prompt>

Extract entities from the below ocr output

KANSAS DRIVER'S LICENSE 


u offer utmL 1111512019 
4d LIC. NO. K12-34-5679 3 V1115/1998 class A saENDNONE 2..-NONE iss 11/15/2017 4b EXP 11115/2023 SAMPLE 2 NICK 8 123 NORTH STREET APT. KS 66612-1234 15SEX M`.. -17WGT 170 113 16 MGT 6,0r 18EYES BRO 
11 /15/98 Igx4x.g2=12 

@complete_json_suffix_v2
</prompt>
</rail>
"""

In [293]:
output_parser = GuardrailsOutputParser.from_rail_string(rail_spec)

TypeError: ValidChoices.__init__() got multiple values for argument 'on_fail'

In [284]:
print(output_parser.guard.base_prompt)


In [285]:
prompt = PromptTemplate(
    template=output_parser.guard.base_prompt,
    input_variables=output_parser.guard.prompt.variable_names,
)

In [279]:
print(prompt)

In [280]:
llm_output = llm(prompt.format_prompt().to_string(), )

In [281]:
llm_output

'\n{"driver_details": {"fname": "Nick", "lname": "Sample", "address": "123 North Street Apt.", "issue_date": "11/15/2017", "exp_date": "11/15/2023", "sex": "M"}}'