<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/langchain/use_cases/Langchain_OpenAI_And_Faker_Generate_Syntethic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain

LangChain is a framework for developing applications powered by language models.

https://python.langchain.com/docs/use_cases

## Langchain Synthetic Data
Synthetic data is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations.

Benefits of Synthetic Data:

Privacy and Security: No real personal data at risk of breaches.
Data Augmentation: Expands datasets for machine learning.
Flexibility: Create specific or rare scenarios.
Cost-effective: Often cheaper than real-world data collection.
Regulatory Compliance: Helps navigate strict data protection laws.
Model Robustness: Can lead to better generalizing AI models.
Rapid Prototyping: Enables quick testing without real data.
Controlled Experimentation: Simulate specific conditions.
Access to Data: Alternative when real data isn't available.

https://python.langchain.com/docs/get_started/introduction

https://python.langchain.com/docs/use_cases/data_generation

https://python.langchain.com/docs/modules/model_io/prompts/

https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

https://api.python.langchain.com/en/latest/experimental_api_reference.html

## Presidio
https://microsoft.github.io/presidio/

https://spacy.io/


## faker

https://faker.readthedocs.io/en/master/


In [27]:
! pip install langchain langchain-community tiktoken faker -q
! pip install -U unstructured numpy -q
! pip install openai  -q
! pip install presidio_analyzer presidio_anonymizer -q
! python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [28]:
! pip install langchain_experimental langchain-openai -q

In [29]:

from google.colab import output
output.enable_custom_widget_manager()

In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
from google.colab import userdata
openai_api_key = userdata.get('KEY_OPENAI')


In [32]:
from langchain.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_experimental.tabular_synthetic_data.openai import (
    OPENAI_TEMPLATE,
    create_openai_data_generator,
)
from langchain_experimental.tabular_synthetic_data.prompts import (
    SYNTHETIC_FEW_SHOT_PREFIX,
    SYNTHETIC_FEW_SHOT_SUFFIX,
)
from langchain_openai import ChatOpenAI

In [33]:
class PII_entities(BaseModel):
    PERSON: str
    LOCATION: str
    CREDIT_CARD: str
    EMAIL_ADDRESS: str
    IP_ADDRESS: str
    IBAN_CODE: str

# Faker to generate PII entities



In [34]:
from faker import Faker
fake = Faker()


synthetic_results = []
for i in range(10):

    cl = PII_entities(
        PERSON=fake.name(),
        LOCATION=fake.city(),
        CREDIT_CARD=fake.credit_card_number(),
        EMAIL_ADDRESS=fake.email(),
        IP_ADDRESS=fake.ipv4_public(),
        IBAN_CODE=fake.iban(),
    )
    synthetic_results.append(cl)

In [35]:
synthetic_results[0]

PII_entities(PERSON='Olivia Berry', LOCATION='New Danielton', CREDIT_CARD='3587172382678078', EMAIL_ADDRESS='ariasrobert@example.net', IP_ADDRESS='19.29.89.27', IBAN_CODE='GB35CPKN05411498890583')

# Create Dataset for Synthetic data


https://api.python.langchain.com/en/latest/tabular_synthetic_data/langchain_experimental.tabular_synthetic_data.base.SyntheticDataGenerator.html



In [37]:
from langchain_experimental.synthetic_data import (
    DatasetGenerator,

)
# LLM
model = ChatOpenAI(model_name="gpt-4", temperature=0.7, openai_api_key=openai_api_key)


In [38]:
imp= []
for r in synthetic_results:

  data= {
        "PERSON": r.PERSON,
        "LOCATION":r.LOCATION,
        "CREDIT_CARD": r.CREDIT_CARD,
        "EMAIL_ADDRESS": r.EMAIL_ADDRESS,
        "IP_ADDRESS": r.IP_ADDRESS,
        "IBAN_CODE": r.IBAN_CODE
    }
  imp.append(data)

In [39]:
# Example input for generating synthetic customer profiles
imp[-1]

{'PERSON': 'Barry Davis',
 'LOCATION': 'Wattsview',
 'CREDIT_CARD': '374019527528556',
 'EMAIL_ADDRESS': 'mccoymark@example.net',
 'IP_ADDRESS': '181.109.191.229',
 'IBAN_CODE': 'GB68AWEJ41397223106326'}

In [40]:
generator = DatasetGenerator(model, {"style": "formal", "minimal length": 500})
dataset = generator(imp)

In [None]:
dataset[-1]

{'fields': {'PERSON': 'Lisa Henderson',
  'LOCATION': 'Port Eric',
  'CREDIT_CARD': '675968923622',
  'EMAIL_ADDRESS': 'justin59@example.com',
  'IP_ADDRESS': '192.193.68.227',
  'IBAN_CODE': 'GB87XPVY63622793797030'},
 'preferences': {'style': 'formal', 'minimal length': 500},
 'text': 'Lisa Henderson, an esteemed resident of the picturesque seaside town of Port Eric, is a woman of sophisticated tastes and impeccable credentials. Due to her meticulous nature, she is known to handle her financial affairs with utmost precision. Her credit card, the number of which is 675968923622, is a testament to her regular transactions and a symbol of her financial independence. She is an adept user of the internet and her activities can be traced back to a specific IP address, 192.193.68.227, highlighting her digital footprint. In the realm of virtual communication, she can be reached via her email address, justin59@example.com, which she checks regularly to stay updated with her personal and profe

In [None]:
dataset[-1]['text']

'Lisa Henderson, an esteemed resident of the picturesque seaside town of Port Eric, is a woman of sophisticated tastes and impeccable credentials. Due to her meticulous nature, she is known to handle her financial affairs with utmost precision. Her credit card, the number of which is 675968923622, is a testament to her regular transactions and a symbol of her financial independence. She is an adept user of the internet and her activities can be traced back to a specific IP address, 192.193.68.227, highlighting her digital footprint. In the realm of virtual communication, she can be reached via her email address, justin59@example.com, which she checks regularly to stay updated with her personal and professional engagements. As an international businesswoman, she is also known to utilize her IBAN code, GB87XPVY63622793797030, for her cross-border financial transactions, further cementing her status as a global citizen.'

# Parsers Extraction

https://python.langchain.com/docs/modules/model_io/output_parsers/types/pydantic


In [None]:
# Parsers

In [41]:
from langchain.output_parsers import PydanticOutputParser

from langchain_openai import OpenAI

In [45]:
llm = OpenAI(model_name="gpt-3.5-turbo-instruct",openai_api_key=openai_api_key)

In [44]:

parser = PydanticOutputParser(pydantic_object=PII_entities)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

In [46]:
_input = prompt.format_prompt(text=dataset[0]["text"])
output = llm(_input.to_string())
parsed = parser.parse(output)
print(parsed)
print(dataset[0]["text"])

PERSON='Olivia Berry' LOCATION='New Danielton' CREDIT_CARD='3587172382678078' EMAIL_ADDRESS='ariasrobert@example.net' IP_ADDRESS='19.29.89.27' IBAN_CODE='GB35CPKN05411498890583'
Olivia Berry, a well-known and respected resident of New Danielton, has been recently identified through her unique IP address of 19.29.89.27, associated with her personal email address, ariasrobert@example.net. Highly esteemed within her community, Ms. Berry has a reputation for her meticulous attention to detail in all aspects of her life, which includes the careful management of her financial assets. Her financial transactions are usually facilitated through her credit card numbered 3587172382678078, a number she guards closely, knowing the potential risks associated with financial fraud in the digital age. Although she had made a conscious choice to live in the small town of New Danielton, she frequently engages in international transactions, especially to the United Kingdom, facilitated by her Internationa

# PII Detection

In [47]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import pprint
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()



In [48]:
import pprint

In [49]:
for d in dataset:
  sample =d['text']
  results = analyzer.analyze(sample, language="en")
  anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
  anonymized_text = anonymized.text
  pprint.pprint(sample)
  pprint.pprint(anonymized_text)
  print("-"*50)


('Olivia Berry, a well-known and respected resident of New Danielton, has been '
 'recently identified through her unique IP address of 19.29.89.27, associated '
 'with her personal email address, ariasrobert@example.net. Highly esteemed '
 'within her community, Ms. Berry has a reputation for her meticulous '
 'attention to detail in all aspects of her life, which includes the careful '
 'management of her financial assets. Her financial transactions are usually '
 'facilitated through her credit card numbered 3587172382678078, a number she '
 'guards closely, knowing the potential risks associated with financial fraud '
 'in the digital age. Although she had made a conscious choice to live in the '
 'small town of New Danielton, she frequently engages in international '
 'transactions, especially to the United Kingdom, facilitated by her '
 'International Bank Account Number (IBAN) GB35CPKN05411498890583. This mode '
 'of transaction she finds convenient, efficient, and secure, espec

In [50]:
results

[type: CREDIT_CARD, start: 406, end: 421, score: 1.0,
 type: EMAIL_ADDRESS, start: 601, end: 622, score: 1.0,
 type: IBAN_CODE, start: 999, end: 1021, score: 1.0,
 type: IP_ADDRESS, start: 824, end: 839, score: 0.95,
 type: PERSON, start: 111, end: 122, score: 0.85,
 type: LOCATION, start: 168, end: 177, score: 0.85,
 type: PERSON, start: 1206, end: 1211, score: 0.85,
 type: PERSON, start: 1364, end: 1375, score: 0.85,
 type: URL, start: 611, end: 622, score: 0.5,
 type: IN_PAN, start: 5, end: 15, score: 0.05,
 type: IN_PAN, start: 95, end: 105, score: 0.05,
 type: IN_PAN, start: 278, end: 288, score: 0.05,
 type: US_BANK_NUMBER, start: 406, end: 421, score: 0.05,
 type: IN_PAN, start: 522, end: 532, score: 0.05,
 type: IN_PAN, start: 601, end: 611, score: 0.05,
 type: IN_PAN, start: 717, end: 727, score: 0.05,
 type: IN_PAN, start: 850, end: 860, score: 0.05,
 type: IN_PAN, start: 925, end: 935, score: 0.05,
 type: IN_PAN, start: 941, end: 951, score: 0.05]