# Advanced Graph Build


## Filter Master CSV to First 6 Categories

Process:
1. Load the full `master_clauses.csv`.  
2. Select our six clause-of-interest columns (`Parties`, `Agreement Date`, `Effective Date`, `Expiration Date`, `Renewal Term`, `Notice Period To Terminate Renewal`) plus their corresponding “-Answer” fields.  
3. Filter the DataFrame to retain only contracts with at least one non-empty/positive answer.  
4. Reset the index and save the result to `filtered_master_clauses.csv`.


In [4]:
import pandas as pd

# 1. Load the master CSV
df_master = pd.read_csv(r'C:\Repositories\USA_Project\Graph-Test\master_clauses.csv')

# 2. Exact column names for our first six categories:
context_cols = [
    "Parties",
    "Agreement Date",
    "Effective Date",
    "Expiration Date",
    "Renewal Term",
    "Notice Period To Terminate Renewal"
]

# Note the precise answer column names (matching the CSV):
answer_cols = [
    "Parties-Answer",
    "Agreement Date-Answer",
    "Effective Date-Answer",
    "Expiration Date-Answer",
    "Renewal Term-Answer",
    "Notice Period To Terminate Renewal- Answer"
]

# 3. Subset to these columns + Filename
cols_to_keep = ["Filename"] + context_cols + answer_cols
df_sub = df_master[cols_to_keep].copy()

# 4. Filter to rows where at least one answer is non-empty/positive
mask = pd.Series(False, index=df_sub.index)
for ans in answer_cols:
    mask |= df_sub[ans].notna() & ~df_sub[ans].isin(["No", "[]", ""])
df_mini = df_sub[mask].reset_index(drop=True)

# 5. Inspect and save
print(f"Filtered to {len(df_mini)} contracts out of {len(df_master)} total.")
print(df_mini.head())

df_mini.to_csv(r'C:\Repositories\USA_Project\Graph-Test\filtered_master_clauses.csv', index=False)
print("Saved filtered CSV to /mnt/data/mini_cuad_master_first6.csv")


Filtered to 509 contracts out of 510 total.
                                            Filename  \
0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
1  EuromediaHoldingsCorp_20070215_10SB12G_EX-10.B...   
2  FulucaiProductionsLtd_20131223_10-Q_EX-10.9_83...   
3  GopageCorp_20140221_10-K_EX-10.1_8432966_EX-10...   
4  IdeanomicsInc_20160330_10-K_EX-10.26_9512211_E...   

                                             Parties  \
0  ['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...   
1  ['EuroMedia Holdings Corp.', 'Rogers', 'Rogers...   
2  ['Producer', 'Fulucai Productions Ltd.', 'Conv...   
3  ['PSiTech Corporation', 'Licensor', 'Licensee'...   
4  ['YOU ON DEMAND HOLDINGS, INC.', 'Licensor', '...   

                           Agreement Date  \
0  ['8th day of May 2014', 'May 8, 2014']   
1                      ['July 11 , 2006']   
2                   ['November 15, 2012']   
3                        ['Feb 10, 2014']   
4                   ['December 21, 2015']   

           

## Explode to Clause-Level Snippets

Process:
1. Read `filtered_master_clauses.csv` (contract-level).  
2. Define the six clause context columns.  
3. Iterate over each contract and each context column, extracting non-empty snippets.  
4. Build a snippet-level DataFrame with columns:  
   - `doc_idx` (contract row index)  
   - `filename`  
   - `category`  
   - `snippet_text`  
5. Save the snippet DataFrame to `mini_cuad_snippets.csv`.

Basically for each doc which has a true value for snippet we retrieve the snippet data into a single row.
So if a document has values for multiple clauses , it'll have multiple columns here. These will be the children nodes. While document is the parent node.
Check mini_cuad_snippets.csv

In [5]:
import pandas as pd

# 1. Load the filtered master CSV
df_filtered = pd.read_csv(r'C:\Repositories\USA_Project\Graph-Test\filtered_master_clauses.csv')

# 2. Define the six clause context columns exactly
context_cols = [
    "Parties",
    "Agreement Date",
    "Effective Date",
    "Expiration Date",
    "Renewal Term",
    "Notice Period To Terminate Renewal"
]

# 3. Explode into one row per non-empty snippet
rows = []
for doc_idx, row in df_filtered.reset_index(drop=True).iterrows():
    for cat in context_cols:
        snippet = row[cat]
        if pd.notna(snippet) and snippet not in ["", "No", "[]"]:
            rows.append({
                "doc_idx": doc_idx,
                "filename": row["Filename"],
                "category": cat,
                "snippet_text": snippet.strip()
            })

snips_df = pd.DataFrame(rows)

# 4. Inspect and save
print(f"Extracted {len(snips_df)} snippets:")
print(snips_df.head())

snips_df.to_csv(r'C:\Repositories\USA_Project\Graph-Test\mini_cuad_snippets.csv', index=False)
print("Snippet‐level CSV saved to /mnt/data/mini_cuad_snippets.csv")


Extracted 2069 snippets:
   doc_idx                                           filename  \
0        0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
1        0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
2        0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
3        0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
4        0  CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   

          category                                       snippet_text  
0          Parties  ['BIRCH FIRST GLOBAL INVESTMENTS INC.', 'MA', ...  
1   Agreement Date             ['8th day of May 2014', 'May 8, 2014']  
2   Effective Date  ['This agreement shall begin upon the date of ...  
3  Expiration Date  ['This agreement shall begin upon the date of ...  
4     Renewal Term  ['This agreement shall begin upon the date of ...  
Snippet‐level CSV saved to /mnt/data/mini_cuad_snippets.csv


## Embed Clause Snippets


1. Load the snippet‐level CSV (`mini_cuad_snippets.csv`) produced in the previous step.  
2. Initialize a legal‐domain SBERT model (`Stern5497/sbert-legal-xlm-roberta-base`).  [Stil testing with this model]
3. Encode each `snippet_text` into a dense vector, storing the result in a new `embedding` column.  
4. Persist the enriched DataFrame to `mini_cuad_snippets_emb.pkl` for downstream graph construction.


In [None]:
import os
import pandas as pd
from sentence_transformers import SentenceTransformer

# 1. Load snippet table
snips = pd.read_csv(r'C:\Repositories\USA_Project\Graph-Test\mini_cuad_snippets.csv')

# 2. Initialize a legal SBERT model
model = SentenceTransformer('Stern5497/sbert-legal-xlm-roberta-base')

# 3. Embed snippets
snip_texts = snips['snippet_text'].tolist()
snip_embs  = model.encode(snip_texts, show_progress_bar=True)
snips['embedding'] = list(snip_embs)
snips.to_pickle(r'C:\Repositories\USA_Project\Graph-Test\mini_cuad_snippets_emb.pkl')
print(f"Encoded {len(snip_embs)} snippets.")



## Test – Verify TXT File Presence

First tried this, 
1. Load `filtered_master_clauses.csv`.  
2. List all expected PDF filenames.  
3. Check for each whether a corresponding `.txt` exists in `full_contract_txt`, and record its file size.  
4. Display a summary table of existence counts and identify any missing or empty files.

Because of slight mismatch of names like `_` instead of `'`

### Test – Fuzzy Match Basenames
1. List all actual `.txt` basenames in the `full_contract_txt` folder.  
2. For each PDF basename with no direct match, use `difflib.get_close_matches` (cutoff=0.8) to propose the closest `.txt` basename.  
3. Display the mapping suggestions to review and ensure slight naming differences (underscores, punctuation) are handled automatically.




In [8]:
# import os
# import pandas as pd
# from sentence_transformers import SentenceTransformer

# model = SentenceTransformer('Stern5497/sbert-legal-xlm-roberta-base')

# # 4. Embed full contracts
# master   = pd.read_csv(r'C:\Repositories\USA_Project\Graph-Test\filtered_master_clauses.csv')
# files    = master['Filename'].unique()
# BASE_TXT_DIR = r'C:\Repositories\USA_Project\Graph-Test\CUAD_v1\full_contract_txt'


# doc_texts = []
# for fname in files:
#     # Split off the extension (whether .pdf, .PDF, .Pdf, etc.)
#     base, _ = os.path.splitext(fname)
#     txt_name = base + '.txt'
#     txt_path = os.path.join(BASE_TXT_DIR, txt_name)

#     try:
#         with open(txt_path, encoding='utf-8', errors='ignore') as f:
#             doc_texts.append(f.read())
#     except FileNotFoundError:
#         print(f"⚠️  Missing {txt_path}, adding empty text")
#         doc_texts.append("")


# doc_embs = model.encode(doc_texts, show_progress_bar=True)
# docs_df  = pd.DataFrame({'Filename': files, 'embedding': list(doc_embs)})
# docs_df.to_pickle(r'C:\Repositories\USA_Project\Graph-Test\mini_cuad_docs_emb.pkl')
# print(f"Encoded {len(doc_embs)} documents.")

In [None]:
# import os
# import pandas as pd

# # Adjust to your local paths
# filtered_csv = r'C:\Repositories\USA_Project\Graph-Test\filtered_master_clauses.csv'
# BASE_TXT_DIR = r'C:\Repositories\USA_Project\Graph-Test\CUAD_v1\full_contract_txt'

# master = pd.read_csv(filtered_csv)
# files = master['Filename'].unique().tolist()

# checks = []
# for fname in files:
#     base, _ = os.path.splitext(fname)
#     txt_name = base + '.txt'
#     txt_path = os.path.join(BASE_TXT_DIR, txt_name)
#     exists = os.path.exists(txt_path)
#     size = os.path.getsize(txt_path) if exists else 0
#     checks.append({'Filename': fname, 'TxtName': txt_name, 'Exists': exists, 'SizeBytes': size})

# df_checks = pd.DataFrame(checks)
# print(df_checks.head(20))
# print("\nExistence counts:\n", df_checks['Exists'].value_counts())
# print("\nMissing files:\n", df_checks.loc[~df_checks['Exists'], 'TxtName'].tolist())


                                             Filename  \
0   CybergyHoldingsInc_20140520_10-Q_EX-10.27_8605...   
1   EuromediaHoldingsCorp_20070215_10SB12G_EX-10.B...   
2   FulucaiProductionsLtd_20131223_10-Q_EX-10.9_83...   
3   GopageCorp_20140221_10-K_EX-10.1_8432966_EX-10...   
4   IdeanomicsInc_20160330_10-K_EX-10.26_9512211_E...   
5   DeltathreeInc_19991102_S-1A_EX-10.19_6227850_E...   
6   EdietsComInc_20001030_10QSB_EX-10.4_2606646_EX...   
7   IntegrityMediaInc_20010329_10-K405_EX-10.17_23...   
8   MusclepharmCorp_20170208_10-KA_EX-10.38_989358...   
9   TomOnlineInc_20060501_20-F_EX-4.46_749700_EX-4...   
10  ConformisInc_20191101_10-Q_EX-10.6_11861402_EX...   
11  EtonPharmaceuticalsInc_20191114_10-Q_EX-10.1_1...   
12  FuelcellEnergyInc_20191106_8-K_EX-10.1_1186800...   
13  ReedsInc_20191113_10-Q_EX-10.4_11888303_EX-10....   
14  FuseMedicalInc_20190321_10-K_EX-10.43_11575454...   
15  GentechHoldingsInc_20190808_1-A_EX1A-6 MAT CTR...   
16  ImineCorp_20180725_S-1_EX-1

## Final Load & Embed Contracts with Automatic Mapping

1. Build an in-memory mapping from each PDF basename to the best matching TXT basename (exact or fuzzy).  
2. Load each contract’s text via that mapping (falling back to `""` if still unmatched).  
3. Encode all contract texts with the legal SBERT model `Stern5497/sbert-legal-xlm-roberta-base`.  
4. Save the resulting DataFrame of (`Filename`, `embedding`) to `mini_cuad_docs_emb.pkl`.


In [None]:
import os
import difflib
import pandas as pd
from sentence_transformers import SentenceTransformer

# ── CONFIGURE THESE THREE PATHS ────────────────────────────────
CSV_PATH    = r'C:\Repositories\USA_Project\Graph-Test\filtered_master_clauses.csv'
TXT_FOLDER  = r'C:\Repositories\USA_Project\Graph-Test\CUAD_v1\full_contract_txt'
OUTPUT_PKL  = r'C:\Repositories\USA_Project\Graph-Test\mini_cuad_docs_emb.pkl'
# ────────────────────────────────────────────────────────────────

# 1. Load the list of filenames from your filtered master CSV
df_master = pd.read_csv(CSV_PATH)
pdf_files = df_master['Filename'].unique().tolist()

# 2. List the actual .txt files on disk and strip off their extensions
all_txt = os.listdir(TXT_FOLDER)
txt_basenames = {os.path.splitext(f)[0]: f for f in all_txt}

# 3. Build a mapping PDF-basename → TXT-basename (fuzzy match if needed)
mapping = {}
for pdf in pdf_files:
    base, _ = os.path.splitext(pdf)            
    if base in txt_basenames:
        mapping[base] = base                     # exact match
    else:
        # find the single best match above a 0.8 similarity threshold
        candidates = difflib.get_close_matches(base, txt_basenames.keys(), n=1, cutoff=0.8)
        mapping[base] = candidates[0] if candidates else None

# (Optional) print out any that still didn’t match
unmatched = [b for b,m in mapping.items() if m is None]
if unmatched:
    print(" No .txt match for these basenames:")
    for u in unmatched:
        print("   ", u)

# 4. Load each contract’s text via the mapping
docs_texts = []
for pdf in pdf_files:
    base, _ = os.path.splitext(pdf)
    txt_base = mapping.get(base)
    if txt_base:
        path = os.path.join(TXT_FOLDER, txt_base + '.txt')
        try:
            with open(path, encoding='utf-8', errors='ignore') as f:
                docs_texts.append(f.read())
        except Exception as e:
            print(f"Failed reading {path}: {e}")
            docs_texts.append("")
    else:
        # no good match found
        docs_texts.append("")
        
df_checks = pd.DataFrame(docs_texts)
print(df_checks.head(20))
rows, cols = df_checks.shape
print(f"There are {rows} rows and {cols} columns.")




                                                    0
0   Exhibit 10.27\n\nMARKETING AFFILIATE AGREEMENT...
1   Exhibit 10.B.01 EXECUTION COPY\n\nVIDEO-ON-DEM...
2   CONTENT DISTRIBUTION AND LICENSE AGREEMENT   D...
3   CONFIDENTIAL\n\n  PSiTECHCORPORATION   WEBSITE...
4   CONTENT LICENSE AGREEMENT\n\nTHIS CONTENT LICE...
5   Execution Copy\n\n                       CO-BR...
6   EXHIBIT 10.4\n\n                              ...
7   1                                             ...
8   ENDORSEMENT LICENSING AND CO-BRANDING AGREEMEN...
9   Exhibit 4.46     6 rue Adolphe Fischer L-1520 ...
10  Execution Version Certain identified informati...
11  Exhibit 10.1 Certain information identified by...
12  EXHIBIT 10.1\n\nJOINT DEVELOPMENT AGREEMENT\n\...
13  RECIPE DEVELOPMENT AGREEMENT This Recipe Devel...
14  EXHIBIT 10.43 Dated 29/3/18\n\nDistributorship...
15  Exhibit 6.1 DISTRIBUTOR AGREEMENT THIS DISTRIB...
16  EXHIBIT 10.5 NON-EXCLUSIVE DISTRIBUTOR AGREEME...
17  Exhibit 10.6 ATTACHMENT 

In [None]:
# 5. Embed with SBERT
model     = SentenceTransformer('Stern5497/sbert-legal-xlm-roberta-base')
doc_embs  = model.encode(docs_texts, show_progress_bar=True)

# 6. Save to disk
output_df = pd.DataFrame({
    'Filename': pdf_files,
    'embedding': list(doc_embs)
})
output_df.to_pickle(OUTPUT_PKL)
print(f"Encoded {len(doc_embs)} documents and saved to {OUTPUT_PKL}")