# Working with Instagram Threads from UCSF Industry Documents

This notebook extracts and processes Instagram messages from a Parquet file, using an Instagram chat conversation as an example. The process includes:

1. Load the Parquet file containing multiple document records.

2. Filter the dataset by a specific document type and ID (ffcn0321), referencing the UCSF Industry Documents Library.

3. Extract and structure the document's content into a table format with fields:
    
    - bates (unique document identifier)
    
    - date sent (document date sent)
    
    - time sent (timestamp, if available)
    
    - author (sender)
    
    - recipient (intended receiver(s))
    
    - message (document content or main text)

4. Print the extracted Instagram thread as a Polars DataFrame for structured representation.

Import Necessary Libraries. We utilized Polars for efficient data handling and displays the extracted content in a structured table format.

In [184]:
import polars as pl
import re

Load documents parquet file and display sample data.

Here, we load the Parquet file `juul_nc_documents.parquet` into a Polars DataFrame. The code will display a preview of the dataset (first few rows) to inspect its structure. This is useful for verifying available columns before filtering or processing.

The metadata is all the data we have, so let's cut down the columns and view just what we want to see!

In [130]:
# load documents parquet file
input_parquet_filename = "juul_nc_documents.parquet"
df = pl.read_parquet(input_parquet_filename)

# show dataframe
print("Sample Data from Parquet:")
print(df.head())

Sample Data from Parquet:
shape: (5, 71)
┌──────────┬─────┬─────────────┬──────────┬───┬──────────────┬──────────┬───────────┬──────────────┐
│ id       ┆ tid ┆ bates       ┆ type     ┆ … ┆ timereceived ┆ timesent ┆ redaction ┆ ocr_text     │
│ ---      ┆ --- ┆ ---         ┆ ---      ┆   ┆ ---          ┆ ---      ┆ ---       ┆ ---          │
│ str      ┆ str ┆ str         ┆ str      ┆   ┆ str          ┆ str      ┆ str       ┆ str          │
╞══════════╪═════╪═════════════╪══════════╪═══╪══════════════╪══════════╪═══════════╪══════════════╡
│ phjp0299 ┆     ┆ JLI05520907 ┆ document ┆ … ┆              ┆          ┆           ┆ SOPHIA PERRY │
│          ┆     ┆             ┆          ┆   ┆              ┆          ┆           ┆ -JOHNSON     │
│          ┆     ┆             ┆          ┆   ┆              ┆          ┆           ┆ UCSF Red…    │
│ xtly0299 ┆     ┆ JLI04260046 ┆ document ┆ … ┆              ┆          ┆           ┆ .01110iillgi │
│          ┆     ┆             ┆          ┆   ┆   

In [186]:
# filter dataframe for only instagram threads 
instagram_docs = df.filter(pl.col("ocr_text").str.to_lowercase().str.contains("cloud instagram direct message"))
print(f"Found {instagram_docs.shape[0]} Instagram thread(s).\n")
print(instagram_docs.head())

Found 6355 Instagram thread(s).

shape: (5, 71)
┌──────────┬─────┬─────────────┬──────────┬───┬──────────────┬──────────┬───────────┬──────────────┐
│ id       ┆ tid ┆ bates       ┆ type     ┆ … ┆ timereceived ┆ timesent ┆ redaction ┆ ocr_text     │
│ ---      ┆ --- ┆ ---         ┆ ---      ┆   ┆ ---          ┆ ---      ┆ ---       ┆ ---          │
│ str      ┆ str ┆ str         ┆ str      ┆   ┆ str          ┆ str      ┆ str       ┆ str          │
╞══════════╪═════╪═════════════╪══════════╪═══╪══════════════╪══════════╪═══════════╪══════════════╡
│ qhnj0338 ┆     ┆ JLI42801990 ┆ document ┆ … ┆              ┆          ┆           ┆ CLOUD        │
│          ┆     ┆             ┆          ┆   ┆              ┆          ┆           ┆ INSTAGRAM…   │
│ ltgm0321 ┆     ┆ JLI42801257 ┆ document ┆ … ┆              ┆          ┆           ┆ CLOUD        │
│          ┆     ┆             ┆          ┆   ┆              ┆          ┆           ┆ INSTAGRAM    │
│          ┆     ┆             ┆          ┆

In [200]:
# document id
document_id = "qhnj0338"

# filter for just this document
filtered_doc = instagram_docs.filter(instagram_docs["id"] == document_id)
print(filtered_doc)

shape: (1, 71)
┌──────────┬─────┬─────────────┬──────────┬───┬──────────────┬──────────┬───────────┬──────────────┐
│ id       ┆ tid ┆ bates       ┆ type     ┆ … ┆ timereceived ┆ timesent ┆ redaction ┆ ocr_text     │
│ ---      ┆ --- ┆ ---         ┆ ---      ┆   ┆ ---          ┆ ---      ┆ ---       ┆ ---          │
│ str      ┆ str ┆ str         ┆ str      ┆   ┆ str          ┆ str      ┆ str       ┆ str          │
╞══════════╪═════╪═════════════╪══════════╪═══╪══════════════╪══════════╪═══════════╪══════════════╡
│ qhnj0338 ┆     ┆ JLI42801990 ┆ document ┆ … ┆              ┆          ┆           ┆ CLOUD        │
│          ┆     ┆             ┆          ┆   ┆              ┆          ┆           ┆ INSTAGRAM…   │
└──────────┴─────┴─────────────┴──────────┴───┴──────────────┴──────────┴───────────┴──────────────┘


In [201]:
def extract_data(text):
    messages = []
    convo_data = {}
    convo_data["num_participants"] = re.search(r"(?<=Display names  )\d{1,2}", text).group(0)
    convo_data["start_date"] = re.search(r"(?<=First message sent date/ me  )\d{1,2}\/\d{1,2}\/\d{4}", text).group(0)
    convo_data["start_time"] = re.search(r"(First message sent date/ me  \d{1,2}\/\d{1,2}\/\d{4} )(\d{1,2}\:\d{1,2}\:\d{1,2} \w{2})", text).group(2)
    convo_data["end_date"] = re.search(r"(?<=Last message sent date/ me  )\d{1,2}\/\d{1,2}\/\d{4}", text).group(0)
    convo_data["end_time"] = re.search(r"(Last message sent date/ me  \d{1,2}\/\d{1,2}\/\d{4} )(\d{1,2}\:\d{1,2}\:\d{1,2} \w{2})", text).group(2)
    
    convo_text = re.search(r"(\w+ \w+ Time)(.+)", text).group(2)
    split_text = convo_text.split("  ")
    cleaned_text = [x for x in split_text if (x not in ["", "CONFIDENTIAL", "NC-JLI-Consent Judgment"]) and ("JLI" not in x)]
    i = 0
    message_data = convo_data.copy()
    for idx, line in enumerate(cleaned_text):
        if i == 0:
            message_data["user"] = line
            i=i+1
        elif i ==1:
            message_data["date_time"] = line 
            i=i+1
        elif i==2:
            message_data["text"] = line
            messages.append(message_data)
            message_data = convo_data.copy()
            i = 0
    return messages

In [202]:
text_example = filtered_doc["ocr_text"].to_list()[0]

In [204]:
parsed_text = re.sub(r"\s{2,}", "  ", text_example.strip())
print(parsed_text)

CLOUD INSTAGRAM DIRECT MESSAGES  CHAT PARTICIPANTS  Numberof par cipants 2  Display names juulvapor  rcorrato  Local user juulvapor  CONVERSATION DETAILS  Numberof messages 12  First message sent date/ me 9/13/2017 12:07:47 AM  Last message sent date/ me 10/7/2017 3:10:56 AM  Case me zone (UTC) Coordinated Universal Time  rcorrato  9/13/2017 12:07:47 AM  r  Are you guys going to make non nic pods  rcorrato  9/18/2017 4:46:09 PM r  Thanks!  CONFIDENTIAL JLI42801990  rcorrato 9/18/2017 7:13:31 PM  CONFIDENTIAL JLI42801991  rcorrato  9/18/2017 11:04:28 PM I/  Free pods since I'm bangin'?  rcorrato  9/19/2017 8:26:01 PM I/  Aw w  rcorrato  9/19/2017 8:28:55 PM r  Do you like menes  rcorrato  9/19/2017 8:28:58 PM I/  Memes*  rcorrato  9/23/2017 12:19:56 AM r  CONFIDENTIAL JLI42801992  CONFIDENTIAL JLI42801993  Redacted  CONFIDENTIAL JLI42801994  rcorrato 10/7/2017 3:10:56 AM  CONFIDENTIAL JLI42801995  Redacted  CONFIDENTIAL JLI42801996


In [205]:
extract_data(parsed_text)

AttributeError: 'NoneType' object has no attribute 'group'

In [None]:
if filtered_doc.shape[0] > 0:
    print(f"Found Instagram thread with ID '{document_id}'\n")
    text_example = filtered_doc["ocr_text"].to_list()[0]
    parsed_text = re.sub(r"\s{2,}", "  ", text_example.strip())

    messages = extract_data(parsed_text)
    chat_df = pl.DataFrame(messages)

    print(chat_df)

    #save to csv
    chat_df.write_csv(f"instagram_thread_{document_id}.csv")
    print(f"\nSaved to instagram_thread_{document_id}.csv")

else:
    print("No matching Instagram document found.")

Found Instagram thread with ID 'ptgm0321'

shape: (3, 8)
┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ num_partic ┆ start_date ┆ start_time ┆ end_date  ┆ end_time  ┆ user      ┆ date_time ┆ text      │
│ ipants     ┆ ---        ┆ ---        ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ ---        ┆ str        ┆ str        ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
│ str        ┆            ┆            ┆           ┆           ┆           ┆           ┆           │
╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2          ┆ 5/29/2016  ┆ 7:17:06 AM ┆ 10/1/2016 ┆ 12:01:10  ┆ jonahrath ┆ 5/29/2016 ┆ Hi there. │
│            ┆            ┆            ┆           ┆ AM        ┆ mer       ┆ 7:17:06   ┆ Are you   │
│            ┆            ┆            ┆           ┆           ┆           ┆ AM        ┆ sending   │
│            ┆            ┆       