# Text Analysis with Large Language Models
## Advanced Applications and Techniques

In this notebook, we'll explore practical applications of LLMs for text analysis:
* Text extraction from unstructured documents
* Sentiment analysis and opinion mining
* Named entity recognition
* Text summarization
* Converting unstructured text to structured formats

In [None]:
%pip install openai python-dotenv

In [None]:
from openai import OpenAI
import os

# Set your API key
# Best practice is to use environment variables
openai = OpenAI(
	api_key=os.environ.get("OPENAI_API_KEY")
)

In [None]:
import json
import pandas as pd

# Helper function from previous notebook
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
    return response.choices[0].message.content

### 1. Text Extraction from Unstructured Documents

Often, we need to extract specific information from messy, unstructured text. Let's look at an example with an email:

In [None]:
email_text = """
From: john.smith@example.com
Sent: Tuesday, November 28, 2024 2:30 PM
To: sarah.jones@example.com
Subject: Project Timeline Update

Hi Sarah,

Following up on our meeting yesterday. Here are the key deliverables and dates:
- Initial prototype due by Jan 15
- User testing phase: Feb 1-15
- Final release scheduled for March 1st

Budget is set at $50,000. Please let me know if you need anything else.

Best regards,
John
"""

prompt = f"""
Extract the following information from the email and return as JSON:
- sender
- recipient
- project_milestones (as an array of objects with 'task' and 'date')
- budget

Email: {email_text}
"""

response = get_completion(prompt)
print(json.dumps(json.loads(response), indent=2))

### 2. Advanced Sentiment Analysis

Let's go beyond basic positive/negative classification to extract detailed sentiment information:

In [None]:
reviews = [
    "The new interface is clean but it takes me forever to find basic features now.",
    "Customer service responded quickly, though they couldn't solve my problem.",
    "Great product overall, especially the battery life, but a bit expensive."
]

prompt = f"""
Analyze each review and provide:
1. Overall sentiment (positive/negative/mixed)
2. Specific aspects mentioned and their sentiment
3. Any suggestions or improvements mentioned

Format as JSON with these fields for each review.

Reviews: {reviews}
"""

response = get_completion(prompt)
print(json.dumps(json.loads(response), indent=2))

### 3. Named Entity Recognition

We can use LLMs to identify and categorize named entities in text:

In [None]:
article = """
Apple CEO Tim Cook announced today at their Cupertino headquarters that the company
is partnering with Microsoft to enhance AI capabilities in iOS. The project, set to
launch in Silicon Valley next month, has already attracted $50M in investment from
Google Ventures. The announcement caused Apple's stock to rise 3% on NASDAQ.
"""

prompt = f"""
Extract and categorize all named entities from the text into these categories:
- People (with their roles)
- Organizations
- Locations
- Products
- Financial figures

Return as JSON. Include context where relevant.

Text: {article}
"""

response = get_completion(prompt)
print(json.dumps(json.loads(response), indent=2))

### 4. Text Summarization

LLMs excel at generating different types of summaries:

In [None]:
long_text = """
The Internet of Things (IoT) represents a major transformation in the way everyday
objects interact with us and each other. By embedding sensors and connectivity into
previously 'dumb' devices, IoT enables real-time data collection and automated responses.
However, this connectivity also raises significant security and privacy concerns. Recent
studies have shown that many IoT devices lack basic security features, making them
vulnerable to attacks. Additionally, the massive amount of personal data collected by
these devices has led to growing privacy concerns among consumers and regulators alike.
Despite these challenges, the IoT market continues to grow rapidly, with analysts
projecting over 25 billion connected devices by 2025. Companies are investing heavily
in IoT solutions for industrial automation, smart homes, and urban infrastructure.
"""

# Generate different types of summaries
prompts = [
    "Provide a one-sentence summary of the main point.",
    "Summarize the key benefits and challenges mentioned.",
    "Create a bullet-point summary of the most important facts and figures."
]

for i, prompt in enumerate(prompts, 1):
    full_prompt = f"{prompt}\n\nText: {long_text}"
    print(f"\nSummary Type {i}:")
    print(get_completion(full_prompt))
    print("-" * 80)

### 5. Converting Unstructured Text to Structured Data

Let's convert a complex text document into a structured format suitable for analysis:

In [None]:
medical_notes = """
Patient Visit Notes - 11/28/2024

Patient: Bilal Khan
Age: 45
Chief Complaint: Persistent headache and fatigue

History: Patient reports experiencing headaches for the past 2 weeks, describes pain
as "throbbing" and primarily on the right side. Associated symptoms include fatigue
and mild nausea. No previous history of migraines. Currently taking ibuprofen 400mg
PRN with minimal relief.

Vitals:
BP: 128/82
HR: 76
Temp: 98.6F

Assessment: Probable tension headache with possible migraine component.

Plan:
1. Prescribed sumatriptan 50mg PRN
2. Recommended stress management techniques
3. Follow-up in 2 weeks if symptoms persist
"""

prompt = f"""
Convert these medical notes into a structured format with the following elements:
1. Patient demographics
2. Symptoms (current and associated)
3. Vital signs
4. Current medications
5. Treatment plan

Return as JSON with appropriate nested structures.

Notes: {medical_notes}
"""

response = get_completion(prompt)
structured_data = json.loads(response)

# Convert to pandas DataFrame for easy viewing
# Note: This requires flattening the nested structure
df = pd.json_normalize(structured_data)
print("\nStructured Data as DataFrame:")
display(df)

### Practice Exercise

Try combining multiple techniques we've learned to analyze this customer feedback:

In [None]:
feedback = """
Customer Feedback Summary - Q4 2024

Mobile App Reviews:
- "Love the new dark mode! Much easier on the eyes at night" - @user123
- "App crashes whenever I try to upload photos" - @tech_savvy
- "Great updates overall but please add landscape mode" - @mobile_user

Email Feedback:
Sarah Johnson (sarah.j@email.com) reported slow loading times on Android devices.
Mark Wilson suggested adding a search feature in the settings menu.

Support Tickets:
- Ticket #1234: Payment processing error on Chrome browser
- Ticket #1235: Password reset emails not being received
- Ticket #1236: Account sync issues between desktop and mobile
"""

# Your task:
# 1. Extract all issues mentioned
# 2. Categorize them (bug, feature request, performance issue, etc.)
# 3. Identify mentioned platforms/devices
# 4. Analyze sentiment of user feedback
# 5. Generate a structured summary

# Write your prompt here