# Python String Operations for GenAI

Learn essential string operations with a focus on GenAI applications!

## Table of Contents
1. [String Basics](#basics)
2. [String Length](#length)
3. [Token Counting for GenAI](#tokens)
4. [String Manipulation](#manipulation)
5. [String Searching](#searching)
6. [String Formatting](#formatting)
7. [GenAI Practical Examples](#genai-examples)

## 1. String Basics <a id='basics'></a>

Strings are sequences of characters used to store text.

In [None]:
# Creating strings
text1 = "Hello, World!"
text2 = 'Python is awesome'
text3 = """This is a
multi-line
string"""



In [None]:
print(text1)

In [None]:
text2

In [None]:
print(text3)

In [None]:
# Accessing characters (indexing)
msg = "GenAI"



In [None]:
msg

In [None]:
msg[0:3:2]

## 2. String Length <a id='length'></a>

**Why it matters in GenAI:** LLMs have input limits, so knowing string length is crucial!

In [41]:
input_prompt = "Explain machine learning in simple terms!"

gpt_output="Machine learning is a subset of AI, which help to learn the pattern through differnt traning alogorithum"


In [42]:
len(input_prompt)

41

In [43]:
len(gpt_output)

104

In [44]:
chartcter_per_token=4

In [45]:
input_token= len(input_prompt)//4
input_token

10

In [46]:
output_token= len(gpt_output)//4
output_token

26

In [47]:
# Counting words
text = "Generative AI is transforming technology"

text

'Generative AI is transforming technology'

In [48]:
words = text.split()
words

['Generative', 'AI', 'is', 'transforming', 'technology']

In [49]:
len(words)

5

In [None]:
len(text)

In [None]:
len(words)

### 3.1 Simple Token Estimation

In [None]:

prompt = "What is artificial intelligence and how does it work?"

len(prompt) // 4


## 4. String Manipulation <a id='manipulation'></a>

### 4.1 Case Conversion

In [57]:
text = "Generative  AI machine"
text


'Generative  AI machine'

In [52]:
a= text.lower()
a

'generative ai'

In [54]:
b= text.upper()
b

'GENERATIVE AI'

In [58]:
c= text.title()
c

'Generative  Ai Machine'

### 4.2 Trimming Whitespace

In [59]:
# Important for cleaning user input!
user_input = "  Hello, AI!  \n"

#print(f"Original: '{user_input}'")
#print(f"Stripped: '{user_input.strip()}'")
#print(f"Left strip: '{user_input.lstrip()}'")
#print(f"Right strip: '{user_input.rstrip()}'")

In [60]:
user_input

'  Hello, AI!  \n'

In [61]:
a=user_input.strip()
a

'Hello, AI!'

In [62]:
b=user_input.lstrip()
b

'Hello, AI!  \n'

In [63]:
c=user_input.rstrip()
c

'  Hello, AI!'

In [None]:
user_input.strip()

### 4.3 Splitting and Joining

In [64]:
# Splitting strings
sentence = "Python is great for AI development"
words = sentence.split()


In [65]:
words

['Python', 'is', 'great', 'for', 'AI', 'development']

In [66]:
type(words)

list

In [69]:
a=['Python', 'is', 'great', 'for', 'AI', 'development', "34", "76", 'and', "mf"]


In [78]:
sentence= " ".join(a)
sentence

'Python is great for AI development 34 76 and mf'

In [None]:
sentence = " ".join(words)


In [None]:

print(f"Original: {sentence}")
print(f"Words: {words}")
print(f"Word count: {len(words)}")

In [None]:
# Joining strings
words = ["Machine", "Learning", "is", "awesome"]
sentence = " ".join(words)

print(f"Words: {words}")
print(f"Sentence: {sentence}")

In [None]:
# Split by custom delimiter
csv_data = "name,age,city"
fields = csv_data.split(",")

print(f"CSV: {csv_data}")
print(f"Fields: {fields}")

### 4.4 Replacing Text

In [88]:
text = "I love Delhi"

new_text = text.replace("Delhi", "Jabalpur")


In [89]:
new_text

'I love Jabalpur'

In [None]:
msg="Hello my name is Bipin and my mailid is abc@gmail.com and my phone number is 1234567890"

In [93]:
name_decs="name" in msg
name_decs

True

In [94]:
phone_decs= "phone" in msg
phone_decs

True

In [95]:
adress_decs= "address" in msg
adress_decs

False

In [96]:
msg

' Hello my name is Bipin and my mailid is abc@gmail.com and my phone number is 1234567890'

In [97]:
finder= msg.find("mailid")
finder

31

In [98]:
word_frequency=msg.count("is")
word_frequency

3

In [111]:
name="Hitesh34" #alpha
phone_number="123456789-" #numeric
password="abcdefg34@" #numberand textboth (alphanumeric)



In [103]:
name.isalpha()

False

In [107]:
phone_number.isdigit()

False

In [112]:
password.isalnum()

False

In [None]:
print(f"{text1}.isalnum(): {text1.isalnum()}")  # Letters and numbers
print(f"{text2}.isdigit(): {text2.isdigit()}")  # Only digits
print(f"{text3}.isalpha(): {text3.isalpha()}")  # Only letters

## 5. String Searching <a id='searching'></a>

In [None]:
# Check if substring exists
text = "Generative AI is transforming technology"

if "AI" in text:
    print("Found 'AI' in text")

if "blockchain" not in text:
    print("'blockchain' not found")

In [None]:
# Find position of substring
text = "Machine Learning and Deep Learning"

position = text.find("Learning")
print(f"First 'Learning' at position: {position}")

# Find returns -1 if not found
position = text.find("Python")
print(f"'Python' position: {position}")

In [None]:
# Count occurrences
text = "AI is great. AI is powerful. AI is the future."
count = text.count("AI")

print(f"Text: {text}")
print(f"'AI' appears {count} times")

In [None]:
# Check string properties


## 6. String Formatting <a id='formatting'></a>

In [113]:
# f-strings (modern and recommended)
name = "Alice"
age = 25

message = f"My name is {name} and I am {age} years old"
message

'My name is Alice and I am 25 years old'

In [None]:
# Format with expressions
tokens = 150
cost_per_token = 0.0001

print(f"Tokens: {tokens}")
print(f"Cost: ${tokens * cost_per_token:.4f}")

In [None]:
# .format() method
template = "Hello, {}! Welcome to {}."
result = template.format("User", "GenAI Tutorial")
print(result)

## 7. GenAI Practical Examples <a id='genai-examples'></a>

### 7.1 Prompt Template Builder

In [None]:
def create_prompt(role, task, context=""):
    """Build a structured prompt for LLMs"""
    prompt = f"""You are a {role}.

Task: {task}
"""
    
    if context:
        prompt += f"\nContext: {context}"
    
    # Show token info
    tokens = estimate_tokens_by_words(prompt)
    print(f"üìä Prompt length: {len(prompt)} chars, ~{tokens} tokens\n")
    
    return prompt

# Example usage
prompt = create_prompt(
    role="helpful Python tutor",
    task="Explain string operations in simple terms",
    context="Student is a beginner learning Python"
)

print(prompt)

### 7.2 Text Preprocessing for LLMs

In [None]:
def preprocess_text(text):
    """Clean text before sending to LLM"""
    # Remove extra whitespace
    text = " ".join(text.split())
    
    # Trim
    text = text.strip()
    
    # Remove multiple punctuation
    text = text.replace("...", ".")
    text = text.replace("!!", "!")
    
    return text

# Test
raw_input = "  What   is    AI???   Tell me!!!  "
clean_input = preprocess_text(raw_input)

print(f"Raw: '{raw_input}'")
print(f"Clean: '{clean_input}'")
print(f"Saved {len(raw_input) - len(clean_input)} characters")

### 7.3 Smart Text Truncation

In [None]:
def smart_truncate(text, max_tokens=50):
    """Truncate text to fit token limit without cutting words"""
    words = text.split()
    truncated = []
    current_tokens = 0
    
    for word in words:
        word_tokens = len(word) // 4 + 1
        
        if current_tokens + word_tokens > max_tokens:
            truncated.append("...")
            break
        
        truncated.append(word)
        current_tokens += word_tokens
    
    result = " ".join(truncated)
    print(f"Original: {len(text.split())} words")
    print(f"Truncated: {len(truncated)} words")
    print(f"Estimated tokens: {current_tokens}")
    
    return result

long_text = """Generative AI is a type of artificial intelligence that can create 
new content including text, images, music, and code. It uses machine learning models 
trained on large datasets to understand patterns and generate human-like outputs."""

result = smart_truncate(long_text, max_tokens=30)
print(f"\nResult: {result}")

### 7.4 Keyword Extraction

In [None]:
def extract_keywords(text, min_length=4):
    """Extract important keywords from text"""
    # Common stop words to ignore
    stop_words = {'the', 'is', 'are', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'a', 'an'}
    
    # Clean and split
    words = text.lower().replace(',', '').replace('.', '').split()
    
    # Filter keywords
    keywords = []
    for word in words:
        if word not in stop_words and len(word) >= min_length:
            keywords.append(word)
    
    # Count frequency
    keyword_freq = {}
    for keyword in keywords:
        keyword_freq[keyword] = keyword_freq.get(keyword, 0) + 1
    
    # Sort by frequency
    sorted_keywords = sorted(keyword_freq.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_keywords

text = """Machine learning is a subset of artificial intelligence. 
Machine learning algorithms learn from data to make predictions."""

keywords = extract_keywords(text)
print("Top Keywords:")
for word, count in keywords[:5]:
    print(f"  {word}: {count} times")

### 7.5 Response Chunking for Streaming

In [None]:
import time

def stream_response(text, chunk_size=5):
    """Simulate streaming LLM response"""
    words = text.split()
    
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        print(chunk, end=" ", flush=True)
        time.sleep(0.3)  # Simulate delay
    
    print()  # New line

response = "Generative AI models can understand context and generate human-like text by learning patterns from vast amounts of training data"

print("Streaming response:")
stream_response(response)

### 7.6 Cost Calculator for API Calls

In [None]:
def calculate_api_cost(prompt, response="", model="gpt-3.5-turbo"):
    """Calculate estimated cost for API call"""
    # Pricing (example rates per 1K tokens)
    pricing = {
        "gpt-3.5-turbo": {"input": 0.0015, "output": 0.002},
        "gpt-4": {"input": 0.03, "output": 0.06}
    }
    
    # Estimate tokens
    input_tokens = estimate_tokens_by_words(prompt)
    output_tokens = estimate_tokens_by_words(response) if response else 0
    
    # Calculate cost
    rates = pricing.get(model, pricing["gpt-3.5-turbo"])
    input_cost = (input_tokens / 1000) * rates["input"]
    output_cost = (output_tokens / 1000) * rates["output"]
    total_cost = input_cost + output_cost
    
    print(f"Model: {model}")
    print(f"Input tokens: {input_tokens} (~${input_cost:.6f})")
    print(f"Output tokens: {output_tokens} (~${output_cost:.6f})")
    print(f"Total cost: ~${total_cost:.6f}")
    
    return total_cost

# Example
prompt = "Explain how transformers work in machine learning"
response = "Transformers are neural network architectures that use self-attention mechanisms to process sequential data efficiently."

calculate_api_cost(prompt, response, "gpt-3.5-turbo")

### 7.7 Multi-Language Text Analysis

In [None]:
def analyze_text(text):
    """Comprehensive text analysis for GenAI"""
    print("=" * 50)
    print("TEXT ANALYSIS")
    print("=" * 50)
    
    # Basic stats
    char_count = len(text)
    word_count = len(text.split())
    sentence_count = text.count('.') + text.count('!') + text.count('?')
    
    print(f"\nüìù Basic Statistics:")
    print(f"  Characters: {char_count}")
    print(f"  Words: {word_count}")
    print(f"  Sentences: {sentence_count}")
    
    # Token estimation
    tokens = estimate_tokens_by_words(text)
    print(f"\nüéØ Token Information:")
    print(f"  Estimated tokens: {tokens}")
    print(f"  Avg tokens/word: {tokens/word_count:.2f}")
    
    # Check limits
    print(f"\n‚úÖ Limit Checks:")
    print(f"  GPT-3.5 (4K): {'OK' if tokens < 4000 else 'EXCEEDS'}")
    print(f"  GPT-4 (8K): {'OK' if tokens < 8000 else 'EXCEEDS'}")
    
    return {
        'chars': char_count,
        'words': word_count,
        'sentences': sentence_count,
        'tokens': tokens
    }

# Test
sample_text = """Generative AI is revolutionizing technology. 
It can create text, images, and code. This technology is becoming 
increasingly important in various industries."""

stats = analyze_text(sample_text)

## üéØ Key Takeaways

### String Length:
- Use `len()` to get character count
- Count words with `len(text.split())`
- Essential for managing LLM input limits

### Token Counting:
- **Rule of thumb**: 1 token ‚âà 4 characters or 1.3 tokens per word
- Use `tiktoken` for accurate counting
- Critical for API costs and context limits

### String Operations:
- **Clean**: `strip()`, `lower()`, `upper()`
- **Split/Join**: Process text into words and back
- **Search**: `in`, `find()`, `count()`
- **Format**: f-strings for dynamic text

### GenAI Applications:
- Prompt template building
- Text preprocessing and cleaning
- Smart truncation for token limits
- Cost calculation for API calls
- Keyword extraction and analysis

## üìö Practice Exercises

1. Create a function to validate prompt length
2. Build a text summarizer that fits token limits
3. Write a keyword extractor for document indexing
4. Implement a cost estimator for different models

Happy Coding! üêç‚ú®