# üéØ Polars Exercises for Agency Operators

**Learn by doing!** These exercises are designed for real-world agency scenarios.

---

## üéì What You'll Learn

By completing these exercises, you'll be able to:

1. ‚úÖ **Segment contacts** by industry, title, or company size
2. ‚úÖ **Create custom reports** showing data quality metrics
3. ‚úÖ **Flag high-value contacts** (executives, specific companies)
4. ‚úÖ **Build completeness scores** for prioritizing outreach
5. ‚úÖ **Extract insights** from messy data (domains, seniority levels)
6. ‚úÖ **Validate data** against your own business rules
7. ‚úÖ **Enrich contacts** with calculated fields

---

## üìù How This Works

Each exercise has:
- üéØ **Scenario** - A real agency problem
- üí° **Your Task** - What you need to build
- üîë **Hints** - Tips to get you started
- ‚úÖ **Solution** - Hidden in collapsed cells (try first!)
- üöÄ **Extension** - Take it further

---

## ‚è±Ô∏è Time Commitment

- **Quick path:** 30 minutes (do exercises 1-3)
- **Full path:** 60 minutes (do all 8 exercises)

---

**Ready? Let's go!** üöÄ

## Setup: Install Polars & Load Data

In [None]:
# Install Polars
!pip install polars -q

import polars as pl
import re

print("‚úÖ Setup complete!")

In [None]:
# Load sample contacts data
url = "https://raw.githubusercontent.com/billiondottech/agency-data-onboarding-kit/main/samples/contacts_messy.csv"
df = pl.read_csv(url)

# Basic cleaning (from the main notebook)
df = df.rename({col: col.strip().lower().replace(" ", "_") for col in df.columns})
df = df.with_columns([
    pl.col("email").str.to_lowercase().str.strip(),
    pl.col("email").str.split("@").list.get(1).alias("email_domain")
])

print(f"üìä Loaded {len(df)} contacts")
print(f"üìã Columns: {df.columns}\n")
df.head(3)

---

# Exercise 1: Flag Executive-Level Contacts üéØ

## Scenario
Your agency only wants to reach out to **C-level executives and VPs**. You need to identify these contacts from the title field.

## Your Task
Create a new column called `is_executive` that is `True` for titles containing:
- CEO, CTO, CFO, COO, CMO (any C-level)
- VP, Vice President
- President, Managing Director

## Hints
- Use `pl.col("title").str.to_lowercase()` to handle case variations
- Use `str.contains()` with the `|` operator for "or" matching
- Example: `str.contains("ceo|cto|cfo")`

## Try It Yourself First!
Write your code in the cell below:

In [None]:
# Your code here




### ‚úÖ Solution (Click to Expand)

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
df = df.with_columns([
    pl.col("title")
      .str.to_lowercase()
      .str.contains("ceo|cto|cfo|coo|cmo|chief|vp|vice president|president|managing director")
      .fill_null(False)
      .alias("is_executive")
])

# Show results
executives = df.filter(pl.col("is_executive"))
print(f"üéØ Found {len(executives)} executives out of {len(df)} total contacts")
print(f"üìä That's {(len(executives)/len(df)*100):.1f}% of your database\n")

print("Executive contacts:")
executives.select(["full_name", "title", "company_name"]).head(10)

</details>

### üöÄ Extension Challenge

Add a `seniority_level` column with values:
- "C-Level" for CEOs, CTOs, etc.
- "VP" for Vice Presidents
- "Director" for Directors
- "Manager" for Managers
- "Individual Contributor" for everyone else

---

# Exercise 2: Calculate Contact Completeness Score üìä

## Scenario
You want to prioritize outreach to contacts with **complete profiles**. A contact with phone + LinkedIn + title is more valuable than just an email.

## Your Task
Create a `completeness_percentage` column that shows what % of fields are filled:
- Check these fields: `full_name`, `email`, `title`, `phone`, `linkedin`
- Calculate: (filled fields / total fields) * 100

## Hints
- Use `.is_not_null()` to check if a field has data
- Cast to integer: `.cast(pl.Int32)`
- Add them up, divide by 5, multiply by 100

## Try It Yourself:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
df = df.with_columns([
    (
        (
            pl.col("full_name").is_not_null().cast(pl.Int32) +
            pl.col("email").is_not_null().cast(pl.Int32) +
            pl.col("title").is_not_null().cast(pl.Int32) +
            pl.col("phone").is_not_null().cast(pl.Int32) +
            pl.col("linkedin").is_not_null().cast(pl.Int32)
        ) / 5.0 * 100
    ).round(0).cast(pl.Int32).alias("completeness_percentage")
])

# Show distribution
print("üìä Completeness Distribution:\n")
print(df.group_by("completeness_percentage").count().sort("completeness_percentage", descending=True))

print("\nüèÜ Most complete profiles:")
print(df.sort("completeness_percentage", descending=True).select([
    "full_name", "email", "title", "phone", "linkedin", "completeness_percentage"
]).head(5))

print("\n‚ö†Ô∏è Least complete profiles:")
print(df.sort("completeness_percentage").select([
    "full_name", "email", "title", "phone", "linkedin", "completeness_percentage"
]).head(5))

</details>

### üöÄ Extension Challenge

Add a `priority_tier` column:
- "A" for completeness >= 80%
- "B" for completeness >= 60%
- "C" for completeness < 60%

Then count how many contacts are in each tier.

---

# Exercise 3: Identify Target Industries üè¢

## Scenario
Your agency specializes in **SaaS, Technology, and Software companies**. You need to filter your contact list to only show contacts from these industries.

## Your Task
1. Create a list of target industries: `["Software", "Technology", "SaaS", "Tech"]`
2. Filter contacts where `company_name` industry matches (you'll need to infer from company name for this dataset)
3. Count how many target vs non-target contacts you have

**Note:** Since this dataset doesn't have industry field, identify tech companies by keywords in company name like "tech", "digital", "software", "solutions", "labs"

## Hints
- Use `str.to_lowercase()` first
- Use `str.contains()` with multiple keywords
- Create an `is_target_industry` boolean column

## Try It:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
df = df.with_columns([
    pl.col("company_name")
      .str.to_lowercase()
      .str.contains("tech|software|digital|solutions|labs|systems|data|cloud|saas|analytics|ai")
      .fill_null(False)
      .alias("is_target_industry")
])

# Count breakdown
target_count = df.filter(pl.col("is_target_industry")).height
total_count = len(df)

print(f"üéØ Target Industry Contacts: {target_count}")
print(f"üìä Other Industries: {total_count - target_count}")
print(f"üìà Target %: {(target_count/total_count*100):.1f}%\n")

print("Sample target companies:")
print(df.filter(pl.col("is_target_industry")).select([
    "company_name", "full_name", "title"
]).head(10))

</details>

### üöÄ Extension Challenge

Create an `industry_category` column that classifies companies into:
- "Tech/Software"
- "Consulting"
- "Marketing/Digital"
- "Finance"
- "Other"

Based on keywords in the company name.

---

# Exercise 4: Extract First Names for Personalization üëã

## Scenario
For email personalization, you need **first names**. But your data only has `full_name` like "Sarah Johnson" or "Michael Chen".

## Your Task
Extract the first name from `full_name` into a new `first_name` column.

## Hints
- Use `str.split(" ")` to split name by space
- Use `.list.get(0)` to get the first element
- Handle cases where full_name might be null

## Try It:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
df = df.with_columns([
    pl.col("full_name")
      .str.split(" ")
      .list.get(0)
      .alias("first_name")
])

print("‚úÖ First names extracted!\n")
print(df.select(["full_name", "first_name", "email"]).head(10))

# Most common first names
print("\nüìä Most common first names:")
print(df.group_by("first_name").count().sort("count", descending=True).head(10))

</details>

### üöÄ Extension Challenge

Also extract `last_name` (hint: use `.list.get(-1)` for last element).

Then create an `email_greeting` column that formats as:
- "Hi Sarah," if first_name exists
- "Hi there," if first_name is null

---

# Exercise 5: Flag VIP Companies üåü

## Scenario
Your agency has a **VIP target list** of dream clients. You want to flag any contacts from these companies for special handling.

## Your Task
1. Create a list of VIP domains: `["acme-corp.com", "globalventures.com", "quantum-sol.com"]`
2. Create an `is_vip` column that checks if `email_domain` is in the VIP list
3. Show how many VIP contacts you have

## Hints
- Use `pl.col("email_domain").is_in([list_of_domains])`
- This returns a boolean (True/False)

## Try It:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
vip_domains = ["acme-corp.com", "globalventures.com", "quantum-sol.com", "momentumgroup.com"]

df = df.with_columns([
    pl.col("email_domain")
      .is_in(vip_domains)
      .alias("is_vip")
])

vip_contacts = df.filter(pl.col("is_vip"))

print(f"üåü VIP Contacts: {len(vip_contacts)}")
print(f"üìä Regular Contacts: {len(df) - len(vip_contacts)}\n")

print("VIP Contact List:")
print(vip_contacts.select([
    "full_name", "title", "company_name", "email"
]))

</details>

### üöÄ Extension Challenge

Create a `company_tier` column:
- "Tier 1" for VIP companies
- "Tier 2" for companies with 3+ contacts in your database
- "Tier 3" for everyone else

Hint: Count contacts per `email_domain` first!

---

# Exercise 6: Create a Data Quality Report üìà

## Scenario
Your boss asks: "How clean is this data?" You need to generate a **quality report** showing:
- Total contacts
- % with phone numbers
- % with LinkedIn profiles
- % with job titles
- % that are executives

## Your Task
Calculate these metrics and print a formatted report.

## Hints
- Use `.filter(pl.col("field").is_not_null()).height` to count non-null
- Divide by total count and multiply by 100 for percentage
- Use f-strings for formatting: `f"{value:.1f}%"`

## Try It:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
total = len(df)
with_phone = df.filter(pl.col("phone").is_not_null()).height
with_linkedin = df.filter(pl.col("linkedin").is_not_null()).height
with_title = df.filter(pl.col("title").is_not_null()).height
executives = df.filter(pl.col("is_executive")).height if "is_executive" in df.columns else 0
vip = df.filter(pl.col("is_vip")).height if "is_vip" in df.columns else 0

print("="*60)
print("üìä DATA QUALITY REPORT")
print("="*60)
print(f"\nüìà Database Overview:")
print(f"  Total Contacts: {total}")
print(f"\nüìû Contact Information Completeness:")
print(f"  With Phone Numbers:    {with_phone:4} ({with_phone/total*100:5.1f}%)")
print(f"  With LinkedIn Profiles: {with_linkedin:4} ({with_linkedin/total*100:5.1f}%)")
print(f"  With Job Titles:        {with_title:4} ({with_title/total*100:5.1f}%)")
print(f"\nüéØ Strategic Segments:")
print(f"  Executive Level:        {executives:4} ({executives/total*100:5.1f}%)")
print(f"  VIP Companies:          {vip:4} ({vip/total*100:5.1f}%)")
print(f"\nüí° Quality Grade: ", end="")

avg_completeness = (with_phone + with_linkedin + with_title) / (total * 3) * 100
if avg_completeness >= 70:
    print("A (Excellent) ‚úÖ")
elif avg_completeness >= 50:
    print("B (Good) üëç")
elif avg_completeness >= 30:
    print("C (Fair) ‚ö†Ô∏è")
else:
    print("D (Needs Work) ‚õî")

print(f"  Average Field Completeness: {avg_completeness:.1f}%")
print("="*60)

</details>

### üöÄ Extension Challenge

Add sections to the report:
- Geographic distribution (% by country)
- Top 5 companies by contact count
- Data quality trends (if you had date_added field)

---

# Exercise 7: Build a Custom Outreach List üöÄ

## Scenario
You're launching a campaign and need a **high-priority outreach list**:
- Executive level contacts
- From tech companies
- With completeness >= 60%
- Sorted by completeness (best first)

## Your Task
Filter the data using all the columns you've created, then export to CSV.

## Hints
- Use multiple `.filter()` statements or combine with `&`
- Use `.sort()` to order results
- Use `.write_csv()` to export

## Try It:

In [None]:
# Your code here




### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
outreach_list = (
    df
    .filter(
        pl.col("is_executive") &
        pl.col("is_target_industry") &
        (pl.col("completeness_percentage") >= 60)
    )
    .sort("completeness_percentage", descending=True)
    .select([
        "first_name",
        "full_name",
        "title",
        "email",
        "phone",
        "linkedin",
        "company_name",
        "completeness_percentage",
        "is_vip"
    ])
)

print(f"üéØ High-Priority Outreach List: {len(outreach_list)} contacts\n")
print(outreach_list.head(10))

# Export to CSV
outreach_list.write_csv("high_priority_outreach.csv")
print("\n‚úÖ Exported to: high_priority_outreach.csv")
print("üíæ Download it from the files panel on the left ‚Üí")

</details>

### üöÄ Extension Challenge

Create 3 separate lists:
1. **Tier A:** VIP executives with 100% completeness
2. **Tier B:** Non-VIP executives with 80%+ completeness
3. **Tier C:** All other target industry contacts

Export each to a separate CSV.

---

# Exercise 8: Create a Contact Scoring System üèÜ

## Scenario
Score every contact based on:
- Executive: +10 points
- VIP company: +15 points
- Target industry: +5 points
- Has phone: +3 points
- Has LinkedIn: +2 points
- +1 point per 10% completeness

## Your Task
Create `contact_score` column. Segment into Hot (30+), Warm (20-29), Cold (<20).

## Try It:

In [None]:
# Your code here



### ‚úÖ Solution

<details>
<summary>Show Solution</summary>

In [None]:
# Solution
df = df.with_columns([
    (
        (pl.col("is_executive").cast(pl.Int32) * 10) +
        (pl.col("is_vip").cast(pl.Int32) * 15) +
        (pl.col("is_target_industry").cast(pl.Int32) * 5) +
        (pl.col("phone").is_not_null().cast(pl.Int32) * 3) +
        (pl.col("linkedin").is_not_null().cast(pl.Int32) * 2) +
        (pl.col("completeness_percentage") / 10).cast(pl.Int32)
    ).alias("contact_score")
])

df = df.with_columns([
    pl.when(pl.col("contact_score") >= 30)
      .then(pl.lit("Hot üî•"))
      .when(pl.col("contact_score") >= 20)
      .then(pl.lit("Warm üå°Ô∏è"))
      .otherwise(pl.lit("Cold ‚ùÑÔ∏è"))
      .alias("lead_temperature")
])

print("üèÜ Contact Scoring Results:\n")
print(df.group_by("lead_temperature").count().sort("count", descending=True))

print("\nüî• Top 10 Hottest Leads:")
print(df.sort("contact_score", descending=True).select([
    "full_name", "title", "company_name", "contact_score", "lead_temperature"
]).head(10))

</details>

---

## üéì Congratulations!

You can now segment, score, filter, and export contact lists like a pro.

**Next:** Try with your own data, or build the full pipeline at [GitHub](https://github.com/billiondottech/agency-data-onboarding-kit)

**Questions?** Join [Billion community](https://billion-blog.com)