<a href="https://colab.research.google.com/github/arloera01-blip/AshlynL_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [4]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [7]:
# Write your answer here
import re
import csv

from pathlib import Path

# Regex for validating emails
EMAIL_REGEX = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"

def validate_email(email: str) -> bool:
    """Return True if email matches regex, else False."""
    return re.fullmatch(EMAIL_REGEX, email.strip()) is not None

def normalize_phone(phone: str) -> str:
    """Keep only the last 10 digits if possible, else return empty string."""
    digits = re.sub(r"\D", "", phone)
    return digits[-10:] if len(digits) >= 10 else ""

def parse_line(line: str):
    """Extract name, email, phone from a raw line."""
    parts = [p.strip() for p in line.split(",")]
    name, email, phone = "", "", ""

    # Attempt to find email in angle brackets or standalone
    for part in parts:
        if "<" in part and ">" in part:
            name = part.split("<")[0].strip('" ').strip()
            email = part.split("<")[1].split(">")[0].strip()
        elif validate_email(part):
            email = part.strip()
        elif re.search(r"\d", part):
            phone = part.strip()
        elif not name:
            name = part.strip('" ').strip()

    return name, email, phone

def main():
    input_file = Path("contacts_raw.txt")
    output_file = Path("contacts_clean.csv")

    try:
        lines = input_file.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print(f"Error: {input_file} not found.")
        return

    seen_emails = set()
    cleaned_rows = []

    for line in lines:
        if not line.strip():
            continue
        name, email, phone = parse_line(line)
        if not validate_email(email):
            continue
        email_lower = email.casefold()
        if email_lower in seen_emails:
            continue
        seen_emails.add(email_lower)
        phone_norm = normalize_phone(phone)
        cleaned_rows.append([name, email, phone_norm])

    # Write CSV
    with output_file.open("w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["name", "email", "phone"])
        writer.writerows(cleaned_rows)

    print(f"Cleaned contacts written to {output_file}")

if __name__ == "__main__":
    main()

    with open("contacts_clean.csv", "r", encoding="utf-8") as f:
        print(f.read())

Cleaned contacts written to contacts_clean.csv
name,email,phone
Alice Johnson,alice@example.com,4695551234
Sara M.,sara@mail.co,2145558888
Mehdi A.,mehdi.ay@example.org,4695559999
Delaram,delaram@example.io,9727772121
Nima,NIMA@example.io,9727772121



## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [11]:
# Write your answer here
import unittest

class TestCRMFunctions(unittest.TestCase):

    # Email validation tests
    def test_valid_emails(self):
        valid = ["alice@example.com", "bob.smith_123@example.co.uk", "user+tag@example.io"]
        for email in valid:
            with self.subTest(email=email):
                self.assertTrue(validate_email(email))

    def test_invalid_emails(self):
        invalid = ["bob[at]example.com", "no-at-symbol.com", "user@@example.com", "user@.com"]
        for email in invalid:
            with self.subTest(email=email):
                self.assertFalse(validate_email(email))

    # Phone normalization tests
    def test_phone_normalization(self):
        cases = [
            ("+1 (469) 555-1234", "4695551234"),
            ("972-777-2121", "9727772121"),
            ("(214) 555 8888", "2145558888"),
            ("972.777.2121", "9727772121"),
            ("12345", ""),  # too short
            ("+44 20 7946 0958", "2079460958")
        ]
        for raw, expected in cases:
            with self.subTest(raw=raw):
                self.assertEqual(normalize_phone(raw), expected)

    # Parsing tests
    def test_parse_line(self):
        lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'duplicate <Alice@Example.com> , 469 555 1234',
            'Sara M. , sara@mail.co , 214 555 8888',
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999'
        ]
        expected = [
            ("Alice Johnson", "alice@example.com", "+1 (469) 555-1234"),
            ("duplicate", "Alice@Example.com", "469 555 1234"),
            ("Sara M.", "sara@mail.co", "214 555 8888"),
            ("Mehdi A.", "mehdi.ay@example.org", "(469)555-9999")
        ]
        for line, exp in zip(lines, expected):
            with self.subTest(line=line):
                self.assertEqual(parse_line(line), exp)

    # Deduplication test
    def test_deduplication(self):
        raw_lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'duplicate <Alice@Example.com> , 469 555 1234',
            'Nima <NIMA@example.io> , 972.777.2121'
        ]
        seen = set()
        cleaned = []
        for line in raw_lines:
            name, email, phone = parse_line(line)
            email_lower = email.casefold()
            if email_lower in seen:
                continue
            seen.add(email_lower)
            phone_norm = normalize_phone(phone)
            cleaned.append([name, email, phone_norm])
        expected_cleaned = [
            ["Alice Johnson", "alice@example.com", "4695551234"],
            ["Nima", "NIMA@example.io", "9727772121"]
        ]
        self.assertEqual(cleaned, expected_cleaned)

# Run tests in Colab
unittest.main(argv=[''], exit=False, verbosity=2)

test_deduplication (__main__.TestCRMFunctions.test_deduplication) ... ok
test_invalid_emails (__main__.TestCRMFunctions.test_invalid_emails) ... ok
test_parse_line (__main__.TestCRMFunctions.test_parse_line) ... ok
test_phone_normalization (__main__.TestCRMFunctions.test_phone_normalization) ... ok
test_valid_emails (__main__.TestCRMFunctions.test_valid_emails) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.009s

OK


<unittest.main.TestProgram at 0x7f12b8273d40>

## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
