<a href="https://colab.research.google.com/github/caffeinated-beverage/NikkiSingh_DTSC3020_Fall2025/blob/main/assignment05_ns1239.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [1]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [26]:
# Write your answer here

import re
import csv
from pathlib import Path

INPUT_FILE = Path("contacts_raw.txt")
OUTPUT_FILE = Path("contacts_clean.csv")

EMAIL_REGEX = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def validate_email(email_str: str) -> bool:
    if not email_str:
        return False
    return EMAIL_REGEX.fullmatch(email_str.strip()) is not None

def normalize_phone(phone_str: str) -> str:
    digits = re.sub(r"\D", "", phone_str)

    if len(digits) >= 10:
        return digits[-10:]
    else:
        return ""

def parse_contact_line(line: str) -> tuple[str, str, str]:
    name_str, email_str = "", ""

    try:
        main_part, phone_part = line.rsplit(',', 1)
        phone_str = phone_part.strip()
        main_part = main_part.strip()

        email_match = re.search(r"<(.+?)>", main_part)
        if email_match:
            email_str = email_match.group(1).strip()

            name_str = main_part.replace(email_match.group(0), "").strip().strip('"')
        else:
            name_email_parts = main_part.rsplit(',', 1)
            if len(name_email_parts) == 2:
                name_str = name_email_parts[0].strip().strip('"')
                email_str = name_email_parts[1].strip()
            else:
                if '@' in main_part:
                    email_str = main_part
                else:
                    name_str = main_part

    except ValueError:
        return "", "", ""

    return name_str, email_str, phone_str

def main():
    print(f"Starting CRM cleanup. Reading from {INPUT_FILE}...")

    clean_contacts = []
    seen_emails = set()

    try:
        with INPUT_FILE.open("r", encoding="utf-8") as f:
            lines = f.readlines()
    except FileNotFoundError:
        print(f"Error: Input file not found at {INPUT_FILE}")
        print("Please create 'contacts_raw.txt' and try again.")
        return
    except Exception as e:
        print(f"An unexpected error occurred while reading the file: {e}")
        return

    for line in lines:
        line = line.strip()
        if not line:
            continue

        name, email, phone_raw = parse_contact_line(line)

        if not validate_email(email):
            print(f"Skipping row: Invalid email format -> {email}")
            continue

        email_folded = email.casefold()
        if email_folded in seen_emails:
            print(f"Skipping row: Duplicate email -> {email}")
            continue

        seen_emails.add(email_folded)

        phone_clean = normalize_phone(phone_raw)

        clean_contacts.append({
            'name': name,
            'email': email,
            'phone': phone_clean
        })

    print(f"Writing {len(clean_contacts)} clean contacts to {OUTPUT_FILE}...")
    try:
        with OUTPUT_FILE.open("w", encoding="utf-8", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
            writer.writeheader()
            writer.writerows(clean_contacts)
    except IOError as e:
        print(f"Error: Could not write to output file {OUTPUT_FILE}: {e}")
    except Exception as e:
        print(f"An unexpected error occurred while writing the file: {e}")

    print("Cleanup complete.")

if __name__ == "__main__":
    main()

Starting CRM cleanup. Reading from contacts_raw.txt...
Skipping row: Invalid email format -> bob[at]example.com
Skipping row: Duplicate email -> Alice@Example.com
Writing 5 clean contacts to contacts_clean.csv...
Cleanup complete.


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [33]:
# Write your answer here

import unittest
from q1_crm_cleanup import (
    validate_email,
    normalize_phone,
    parse_contact_line
)

class TestCrmCleanup(unittest.TestCase):
    def test_validate_email(self):
        # 1. Valid cases
        self.assertTrue(validate_email("alice@example.com"))
        self.assertTrue(validate_email("first.last@example.co.uk"))
        self.assertTrue(validate_email("user123@sub.example-domain.org"))
        self.assertTrue(validate_email("user_%+-@example.com"))
        self.assertTrue(validate_email("  alice@example.com  "))
        self.assertTrue(validate_email("sara@mail.co"))

        self.assertFalse(validate_email("bob[at]example.com"))
        self.assertFalse(validate_email("invalid-email.com"))
        self.assertFalse(validate_email("invalid@domain"))
        self.assertFalse(validate_email("invalid@domain.c"))
        self.assertFalse(validate_email("@example.com"))
        self.assertFalse(validate_email(""))
        self.assertFalse(validate_email("   "))

    def test_normalize_phone(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("972-555-7777"), "9725557777")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("(469)555-9999"), "4695559999")
        self.assertEqual(normalize_phone("+1-972-777-2121"), "9727772121")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")
        self.assertEqual(normalize_phone("469 555 1234"), "4695551234")

        self.assertEqual(normalize_phone("14695551234"), "4695551234")

        self.assertEqual(normalize_phone("972-555-777"), "")
        self.assertEqual(normalize_phone("1234567"), "")
        self.assertEqual(normalize_phone(""), "")
        self.assertEqual(normalize_phone("(abc) def-ghij"), "")

    def test_parse_contact_line(self):
        line1 = 'Alice Johnson <alice@example.com> , +1 (469) 555-1234'
        self.assertEqual(
            parse_contact_line(line1),
            ("Alice Johnson", "alice@example.com", "+1 (469) 555-1234")
        )
        line2 = 'Sara M. , sara@mail.co , 214 555 8888'
        self.assertEqual(
            parse_contact_line(line2),
            ("Sara M.", "sara@mail.co", "214 555 8888")
        )
        line3 = '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999'
        self.assertEqual(
            parse_contact_line(line3),
            ("Mehdi A.", "mehdi.ay@example.org", "(469)555-9999")
        )
        line4 = 'Delaram <delaram@example.io>, +1-972-777-2121'
        self.assertEqual(
            parse_contact_line(line4),
            ("Delaram", "delaram@example.io", "+1-972-777-2121")
        )

    def test_deduplication_logic(self):
        raw_contacts = [
            ("Alice", "alice@example.com"),
            ("Bob", "bob@example.com"),
            ("Duplicate Alice", "Alice@Example.com"),
            ("Carol", "carol@example.com"),
            ("Second Bob", "bob@example.com")
        ]

        clean_contacts = []
        seen_emails = set()

        for name, email in raw_contacts:
            email_folded = email.casefold()
            if email_folded not in seen_emails:
                seen_emails.add(email_folded)
                clean_contacts.append((name, email))

        self.assertEqual(len(clean_contacts), 3)

        self.assertEqual(clean_contacts[0], ("Alice", "alice@example.com"))

        self.assertEqual(clean_contacts[1], ("Bob", "bob@example.com"))

        self.assertEqual(clean_contacts[2], ("Carol", "carol@example.com"))

if __name__ == '__main__':
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

....
----------------------------------------------------------------------
Ran 4 tests in 0.005s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
