<a href="https://colab.research.google.com/github/ayesha-siddiqui17/Ayesha_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [10]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [11]:
# Write your answer here

import re
from pathlib import Path

def is_valid_email(email):
    return bool(re.fullmatch(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", email.strip()))

def normalize_phone(phone):
    digits = re.sub(r"\D", "", phone)
    return digits[-10:] if len(digits) >= 10 else ""

def parse_contacts(lines):
    contacts = []
    for line in lines:
        parts = [p.strip().strip('"') for p in re.split(r",", line)]
        if len(parts) >= 3:
            name, email, phone = parts[0], parts[1], parts[2]
            if is_valid_email(email):
                contacts.append((name, email.strip(), normalize_phone(phone)))
    return contacts

def dedup_contacts(contacts):
    seen = set()
    result = []
    for name, email, phone in contacts:
        key = email.casefold()
        if key not in seen:
            seen.add(key)
            result.append((name, email, phone))
    return result

def main():
    path = Path("contacts_raw.txt")
    try:
        lines = path.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print("contacts_raw.txt not found.")
        return
    contacts = parse_contacts(lines)
    contacts = dedup_contacts(contacts)
    out = Path("contacts_clean.csv")
    with out.open("w", encoding="utf-8") as f:
        f.write("name,email,phone\n")
        for name, email, phone in contacts:
            f.write(f"{name},{email},{phone}\n")

if __name__ == "__main__":
    main()


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [12]:
import unittest
import sys
import re
from pathlib import Path

def is_valid_email(email):
    return bool(re.fullmatch(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", email.strip()))

def normalize_phone(phone):
    digits = re.sub(r"\D", "", phone)
    return digits[-10:] if len(digits) >= 10 else ""

def parse_contacts(lines):
    contacts = []
    for line in lines:
        match = re.match(r'([^<>,"]*?)?\s*(<([^<>]+)>)?\s*[,]?\s*"?([^"]*)"?\s*[,]?\s*(.*)', line)
        if match:
            name_part, _, email_in_angle_brackets, name_in_quotes, rest = match.groups()

            name = name_in_quotes.strip() if name_in_quotes else name_part.strip()

            email = email_in_angle_brackets.strip() if email_in_angle_brackets else ""
            if not email:
                 email_match = re.search(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", rest)
                 email = email_match.group(0).strip() if email_match else ""

            phone_part = re.sub(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}", "", rest).strip()
            phone = phone_part

            if is_valid_email(email):
                contacts.append((name, email, normalize_phone(phone)))

    return contacts


def dedup_contacts(contacts):
    seen = set()
    result = []
    for name, email, phone in contacts:
        key = email.casefold()
        if key not in seen:
            seen.add(key)
            result.append((name, email, phone))
    return result


class TestCRMCleanup(unittest.TestCase):
    def test_valid_emails(self):
        self.assertTrue(is_valid_email("test@example.com"))
        self.assertFalse(is_valid_email("test[at]example.com"))
        self.assertFalse(is_valid_email("bad.com"))
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("sara@mail.co"))
        self.assertTrue(is_valid_email("mehdi.ay@example.org"))
        self.assertTrue(is_valid_email("delaram@example.io"))
        self.assertTrue(is_valid_email("NIMA@example.io"))


    def test_phone_normalization(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("972-555-777"), "")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("(469)555-9999"), "4695559999")
        self.assertEqual(normalize_phone("+1-972-777-2121"), "9727772121")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")
        self.assertEqual(normalize_phone("469 555 1234"), "4695551234")
        self.assertEqual(normalize_phone("123"), "")
        self.assertEqual(normalize_phone("1234567890123"), "4567890123")


    def test_parse_contacts(self):
        lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'Bob Roberts <bob[at]example.com> , 972-555-777',
            'Sara M. , sara@mail.co , 214 555 8888',
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999',
            'Delaram <delaram@example.io>, +1-972-777-2121',
            'Nima <NIMA@example.io> , 972.777.2121',
            'duplicate <Alice@Example.com> , 469 555 1234',
            'Name Only, bademail',
            'Name and Email <valid@example.com>',
            'Name, valid2@example.com, 1234567890'
        ]
        result = parse_contacts(lines)
        expected = [
            ("Alice Johnson", "alice@example.com", "4695551234"),
            ("Sara M.", "sara@mail.co", "2145558888"),
            ("Mehdi A.", "mehdi.ay@example.org", "4695559999"),
            ("Delaram", "delaram@example.io", "9727772121"),
            ("Nima", "NIMA@example.io", "9727772121"),
            ("duplicate", "Alice@Example.com", "4695551234"),
            ("Name and Email", "valid@example.com", ""),
            ("Name", "valid2@example.com", "1234567890")
        ]
        self.assertEqual(result, expected)


    def test_dedup_contacts(self):
        data = [
            ("Alice", "alice@example.com", "123"),
            ("Dup", "ALICE@example.com", "456"),
            ("Bob", "bob@example.com", "789"),
            ("Another Alice", "alice@example.com", "987"),
            ("Charlie", "charlie@example.com", "654")
        ]
        result = dedup_contacts(data)
        expected = [
            ("Alice", "alice@example.com", "123"),
            ("Bob", "bob@example.com", "789"),
            ("Charlie", "charlie@example.com", "654")
        ]
        self.assertEqual(result, expected)

if __name__ == "__main__":
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

.F..
FAIL: test_parse_contacts (__main__.TestCRMCleanup.test_parse_contacts)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipython-input-284979908.py", line 95, in test_parse_contacts
    self.assertEqual(result, expected)
AssertionError: Lists differ: [('Mehdi A.', 'mehdi.ay@example.org', '4695559999')] != [('Alice Johnson', 'alice@example.com', '469[333 chars]90')]

First differing element 0:
('Mehdi A.', 'mehdi.ay@example.org', '4695559999')
('Alice Johnson', 'alice@example.com', '4695551234')

Second list contains 7 additional elements.
First extra element 1:
('Sara M.', 'sara@mail.co', '2145558888')

+ [('Alice Johnson', 'alice@example.com', '4695551234'),
+  ('Sara M.', 'sara@mail.co', '2145558888'),
- [('Mehdi A.', 'mehdi.ay@example.org', '4695559999')]
? ^                                                  ^

+  ('Mehdi A.', 'mehdi.ay@example.org', '4695559999'),
? ^                                          

## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
