# Chapter 5: File I/O and Exception Handling

---

## The CRAWL ‚Üí WALK ‚Üí RUN Framework

This textbook uses a structured approach to learning Python while developing effective AI collaboration skills. Each chapter follows three distinct phases:

| Mode | Icon | AI Policy | Purpose |
|------|------|-----------|--------|
| **CRAWL** | üêõ | No AI assistance | Build foundational skills you can demonstrate independently |
| **WALK** | üö∂ | AI for understanding only | Use AI to explain concepts and errors, but write your own code |
| **RUN** | üöÄ | Full AI collaboration | Partner with AI on complex tasks while documenting your process |

**Why This Matters:** Your exams will test CRAWL and WALK material with no AI assistance. If you skip the foundational work and rely entirely on AI, you won't pass. The progression ensures you build genuine competence before leveraging AI as a professional tool.

## üìä Case Study: From Clean to Messy Data

This is the chapter where theory meets reality. You've been working with data structures in memory. Now you'll learn to:

- **Read data from files** on disk into Python
- **Write results back** to files for sharing
- **Handle errors gracefully** when files don't exist or data is corrupted
- **Process the messy Lehigh dataset** with all its real-world problems

**The Big Picture:**

| File | Records | Columns | Quality | Use |
|------|---------|---------|---------|-----|
| `lehigh_students_clean.csv` | 600 | 7 | Perfect | Learning file I/O basics |
| `lehigh_students_messy.csv` | 605 | 8 | Intentionally degraded | Data cleaning project |

The messy dataset has 28+ data quality issues including:
- Inconsistent college names ("COB", "Business", "college of business")
- Missing GPA values
- Duplicate student records
- Invalid values (GPA > 4.0)
- Multiple date formats

By the end of this chapter, you'll have the tools to detect and handle all of these problems.

## Learning Objectives

By the end of this chapter, you will:

- üêõ Open, read, and close files using `open()` and file modes
- üêõ Read entire files, read line by line, and write to files
- üêõ Use `with` statements for safe file handling
- üêõ Parse CSV files using the `csv` module
- üêõ Handle common exceptions with `try/except` blocks
- üö∂ Work with different file encodings (UTF-8, Latin-1)
- üö∂ Raise your own exceptions for data validation
- üö∂ Use `finally` for cleanup operations
- üöÄ Build a complete data cleaning pipeline for the messy dataset

---

# üêõ CRAWL: Reading and Writing Files

**Rules for this section:**
- Close all AI tools (ChatGPT, Claude, Copilot, etc.)
- Work through examples by typing them yourself
- Use only this notebook, Python documentation, or your instructor for help
- This material will appear on exams without AI assistance

---

## üìö DataCamp Resources for Chapter 5

**[Introduction to Importing Data in Python](https://www.datacamp.com/courses/introduction-to-importing-data-in-python)** - Complete these:

| Chapter | Topics Covered | Alignment |
|---------|---------------|------------|
| Chapter 1: Introduction and flat files | Reading text files, CSV files | Sections 5.1-5.4 |

**[Writing Functions in Python](https://www.datacamp.com/courses/writing-functions-in-python)** - Complete these:

| Chapter | Topics Covered | Alignment |
|---------|---------------|------------|
| Chapter 4: More on Decorators (Error Handling section) | Exception handling | Sections 5.5-5.7 |

**Estimated time:** 2-3 hours total

---

## 5.1 Opening Files with `open()`

The `open()` function creates a connection between your Python code and a file on disk. You must specify:

1. **File path** - where the file is located
2. **Mode** - what you want to do with the file

| Mode | Meaning | Creates file? | Erases existing? |
|------|---------|---------------|------------------|
| `'r'` | Read (default) | No | No |
| `'w'` | Write | Yes | **Yes!** |
| `'a'` | Append | Yes | No |
| `'r+'` | Read and write | No | No |

**Warning:** Mode `'w'` will destroy any existing file content without warning. Be careful.

In [None]:
# First, let's create a simple text file to work with
# We'll use write mode to create it

file = open('sample_students.txt', 'w')
file.write('Student_ID,GPA,Major\n')
file.write('LU100001,3.41,Finance\n')
file.write('LU100002,3.55,Computer Science\n')
file.write('LU100003,2.85,Biology\n')
file.close()  # Always close files when done!

print("File created!")

In [None]:
# Now read the file we just created
file = open('sample_students.txt', 'r')
content = file.read()  # Read entire file as one string
file.close()

print(content)

In [None]:
# Read line by line
file = open('sample_students.txt', 'r')

line1 = file.readline()  # Reads first line
line2 = file.readline()  # Reads second line

print(f"Line 1: {line1}")
print(f"Line 2: {line2}")

file.close()

Notice the extra blank lines? That's because each line in the file ends with `\n` (newline), and `print()` adds another newline. You'll often want to strip these.

In [None]:
# Read all lines into a list
file = open('sample_students.txt', 'r')
lines = file.readlines()  # Returns a list of strings
file.close()

print(f"Number of lines: {len(lines)}")
print(f"Lines: {lines}")

In [None]:
# Strip the newlines
file = open('sample_students.txt', 'r')
lines = file.readlines()
file.close()

for line in lines:
    clean_line = line.strip()  # Remove leading/trailing whitespace including \n
    print(clean_line)

## 5.2 The `with` Statement (Context Manager)

Forgetting to close files is a common bug that can cause data loss or resource leaks. The `with` statement automatically closes the file when you're done, even if an error occurs.

**Always use `with` for file operations.** It's cleaner and safer.

In [None]:
# The proper way to read files
with open('sample_students.txt', 'r') as file:
    content = file.read()
    print(content)

# File is automatically closed here, even if an error occurred
print("File is now closed:", file.closed)

In [None]:
# Read and process line by line (memory efficient for large files)
with open('sample_students.txt', 'r') as file:
    for line in file:  # Iterate directly over the file object
        print(line.strip())

In [None]:
# Writing with 'with'
students = [
    ("LU100004", 3.92, "Psychology"),
    ("LU100005", 2.10, "Marketing"),
    ("LU100006", 3.78, "Economics")
]

with open('new_students.txt', 'w') as file:
    file.write("Student_ID,GPA,Major\n")  # Header
    for student_id, gpa, major in students:
        file.write(f"{student_id},{gpa},{major}\n")

print("File written!")

# Verify it worked
with open('new_students.txt', 'r') as file:
    print(file.read())

In [None]:
# Append mode adds to the end without erasing
with open('new_students.txt', 'a') as file:
    file.write("LU100007,3.45,History\n")

# Check the result
with open('new_students.txt', 'r') as file:
    print(file.read())

## 5.3 Working with File Paths

File paths can be:
- **Relative:** Relative to your current working directory (`'data/students.csv'`)
- **Absolute:** Full path from root (`'/home/user/data/students.csv'` or `'C:\\Users\\data\\students.csv'`)

Use forward slashes `/` even on Windows, or use raw strings `r'C:\path\to\file'`.

In [None]:
import os

# What's the current working directory?
print(f"Current directory: {os.getcwd()}")

# List files in current directory
print(f"Files here: {os.listdir('.')}")

In [None]:
# Check if a file exists before trying to open it
filename = 'sample_students.txt'

if os.path.exists(filename):
    print(f"{filename} exists!")
    with open(filename, 'r') as file:
        print(file.read())
else:
    print(f"{filename} not found!")

In [None]:
# Build paths safely using os.path.join
# This handles path separators correctly on any operating system

folder = 'data'
filename = 'students.csv'

full_path = os.path.join(folder, filename)
print(f"Full path: {full_path}")

## 5.4 Reading CSV Files with the `csv` Module

CSV (Comma-Separated Values) is the most common format for tabular data. While you could split lines by commas yourself, the `csv` module handles edge cases like:
- Values containing commas (enclosed in quotes)
- Values containing newlines
- Different delimiters (tabs, semicolons)

This is what you'll use to load the Lehigh student datasets.

In [None]:
import csv

# Create a CSV file with some edge cases
with open('tricky_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'College', 'Notes'])
    writer.writerow(['Alice Smith', 'College of Business', 'Dean\'s List'])
    writer.writerow(['Bob Jones', 'Engineering', 'Transferred from "State U"'])  # Quotes in value
    writer.writerow(['Charlie Brown', 'Arts, Sciences', 'Double major'])  # Comma in value

print("CSV written!")

# Look at the raw file
with open('tricky_data.csv', 'r') as file:
    print("Raw file contents:")
    print(file.read())

In [None]:
# Read with csv.reader
with open('tricky_data.csv', 'r') as file:
    reader = csv.reader(file)
    
    for row in reader:
        print(row)  # Each row is a list

In [None]:
# csv.DictReader gives you dictionaries instead of lists
# This is usually more convenient

with open('tricky_data.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        print(row)
        print(f"  Name: {row['Name']}")
        print(f"  College: {row['College']}")
        print()

In [None]:
# Writing with csv.DictWriter
students = [
    {'id': 'LU100001', 'gpa': 3.41, 'college': 'Business'},
    {'id': 'LU100002', 'gpa': 3.55, 'college': 'Engineering'},
    {'id': 'LU100003', 'gpa': 2.85, 'college': 'Health'},
]

with open('output_students.csv', 'w', newline='') as file:
    fieldnames = ['id', 'gpa', 'college']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    
    writer.writeheader()  # Write the header row
    writer.writerows(students)  # Write all data rows

# Verify
with open('output_students.csv', 'r') as file:
    print(file.read())

### Loading the Lehigh Student Dataset

Now let's load the actual clean dataset. The file should be in your course materials.

In [None]:
# Load the clean Lehigh dataset
# Adjust the path based on where your file is located

students = []

with open('lehigh_students_clean.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        students.append(row)

print(f"Loaded {len(students)} students")
print(f"\nFirst student: {students[0]}")
print(f"\nColumn names: {list(students[0].keys())}")

In [None]:
# Notice that everything is a string!
first_student = students[0]
print(f"GPA value: {first_student['GPA']}")
print(f"GPA type: {type(first_student['GPA'])}")

# You need to convert types manually
gpa_float = float(first_student['GPA'])
print(f"GPA as float: {gpa_float}")

In [None]:
# Process the data with proper type conversion
students = []

with open('lehigh_students_clean.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        student = {
            'id': row['Student_ID'],
            'college': row['College'],
            'major': row['Major'],
            'class_year': row['Class_Year'],
            'gpa': float(row['GPA']),
            'credits_attempted': int(row['Credits_Attempted']),
            'credits_earned': int(row['Credits_Earned'])
        }
        students.append(student)

print(f"Loaded {len(students)} students with proper types")
print(f"\nFirst student: {students[0]}")
print(f"GPA type: {type(students[0]['gpa'])}")

In [None]:
# Now we can do calculations
gpas = [s['gpa'] for s in students]
avg_gpa = sum(gpas) / len(gpas)
print(f"Average GPA: {avg_gpa:.2f}")

# Find Dean's List students
deans_list = [s for s in students if s['gpa'] >= 3.5]
print(f"Dean's List students: {len(deans_list)}")

---

# üêõ CRAWL: Exception Handling

---

## 5.5 What Are Exceptions?

When Python encounters an error during execution, it raises an **exception**. If you don't handle it, your program crashes.

Common exceptions you'll encounter:

| Exception | When It Occurs |
|-----------|----------------|
| `FileNotFoundError` | File doesn't exist |
| `ValueError` | Can't convert value (e.g., `int('abc')`) |
| `KeyError` | Dictionary key doesn't exist |
| `IndexError` | List index out of range |
| `TypeError` | Wrong type for operation |
| `ZeroDivisionError` | Division by zero |

In [None]:
# This will crash - uncomment to see
# file = open('nonexistent_file.csv', 'r')

In [None]:
# This will crash - uncomment to see
# gpa = float('N/A')

In [None]:
# This will crash - uncomment to see
# student = {'id': 'LU001'}
# print(student['gpa'])

## 5.6 The `try/except` Block

Use `try/except` to catch exceptions and handle them gracefully instead of crashing.

```python
try:
    # Code that might raise an exception
except ExceptionType:
    # Code to run if that exception occurs
```

In [None]:
# Handle file not found
try:
    with open('nonexistent_file.csv', 'r') as file:
        content = file.read()
except FileNotFoundError:
    print("Error: The file was not found.")
    print("Please check the file path and try again.")

In [None]:
# Handle conversion errors
gpa_values = ['3.41', '3.55', 'N/A', '3.90', '', '2.85']

valid_gpas = []
invalid_count = 0

for value in gpa_values:
    try:
        gpa = float(value)
        valid_gpas.append(gpa)
    except ValueError:
        invalid_count += 1
        print(f"Could not convert '{value}' to float")

print(f"\nValid GPAs: {valid_gpas}")
print(f"Invalid values: {invalid_count}")

In [None]:
# Catch multiple exception types
def get_student_gpa(students, student_id):
    """Look up a student's GPA by ID."""
    try:
        # Find the student
        for student in students:
            if student['id'] == student_id:
                return float(student['gpa'])
        raise KeyError(f"Student {student_id} not found")
    except KeyError as e:
        print(f"Error: {e}")
        return None
    except ValueError as e:
        print(f"Error converting GPA: {e}")
        return None

# Test with sample data
sample_students = [
    {'id': 'LU001', 'gpa': '3.41'},
    {'id': 'LU002', 'gpa': 'N/A'},
]

print(get_student_gpa(sample_students, 'LU001'))
print(get_student_gpa(sample_students, 'LU002'))
print(get_student_gpa(sample_students, 'LU999'))

In [None]:
# Access the exception details
try:
    result = 10 / 0
except ZeroDivisionError as e:
    print(f"Exception type: {type(e).__name__}")
    print(f"Exception message: {e}")

In [None]:
# Catch any exception (use sparingly - you usually want specific types)
try:
    # Some risky operation
    value = int('abc')
except Exception as e:
    print(f"Something went wrong: {type(e).__name__}: {e}")

## 5.7 The `else` and `finally` Clauses

The full `try` statement can have four parts:

```python
try:
    # Code that might fail
except SomeError:
    # Handle the error
else:
    # Runs only if NO exception occurred
finally:
    # ALWAYS runs, whether exception occurred or not
```

In [None]:
def safe_divide(a, b):
    """Safely divide two numbers."""
    try:
        result = a / b
    except ZeroDivisionError:
        print("Cannot divide by zero!")
        return None
    else:
        print(f"Division successful: {a}/{b} = {result}")
        return result
    finally:
        print("Division attempt complete.\n")

safe_divide(10, 2)
safe_divide(10, 0)

In [None]:
# finally is useful for cleanup
def process_file(filename):
    """Process a file and always report when done."""
    file = None
    try:
        file = open(filename, 'r')
        content = file.read()
        # Process the content...
        print(f"Processed {len(content)} characters")
    except FileNotFoundError:
        print(f"File {filename} not found")
    finally:
        if file:
            file.close()
            print("File closed")
        print("Processing complete\n")

process_file('sample_students.txt')
process_file('nonexistent.txt')

**Note:** When using `with` statements, you rarely need `finally` for file cleanup because `with` handles that automatically. But `finally` is still useful for other cleanup tasks.

---

## üêõ CRAWL Practice Problems

Complete these problems without any AI assistance.

---

### Problem 5.1: Basic File I/O
1. Create a file called `colleges.txt` containing the five Lehigh colleges, one per line
2. Read the file back and print each college with a number (1. College of Business, etc.)
3. Append "Graduate Programs" to the file
4. Read and display the updated file

In [None]:
# Your code here


### Problem 5.2: CSV Reading
Load the clean Lehigh student dataset and:
1. Count how many students are in each college
2. Find the average GPA (remember to convert from string to float)
3. List all unique majors

In [None]:
# Your code here


### Problem 5.3: CSV Writing
Filter the Lehigh dataset to include only students with GPA >= 3.5 and write them to a new file called `deans_list.csv`.

In [None]:
# Your code here


### Problem 5.4: Exception Handling
Write a function `safe_gpa_convert(value)` that:
- Takes a string value
- Returns the float if conversion succeeds
- Returns None if conversion fails
- Prints an informative message for failures

Test with: '3.41', '', 'N/A', '3.5 ', '-1.0', '5.0'

In [None]:
# Your code here


### Problem 5.5: Predict the Output
What happens in each case? Predict first, then run.

```python
a) 
try:
    x = int('hello')
    print('Success')
except ValueError:
    print('Failed')

b)
try:
    x = 10 / 2
except ZeroDivisionError:
    print('Division error')
else:
    print(f'Result: {x}')

c)
try:
    x = int('5')
except ValueError:
    x = 0
finally:
    print(f'x = {x}')
```

In [None]:
# Check your predictions


---

# üö∂ WALK: Advanced File Handling and Data Validation

**Rules for this section:**
- You may use AI tools to **explain** concepts and errors
- You must **write all code yourself**
- Good prompts: "What does this encoding error mean?" or "How do I handle different date formats?"
- Bad prompts: "Write code to clean my CSV file"

---

## 5.8 File Encodings

Text files use **encodings** to convert bytes to characters. The most common encodings are:

| Encoding | Description | When to Use |
|----------|-------------|-------------|
| UTF-8 | Universal, handles all languages | Default choice, most modern files |
| Latin-1 (ISO-8859-1) | Western European characters | Older Windows files |
| cp1252 | Windows Western | Excel exports |

When you get a `UnicodeDecodeError`, the file probably uses a different encoding than Python expects.

In [None]:
# Specify encoding explicitly (best practice)
with open('sample_students.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

In [None]:
# Function to try multiple encodings
def read_file_with_fallback(filename):
    """Try to read a file with multiple encodings."""
    encodings = ['utf-8', 'latin-1', 'cp1252']
    
    for encoding in encodings:
        try:
            with open(filename, 'r', encoding=encoding) as file:
                content = file.read()
            print(f"Successfully read with {encoding}")
            return content
        except UnicodeDecodeError:
            print(f"{encoding} failed, trying next...")
    
    raise ValueError(f"Could not read {filename} with any known encoding")

# Test it
content = read_file_with_fallback('sample_students.txt')

## 5.9 Raising Exceptions for Data Validation

You can raise your own exceptions when data doesn't meet your requirements. This is crucial for data quality.

```python
raise ExceptionType("Error message")
```

In [None]:
def validate_gpa(gpa):
    """Validate that GPA is within valid range."""
    if gpa < 0:
        raise ValueError(f"GPA cannot be negative: {gpa}")
    if gpa > 4.0:
        raise ValueError(f"GPA cannot exceed 4.0: {gpa}")
    return True

# Test
print(validate_gpa(3.5))  # OK

try:
    validate_gpa(5.0)  # Invalid
except ValueError as e:
    print(f"Validation error: {e}")

In [None]:
def validate_student_record(record):
    """Validate a student record dictionary."""
    required_fields = ['id', 'gpa', 'credits_attempted', 'credits_earned']
    
    # Check for missing fields
    for field in required_fields:
        if field not in record:
            raise KeyError(f"Missing required field: {field}")
    
    # Validate GPA
    gpa = record['gpa']
    if not (0 <= gpa <= 4.0):
        raise ValueError(f"Invalid GPA {gpa} for student {record['id']}")
    
    # Validate credits relationship
    if record['credits_earned'] > record['credits_attempted']:
        raise ValueError(f"Credits earned ({record['credits_earned']}) > attempted ({record['credits_attempted']})")
    
    return True

# Test with valid record
good_record = {'id': 'LU001', 'gpa': 3.5, 'credits_attempted': 100, 'credits_earned': 95}
print(f"Valid record: {validate_student_record(good_record)}")

# Test with invalid record
bad_record = {'id': 'LU002', 'gpa': 5.0, 'credits_attempted': 100, 'credits_earned': 95}
try:
    validate_student_record(bad_record)
except ValueError as e:
    print(f"Invalid record: {e}")

## 5.10 Building a Data Loading Function

Let's combine everything into a robust function that can handle the messy dataset.

In [None]:
def load_student_data(filename, validate=True, skip_invalid=True):
    """
    Load student data from a CSV file.
    
    Parameters:
        filename: Path to the CSV file
        validate: Whether to validate each record
        skip_invalid: If True, skip invalid records; if False, raise exception
    
    Returns:
        Tuple of (valid_records, error_log)
    """
    valid_records = []
    error_log = []
    
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            
            for row_num, row in enumerate(reader, start=2):  # Start at 2 (1 is header)
                try:
                    # Convert types
                    record = {
                        'id': row['Student_ID'],
                        'college': row['College'].strip(),  # Strip whitespace
                        'major': row['Major'].strip(),
                        'class_year': row['Class_Year'].strip(),
                        'gpa': float(row['GPA']) if row['GPA'].strip() else None,
                        'credits_attempted': int(row['Credits_Attempted']) if row['Credits_Attempted'].strip() else None,
                        'credits_earned': int(row['Credits_Earned']) if row['Credits_Earned'].strip() else None
                    }
                    
                    if validate and record['gpa'] is not None:
                        if not (0 <= record['gpa'] <= 4.0):
                            raise ValueError(f"Invalid GPA: {record['gpa']}")
                    
                    valid_records.append(record)
                    
                except (ValueError, KeyError) as e:
                    error_msg = f"Row {row_num}: {e}"
                    error_log.append(error_msg)
                    
                    if not skip_invalid:
                        raise
    
    except FileNotFoundError:
        raise FileNotFoundError(f"Could not find file: {filename}")
    
    return valid_records, error_log

# Test with the clean dataset
students, errors = load_student_data('lehigh_students_clean.csv')
print(f"Loaded {len(students)} valid records")
print(f"Errors encountered: {len(errors)}")

---

## üö∂ WALK Practice Problems

Use AI to help you understand concepts and errors, but write all code yourself.

---

### Problem 5.6: Robust Data Loader
Enhance the `load_student_data` function to also:
1. Track which specific fields had errors (missing, invalid type, out of range)
2. Return statistics about the loading process (total rows, valid rows, rows with each error type)

In [None]:
# Your code here


### Problem 5.7: Data Validation Suite
Write a set of validation functions for the Lehigh dataset:
1. `validate_college(college)` - Check if college is one of the 5 valid options
2. `validate_class_year(year)` - Check if class year is valid
3. `validate_credits(attempted, earned)` - Check if credits relationship is valid
4. `validate_record(record)` - Run all validations on a record

Each should raise a descriptive exception on failure.

In [None]:
# Your code here


### Problem 5.8: Load and Analyze the Messy Dataset
Load `lehigh_students_messy.csv` and report:
1. How many records failed to load?
2. What were the most common errors?
3. How many unique college names exist (before standardization)?

If you get stuck on specific errors, ask AI to explain them.

In [None]:
# Your code here


### Problem 5.9: Debug These Errors
Fix these code snippets. Use AI to understand the errors if needed.

In [None]:
# Error 1: Fix this file reading code
with open('sample_students.txt') as file:
    for line in file:
        print(line)
print(file.read())  # Why does this fail?

In [None]:
# Error 2: This doesn't catch the error. Why?
try:
    x = int('3.14')  # Should work, right?
except TypeError:
    print("Type error caught")

In [None]:
# Error 3: Fix the CSV writing
data = [{'name': 'Alice', 'gpa': 3.5}]
with open('output.csv', 'w') as file:
    writer = csv.DictWriter(file)
    writer.writerows(data)

---

# üöÄ RUN: Data Cleaning Pipeline

**Rules for this section:**
- Full AI collaboration is encouraged
- Document your process
- You must understand and be able to explain every line

---

## Chapter Project: Clean the Messy Lehigh Dataset

This is the culminating project for Weeks 1-3. You'll apply everything you've learned to transform the messy dataset into a clean, analyzable version.

### The Messy Dataset Issues

The file `lehigh_students_messy.csv` contains 605 records with these problems:

| Issue | Count | Example |
|-------|-------|--------|
| Inconsistent college names | 28 variations | "COB", "Business", "college of business" |
| Inconsistent class years | 23 variations | "Fr", "Freshman", "first year" |
| Missing GPA | ~35 records | Empty cells |
| Missing credits | ~36 records | Empty cells |
| Extra whitespace | ~151 records | "  Engineering  " |
| Typos in majors | ~30 records | "Computer Sceince", "Finace" |
| Invalid GPA values | ~8 records | GPA > 4.0 |
| Duplicate records | 5 records | Same Student_ID twice |
| Inconsistent date formats | 3+ formats | Mixed date styles |

### Requirements

Build a data cleaning pipeline that:

1. **Loads the messy data** with proper error handling
2. **Standardizes college names** to the 5 official names
3. **Standardizes class years** to consistent format
4. **Strips whitespace** from all text fields
5. **Handles missing values** (document your strategy)
6. **Fixes typos** in major names
7. **Validates GPA** and flags/corrects invalid values
8. **Detects and removes duplicates** (keep the first occurrence)
9. **Writes the cleaned data** to `lehigh_students_cleaned.csv`
10. **Generates a cleaning report** showing what was fixed

### Deliverables

1. A cleaned CSV file with 600 valid, standardized records
2. A cleaning report showing:
   - How many records were affected by each issue
   - What standardization rules you applied
   - How you handled missing values and why
   - Any records that were dropped and why
3. Documented code explaining your process

### AI Collaboration Tips

Good prompts:
- "What's the best way to create a mapping dictionary for standardizing names?"
- "How can I detect near-duplicate strings that might be typos?"
- "Explain fuzzy string matching for finding similar major names"

Avoid:
- "Clean this dataset for me"
- "Write a data cleaning script"

In [None]:
# DATA CLEANING PIPELINE
#
# AI Collaboration Log:
# - Prompts used:
# - Key insights:
# - My modifications:

import csv
from collections import Counter

# ============================================
# CONFIGURATION: Standardization Mappings
# ============================================

# Official college names
VALID_COLLEGES = {
    "College of Business",
    "P.C. Rossin College of Engineering",
    "College of Arts and Sciences",
    "College of Health",
    "College of Education"
}

# Mapping from variations to standard names
COLLEGE_MAP = {
    # Business variations
    'cob': 'College of Business',
    'business': 'College of Business',
    'college of business': 'College of Business',
    'buisness': 'College of Business',
    # Add more mappings as you discover them...
}

# Add your class year mappings
CLASS_YEAR_MAP = {
    # First Year variations
    'fr': 'First Year',
    'freshman': 'First Year',
    'first year': 'First Year',
    '1st year': 'First Year',
    # Add more...
}

# Add your major typo fixes
MAJOR_FIXES = {
    'computer sceince': 'Computer Science',
    'finace': 'Finance',
    # Add more...
}

# ============================================
# CLEANING FUNCTIONS
# ============================================

def standardize_college(college):
    """Standardize college name to official format."""
    # Your implementation here
    pass

def standardize_class_year(year):
    """Standardize class year to official format."""
    # Your implementation here
    pass

def fix_major_typo(major):
    """Fix known typos in major names."""
    # Your implementation here
    pass

def clean_gpa(gpa_str):
    """Convert GPA string to float, handling errors."""
    # Your implementation here
    pass

def clean_record(row):
    """
    Clean a single student record.
    Returns (cleaned_record, issues_found)
    """
    # Your implementation here
    pass

# ============================================
# MAIN PIPELINE
# ============================================

def run_cleaning_pipeline(input_file, output_file):
    """
    Run the complete data cleaning pipeline.
    
    Returns a report dictionary with statistics.
    """
    # Your implementation here
    pass

# ============================================
# RUN THE PIPELINE
# ============================================

# report = run_cleaning_pipeline('lehigh_students_messy.csv', 'lehigh_students_cleaned.csv')
# print_cleaning_report(report)

### Cleaning Report Template

After running your pipeline, fill in this report:

```
DATA CLEANING REPORT
====================

Input file: lehigh_students_messy.csv
Output file: lehigh_students_cleaned.csv

SUMMARY
-------
Records in input: ___
Records in output: ___
Records dropped: ___

ISSUES FIXED
------------
College names standardized: ___ records
Class years standardized: ___ records  
Whitespace stripped: ___ records
Major typos fixed: ___ records
Duplicates removed: ___ records

MISSING VALUES
--------------
Records with missing GPA: ___
Records with missing credits: ___
Strategy used: [describe your approach]

INVALID VALUES
--------------
Invalid GPA values found: ___
Strategy used: [describe your approach]

STANDARDIZATION MAPPINGS
------------------------
College variations found: [list them]
Class year variations found: [list them]
Major typos found: [list them]

QUALITY VERIFICATION
--------------------
Unique colleges in output: ___ (should be 5)
Unique class years in output: ___ (should be 5)
GPA range in output: ___ to ___ (should be 0.0-4.0)
All Student_IDs unique: Yes/No
```

### Project Reflection

1. Which data quality issue was hardest to handle? Why?
2. What decisions did you make about missing values? Defend your choice.
3. How would you handle this differently with pandas (which you'll learn next week)?
4. What additional validations would you add for production use?
5. How did AI help (or not help) with this project?

*Your reflection here:*



---

# Midterm Preparation

The midterm exam covers Chapters 1-5. Here's what you need to know **without AI assistance**:

## From Chapter 5 (File I/O and Exceptions)

### Must Know (CRAWL)
- [ ] Open files with `open()` and different modes ('r', 'w', 'a')
- [ ] Use `with` statements for safe file handling
- [ ] Read files: `read()`, `readline()`, `readlines()`
- [ ] Write to files with `write()`
- [ ] Use `csv.reader` and `csv.DictReader`
- [ ] Write `try/except` blocks for common exceptions
- [ ] Know when `FileNotFoundError`, `ValueError`, `KeyError` occur

### Should Know (WALK)
- [ ] Handle multiple exception types
- [ ] Use `else` and `finally` clauses
- [ ] Raise your own exceptions with `raise`
- [ ] Work with file encodings

## Common Exam Question Types

1. **Code output prediction:** Given code with file operations or try/except, predict what happens
2. **Error identification:** Identify what's wrong with file handling code
3. **Code completion:** Fill in blanks to make file I/O code work
4. **Short answer:** Explain the difference between 'r', 'w', 'a' modes
5. **Practical:** Write code to load a CSV and calculate something

## Practice Questions

In [None]:
# Practice 1: What does this print?

try:
    x = int('hello')
    print('A')
except ValueError:
    print('B')
else:
    print('C')
finally:
    print('D')

# Your prediction: ___

In [None]:
# Practice 2: What does this print?

try:
    x = int('5')
    print('A')
except ValueError:
    print('B')
else:
    print('C')
finally:
    print('D')

# Your prediction: ___

In [None]:
# Practice 3: What's wrong with this code?

file = open('data.txt', 'r')
data = file.read()
# process data...
# What's missing?

In [None]:
# Practice 4: Fill in the blanks

# Read a CSV file and print each row
import ___

___ open('students.csv', 'r') as file:
    reader = csv.___(file)
    for ___ in reader:
        print(row)

---

# Accountability Check

## üêõ CRAWL (Must do without AI)
- [ ] Open, read, and close files with `open()`
- [ ] Use `with` statements for file handling
- [ ] Explain the difference between 'r', 'w', 'a' modes
- [ ] Read files with `read()`, `readline()`, `readlines()`
- [ ] Parse CSV files with `csv.reader` and `csv.DictReader`
- [ ] Write `try/except` blocks to handle specific exceptions
- [ ] Know which exceptions occur in common situations

## üö∂ WALK (AI to learn, write code yourself)
- [ ] Handle multiple exception types in one block
- [ ] Use `else` and `finally` appropriately
- [ ] Raise exceptions for data validation
- [ ] Work with file encodings (UTF-8, Latin-1)
- [ ] Build robust data loading functions

## üöÄ RUN (AI-assisted, must understand)
- [ ] Build a complete data cleaning pipeline
- [ ] Handle multiple data quality issues systematically
- [ ] Document cleaning decisions and their rationale
- [ ] Generate cleaning reports with statistics

**Review CRAWL material before the midterm. You cannot use AI on the exam.**

---

## What's Next?

After the midterm, you'll enter the data analysis phase:

**Week 4: NumPy**
- Numerical arrays for fast computation
- Statistical operations
- The foundation of pandas

**Week 5: Pandas**
- DataFrames for tabular data (think: Excel on steroids)
- Everything you did in this chapter's project, but in one line of code
- Group by, merge, pivot operations

**Week 6: Visualization**
- Matplotlib and Seaborn
- Turning data into insights
- Final project

The cleaned dataset you produced in this chapter will be your data for the rest of the course. Good luck on the midterm!

---