# Module 5: File Handling

## Topics Covered
1. Reading Text Files
2. Writing to Files
3. Working with CSV Files (without pandas)
4. Working with JSON Files
5. Context Managers (with statement)
6. File Paths and OS Module

## Learning Objectives

By the end of this module, you will be able to:
- Read and write text files using Python's built-in functions
- Process CSV data without external libraries
- Parse and create JSON files for configuration and data exchange
- Use context managers for safe file handling
- Navigate and manipulate file paths programmatically
- Build data pipelines that read, process, and write files

---

---
# Section 1: Reading Text Files
---

## What is File Handling?

File handling allows your Python programs to interact with files stored on your computer. This is fundamental to data science because real-world data comes from files - log files, data exports, configuration files, and more.

### Why This Matters in Data Science

As a data analyst or data scientist, you'll constantly work with files:
- **Reading data** from CSV exports, JSON APIs, or text logs
- **Writing results** to reports, processed datasets, or output files
- **Parsing configuration** files for data pipelines
- **Processing log files** to extract insights

Understanding how files work at a fundamental level helps you debug issues and work with any file format, even when pandas isn't available or appropriate.

## Syntax

```python
# Opening a file for reading
file = open(filename, mode)

# Common modes:
# 'r'  - Read (default) - file must exist
# 'w'  - Write - creates new file or overwrites existing
# 'a'  - Append - adds to end of file
# 'r+' - Read and write
```

**Parameters:**
- `filename`: Path to the file (string)
- `mode`: How to open the file ('r', 'w', 'a', etc.)

**Returns:** A file object that you can read from or write to

## Reading Methods

Python provides several methods to read file contents:

| Method | Description |
|--------|-------------|
| `read()` | Read entire file as a single string |
| `readline()` | Read one line at a time |
| `readlines()` | Read all lines into a list |
| Iteration | Loop through lines directly |

In [None]:
# Example: Reading an entire file with read()
# Let's read a simple text file

file = open('assets/datasets/meeting_notes.txt', 'r')
content = file.read()
file.close()  # Always close the file when done!

print(content)

In [None]:
# Example: Reading line by line with readline()
# Useful when you only need specific lines

file = open('assets/datasets/meeting_notes.txt', 'r')

# Read just the first three lines
line1 = file.readline()
line2 = file.readline()
line3 = file.readline()

file.close()

print("First line:", line1)
print("Second line:", line2)
print("Third line:", line3)

In [None]:
# Example: Reading all lines into a list with readlines()
# Each line becomes an element in the list

file = open('assets/datasets/meeting_notes.txt', 'r')
lines = file.readlines()
file.close()

print(f"Total lines: {len(lines)}")
print(f"Type: {type(lines)}")
print("\nFirst 5 lines:")
for i, line in enumerate(lines[:5]):
    print(f"  {i}: {repr(line)}")

In [None]:
# Example: Iterating through a file directly
# Most memory-efficient way to process large files

file = open('assets/datasets/server_log.txt', 'r')

error_count = 0
warning_count = 0

for line in file:
    if 'ERROR' in line:
        error_count += 1
        print(f"Error found: {line.strip()}")
    elif 'WARNING' in line:
        warning_count += 1

file.close()

print(f"\nSummary: {error_count} errors, {warning_count} warnings")

## Common Issue: Newline Characters

When reading files, each line includes the newline character `\n` at the end. Use `strip()` to remove it:

In [None]:
# Example: Handling newline characters

file = open('assets/datasets/meeting_notes.txt', 'r')
lines = file.readlines()
file.close()

# With newline (notice the extra blank line between prints)
print("With newline:")
print(lines[0])
print(lines[1])

print("---")

# Using strip() to remove newline
print("Without newline (using strip):")
print(lines[0].strip())
print(lines[1].strip())

## Practice Exercise 1.1

**Task:** Read the file `assets/datasets/server_log.txt` and count how many log entries there are for each log level (INFO, DEBUG, WARNING, ERROR).

**Expected Output:**
```
Log Level Counts:
INFO: 12
DEBUG: 2
WARNING: 2
ERROR: 2
```

In [None]:
# Your code here


In [None]:
# Solution 1.1

file = open('assets/datasets/server_log.txt', 'r')

# Initialize counters
counts = {'INFO': 0, 'DEBUG': 0, 'WARNING': 0, 'ERROR': 0}

for line in file:
    for level in counts:
        if level in line:
            counts[level] += 1
            break  # Each line has only one level

file.close()

print("Log Level Counts:")
for level, count in counts.items():
    print(f"{level}: {count}")

## Practice Exercise 1.2

**Task:** Read `assets/datasets/meeting_notes.txt` and extract all the action items (lines that start with "- "). Store them in a list and print each action item.

**Expected Output:**
```
Action Items Found:
1. Sarah: Finalize budget allocation by Jan 15
2. Mike: Complete hiring pipeline by Jan 20
3. Jennifer: Prepare product launch materials
4. David: Set up analytics dashboard
```

In [None]:
# Your code here


In [None]:
# Solution 1.2

file = open('assets/datasets/meeting_notes.txt', 'r')
action_items = []

for line in file:
    line = line.strip()
    if line.startswith('- '):
        # Remove the '- ' prefix
        action_items.append(line[2:])

file.close()

print("Action Items Found:")
for i, item in enumerate(action_items, 1):
    print(f"{i}. {item}")

---
# Section 2: Writing to Files
---

## Writing Modes

Python provides different modes for writing files:

| Mode | Description |
|------|-------------|
| `'w'` | Write mode - creates new file or **overwrites** existing |
| `'a'` | Append mode - adds to the end of existing file |
| `'x'` | Exclusive creation - fails if file already exists |

**Warning:** Using `'w'` mode on an existing file will erase all its contents!

## Syntax

```python
# Writing to a file
file = open(filename, 'w')  # or 'a' for append
file.write(string)          # Write a string
file.writelines(list)       # Write a list of strings
file.close()
```

**Note:** `write()` does not automatically add newlines - you must include `\n` yourself.

In [None]:
# Example: Writing a new file with write()

file = open('assets/datasets/output_example.txt', 'w')

file.write("Data Analysis Report\n")
file.write("=" * 20 + "\n")
file.write("\n")
file.write("Key Findings:\n")
file.write("1. Sales increased by 15%\n")
file.write("2. Customer satisfaction is at 4.2/5\n")
file.write("3. Top region: Central\n")

file.close()

# Verify by reading it back
file = open('assets/datasets/output_example.txt', 'r')
print(file.read())
file.close()

In [None]:
# Example: Writing multiple lines with writelines()

lines = [
    "Product Inventory Report\n",
    "Generated: 2024-01-15\n",
    "\n",
    "Low Stock Items:\n",
    "- Standing Desk: 15 units\n",
    "- Ergonomic Chair: 28 units\n",
    "- 4K Monitor: 32 units\n"
]

file = open('assets/datasets/inventory_report.txt', 'w')
file.writelines(lines)  # Note: each string must include \n
file.close()

# Verify
file = open('assets/datasets/inventory_report.txt', 'r')
print(file.read())
file.close()

In [None]:
# Example: Appending to an existing file

# First, let's see what's in the file
file = open('assets/datasets/inventory_report.txt', 'r')
print("Before append:")
print(file.read())
file.close()

# Now append new content
file = open('assets/datasets/inventory_report.txt', 'a')
file.write("\n--- Update ---\n")
file.write("- USB-C Dock: Restocked to 150 units\n")
file.close()

# Check the result
file = open('assets/datasets/inventory_report.txt', 'r')
print("After append:")
print(file.read())
file.close()

## Practical Example: Processing Log Data

A common data task: read a log file, filter for specific entries, and write results to a new file.

In [None]:
# Example: Extract errors from log and save to separate file

# Read the source file
source = open('assets/datasets/server_log.txt', 'r')
output = open('assets/datasets/errors_only.txt', 'w')

# Write header
output.write("Error Log Extract\n")
output.write("=" * 40 + "\n\n")

error_count = 0
for line in source:
    if 'ERROR' in line:
        output.write(line)
        error_count += 1

output.write(f"\nTotal errors: {error_count}\n")

source.close()
output.close()

# Verify output
file = open('assets/datasets/errors_only.txt', 'r')
print(file.read())
file.close()

## Practice Exercise 2.1

**Task:** Read the server log file and create a summary report that shows:
1. Total number of log entries
2. Count of each log level
3. Timestamp of first and last entry

Write this summary to `assets/datasets/log_summary.txt`.

**Expected Output (in log_summary.txt):**
```
Server Log Summary
==================

Total entries: 20

Log Levels:
  INFO: 12
  DEBUG: 2
  WARNING: 2
  ERROR: 2

Time Range:
  First: 2024-01-15 08:23:45
  Last: 2024-01-15 09:00:00
```

In [None]:
# Your code here


In [None]:
# Solution 2.1

# Read and analyze the log
source = open('assets/datasets/server_log.txt', 'r')
lines = source.readlines()
source.close()

# Count log levels
counts = {'INFO': 0, 'DEBUG': 0, 'WARNING': 0, 'ERROR': 0}
first_timestamp = None
last_timestamp = None

for line in lines:
    line = line.strip()
    if not line:
        continue
    
    # Extract timestamp (first 19 characters: YYYY-MM-DD HH:MM:SS)
    timestamp = line[:19]
    if first_timestamp is None:
        first_timestamp = timestamp
    last_timestamp = timestamp
    
    # Count log levels
    for level in counts:
        if level in line:
            counts[level] += 1
            break

# Write summary
output = open('assets/datasets/log_summary.txt', 'w')
output.write("Server Log Summary\n")
output.write("==================\n\n")
output.write(f"Total entries: {len(lines)}\n\n")
output.write("Log Levels:\n")
for level, count in counts.items():
    output.write(f"  {level}: {count}\n")
output.write(f"\nTime Range:\n")
output.write(f"  First: {first_timestamp}\n")
output.write(f"  Last: {last_timestamp}\n")
output.close()

# Verify
file = open('assets/datasets/log_summary.txt', 'r')
print(file.read())
file.close()

---
# Section 3: Working with CSV Files (without pandas)
---

## What is CSV?

CSV (Comma-Separated Values) is one of the most common formats for tabular data. Each line represents a row, and values are separated by commas (or other delimiters).

Example CSV:
```
name,age,city
Alice,30,New York
Bob,25,Los Angeles
```

### Why Learn CSV Without pandas?

While pandas is the go-to tool for CSV in data science, understanding the `csv` module helps you:
- Work in environments where pandas isn't available
- Handle edge cases pandas might not catch
- Process very large files more efficiently
- Understand what pandas does under the hood

## The csv Module

Python's built-in `csv` module handles the complexities of CSV parsing:
- Properly handles quoted fields
- Deals with commas inside values
- Supports different delimiters

## Syntax

```python
import csv

# Reading CSV
with open('file.csv', 'r') as file:
    reader = csv.reader(file)      # Returns rows as lists
    # or
    reader = csv.DictReader(file)  # Returns rows as dictionaries

# Writing CSV
with open('file.csv', 'w', newline='') as file:
    writer = csv.writer(file)           # Write lists as rows
    # or
    writer = csv.DictWriter(file, fieldnames=['col1', 'col2'])
```

**Note:** Always use `newline=''` when opening files for CSV writing to prevent extra blank rows.

In [None]:
import csv

# Example: Reading CSV with csv.reader()
# Returns each row as a list

file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.reader(file)

# Get the header row
headers = next(reader)
print("Headers:", headers)
print()

# Read first 5 data rows
print("First 5 rows:")
for i, row in enumerate(reader):
    if i >= 5:
        break
    print(row)

file.close()

In [None]:
# Example: Reading CSV with csv.DictReader()
# Returns each row as a dictionary with column names as keys

file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)

print("First 3 transactions:\n")
for i, row in enumerate(reader):
    if i >= 3:
        break
    print(f"Transaction: {row['transaction_id']}")
    print(f"  Product: {row['product']}")
    print(f"  Amount: ${row['total_amount']}")
    print(f"  Region: {row['region']}")
    print()

file.close()

In [None]:
# Example: Analyzing CSV data - Calculate total sales by region

file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)

# Aggregate sales by region
sales_by_region = {}

for row in reader:
    region = row['region']
    amount = float(row['total_amount'])
    
    if region in sales_by_region:
        sales_by_region[region] += amount
    else:
        sales_by_region[region] = amount

file.close()

print("Total Sales by Region:")
print("-" * 30)
for region, total in sorted(sales_by_region.items(), key=lambda x: x[1], reverse=True):
    print(f"{region:12} ${total:>12,.2f}")

In [None]:
# Example: Handling missing values in CSV

file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)

# Find rows with missing data
missing_price = 0
missing_rep = 0
missing_rating = 0

for row in reader:
    if row['unit_price'] == '':
        missing_price += 1
    if row['sales_rep'] == '':
        missing_rep += 1
    if row['customer_rating'] == '':
        missing_rating += 1

file.close()

print("Missing Value Analysis:")
print(f"  unit_price: {missing_price} missing")
print(f"  sales_rep: {missing_rep} missing")
print(f"  customer_rating: {missing_rating} missing")

In [None]:
# Example: Writing CSV with csv.writer()

# Create a summary report as CSV
file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)

# Calculate category summaries
category_stats = {}
for row in reader:
    category = row['category']
    amount = float(row['total_amount'])
    
    if category not in category_stats:
        category_stats[category] = {'total': 0, 'count': 0}
    
    category_stats[category]['total'] += amount
    category_stats[category]['count'] += 1

file.close()

# Write summary to new CSV
output = open('assets/datasets/category_summary.csv', 'w', newline='')
writer = csv.writer(output)

# Write header
writer.writerow(['Category', 'Transaction Count', 'Total Sales', 'Average Sale'])

# Write data rows
for category, stats in category_stats.items():
    avg = stats['total'] / stats['count']
    writer.writerow([category, stats['count'], round(stats['total'], 2), round(avg, 2)])

output.close()

# Verify by reading back
print("Category Summary CSV:")
file = open('assets/datasets/category_summary.csv', 'r')
print(file.read())
file.close()

In [None]:
# Example: Writing CSV with csv.DictWriter()
# More readable when you have dictionaries

# Sample employee data to write
employees = [
    {'id': 'E001', 'name': 'Alice Johnson', 'department': 'Engineering', 'salary': 95000},
    {'id': 'E002', 'name': 'Bob Smith', 'department': 'Marketing', 'salary': 75000},
    {'id': 'E003', 'name': 'Carol Davis', 'department': 'Engineering', 'salary': 88000}
]

# Define column order
fieldnames = ['id', 'name', 'department', 'salary']

output = open('assets/datasets/employees_sample.csv', 'w', newline='')
writer = csv.DictWriter(output, fieldnames=fieldnames)

writer.writeheader()  # Writes the column names
writer.writerows(employees)  # Writes all rows at once

output.close()

# Verify
print("Employees Sample CSV:")
file = open('assets/datasets/employees_sample.csv', 'r')
print(file.read())
file.close()

## Practice Exercise 3.1

**Task:** Read the `employees.csv` file and create a report showing:
1. Number of employees per department
2. Average salary per department

Write the results to `assets/datasets/department_report.csv`.

**Expected Output (printed and in CSV):**
```
Department,Employee Count,Average Salary
Engineering,XX,XXXXX.XX
Sales,XX,XXXXX.XX
...
```

In [None]:
# Your code here


In [None]:
# Solution 3.1

import csv

# Read employees data
file = open('assets/datasets/employees.csv', 'r')
reader = csv.DictReader(file)

# Aggregate by department
dept_stats = {}

for row in reader:
    dept = row['department']
    salary = int(row['salary'])
    
    if dept not in dept_stats:
        dept_stats[dept] = {'count': 0, 'total_salary': 0}
    
    dept_stats[dept]['count'] += 1
    dept_stats[dept]['total_salary'] += salary

file.close()

# Write report CSV
output = open('assets/datasets/department_report.csv', 'w', newline='')
writer = csv.writer(output)

writer.writerow(['Department', 'Employee Count', 'Average Salary'])

for dept, stats in sorted(dept_stats.items()):
    avg_salary = stats['total_salary'] / stats['count']
    writer.writerow([dept, stats['count'], round(avg_salary, 2)])

output.close()

# Display results
print("Department Report:")
file = open('assets/datasets/department_report.csv', 'r')
print(file.read())
file.close()

## Practice Exercise 3.2

**Task:** Find the top 10 highest-value transactions from `sales_data.csv`. Write these to a new file `assets/datasets/top_transactions.csv` with just the columns: transaction_id, date, product, total_amount.

**Hint:** You'll need to read all data, sort it, then write the top 10.

In [None]:
# Your code here


In [None]:
# Solution 3.2

import csv

# Read all transactions
file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)

transactions = []
for row in reader:
    transactions.append({
        'transaction_id': row['transaction_id'],
        'date': row['date'],
        'product': row['product'],
        'total_amount': float(row['total_amount'])
    })

file.close()

# Sort by total_amount descending and get top 10
top_10 = sorted(transactions, key=lambda x: x['total_amount'], reverse=True)[:10]

# Write to new CSV
output = open('assets/datasets/top_transactions.csv', 'w', newline='')
writer = csv.DictWriter(output, fieldnames=['transaction_id', 'date', 'product', 'total_amount'])

writer.writeheader()
writer.writerows(top_10)

output.close()

# Display results
print("Top 10 Highest-Value Transactions:")
print("-" * 60)
file = open('assets/datasets/top_transactions.csv', 'r')
print(file.read())
file.close()

---
# Section 4: Working with JSON Files
---

## What is JSON?

JSON (JavaScript Object Notation) is a lightweight data format that's easy for humans to read and machines to parse. It's widely used for:

- **API responses** - Most web APIs return JSON
- **Configuration files** - Application settings
- **Data exchange** - Sharing structured data between systems
- **NoSQL databases** - MongoDB, CouchDB store JSON documents

### JSON Data Types

| JSON Type | Python Type |
|-----------|-------------|
| object | dict |
| array | list |
| string | str |
| number | int or float |
| true/false | True/False |
| null | None |

## Syntax

```python
import json

# Reading JSON from file
with open('file.json', 'r') as f:
    data = json.load(f)  # Parses file into Python object

# Reading JSON from string
data = json.loads(json_string)

# Writing JSON to file
with open('file.json', 'w') as f:
    json.dump(data, f)  # Writes Python object as JSON

# Converting to JSON string
json_string = json.dumps(data)
```

**Common parameters:**
- `indent`: Number of spaces for pretty-printing
- `sort_keys`: Sort dictionary keys alphabetically

In [None]:
import json

# Example: Reading a JSON configuration file

file = open('assets/datasets/config.json', 'r')
config = json.load(file)
file.close()

print(f"Application: {config['application']['name']}")
print(f"Version: {config['application']['version']}")
print(f"Database Host: {config['database']['host']}")
print(f"Debug Mode: {config['application']['debug_mode']}")

In [None]:
# Example: Navigating nested JSON structure

file = open('assets/datasets/products.json', 'r')
catalog = json.load(file)
file.close()

print(f"Catalog Version: {catalog['catalog_version']}")
print(f"Last Updated: {catalog['last_updated']}")
print(f"Currency: {catalog['currency']}")
print()

# Access nested categories
for category_name, category_data in catalog['categories'].items():
    product_count = len(category_data['products'])
    tax_rate = category_data['tax_rate']
    print(f"{category_name}: {product_count} products (Tax: {tax_rate*100}%)")

In [None]:
# Example: Extracting specific data from nested JSON

file = open('assets/datasets/products.json', 'r')
catalog = json.load(file)
file.close()

# Find all products with low stock (less than 50 units)
print("Low Stock Alert (< 50 units):")
print("-" * 50)

for category_name, category_data in catalog['categories'].items():
    for product in category_data['products']:
        stock = product['inventory']['stock']
        if stock < 50:
            print(f"{product['name']:30} Stock: {stock:4} (Reorder: {product['inventory']['reorder_point']})")

In [None]:
# Example: Writing JSON with pretty printing

# Create a data summary to save
file = open('assets/datasets/products.json', 'r')
catalog = json.load(file)
file.close()

# Calculate statistics
summary = {
    "report_type": "Product Catalog Summary",
    "generated": "2024-01-15",
    "categories": {}
}

total_products = 0
total_inventory = 0

for cat_name, cat_data in catalog['categories'].items():
    products = cat_data['products']
    product_count = len(products)
    total_stock = sum(p['inventory']['stock'] for p in products)
    avg_rating = sum(p['ratings']['average'] for p in products) / product_count
    
    summary['categories'][cat_name] = {
        "product_count": product_count,
        "total_stock": total_stock,
        "average_rating": round(avg_rating, 2)
    }
    
    total_products += product_count
    total_inventory += total_stock

summary['totals'] = {
    "total_products": total_products,
    "total_inventory": total_inventory
}

# Write with pretty printing
output = open('assets/datasets/catalog_summary.json', 'w')
json.dump(summary, output, indent=2)
output.close()

# Display result
print("Written to catalog_summary.json:")
file = open('assets/datasets/catalog_summary.json', 'r')
print(file.read())
file.close()

In [None]:
# Example: Modifying and saving JSON

# Read existing config
file = open('assets/datasets/config.json', 'r')
config = json.load(file)
file.close()

# Create a modified copy for testing
test_config = config.copy()
test_config['application']['environment'] = 'testing'
test_config['application']['debug_mode'] = True
test_config['database']['host'] = 'localhost'
test_config['database']['name'] = 'analytics_test'
test_config['logging']['level'] = 'DEBUG'

# Save the test configuration
output = open('assets/datasets/config_test.json', 'w')
json.dump(test_config, output, indent=2)
output.close()

print("Test configuration saved!")
print("\nKey differences:")
print(f"  Environment: {config['application']['environment']} -> {test_config['application']['environment']}")
print(f"  Debug Mode: {config['application']['debug_mode']} -> {test_config['application']['debug_mode']}")
print(f"  Database: {config['database']['name']} -> {test_config['database']['name']}")

## Practice Exercise 4.1

**Task:** Read `products.json` and find:
1. The product with the highest rating
2. The product with the most reviews
3. The most expensive product

Print the results in a formatted way.

In [None]:
# Your code here


In [None]:
# Solution 4.1

import json

file = open('assets/datasets/products.json', 'r')
catalog = json.load(file)
file.close()

# Initialize trackers
highest_rated = {'name': '', 'rating': 0, 'category': ''}
most_reviewed = {'name': '', 'reviews': 0, 'category': ''}
most_expensive = {'name': '', 'price': 0, 'category': ''}

# Search through all products
for cat_name, cat_data in catalog['categories'].items():
    for product in cat_data['products']:
        # Check rating
        if product['ratings']['average'] > highest_rated['rating']:
            highest_rated = {
                'name': product['name'],
                'rating': product['ratings']['average'],
                'category': cat_name
            }
        
        # Check reviews
        if product['ratings']['count'] > most_reviewed['reviews']:
            most_reviewed = {
                'name': product['name'],
                'reviews': product['ratings']['count'],
                'category': cat_name
            }
        
        # Check price
        if product['price'] > most_expensive['price']:
            most_expensive = {
                'name': product['name'],
                'price': product['price'],
                'category': cat_name
            }

print("Product Analysis Results")
print("=" * 50)
print()
print(f"Highest Rated Product:")
print(f"  {highest_rated['name']} ({highest_rated['category']})")
print(f"  Rating: {highest_rated['rating']} / 5.0")
print()
print(f"Most Reviewed Product:")
print(f"  {most_reviewed['name']} ({most_reviewed['category']})")
print(f"  Reviews: {most_reviewed['reviews']}")
print()
print(f"Most Expensive Product:")
print(f"  {most_expensive['name']} ({most_expensive['category']})")
print(f"  Price: ${most_expensive['price']}")

## Practice Exercise 4.2

**Task:** Create a JSON file called `inventory_alerts.json` that contains:
1. A list of all products with stock below their reorder point
2. For each product, include: sku, name, current_stock, reorder_point, shortage (how many units below reorder)

The output should look like:
```json
{
  "alert_type": "Low Inventory",
  "products": [
    {
      "sku": "XXX",
      "name": "Product Name",
      "current_stock": 10,
      "reorder_point": 15,
      "shortage": 5
    }
  ]
}
```

In [None]:
# Your code here


In [None]:
# Solution 4.2

import json

# Read product catalog
file = open('assets/datasets/products.json', 'r')
catalog = json.load(file)
file.close()

# Find products below reorder point
low_stock_products = []

for cat_name, cat_data in catalog['categories'].items():
    for product in cat_data['products']:
        stock = product['inventory']['stock']
        reorder = product['inventory']['reorder_point']
        
        if stock < reorder:
            low_stock_products.append({
                'sku': product['sku'],
                'name': product['name'],
                'current_stock': stock,
                'reorder_point': reorder,
                'shortage': reorder - stock
            })

# Create alert document
alerts = {
    'alert_type': 'Low Inventory',
    'generated': '2024-01-15',
    'total_alerts': len(low_stock_products),
    'products': sorted(low_stock_products, key=lambda x: x['shortage'], reverse=True)
}

# Write to file
output = open('assets/datasets/inventory_alerts.json', 'w')
json.dump(alerts, output, indent=2)
output.close()

# Display
print("Inventory Alerts:")
file = open('assets/datasets/inventory_alerts.json', 'r')
print(file.read())
file.close()

---
# Section 5: Context Managers (with statement)
---

## The Problem with Manual File Handling

So far, we've been manually opening and closing files. This approach has problems:

```python
file = open('data.txt', 'r')
content = file.read()
# What if an error occurs here?
file.close()  # This might never run!
```

If an error occurs before `close()`, the file remains open, which can:
- Lock the file
- Consume system resources
- Cause data corruption when writing

## The `with` Statement

Context managers solve this problem elegantly:

```python
with open('data.txt', 'r') as file:
    content = file.read()
    # Process content
# File is automatically closed here, even if errors occur!
```

### Why This Matters

Using `with` is the **recommended** way to handle files in Python:
- **Automatic cleanup** - Files are always closed properly
- **Cleaner code** - No need to remember `close()`
- **Error handling** - Works correctly even when exceptions occur
- **Professional standard** - Expected in production code

In [None]:
# Example: Reading with context manager

with open('assets/datasets/meeting_notes.txt', 'r') as file:
    content = file.read()
    print(f"Read {len(content)} characters")
    print(f"File is open inside with block: {not file.closed}")

# File is automatically closed after the with block
print(f"File is closed after with block: {file.closed}")

In [None]:
# Example: Writing with context manager

data_to_write = [
    "Monthly Sales Report",
    "=" * 20,
    "",
    "Total Sales: $125,430",
    "Top Product: Laptop Pro 15",
    "Best Region: Central"
]

with open('assets/datasets/monthly_report.txt', 'w') as file:
    for line in data_to_write:
        file.write(line + '\n')

# Verify by reading back
with open('assets/datasets/monthly_report.txt', 'r') as file:
    print(file.read())

In [None]:
# Example: Error handling with context managers
# The file is still closed even when an error occurs

try:
    with open('assets/datasets/server_log.txt', 'r') as file:
        lines = file.readlines()
        # Simulate an error during processing
        result = 10 / 0  # This will raise ZeroDivisionError
except ZeroDivisionError:
    print("An error occurred during processing!")

# Even though an error occurred, the file was properly closed
print(f"File is closed: {file.closed}")

In [None]:
# Example: Working with multiple files simultaneously

with open('assets/datasets/server_log.txt', 'r') as source:
    with open('assets/datasets/errors_log.txt', 'w') as errors:
        with open('assets/datasets/warnings_log.txt', 'w') as warnings:
            for line in source:
                if 'ERROR' in line:
                    errors.write(line)
                elif 'WARNING' in line:
                    warnings.write(line)

# Verify
print("Errors file:")
with open('assets/datasets/errors_log.txt', 'r') as f:
    print(f.read())

print("Warnings file:")
with open('assets/datasets/warnings_log.txt', 'r') as f:
    print(f.read())

In [None]:
# Example: Modern Python syntax for multiple files (Python 3.9+)
# Cleaner way to handle multiple files

with (
    open('assets/datasets/sales_data.csv', 'r') as sales_file,
    open('assets/datasets/sales_analysis.txt', 'w') as report_file
):
    import csv
    reader = csv.DictReader(sales_file)
    
    total_sales = 0
    count = 0
    
    for row in reader:
        total_sales += float(row['total_amount'])
        count += 1
    
    report_file.write("Sales Analysis\n")
    report_file.write("=" * 20 + "\n")
    report_file.write(f"Total Transactions: {count}\n")
    report_file.write(f"Total Revenue: ${total_sales:,.2f}\n")
    report_file.write(f"Average Transaction: ${total_sales/count:,.2f}\n")

# Verify
with open('assets/datasets/sales_analysis.txt', 'r') as f:
    print(f.read())

In [None]:
# Example: Using context managers with CSV module

import csv

# Read and process CSV properly
with open('assets/datasets/employees.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    engineering_employees = []
    for row in reader:
        if row['department'] == 'Engineering':
            engineering_employees.append({
                'name': f"{row['first_name']} {row['last_name']}",
                'title': row['title'],
                'salary': int(row['salary'])
            })

print(f"Found {len(engineering_employees)} engineers")
print("\nTop 5 by salary:")
for emp in sorted(engineering_employees, key=lambda x: x['salary'], reverse=True)[:5]:
    print(f"  {emp['name']:25} {emp['title']:25} ${emp['salary']:,}")

In [None]:
# Example: Using context managers with JSON module

import json

# Read JSON properly
with open('assets/datasets/products.json', 'r') as file:
    catalog = json.load(file)

# Process and create new JSON
electronics = catalog['categories']['Electronics']['products']
top_electronics = sorted(electronics, key=lambda x: x['ratings']['average'], reverse=True)[:3]

# Write new JSON file
with open('assets/datasets/top_electronics.json', 'w') as file:
    json.dump({
        'report': 'Top Rated Electronics',
        'products': top_electronics
    }, file, indent=2)

# Verify
with open('assets/datasets/top_electronics.json', 'r') as file:
    print(file.read())

## Practice Exercise 5.1

**Task:** Rewrite this code to use context managers properly:

```python
file = open('assets/datasets/sales_data.csv', 'r')
reader = csv.DictReader(file)
regions = {}
for row in reader:
    region = row['region']
    if region not in regions:
        regions[region] = 0
    regions[region] += 1
file.close()

output = open('assets/datasets/region_counts.txt', 'w')
for region, count in regions.items():
    output.write(f"{region}: {count}\n")
output.close()
```

In [None]:
# Your code here


In [None]:
# Solution 5.1

import csv

# Read with context manager
with open('assets/datasets/sales_data.csv', 'r') as file:
    reader = csv.DictReader(file)
    regions = {}
    for row in reader:
        region = row['region']
        if region not in regions:
            regions[region] = 0
        regions[region] += 1

# Write with context manager
with open('assets/datasets/region_counts.txt', 'w') as output:
    for region, count in sorted(regions.items()):
        output.write(f"{region}: {count}\n")

# Verify
with open('assets/datasets/region_counts.txt', 'r') as f:
    print(f.read())

---
# Section 6: File Paths and OS Module
---

## Why File Paths Matter

When working with files, you need to correctly specify their locations. Different operating systems use different path formats:

- **Windows:** `C:\Users\data\sales.csv`
- **Mac/Linux:** `/home/user/data/sales.csv`

Python's `os` and `os.path` modules help you work with paths in a platform-independent way.

### Why This Matters in Data Science

Data pipelines must work across different systems. Hardcoding paths breaks when:
- Moving code to a server
- Sharing with team members
- Deploying to different environments

In [None]:
import os

# Example: Basic os module functions

# Get current working directory
print(f"Current directory: {os.getcwd()}")

# List files in a directory
print(f"\nFiles in datasets folder:")
for item in os.listdir('assets/datasets'):
    print(f"  {item}")

In [None]:
# Example: os.path for path manipulation

import os.path

file_path = 'assets/datasets/sales_data.csv'

# Get different parts of the path
print(f"Full path: {file_path}")
print(f"Directory: {os.path.dirname(file_path)}")
print(f"Filename: {os.path.basename(file_path)}")
print(f"File extension: {os.path.splitext(file_path)[1]}")
print(f"Name without extension: {os.path.splitext(os.path.basename(file_path))[0]}")

In [None]:
# Example: Checking if files/directories exist

import os.path

paths_to_check = [
    'assets/datasets/sales_data.csv',
    'assets/datasets/nonexistent.csv',
    'assets/datasets',
    'assets/images'
]

for path in paths_to_check:
    exists = os.path.exists(path)
    is_file = os.path.isfile(path)
    is_dir = os.path.isdir(path)
    
    if exists:
        path_type = "file" if is_file else "directory"
        print(f"{path}: exists ({path_type})")
    else:
        print(f"{path}: does not exist")

In [None]:
# Example: Building paths safely with os.path.join()

import os.path

# WRONG way - may break on different operating systems
bad_path = 'assets' + '/' + 'datasets' + '/' + 'sales_data.csv'

# RIGHT way - works on any operating system
good_path = os.path.join('assets', 'datasets', 'sales_data.csv')

print(f"Manual path: {bad_path}")
print(f"os.path.join: {good_path}")

# Using variables
base_dir = 'assets'
data_dir = 'datasets'
filename = 'employees.csv'

full_path = os.path.join(base_dir, data_dir, filename)
print(f"\nConstructed path: {full_path}")
print(f"File exists: {os.path.exists(full_path)}")

In [None]:
# Example: Getting file information

import os
import os.path
from datetime import datetime

file_path = 'assets/datasets/sales_data.csv'

if os.path.exists(file_path):
    # Get file size
    size_bytes = os.path.getsize(file_path)
    size_kb = size_bytes / 1024
    
    # Get modification time
    mod_time = os.path.getmtime(file_path)
    mod_datetime = datetime.fromtimestamp(mod_time)
    
    print(f"File: {file_path}")
    print(f"Size: {size_bytes:,} bytes ({size_kb:.2f} KB)")
    print(f"Last modified: {mod_datetime}")
else:
    print(f"File not found: {file_path}")

In [None]:
# Example: Creating directories

import os

# Create a single directory
new_dir = 'assets/datasets/reports'

if not os.path.exists(new_dir):
    os.mkdir(new_dir)
    print(f"Created directory: {new_dir}")
else:
    print(f"Directory already exists: {new_dir}")

# Create nested directories (including any missing parent directories)
nested_dir = 'assets/datasets/archive/2024/january'

if not os.path.exists(nested_dir):
    os.makedirs(nested_dir)
    print(f"Created nested directories: {nested_dir}")
else:
    print(f"Directory already exists: {nested_dir}")

In [None]:
# Example: Listing files with filtering

import os
import os.path

data_dir = 'assets/datasets'

# List only CSV files
print("CSV files:")
for filename in os.listdir(data_dir):
    if filename.endswith('.csv'):
        full_path = os.path.join(data_dir, filename)
        size = os.path.getsize(full_path)
        print(f"  {filename}: {size:,} bytes")

print("\nJSON files:")
for filename in os.listdir(data_dir):
    if filename.endswith('.json'):
        full_path = os.path.join(data_dir, filename)
        size = os.path.getsize(full_path)
        print(f"  {filename}: {size:,} bytes")

In [None]:
# Example: Walking through directories recursively

import os

print("All files in assets folder:")
print("-" * 50)

for root, dirs, files in os.walk('assets'):
    # Calculate depth for indentation
    level = root.replace('assets', '').count(os.sep)
    indent = '  ' * level
    
    print(f"{indent}{os.path.basename(root)}/")
    
    # Print files
    subindent = '  ' * (level + 1)
    for file in files:
        print(f"{subindent}{file}")

In [None]:
# Example: Practical use case - processing all CSV files in a directory

import os
import os.path
import csv

data_dir = 'assets/datasets'

print("CSV File Analysis")
print("=" * 60)

for filename in os.listdir(data_dir):
    if not filename.endswith('.csv'):
        continue
    
    full_path = os.path.join(data_dir, filename)
    
    with open(full_path, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)
        row_count = sum(1 for row in reader)
    
    file_size = os.path.getsize(full_path)
    
    print(f"\n{filename}")
    print(f"  Size: {file_size:,} bytes")
    print(f"  Columns: {len(header)}")
    print(f"  Rows: {row_count}")
    print(f"  Headers: {', '.join(header[:5])}{'...' if len(header) > 5 else ''}")

## The pathlib Module (Modern Alternative)

Python 3.4+ introduced `pathlib`, which provides an object-oriented approach to file paths. It's often more readable than `os.path`.

In [None]:
from pathlib import Path

# Example: Basic pathlib usage

# Create a Path object
data_path = Path('assets') / 'datasets' / 'sales_data.csv'

print(f"Path: {data_path}")
print(f"Exists: {data_path.exists()}")
print(f"Is file: {data_path.is_file()}")
print(f"Parent: {data_path.parent}")
print(f"Name: {data_path.name}")
print(f"Stem: {data_path.stem}")
print(f"Suffix: {data_path.suffix}")

In [None]:
from pathlib import Path

# Example: Finding files with glob patterns

data_dir = Path('assets/datasets')

# Find all CSV files
print("CSV files:")
for csv_file in data_dir.glob('*.csv'):
    print(f"  {csv_file.name}")

# Find all JSON files
print("\nJSON files:")
for json_file in data_dir.glob('*.json'):
    print(f"  {json_file.name}")

In [None]:
from pathlib import Path

# Example: Reading and writing with pathlib

# Read a file
notes_path = Path('assets/datasets/meeting_notes.txt')
content = notes_path.read_text()
print(f"Read {len(content)} characters from {notes_path.name}")

# Write a file
output_path = Path('assets/datasets/pathlib_example.txt')
output_path.write_text("This file was created using pathlib!\n")
print(f"Wrote to {output_path.name}")

# Verify
print(f"Content: {output_path.read_text()}")

## Practice Exercise 6.1

**Task:** Write a function called `summarize_directory` that:
1. Takes a directory path as input
2. Counts files by extension
3. Calculates total size by extension
4. Returns a dictionary with the summary

Test it on the `assets/datasets` directory.

In [None]:
# Your code here


In [None]:
# Solution 6.1

import os
import os.path

def summarize_directory(dir_path):
    """Summarize files in a directory by extension."""
    summary = {}
    
    for filename in os.listdir(dir_path):
        full_path = os.path.join(dir_path, filename)
        
        # Skip directories
        if not os.path.isfile(full_path):
            continue
        
        # Get extension
        ext = os.path.splitext(filename)[1]
        if ext == '':
            ext = '(no extension)'
        
        # Get file size
        size = os.path.getsize(full_path)
        
        # Add to summary
        if ext not in summary:
            summary[ext] = {'count': 0, 'total_size': 0}
        
        summary[ext]['count'] += 1
        summary[ext]['total_size'] += size
    
    return summary

# Test the function
result = summarize_directory('assets/datasets')

print("Directory Summary: assets/datasets")
print("=" * 50)
print(f"{'Extension':<15} {'Count':>8} {'Total Size':>15}")
print("-" * 50)

for ext, stats in sorted(result.items()):
    size_str = f"{stats['total_size']:,} bytes"
    print(f"{ext:<15} {stats['count']:>8} {size_str:>15}")

## Practice Exercise 6.2

**Task:** Create a data processing pipeline that:
1. Reads all CSV files in `assets/datasets/`
2. Creates a summary JSON file in `assets/datasets/reports/data_inventory.json`
3. The JSON should contain info about each CSV: filename, columns, row_count, file_size

Use `os.makedirs` to create the reports directory if it doesn't exist.

In [None]:
# Your code here


In [None]:
# Solution 6.2

import os
import os.path
import csv
import json
from datetime import datetime

# Define paths
data_dir = 'assets/datasets'
reports_dir = os.path.join(data_dir, 'reports')
output_file = os.path.join(reports_dir, 'data_inventory.json')

# Create reports directory if needed
if not os.path.exists(reports_dir):
    os.makedirs(reports_dir)
    print(f"Created directory: {reports_dir}")

# Process all CSV files
inventory = {
    'generated': datetime.now().isoformat(),
    'source_directory': data_dir,
    'files': []
}

for filename in sorted(os.listdir(data_dir)):
    if not filename.endswith('.csv'):
        continue
    
    full_path = os.path.join(data_dir, filename)
    
    # Get file info
    file_size = os.path.getsize(full_path)
    
    # Read CSV to get column and row info
    with open(full_path, 'r') as f:
        reader = csv.reader(f)
        columns = next(reader)
        row_count = sum(1 for row in reader)
    
    # Add to inventory
    inventory['files'].append({
        'filename': filename,
        'columns': columns,
        'column_count': len(columns),
        'row_count': row_count,
        'file_size_bytes': file_size
    })

# Add summary statistics
inventory['summary'] = {
    'total_files': len(inventory['files']),
    'total_rows': sum(f['row_count'] for f in inventory['files']),
    'total_size_bytes': sum(f['file_size_bytes'] for f in inventory['files'])
}

# Write to JSON
with open(output_file, 'w') as f:
    json.dump(inventory, f, indent=2)

print(f"Created: {output_file}")
print(f"\nInventory Summary:")
print(f"  Files: {inventory['summary']['total_files']}")
print(f"  Total Rows: {inventory['summary']['total_rows']:,}")
print(f"  Total Size: {inventory['summary']['total_size_bytes']:,} bytes")

---
# Module Summary

## Key Takeaways

1. **Text Files**: Use `open()` with modes 'r', 'w', 'a' for reading, writing, and appending
2. **CSV Files**: The `csv` module handles comma-separated data properly, including quoted fields
3. **JSON Files**: The `json` module converts between Python objects and JSON format
4. **Context Managers**: Always use `with` statements for automatic and safe file handling
5. **File Paths**: Use `os.path.join()` or `pathlib.Path` for cross-platform compatibility
6. **Directory Operations**: Use `os.listdir()`, `os.makedirs()`, and `os.walk()` for file system navigation

## Best Practices

- **Always use context managers** (`with` statement) for file operations
- **Handle missing values** when reading data files
- **Use os.path.join()** to build file paths instead of hardcoded slashes
- **Check if files exist** before attempting to read them
- **Use appropriate modes** - 'w' overwrites, 'a' appends

## Common Patterns

```python
# Reading CSV
with open('data.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        process(row)

# Writing JSON
with open('output.json', 'w') as f:
    json.dump(data, f, indent=2)

# Processing files in directory
for filename in os.listdir(directory):
    full_path = os.path.join(directory, filename)
    if filename.endswith('.csv'):
        process_csv(full_path)
```

## Next Module

In the next module, we'll cover **NumPy Fundamentals** - the foundation for numerical computing in Python. You'll learn to work with arrays, perform mathematical operations, and prepare for data analysis with pandas.

## Additional Practice

For extra practice, try these challenges:

1. **Log Analyzer**: Read a log file, parse timestamps, and create an hourly activity report as JSON
2. **CSV Merger**: Write a function that combines multiple CSV files with the same structure into one
3. **Data Validator**: Create a script that reads a CSV, checks for missing values and data type issues, and writes a validation report
4. **Backup Tool**: Write a function that copies all files of a specific type from one directory to another, adding timestamps to filenames