# Python Data Structures

**Course:** MLM-101 - Machine Learning Mastery  
**Phase 2:** Python Programming (Lectures 13-15)  
**Topics:** Lists, Tuples, Sets, Dictionaries

---

## üìö Learning Objectives

By the end of this notebook, you will be able to:

‚úÖ Create and manipulate lists (mutable sequences)  
‚úÖ Use tuples for immutable data  
‚úÖ Work with sets for unique elements  
‚úÖ Use dictionaries for key-value pairs  
‚úÖ Choose the right data structure for ML tasks

---

## 1Ô∏è‚É£ Lists - Ordered, Mutable Collections

Lists can store multiple items and can be modified.

In [None]:
# Creating lists
features = ["age", "income", "credit_score", "loan_amount"]
accuracies = [0.92, 0.88, 0.95, 0.91]
mixed = ["Model", 100, 0.95, True]

print("Features:", features)
print("Accuracies:", accuracies)
print("Mixed types:", mixed)

In [None]:
# Accessing elements (0-indexed)
models = ["Linear Regression", "Decision Tree", "Random Forest", "SVM", "Neural Network"]

print("First model:", models[0])
print("Third model:", models[2])
print("Last model:", models[-1])
print("Second to last:", models[-2])

In [None]:
# Slicing lists
data = [10, 20, 30, 40, 50, 60, 70, 80, 90]

print("Full list:", data)
print("First 3:", data[:3])           # [10, 20, 30]
print("Last 3:", data[-3:])           # [70, 80, 90]
print("Middle (index 2-5):", data[2:6])  # [30, 40, 50, 60]
print("Every 2nd element:", data[::2])    # [10, 30, 50, 70, 90]
print("Reversed:", data[::-1])        # [90, 80, 70, ...]

In [None]:
# Modifying lists
epochs = [1, 2, 3, 4, 5]
print("Original:", epochs)

# Append (add to end)
epochs.append(6)
print("After append(6):", epochs)

# Insert at position
epochs.insert(0, 0)  # Insert 0 at index 0
print("After insert(0, 0):", epochs)

# Remove by value
epochs.remove(3)
print("After remove(3):", epochs)

# Pop (remove and return last or at index)
last = epochs.pop()
print(f"Popped: {last}, Remaining:", epochs)

# Extend (add multiple items)
epochs.extend([10, 11, 12])
print("After extend([10, 11, 12]):", epochs)

In [None]:
# List operations
losses = [2.5, 1.8, 1.2, 0.9, 0.7]

print("Losses:", losses)
print("Length:", len(losses))
print("Min:", min(losses))
print("Max:", max(losses))
print("Sum:", sum(losses))
print("Average:", sum(losses) / len(losses))
print("Sorted:", sorted(losses))
print("Sorted (descending):", sorted(losses, reverse=True))

### üéØ ML Example: Training/Test Split

In [None]:
# Simple train/test split
data = list(range(1, 101))  # 100 samples
train_size = int(0.8 * len(data))

train_data = data[:train_size]
test_data = data[train_size:]

print(f"Total samples: {len(data)}")
print(f"Training: {len(train_data)} samples (first: {train_data[0]}, last: {train_data[-1]})")
print(f"Testing: {len(test_data)} samples (first: {test_data[0]}, last: {test_data[-1]})")

---

## 2Ô∏è‚É£ Tuples - Ordered, Immutable Collections

Tuples are like lists but cannot be modified (immutable).

In [None]:
# Creating tuples
model_info = ("Random Forest", 0.92, 100)  # (name, accuracy, n_estimators)
coordinates = (45.5, -122.6)  # (latitude, longitude)

print("Model info:", model_info)
print("Coordinates:", coordinates)

# Accessing elements
print(f"\nModel: {model_info[0]}")
print(f"Accuracy: {model_info[1]}")
print(f"N estimators: {model_info[2]}")

In [None]:
# Tuple unpacking
model_name, accuracy, n_trees = model_info

print(f"Name: {model_name}")
print(f"Accuracy: {accuracy}")
print(f"Trees: {n_trees}")

In [None]:
# Multiple return values (using tuples)
def train_model():
    accuracy = 0.92
    loss = 0.15
    epochs = 50
    return accuracy, loss, epochs

# Unpack returned tuple
acc, loss, ep = train_model()
print(f"Training completed: Accuracy={acc}, Loss={loss}, Epochs={ep}")

### üí° Why Use Tuples?

- **Faster** than lists (immutable)
- **Safer** for data that shouldn't change
- Can be used as dictionary keys (lists cannot)
- Good for returning multiple values from functions

---

## 3Ô∏è‚É£ Sets - Unordered, Unique Collections

Sets store unique elements (no duplicates).

In [None]:
# Creating sets
categories = {"cat", "dog", "bird", "cat", "dog"}  # Duplicates removed
print("Categories:", categories)

# Convert list to set (removes duplicates)
labels = [1, 2, 2, 3, 1, 4, 3, 2, 1]
unique_labels = set(labels)
print(f"\nOriginal labels: {labels}")
print(f"Unique labels: {unique_labels}")
print(f"Count: {len(labels)} ‚Üí {len(unique_labels)}")

In [None]:
# Set operations
train_classes = {0, 1, 2, 3, 4}
test_classes = {2, 3, 4, 5, 6}

print("Train classes:", train_classes)
print("Test classes:", test_classes)

# Union (all unique elements)
all_classes = train_classes | test_classes
print(f"\nUnion (|): {all_classes}")

# Intersection (common elements)
common = train_classes & test_classes
print(f"Intersection (&): {common}")

# Difference (in train but not in test)
only_train = train_classes - test_classes
print(f"Difference (-): {only_train}")

# Symmetric difference (in either but not both)
exclusive = train_classes ^ test_classes
print(f"Symmetric Diff (^): {exclusive}")

### üéØ ML Example: Check Class Balance

In [None]:
# Dataset 1 and Dataset 2 labels
dataset1_labels = [0, 1, 2, 0, 1, 2, 0, 1]
dataset2_labels = [1, 2, 3, 1, 2, 3, 1, 2]

classes_d1 = set(dataset1_labels)
classes_d2 = set(dataset2_labels)

print(f"Dataset 1 classes: {sorted(classes_d1)}")
print(f"Dataset 2 classes: {sorted(classes_d2)}")
print(f"\nCommon classes: {sorted(classes_d1 & classes_d2)}")
print(f"Unique to Dataset 1: {sorted(classes_d1 - classes_d2)}")
print(f"Unique to Dataset 2: {sorted(classes_d2 - classes_d1)}")

---

## 4Ô∏è‚É£ Dictionaries - Key-Value Pairs

Store data as key-value mappings.

In [None]:
# Creating dictionaries
model_config = {
    "name": "Random Forest",
    "n_estimators": 100,
    "max_depth": 10,
    "random_state": 42,
    "accuracy": 0.92
}

print("Model Configuration:")
print(model_config)

In [None]:
# Accessing values
print(f"Model name: {model_config['name']}")
print(f"Trees: {model_config['n_estimators']}")
print(f"Accuracy: {model_config['accuracy']}")

# Safe access with .get() (returns None if key doesn't exist)
print(f"\nLearning rate: {model_config.get('learning_rate', 'Not specified')}")

In [None]:
# Modifying dictionaries
model_config["max_depth"] = 15  # Update existing
model_config["min_samples_split"] = 2  # Add new key

print("Updated config:")
for key, value in model_config.items():
    print(f"  {key}: {value}")

In [None]:
# Dictionary methods
hyperparameters = {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50,
    "dropout": 0.2
}

print("Keys:", list(hyperparameters.keys()))
print("Values:", list(hyperparameters.values()))
print("Items:", list(hyperparameters.items()))

# Check if key exists
if "learning_rate" in hyperparameters:
    print(f"\nLR found: {hyperparameters['learning_rate']}")

### üéØ ML Example: Model Comparison

In [None]:
# Store performance metrics for different models
model_metrics = {
    "Logistic Regression": {"accuracy": 0.85, "training_time": 2.3},
    "Decision Tree": {"accuracy": 0.88, "training_time": 1.5},
    "Random Forest": {"accuracy": 0.92, "training_time": 12.8},
    "Neural Network": {"accuracy": 0.94, "training_time": 45.2}
}

print("Model Performance Comparison:\n")
for model_name, metrics in model_metrics.items():
    print(f"{model_name}:")
    print(f"  Accuracy: {metrics['accuracy']:.1%}")
    print(f"  Training Time: {metrics['training_time']:.1f}s")
    print()

# Find best accuracy
best_model = max(model_metrics.items(), key=lambda x: x[1]['accuracy'])
print(f"üèÜ Best Model: {best_model[0]} ({best_model[1]['accuracy']:.1%})")

### üí° Nested Dictionaries

In [None]:
# ML pipeline configuration
pipeline_config = {
    "preprocessing": {
        "scaler": "StandardScaler",
        "handle_missing": "mean",
        "encode_categorical": True
    },
    "model": {
        "type": "RandomForest",
        "n_estimators": 100,
        "max_depth": 10
    },
    "training": {
        "test_size": 0.2,
        "cv_folds": 5,
        "random_state": 42
    }
}

print("Pipeline Configuration:\n")
for stage, params in pipeline_config.items():
    print(f"{stage.upper()}:")
    for param, value in params.items():
        print(f"  {param}: {value}")
    print()

---

## 5Ô∏è‚É£ Choosing the Right Data Structure

| Structure | Ordered | Mutable | Duplicates | Use Case |
|-----------|---------|---------|------------|----------|
| **List** | ‚úÖ | ‚úÖ | ‚úÖ | General sequences, features, predictions |
| **Tuple** | ‚úÖ | ‚ùå | ‚úÖ | Fixed data, function returns, coordinates |
| **Set** | ‚ùå | ‚úÖ | ‚ùå | Unique items, class labels, fast lookups |
| **Dict** | ‚úÖ (3.7+) | ‚úÖ | Keys: ‚ùå | Configurations, metrics, mappings |

---

## üéØ Practice Exercises

### Exercise 1: Feature Engineering

In [None]:
# Given features, create new features
ages = [25, 30, 35, 40, 45]

# Create a list of age groups: "young" (<30), "middle" (30-40), "senior" (>40)
age_groups = ["young" if age < 30 else "middle" if age <= 40 else "senior" for age in ages]

print("Ages:", ages)
print("Groups:", age_groups)

### Exercise 2: Class Distribution

In [None]:
# Count occurrences of each class
predictions = [0, 1, 1, 0, 1, 2, 0, 1, 2, 1, 0, 2]

# Use a dictionary to count
class_counts = {}
for pred in predictions:
    class_counts[pred] = class_counts.get(pred, 0) + 1

print("Class Distribution:")
for class_label, count in sorted(class_counts.items()):
    print(f"  Class {class_label}: {count} samples")

### Exercise 3: Hyperparameter Combinations

In [None]:
# Generate all combinations of hyperparameters
learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [16, 32]

# Create list of tuples with all combinations
combinations = [(lr, bs) for lr in learning_rates for bs in batch_sizes]

print("Hyperparameter Combinations:")
for i, (lr, bs) in enumerate(combinations, 1):
    print(f"{i}. LR={lr}, Batch={bs}")

print(f"\nTotal combinations: {len(combinations)}")

---

## üéì Summary

In this notebook, you learned:

‚úÖ **Lists**: Ordered, mutable collections - perfect for features, predictions  
‚úÖ **Tuples**: Ordered, immutable collections - ideal for fixed data  
‚úÖ **Sets**: Unordered, unique collections - great for unique labels  
‚úÖ **Dictionaries**: Key-value mappings - excellent for configurations, metrics  
‚úÖ **Data Structure Selection**: Choose based on mutability, order, uniqueness needs

### üöÄ Next Steps

Continue to:
- **`python_functions_oop.ipynb`** - Functions and object-oriented programming
- **`numpy_arrays_basics.ipynb`** - NumPy for numerical computing

---

**Course:** MLM-101 - Machine Learning Mastery  
**Website:** [https://flowdiary.com.ng/course/MLM-101](https://flowdiary.com.ng/course/MLM-101)