# Analogy-Based Estimation Demo

**Disclaimer:** This notebook uses *fictional, AI-generated data* purely for demonstration purposes. It is not based on real projects and should only be used for educational exercises within this course.

## Step 1: Sample Dataset

We will create a toy dataset of past projects with basic attributes:
- **Size (story points)**
- **Complexity (1–5 scale)**
- **Team Experience (years)**
- **Duration (weeks)** (actual outcome)

We'll then try to estimate a new project's duration by comparing it with similar past projects.

In [None]:
import pandas as pd
import numpy as np

# Fictional dataset
data = [
    {"Project": "Alpha",   "Size": 50,  "Complexity": 3, "Experience": 5, "Duration": 12},
    {"Project": "Beta",    "Size": 80,  "Complexity": 4, "Experience": 4, "Duration": 18},
    {"Project": "Gamma",   "Size": 40,  "Complexity": 2, "Experience": 6, "Duration": 10},
    {"Project": "Delta",   "Size": 100, "Complexity": 5, "Experience": 3, "Duration": 22},
    {"Project": "Epsilon", "Size": 70,  "Complexity": 3, "Experience": 5, "Duration": 15}
]

df = pd.DataFrame(data)
df

## Step 2: Define Similarity Function

We'll use a simple distance metric (Euclidean distance) based on **Size**, **Complexity**, and **Experience**. 

The closest past project(s) will guide our estimate.

In [None]:
# Install required packages
!pip install scikit-learn

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

# Features used for similarity
features = ["Size", "Complexity", "Experience"]
scaler = StandardScaler()
# Use .values to get numpy array instead of DataFrame with feature names
X = scaler.fit_transform(df[features].values)

def estimate_new_project(new_project, k=1):
    # Convert to numpy array and reshape
    new_array = np.array(new_project).reshape(1, -1)
    new_scaled = scaler.transform(new_array)
    distances = euclidean_distances(new_scaled, X)[0]
    df_copy = df.copy()  # Work with a copy to avoid modifying original DataFrame
    df_copy["Distance"] = distances
    nearest = df_copy.nsmallest(k, "Distance")
    estimate = nearest["Duration"].mean()
    return estimate, nearest[["Project", "Duration", "Distance"]]

## Step 3: Example Estimation

Suppose we have a **new project**:
- Size: 60 story points
- Complexity: 3
- Team Experience: 4 years

We want to estimate its duration.

In [None]:
new_project = [60, 3, 4]
estimate, nearest = estimate_new_project(new_project, k=2)

print("Estimated Duration (weeks):", round(estimate, 1))
nearest

## Step 4: Reflection

- The estimate comes from the *closest 2 past projects* in terms of attributes.
- This mirrors real-world practice: find projects that are similar in scope, team, and complexity.
- **Limitations:**
  - Requires a good historical dataset.
  - Sensitive to the choice of analogy.
  - Works best when projects are truly comparable.

## Step 5: Try Your Own Estimations

Now it's your turn! Try estimating different project scenarios and see how the results change.

**Exercise Ideas:**
1. **Different project sizes**: Try a very small project (20 story points) vs a very large one (150 story points)
2. **Vary complexity**: How does changing complexity from 1 to 5 affect the estimate?
3. **Team experience impact**: Compare estimates for inexperienced (1-2 years) vs experienced (7+ years) teams
4. **Change k value**: Try using k=1 (closest match) vs k=3 (average of 3 closest) vs k=5 (all projects)

**Example scenarios to try:**
- Small, simple project with experienced team: [25, 1, 7]
- Large, complex project with new team: [120, 5, 2]
- Medium project with average specs: [75, 3, 4]

In [None]:
# Try your own project estimations here!
# Format: [Size, Complexity, Experience]

# Example 1: Small, simple project with experienced team
test_project_1 = [25, 1, 7]
estimate_1, nearest_1 = estimate_new_project(test_project_1, k=2)
print(f"Project 1 - Small & Simple with Experienced Team:")
print(f"Estimated Duration: {round(estimate_1, 1)} weeks")
print("Closest matches:")
print(nearest_1)
print()

# Example 2: Large, complex project with new team
test_project_2 = [120, 5, 2]
estimate_2, nearest_2 = estimate_new_project(test_project_2, k=2)
print(f"Project 2 - Large & Complex with New Team:")
print(f"Estimated Duration: {round(estimate_2, 1)} weeks")
print("Closest matches:")
print(nearest_2)
print()

# Your turn! Modify the values below:
your_project = [75, 3, 4]  # Change these values!
your_estimate, your_nearest = estimate_new_project(your_project, k=2)
print(f"Your Project Estimate:")
print(f"Size: {your_project[0]}, Complexity: {your_project[1]}, Experience: {your_project[2]} years")
print(f"Estimated Duration: {round(your_estimate, 1)} weeks")
print("Closest matches:")
print(your_nearest)

## Step 6: Advanced Exploration (Optional)

For those wanting to dig deeper, here are some advanced exercises:

### A. Sensitivity Analysis
Try the same project with different `k` values to see how the number of analogies affects the estimate:
```python
# Compare k=1 vs k=2 vs k=3 for the same project
project = [60, 3, 4]
for k_val in [1, 2, 3, 4, 5]:
    est, _ = estimate_new_project(project, k=k_val)
    print(f"k={k_val}: {round(est, 1)} weeks")
```

### B. Add Your Own Historical Data
What if you had more projects in your dataset? Try adding a few more fictional projects to `data` and see how it affects estimates:
```python
# Example: Add more projects to make estimates more robust
# new_projects = [
#     {"Project": "Zeta", "Size": 45, "Complexity": 2, "Experience": 4, "Duration": 11},
#     {"Project": "Eta", "Size": 90, "Complexity": 4, "Experience": 6, "Duration": 17}
# ]
```

### C. Feature Importance
Which attributes matter most? Try estimating projects where you vary only one attribute at a time to see its impact.