# A - What is Data Science

## A.1 From Raw Data &rarr; Insight &rarr; Decision

This lab will practice the simplest complete loop in data science: 
* Take a small table of raw data
* Compute a useful metric **(ROI)**
* Summarize it into a short insight **(average vs recent ROI)**
* Turn that into a concrete action using the **decision rule** (a threshold)

Experiment by editing the last week or two of `revenue` and `ad_spend`, changing the `threshold`, or switching the "recent" from the last 2 weeks to the last 3-4 weeks. Watch how small changes in assumptions can flip the recommended action.

<div style="border: 1px solid #ccc; padding: 10px; border-radius: 5px;">
    <b>Rule of thumb</b> <br> 
    <span style="color:blue">Value of analysis $\approx$ (decision improvement) - (time/compute/org cost)</span>
</div>

In [1]:
import pandas as pd

df = pd.DataFrame({
    "week": [1,2,3,4,5,6],
    "ad_spend": [200, 220, 240, 260, 260, 280],
    "revenue":  [900, 930, 950, 980, 970, 1030],
})

df["roi"] = (df["revenue"] - df["ad_spend"]) / df["ad_spend"]

print("Data:")
print(df)

avg_roi = df["roi"].mean()
recent_roi = df.tail(2)["roi"].mean()

threshold = 2.5
decision = "INCREASE ad spend" if recent_roi >= threshold else "HOLD ad spend"

print ("\nAverage ROI:", round(avg_roi, 3))
print ("Recent ROI (last 2 weeks):", round(recent_roi, 3))
print ("Threshold:", threshold)
print ("Decision:", decision)

Data:
   week  ad_spend  revenue       roi
0     1       200      900  3.500000
1     2       220      930  3.227273
2     3       240      950  2.958333
3     4       260      980  2.769231
4     5       260      970  2.730769
5     6       280     1030  2.678571

Average ROI: 2.977
Recent ROI (last 2 weeks): 2.705
Threshold: 2.5
Decision: INCREASE ad spend


### Questions


* What is the **decision** (what could change)?  
    <span style="color:blue">The decision is the amount spent for each advertisement. It is based on the `threshold`</span>
  
* What is the **metric** that will be used as evidence?  
    <span style="color:blue">For the evidence, I will look at the Return on Investment (ROI). Specifically `avg_roi`</span>

* What is the **rule** that would trigger a different action?  
    <span style="color:blue">The `decision` is where a different action will trigger. If the recent ROI is greater than or equal to the threshold then increase advertisement spending, otherwise hold advertisement spending.</span>

## A.2 Exploratory vs Predictive vs Presriptive

| Mode | Core Question | Typical Outputs | Common Pitfalls |
| :--- | :--- | :--- | :--- |  
| Exploratory | "What's going on in the data?" | Summaries, plots, anomalies, hypothesis | Seeing patterns that aren't real; correlation $\neq$ causation |
| Predictive | "What will happen next?" | Forecasts, classifications, risk scores | Leakage, overfitting, evaluation on the wrong data |
| Prescriptive | "What should we do?" | Policies, optimizations, recommended actions | Optimizing the wrong objective; ignoring constraints/incentives | 


This section shows how the same dataset can lead to three different **next steps** depending on the question, following this path
* **Exploratory**: describe what is happening
* **Predictive**: Fit a simple model and forecast signups for new spend levels
* **Prescriptive**: Choose a spend level that maximizes a defined objective

**Experiment:**  
* Make the relationship noisier by changing a few `signups` values
* Add an outlier day with unusually high spend
* Change `value_per_signup` to see how the "best" action shifts

In [14]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
    "day": np.arange(1, 11),
    "spend": [50, 60, 65, 70, 60, 80, 85, 90, 95, 100],
    "signups": [20, 22, 24, 25, 23, 28, 30, 32, 33, 35],
})

print("=== exploratory (describe what is going on) ===")
print(df.describe())

X = df[["spend"]].values
y = df["signups"].values
model = LinearRegression().fit(X, y)

print("\n=== Predictive (estimate what will happen) ===")
print("Estimated signups = {:.3f} * spend + {:.3f}".format(model.coef_[0], model.intercept_))
for test_spend in [70, 110]:
    pred = model.predict([[test_spend]])[0]
    print(f"Predicted signups at spend={test_spend}: {pred:.2f}")

print("\n=== Prescriptive (choose an action) ===")
value_per_signup = 3.0
candidate_spend = np.arange(40, 141, 5)
pred_signups = model.predict(candidate_spend.reshape(-1,1))
profit = value_per_signup * pred_signups - candidate_spend

best_idx = int(np.argmax(profit))
best_spend = int(candidate_spend[best_idx])
best_profit = float(profit[best_idx])

print("Assuming value_per_signup =", value_per_signup)
print("Best spend (grid search):", best_spend)
print("Expected profit at best spend:", round(best_profit, 2))


=== exploratory (describe what is going on) ===
            day       spend    signups
count  10.00000   10.000000  10.000000
mean    5.50000   75.500000  27.200000
std     3.02765   16.906606   5.138093
min     1.00000   50.000000  20.000000
25%     3.25000   61.250000  23.250000
50%     5.50000   75.000000  26.500000
75%     7.75000   88.750000  31.500000
max    10.00000  100.000000  35.000000

=== Predictive (estimate what will happen) ===
Estimated signups = 0.303 * spend + 4.337
Predicted signups at spend=70: 25.53
Predicted signups at spend=110: 37.65

=== Prescriptive (choose an action) ===
Assuming value_per_signup = 3.0
Best spend (grid search): 40
Expected profit at best spend: 9.35


### Analysis

* What happens when I add an outlier:  
    <span style="color:blue">The expected profit at best spend changes significantly. I believe this is because the dataset is quite small.</span>

* What happens when I change the value per signup:
    <span style="color:blue">It modifies the best spend</span>

# B - The Data Science Lifecycle

There are 5 recurring phases in the data science lifecycle:
* Problem Framing
* Data Collection
* Modeling
* Evaluation
* Deployment & Iteration

## B.1 - Problem Framing

We need to determine what "better" means and how the decision will be made.  
This framing produces
* **Decision**: What action could change?
* **Target**: What are we trying to predict / estimate / optimize?
* **Metric**: How will we measure success?
* **Constraints**: What limits are real (budget, latency, fairness, policy)?

<div style="border: 1px solid #ccc; padding: 10px; border-radius: 5px;">
    <b>Rule of thumb</b> <br> 
    <span style="color:blue">Bad framing &rarr; perfect model on the wrong question </span>
</div>

### Micro-lab - Turn a vague goal into a measurable plan

This lab takes a fuzzy business goal (like "improve retention") and turns it into a decision, measurable target, and a success metric. Then sanity-check whether an intervention is worth doing.

In [None]:
# Fill these in like you are writing a one-paragraph project brief.

goal = "Improve weekly retention"
decision = "Offer a discount to "