<a href="https://colab.research.google.com/github/badineniharshith/AIML-Course/blob/main/AIML_DAY_14_FRAMING_A_ML_PROJECT_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

SOURCE :https://www.youtube.com/watch?v=ZftI2fEz0Fw&list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH

Effectively planning a data science project involves translating a business need into a structured, solvable machine learning problem. This process ensures the final model delivers real value.

Here's how to structure your plan, using the example of reducing customer churn at Netflix.

***

### ## From Business Problem to ML Problem: The Netflix Example 🏢➡️🤖

The most critical step is reframing a business goal into a specific question that machine learning can answer.

* **Business Problem:** Netflix is losing money and market share because too many customers are canceling their subscriptions. The leadership wants to **reduce customer churn**. This is a broad business objective.
* **Machine Learning Problem:** To tackle this, we need a predictive task. We can reframe the problem as: **Can we predict which active subscribers are most likely to cancel their subscription in the next month?** This is a specific, actionable, and measurable ML problem.

By solving the ML problem, we provide the business with a list of at-risk customers, allowing them to take targeted action (like offering a discount or a content recommendation) to solve the original business problem.

***

### ## Defining the ML Problem Type 🧐

Once the problem is framed, you must classify it. This determines which algorithms and evaluation methods you'll use.

For the Netflix churn problem, you are predicting one of two outcomes for each user: `will_churn` or `will_not_churn`. This is a **binary classification** problem. The model's output will be a probability score (e.g., there's a 92% chance this user will churn).

Other common problem types include:
* **Regression:** Predicting a continuous value (e.g., predicting how many hours a user will watch next week).
* **Clustering:** Grouping similar items without predefined labels (e.g., segmenting users into "action movie fans" and "documentary lovers").

***

### ## Analyzing the Current Solution (Baseline) 📈

Before building a complex model, understand and measure the current process. What does Netflix do *now* to prevent churn?

Perhaps the current solution is a simple rule-based system (e.g., email anyone who hasn't logged in for 30 days) or no targeted solution at all. Establishing this **baseline** is crucial. If your new, complex ML model can't perform significantly better than the simple existing solution, it's not worth the investment.

***

### ## Sourcing and Gathering Data 💾

You can't build a model without data. For the churn problem, you would need historical data containing features that might indicate a user's intent to leave.

**Required Data:**
* **Features (Input):**
    * `User Activity`: Login frequency, hours watched per week, content genres viewed, devices used.
    * `Subscription Information`: Plan tier, account tenure, payment history, recent price changes.
    * `Customer Interactions`: Number of customer support tickets, survey responses.
* **Target Label (Output):** A historical record of who has actually churned. This is a binary flag (`churned = 1` for users who left, `churned = 0` for those who stayed). Without this label, you can't train a supervised model.

***

### ## Choosing the Right Metrics 📏

How will you measure success? You need both business and ML metrics.

* **Business Metric:** The ultimate goal. For Netflix, this would be a **reduction in the monthly churn rate** (e.g., a 2% decrease).
* **ML Metrics:** These evaluate the model's predictive accuracy. For a classification problem like churn, accuracy alone is misleading. Key metrics include:
    * **Precision:** Of all the users we *predict* will churn, what percentage actually do? High precision is vital to avoid giving discounts to happy customers.
        $$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$
    * **Recall (Sensitivity):** Of all the users who *actually* churned, what percentage did our model catch? High recall is essential to identify as many at-risk users as possible.
        $$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$
    * **F1-Score:** The harmonic mean of Precision and Recall, providing a single score that balances both.

***

### ## Deciding Between Online vs. Batch Learning 🔄

This choice depends on how fresh the data needs to be for your predictions.

* **Batch Learning:** The model is trained offline periodically (e.g., once a week) on a large batch of historical data. The trained model is then deployed to make predictions. For churn prediction, a user's behavior over weeks or months is more important than their behavior in the last five minutes, so **batch learning is usually sufficient and simpler to implement.**
* **Online Learning:** The model is updated continuously or in small mini-batches as new data arrives. This is useful for highly dynamic environments like ad bidding or stock trading, but likely overkill for a churn prediction project.

***

### ## Validating Assumptions ✅

Every project is built on assumptions. List them and question them.

* **Assumption 1:** Past behavior is predictive of future churn.
    * *Check:* Is this always true? A major external event (like a competitor launching a new service or a big price hike) could make historical data less relevant.
* **Assumption 2:** We have access to sufficient, high-quality data.
    * *Check:* Are there gaps in the data? Are there known data quality issues? Is there enough historical data on churned users to learn from?
* **Assumption 3:** The business can and will act on the model's output.
    * *Check:* Is there a system in place to email targeted offers to users flagged by the model? A perfect prediction is useless without an action plan. If the business can't use the predictions, the project has no value.

Here are simple definitions for some of the key terms used in the project plan.

***

### ## Churn
In business, **churn** is the rate at which customers stop using a service. For Netflix, it simply means a user canceling their subscription. The goal is always to keep the churn rate as low as possible.

***

### ## Binary Classification
This is a type of machine learning task where the goal is to predict one of two possible outcomes. It answers a "yes" or "no" question.
* **Example:** Will this user *churn* (yes) or *not churn* (no)? Will an email be marked as *spam* (yes) or *not spam* (no)?

***

### ## Baseline
A **baseline** is a simple, often non-ML, model or rule that you compare your new, complex model against. It's the standard to beat. If your advanced model doesn't perform better than the simple baseline, it's not providing value.
* **Example:** A baseline for Netflix could be: "Any user who hasn't logged in for 30 days is at risk of churning." Your ML model must be more accurate than this simple rule.

***

### ## Features and Target Label
* **Features:** These are the **inputs** for your model—the pieces of information it uses to make a decision. Think of them as the columns in a spreadsheet. For the churn problem, features would be `account_age`, `last_login_date`, `hours_watched_this_month`, etc.
* **Target Label:** This is the **output** you are trying to predict. It's the "answer key" in your historical data that the model learns from. For the churn problem, the label is a single column indicating if a user `churned (1)` or `did not churn (0)`.

***

### ## Precision and Recall
These two metrics are crucial for classification problems because they give a more complete picture than simple accuracy. To understand them, consider the four possible outcomes of a prediction:
* **True Positive:** You predicted a user would churn, and they did. ✅ **Correct!**
* **False Positive:** You predicted a user would churn, but they didn't. ❌ **Mistake!** (You might annoy a happy customer).
* **True Negative:** You predicted a user would *not* churn, and they didn't. ✅ **Correct!**
* **False Negative:** You predicted a user would *not* churn, but they did. ❌ **Mistake!** (You missed a chance to save them).

With that in mind:
* **Precision:** Measures how trustworthy your "yes" predictions are. Of all the users you *predicted* would churn, how many actually did? **High precision minimizes False Positives.**
* **Recall:** Measures how well you find all the actual "yes" cases. Of all the users who *truly churned*, how many did you find? **High recall minimizes False Negatives.**

***

### ## F1-Score
The **F1-Score** is a single number that combines and balances precision and recall. It's useful when you care about both minimizing false positives and minimizing false negatives, and you want a single metric to compare models.

***

### ## Batch vs. Online Learning
This describes how a model gets trained.
* **Batch Learning (Offline Learning):** The model is trained all at once on a large, fixed dataset. It learns everything it can, and then it's done. You might re-train it from scratch every week or month with new data.
* **Online Learning:** The model is updated incrementally as new data points arrive one by one or in small groups. It learns continuously and adapts on the fly. This is more complex and only needed for problems where predictions must reflect real-time information.