# Session 51: The Machine Learning Workflow in Python

**Unit 5: Basics of Predictive Analytics**
**Hour: 51**
**Mode: Practical Lab**

---

### 1. Objective

This lab introduces the standard workflow for preparing data for machine learning in Python using the **Scikit-learn** library. Before we can train a model, we must:
1.  Separate our data into features (X) and the target (y).
2.  Convert categorical features into a numerical format.
3.  Split our data into a training set and a testing set.

### 2. Setup

We will use our clean Telco dataset. For this lab, we'll focus on a smaller subset of columns for simplicity.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

# Let's use a smaller set of features for this example
df_subset = df[['tenure', 'MonthlyCharges', 'Contract', 'Churn']].copy() # Use .copy() to avoid SettingWithCopyWarning

# Simple cleaning for this subset
df_subset.dropna(inplace=True)
df_subset = df_subset[df_subset['TotalCharges'] != ' ']

### 3. Step 1: Separate Features (X) and Target (y)

*   The **target (y)** is the single column we are trying to predict. In our case, this is `Churn`.
*   The **features (X)** are all the other columns we will use to make the prediction.

By convention, `X` is capitalized because it's a matrix (a DataFrame), and `y` is lowercase because it's a vector (a Series).

In [None]:
X = df_subset.drop('Churn', axis=1)
y = df_subset['Churn']

print("--- Features (X) ---")
print(X.head())
print("\n--- Target (y) ---")
print(y.head())

### 4. Step 2: Convert Categorical Features to Numbers

Machine learning models are mathematical, so they can't understand text values like 'Month-to-month'. We need to convert them into numbers. The most common method is **One-Hot Encoding**.

One-Hot Encoding takes a categorical column and creates a new binary (0 or 1) column for each category. Pandas has a simple function for this called `pd.get_dummies()`.

In [None]:
X_encoded = pd.get_dummies(X, columns=['Contract'], drop_first=True)
# drop_first=True is used to avoid multicollinearity, a statistical issue.

X_encoded.head()

**Interpretation:** The `Contract` column has been replaced by two new columns. For a customer with a 'One year' contract, the `Contract_One year` column is 1 and `Contract_Two year` is 0. If both are 0, it implies the contract was 'Month-to-month'.

### 5. Step 3: Split Data into Training and Testing Sets

This is a crucial step. We need to hold back some of our data to evaluate our model's performance on data it has **never seen before**.

*   **Training Set:** The data we use to teach the model (usually 70-80% of the data).
*   **Testing Set:** The data we use to test how well the model learned (usually 20-30% of the data).

Scikit-learn provides a handy function, `train_test_split`, to do this for us.

In [None]:
from sklearn.model_selection import train_test_split

# test_size=0.2 means we'll hold back 20% of the data for testing.
# random_state=42 ensures that we get the same split every time we run the code, for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

### 6. Conclusion

In this session, you learned the essential preprocessing steps for any supervised machine learning project:
1.  **Separate** your data into features `X` and a target `y`.
2.  **Encode** categorical text data into a numerical format using `pd.get_dummies()`.
3.  **Split** your data into training and testing sets using `train_test_split` to ensure a fair evaluation.

Our data is now fully prepared. We are ready to train our first model in Python.

**Next Session:** We will build a Linear Regression model in Python to predict `TotalCharges`.