In [1]:
import pandas as pd

# Load datasets
data_df = pd.read_csv('data.csv')
data_df["date_activ"] = pd.to_datetime(data_df["date_activ"], format='%Y-%m-%d')
data_df["date_end"] = pd.to_datetime(data_df["date_end"], format='%Y-%m-%d')
data_df["date_modif_prod"] = pd.to_datetime(data_df["date_modif_prod"], format='%Y-%m-%d')
data_df["date_renewal"] = pd.to_datetime(data_df["date_renewal"], format='%Y-%m-%d')

price_df = pd.read_csv('price_data.csv')
price_df["price_date"] = pd.to_datetime(price_df["price_date"], format='%Y-%m-%d')

# Remove irrelevant columns
data_df = data_df.drop(columns=['Unnamed: 0'], errors='ignore')
price_df = price_df.drop(columns=['Unnamed: 0'], errors='ignore')

# Extract new temporal features
price_df['year'] = price_df['price_date'].dt.year
price_df['month'] = price_df['price_date'].dt.month
price_df['day_of_week'] = price_df['price_date'].dt.dayofweek

# Calculate durations
data_df['duration_activ_to_end'] = (data_df['date_end'] - data_df['date_activ']).dt.days
data_df['days_to_renewal'] = (data_df['date_renewal'] - pd.Timestamp.now()).dt.days

# Combine columns to create derived features
price_df['price_var_ratio'] = price_df['price_off_peak_var'] / (price_df['price_peak_var'] + 1e-6)
price_df['price_change_percent'] = price_df.groupby('id')['price_off_peak_var'].pct_change()

# Merge datasets
merged_df = pd.merge(data_df, price_df, on='id', how='inner')

# Feature engineering on merged dataset
merged_df['avg_price'] = (merged_df['price_off_peak_var'] + merged_df['price_peak_var'] + merged_df['price_mid_peak_var']) / 3
merged_df['price_category'] = pd.cut(
    merged_df['avg_price'], bins=[0, 0.1, 0.2, float('inf')], labels=['Low', 'Medium', 'High']
)

# Save the transformed dataset
merged_df.to_csv('engineered_features.csv', index=False)

print("Feature engineering completed and saved to 'engineered_features.csv'.")

  price_df['price_change_percent'] = price_df.groupby('id')['price_off_peak_var'].pct_change()


Feature engineering completed and saved to 'engineered_features.csv'.


In [None]:
1. Why Did You Choose the Evaluation Metrics?
Evaluation metrics should align with the business objective and the nature of the data problem. For churn prediction, typical metrics include:

Accuracy: Measures the percentage of correct predictions but may not be sufficient if the dataset is imbalanced.
Precision and Recall: Useful for understanding the trade-off between false positives (predicting churn when the customer doesn’t churn) and false negatives (missing customers who churn).
F1-Score: Combines precision and recall, especially useful when dealing with imbalanced datasets.
ROC-AUC: Evaluates the model's ability to distinguish between classes across thresholds.
Justification:
If customer churn prediction has class imbalance (e.g., fewer churners), focus on metrics like precision, recall, F1-score, and AUC-ROC to ensure you're addressing the business need effectively.

2. Is the Model Performance Satisfactory?
To determine whether performance is satisfactory:

Compare metrics against a baseline model (e.g., always predicting the majority class).
Evaluate metrics like recall if the cost of missing a churner is high.
Consider business implications: Does the predicted churn help create actionable insights or drive interventions?
Example Explanation:
A high recall means we’re capturing most churners, reducing the risk of losing valuable customers.
A balanced F1-score ensures we’re not overfitting to either precision or recall.
If the ROC-AUC is significantly above 0.5 (e.g., 0.8 or higher), the model is performing better than random guessing.
3. Present Your Work Clearly
Code Example for Model Evaluation in Colab:
python
Copy code
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# Example: Predictions and ground truth
y_true = test_data['churn']  # Replace with actual test labels
y_pred = model.predict(X_test)  # Replace with model predictions
y_prob = model.predict_proba(X_test)[:, 1]  # Predicted probabilities for ROC-AUC

# Evaluation Metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
roc_auc = roc_auc_score(y_true, y_prob)

# Print and Explain Metrics
print("Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.2f} - Percentage of correct predictions.")
print(f"Precision: {precision:.2f} - Percentage of correctly predicted churn cases out of total predicted churn cases.")
print(f"Recall: {recall:.2f} - Percentage of actual churn cases identified by the model.")
print(f"F1-Score: {f1:.2f} - Balance between precision and recall.")
print(f"ROC-AUC: {roc_auc:.2f} - Ability to distinguish between churn and non-churn.")

# Confusion Matrix for Clear Presentation
cm = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Visualize Confusion Matrix (Optional)
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Churn', 'Churn'], yticklabels=['No Churn', 'Churn'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
Example for the Report:
Metrics Explanation:
"We used recall to ensure most churners are captured, as missing them could result in significant business losses. Precision was used to evaluate the model's ability to avoid false alarms. The ROC-AUC score provides an overall measure of classification performance."

Performance Justification:

"The model achieves a recall of 0.85, meaning 85% of churn cases are identified. The precision of 0.75 indicates a balanced performance, ensuring that interventions are targeted effectively. An AUC-ROC of 0.88 confirms the model's robust discriminatory power."
Clear Presentation:
Include visualizations (confusion matrix, AUC-ROC curve) to support explanations.