# Modèle de Propension à l'Achat

Ce notebook développe un modèle d'apprentissage automatique pour prédire la propension des clients à effectuer des achats.

## Objectifs :
- Analyser le comportement d'achat des clients
- Construire un modèle prédictif de propension à l'achat
- Identifier les clients à forte propension
- Fournir des insights pour le marketing ciblé

In [None]:
# Importer les bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
import snowflake.connector
from snowflake.connector.pandas_tools import pd_read_sql

plt.style.use('default')
sns.set_palette("husl")

print("Bibliothèques importées avec succès !")

In [None]:
# Importer les bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.preprocessing import StandardScaler
import snowflake.connector

plt.style.use('default')
sns.set_palette("husl")

print("Bibliothèques importées avec succès !")

In [None]:
# Se connecter à Snowflake
conn_params = {
    'user': 'workshop_user',
    'password': 'VotreMotDePasse123!',
    'account': 'dnb65599',
    'warehouse': 'ANYCOMPANY_WH',
    'database': 'ANYCOMPANY_LAB',
    'schema': 'ANALYTICS'
}

conn = snowflake.connector.connect(**conn_params)
print("Connecté à Snowflake !")

In [None]:
# Prétraitement des données
# Features pour la prédiction
features = ['age', 'income_category', 'region', 'total_transactions', 
           'total_amount', 'avg_transaction_amount', 'days_since_last_purchase',
           'purchase_frequency', 'avg_days_between_purchases']

# Target : propension à l'achat (basé sur la fréquence et le montant récent)
target = 'high_propensity_customer'

# Préparer les données
X = df[features].fillna(0)
y = df[target]

# Encoder les variables catégorielles
X = pd.get_dummies(X, columns=['income_category', 'region'], drop_first=True)

# Mettre à l'échelle les features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Features : {list(X.columns)}")
print(f"Target : {target}")
print(f"Distribution des classes : {y.value_counts(normalize=True)}")

In [None]:
# Diviser les données
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Ensemble d'entraînement : {X_train.shape[0]} échantillons")
print(f"Ensemble de test : {X_test.shape[0]} échantillons")

In [None]:
# Modèle 1 : Régression Logistique
lr_model = LogisticRegression(random_state=42, class_weight='balanced')
lr_model.fit(X_train, y_train)

# Modèle 2 : Random Forest
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    class_weight='balanced'
)
rf_model.fit(X_train, y_train)

print("Modèles entraînés avec succès !")

In [None]:
# Évaluation des modèles
models = {'Régression Logistique': lr_model, 'Random Forest': rf_model}

for name, model in models.items():
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    print(f"\n=== {name} ===")
    print("Rapport de Classification :")
    print(classification_report(y_test, y_pred))
    print(f"Score ROC AUC : {roc_auc_score(y_test, y_pred_proba):.3f}")
    
    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Propension Faible', 'Propension Élevée'],
                yticklabels=['Propension Faible', 'Propension Élevée'])
    plt.title(f'Matrice de Confusion - {name}')
    plt.ylabel('Réel')
    plt.xlabel('Prédit')
    plt.show()

In [None]:
# Courbes ROC
plt.figure(figsize=(8, 6))

for name, model in models.items():
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('Taux de Faux Positifs')
plt.ylabel('Taux de Vrais Positifs')
plt.title('Courbes ROC - Comparaison des Modèles')
plt.legend()
plt.show()

In [None]:
# Importance des features (Random Forest)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Top 10 Features les Plus Importantes (Random Forest)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

print("Top 10 features les plus importantes :")
print(feature_importance.head(10))

## Insights Business et Recommandations

### Principaux Résultats :
1. **Facteurs de Propension** : [Analyse des features importantes]
2. **Performance des Modèles** : [Comparaison des scores]
3. **Segments Clients** : [Identification des groupes à forte propension]

### Recommandations :
1. **Ciblage Marketing** : Se concentrer sur les clients identifiés à forte propension
2. **Stratégies de Fidélisation** : Programmes pour les clients à risque
3. **Campagnes Personnalisées** : Messages adaptés aux profils clients
4. **Timing Optimal** : Moments stratégiques pour les communications

### Prochaines Étapes :
- Déployer le modèle sélectionné en production
- Intégrer les prédictions dans le système de CRM
- Mesurer l'impact des campagnes ciblées
- Mettre à jour le modèle avec de nouvelles données

In [None]:
# Fermer la connexion
conn.close()
print("Analyse terminée et connexion fermée !")

# Modèle de Propension à l'Achat

Ce notebook développe un modèle pour prédire la propension des clients à effectuer des achats futurs basé sur des features démographiques et comportementales.

## Objectifs :
- Prédire la probabilité d'achats futurs
- Identifier les principaux drivers du comportement d'achat
- Segmenter les clients selon leur potentiel d'achat
- Fournir des recommandations de ciblage

In [None]:
# Importer les bibliothèques
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, classification_report
from sklearn.preprocessing import StandardScaler
import snowflake.connector
from snowflake.connector.pandas_tools import pd_read_sql

plt.style.use('default')
sns.set_palette("husl")

print("Bibliothèques importées avec succès !")

In [None]:
# Se connecter à Snowflake
conn_params = {
    'user': 'workshop_user',
    'password': 'VotreMotDePasse123!',
    'account': 'dnb65599',
    'warehouse': 'ANYCOMPANY_WH',
    'database': 'ANYCOMPANY_LAB',
    'schema': 'ANALYTICS'
}

conn = snowflake.connector.connect(**conn_params)

In [None]:
# Charger les données clients avec target synthétique
# Note : En scénario réel, le target serait basé sur l'historique d'achat réel
query = """
SELECT 
    *,
    -- Target synthétique : propension élevée si haut revenu et âge jeune
    CASE WHEN annual_income > 60000 AND age < 50 THEN 1 ELSE 0 END AS purchase_propensity
FROM ANALYTICS.customer_ml_features
"""

df = pd.read_sql(query, conn)
print(f"{len(df)} enregistrements clients chargés")
print(f"Distribution du target : {df['purchase_propensity'].value_counts(normalize=True)}")
print(df.head())

In [None]:
# Sélection des features
features = ['age', 'annual_income', 'age_group_encoded', 'income_segment_encoded',
           'region_north', 'region_south', 'region_east', 'region_west']

X = df[features]
y = df['purchase_propensity']

# Mettre à l'échelle les features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Features sélectionnées : {features}")
print(f"Forme de la matrice de features : {X_scaled.shape}")

In [None]:
# Diviser les données
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Échantillons d'entraînement : {len(X_train)}")
print(f"Échantillons de test : {len(X_test)}")

In [None]:
# Entraîner le modèle de Régression Logistique
lr_model = LogisticRegression(random_state=42, class_weight='balanced')
lr_model.fit(X_train, y_train)

# Prédictions
y_pred_lr = lr_model.predict(X_test)
y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]

print("Régression Logistique entraînée !")
print(f"ROC AUC : {roc_auc_score(y_test, y_pred_proba_lr):.3f}")

In [None]:
# Entraîner le modèle Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

# Prédictions
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest entraîné !")
print(f"ROC AUC : {roc_auc_score(y_test, y_pred_proba_rf):.3f}")

In [None]:
# Comparaison des modèles
models = ['Régression Logistique', 'Random Forest']
predictions = [y_pred_lr, y_pred_rf]
probabilities = [y_pred_proba_lr, y_pred_proba_rf]

for name, pred, proba in zip(models, predictions, probabilities):
    print(f"\n{name} :")
    print(classification_report(y_test, pred))
    print(f"ROC AUC : {roc_auc_score(y_test, proba):.3f}")

In [None]:
# Courbes ROC
plt.figure(figsize=(8, 6))

for name, proba in zip(models, probabilities):
    fpr, tpr, _ = roc_curve(y_test, proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Aléatoire')
plt.xlabel('Taux de Faux Positifs')
plt.ylabel('Taux de Vrais Positifs')
plt.title('Courbes ROC - Modèles de Propension à l\'Achat')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Importance des features (Random Forest)
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance)
plt.title('Importance des Features - Propension à l\'Achat')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

print("Features principales driving la propension à l'achat :")
print(feature_importance.head())

In [None]:
# Générer des scores de propension pour tous les clients
df['propensity_score'] = rf_model.predict_proba(scaler.transform(df[features]))[:, 1]

# Créer des segments de propension
df['propensity_segment'] = pd.qcut(df['propensity_score'], q=4, labels=['Faible', 'Moyen', 'Élevé', 'Très Élevé'])

print("Distribution des scores de propension :")
print(df['propensity_segment'].value_counts().sort_index())

# Visualiser la distribution de propension
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='propensity_score', hue='propensity_segment', multiple='stack')
plt.title('Distribution de la Propension à l\'Achat Client')
plt.xlabel('Score de Propension')
plt.ylabel('Nombre de Clients')
plt.show()

## Recommandations Business

### Insights Clés :
1. **Drivers Principaux** : Le revenu et l'âge sont les principaux drivers de la propension à l'achat
2. **Segments à Haute Propension** : Se concentrer sur les efforts marketing vers les clients à haute propension
3. **Performance du Modèle** : [Scores ROC AUC]

### Recommandations Actionnables :
1. **Marketing Ciblé** : Prioriser les clients à haute propension pour les campagnes
2. **Personnalisation** : Adapter les offres basées sur les scores de propension
3. **Focus Fidélisation** : Développer des stratégies de rétention pour les segments à haute propension
4. **Stratégie d'Acquisition** : Cibler des profils similaires pour l'acquisition de clients

### Mise en Œuvre :
- Intégrer les scores de propension dans le CRM
- Utiliser les scores pour le ciblage des campagnes
- Surveiller les changements de score dans le temps
- Mettre à jour le modèle avec de nouvelles données tous les trimestres

In [None]:
# Sauvegarder les résultats
results_df = df[['customer_id', 'propensity_score', 'propensity_segment']]
print(f"Résultats prêts : {len(results_df)} scores de propension client")

# Fermer la connexion
conn.close()
print("Analyse de propension à l'achat terminée !")

# Purchase Propensity Model

This notebook develops a model to predict customer purchase propensity based on demographic and behavioral features.

## Objectives:
- Predict likelihood of future purchases
- Identify key drivers of purchase behavior
- Segment customers by purchase potential
- Provide targeting recommendations

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, classification_report
from sklearn.preprocessing import StandardScaler
import snowflake.connector
from snowflake.connector.pandas_tools import pd_read_sql

plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Connect to Snowflake
conn_params = {
    'user': 'workshop_user',
    'password': 'VotreMotDePasse123!',
    'account': 'dnb65599',
    'warehouse': 'ANYCOMPANY_WH',
    'database': 'ANYCOMPANY_LAB',
    'schema': 'ANALYTICS'
}

In [None]:
# Load customer data with synthetic target
# Note: In real scenario, target would be based on actual purchase history
query = """
SELECT 
    *,
    -- Synthetic target: high propensity if high income and young age
    CASE WHEN annual_income > 60000 AND age < 50 THEN 1 ELSE 0 END AS purchase_propensity
FROM ANALYTICS.customer_ml_features
"""

df = pd_read_sql(query, conn)
print(f"Loaded {len(df)} customer records")
print(f"Target distribution: {df['purchase_propensity'].value_counts(normalize=True)}")
print(df.head())

In [None]:
# Feature selection
features = ['age', 'annual_income', 'age_group_encoded', 'income_segment_encoded',
           'region_north', 'region_south', 'region_east', 'region_west']

X = df[features]
y = df['purchase_propensity']

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Selected features: {features}")
print(f"Feature matrix shape: {X_scaled.shape}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Train Logistic Regression model
lr_model = LogisticRegression(random_state=42, class_weight='balanced')
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]

print("Logistic Regression trained!")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_lr):.3f}")

In [None]:
# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest trained!")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba_rf):.3f}")

In [None]:
# Model comparison
models = ['Logistic Regression', 'Random Forest']
predictions = [y_pred_lr, y_pred_rf]
probabilities = [y_pred_proba_lr, y_pred_proba_rf]

for name, pred, proba in zip(models, predictions, probabilities):
    print(f"\n{name}:")
    print(classification_report(y_test, pred))
    print(f"ROC AUC: {roc_auc_score(y_test, proba):.3f}")

In [None]:
# ROC Curves
plt.figure(figsize=(8, 6))

for name, proba in zip(models, probabilities):
    fpr, tpr, _ = roc_curve(y_test, proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Purchase Propensity Models')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Feature importance (Random Forest)
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance)
plt.title('Feature Importance - Purchase Propensity')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

print("Top features driving purchase propensity:")
print(feature_importance.head())

In [None]:
# Generate propensity scores for all customers
df['propensity_score'] = rf_model.predict_proba(scaler.transform(df[features]))[:, 1]

# Create propensity segments
df['propensity_segment'] = pd.qcut(df['propensity_score'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

print("Propensity score distribution:")
print(df['propensity_segment'].value_counts().sort_index())

# Visualize propensity distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='propensity_score', hue='propensity_segment', multiple='stack')
plt.title('Customer Purchase Propensity Distribution')
plt.xlabel('Propensity Score')
plt.ylabel('Number of Customers')
plt.show()

## Business Recommendations

### Key Insights:
1. **Top Drivers**: Income and age are primary drivers of purchase propensity
2. **High Propensity Segments**: Focus marketing efforts on high-propensity customers
3. **Model Performance**: [ROC AUC scores]

### Actionable Recommendations:
1. **Targeted Marketing**: Prioritize high-propensity customers for campaigns
2. **Personalization**: Tailor offers based on propensity scores
3. **Retention Focus**: Develop retention strategies for high-propensity segments
4. **Acquisition Strategy**: Target similar profiles for customer acquisition

### Implementation:
- Integrate propensity scores into CRM
- Use scores for campaign targeting
- Monitor score changes over time
- Update model with new data quarterly

In [None]:
# Save results
results_df = df[['customer_id', 'propensity_score', 'propensity_segment']]
print(f"Results ready: {len(results_df)} customer propensity scores")

# Close connection
conn.close()
print("Purchase propensity analysis completed!")