In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Challenge d'amélioration de modèle de prévision des ventes\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Bienvenue dans ce challenge professionnel ! Vous êtes stagiaire data scientist dans une entreprise de commerce en ligne, et on vous a confié la mission d'améliorer le système de prévision des ventes existant.\n",
    "\n",
    "Le modèle actuel ne donne pas de résultats satisfaisants, ce qui entraîne des problèmes de gestion des stocks et de planification. Votre objectif est d'analyser ce modèle, d'identifier ses faiblesses, et de proposer des améliorations concrètes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Configuration et importation des bibliothèques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n",
    "import tensorflow as tf\n",
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense, Dropout, LSTM, Conv1D, MaxPooling1D, Flatten\n",
    "from tensorflow.keras.callbacks import EarlyStopping\n",
    "import sys\n",
    "import warnings\n",
    "\n",
    "# Import des utilitaires personnalisés\n",
    "sys.path.append('./utils')\n",
    "from visualization import plot_actual_vs_predicted, plot_error_distribution, plot_category_performance\n",
    "from model_evaluation import evaluate_model, calculate_metrics_by_category\n",
    "from data_preprocessing import create_time_features, prepare_sequences\n",
    "\n",
    "# Configuration pour la reproductibilité\n",
    "np.random.seed(42)\n",
    "tf.random.set_seed(42)\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Style des graphiques\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "sns.set_palette('colorblind')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Chargement et exploration des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Chargement des données\n",
    "data = pd.read_csv('./data/sales_data.csv')\n",
    "\n",
    "# Convertir la colonne de date en datetime\n",
    "data['date'] = pd.to_datetime(data['date'])\n",
    "\n",
    "# Afficher les premières lignes\n",
    "print(\"Aperçu des données:\")\n",
    "display(data.head())\n",
    "\n",
    "# Informations sur le jeu de données\n",
    "print(\"\\nInformations sur le jeu de données:\")\n",
    "print(f\"Nombre d'enregistrements: {data.shape[0]}\")\n",
    "print(f\"Période couverte: de {data['date'].min().date()} à {data['date'].max().date()}\")\n",
    "print(f\"Catégories de produits: {', '.join(data['product_category'].unique())}\")\n",
    "\n",
    "# Statistiques descriptives\n",
    "print(\"\\nStatistiques descriptives:\")\n",
    "display(data.describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualisation des ventes au fil du temps\n",
    "plt.figure(figsize=(15, 6))\n",
    "data_grouped = data.groupby('date')['total_sales'].sum().reset_index()\n",
    "plt.plot(data_grouped['date'], data_grouped['total_sales'])\n",
    "plt.title('Évolution des ventes totales au fil du temps')\n",
    "plt.xlabel('Date')\n",
    "plt.ylabel('Ventes totales')\n",
    "plt.grid(True)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Visualisation des ventes par catégorie de produit\n",
    "plt.figure(figsize=(12, 6))\n",
    "category_sales = data.groupby('product_category')['total_sales'].sum().sort_values(ascending=False)\n",
    "ax = category_sales.plot(kind='bar', color='skyblue')\n",
    "plt.title('Ventes totales par catégorie de produit')\n",
    "plt.xlabel('Catégorie de produit')\n",
    "plt.ylabel('Ventes totales')\n",
    "plt.xticks(rotation=45)\n",
    "\n",
    "# Ajouter les valeurs sur les barres\n",
    "for i, v in enumerate(category_sales):\n",
    "    ax.text(i, v + 0.1, f'{v:.1f}', ha='center')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Corrélation entre les variables\n",
    "plt.figure(figsize=(10, 8))\n",
    "numeric_data = data.select_dtypes(include=[np.number])\n",
    "correlation_matrix = numeric_data.corr()\n",
    "mask = np.triu(np.ones_like(correlation_matrix))\n",
    "sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', cbar_kws={'shrink': .8})\n",
    "plt.title('Matrice de corrélation des variables numériques')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Préparation des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Création de caractéristiques temporelles\n",
    "data = create_time_features(data)\n",
    "\n",
    "# Séparation des variables explicatives et de la cible\n",
    "X = data.drop(['date', 'total_sales'], axis=1)\n",
    "y = data['total_sales']\n",
    "\n",
    "# Séparation en ensembles d'entraînement et de test (80-20)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "# Identification des colonnes catégorielles et numériques\n",
    "categorical_cols = ['product_category', 'weekday']\n",
    "numerical_cols = [col for col in X.columns if col not in categorical_cols]\n",
    "\n",
    "# Préprocesseur pour transformer les données\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', StandardScaler(), numerical_cols),\n",
    "        ('cat', OneHotEncoder(drop='first'), categorical_cols)\n",
    "    ])\n",
    "\n",
    "# Application du préprocesseur\n",
    "X_train_processed = preprocessor.fit_transform(X_train)\n",
    "X_test_processed = preprocessor.transform(X_test)\n",
    "\n",
    "print(f\"Forme des données d'entraînement après prétraitement: {X_train_processed.shape}\")\n",
    "print(f\"Forme des données de test après prétraitement: {X_test_processed.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Modèle de base (sous-optimal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_baseline_model(input_shape):\n",
    "    \"\"\"Création du modèle de base sous-optimal\"\"\"\n",
    "    model = Sequential([\n",
    "        Dense(64, activation='relu', input_shape=(input_shape,)),\n",
    "        Dense(32, activation='relu'),\n",
    "        Dense(1)  # Régression: pas d'activation sur la couche de sortie\n",
    "    ])\n",
    "    \n",
    "    model.compile(optimizer='sgd', loss='mse', metrics=['mae'])\n",
    "    return model\n",
    "\n",
    "# Création et entraînement du modèle de base\n",
    "baseline_model = create_baseline_model(X_train_processed.shape[1])\n",
    "baseline_model.summary()\n",
    "\n",
    "# Configuration des callbacks\n",
    "early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)\n",
    "\n",
    "# Entraînement du modèle\n",
    "baseline_history = baseline_model.fit(\n",
    "    X_train_processed, y_train,\n",
    "    epochs=50,\n",
    "    batch_size=32,\n",
    "    validation_split=0.2,\n",
    "    callbacks=[early_stopping],\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Évaluation du modèle de base\n",
    "y_pred_baseline = baseline_model.predict(X_test_processed).flatten()\n",
    "\n",
    "# Métriques globales\n",
    "baseline_metrics = evaluate_model(y_test, y_pred_baseline)\n",
    "print(\"\\nMétriques du modèle de base:\")\n",
    "print(f\"RMSE: {baseline_metrics['rmse']:.2f}\")\n",
    "print(f\"MAE: {baseline_metrics['mae']:.2f}\")\n",
    "print(f\"R²: {baseline_metrics['r2']:.4f}\")\n",
    "print(f\"MAPE: {baseline_metrics['mape']:.2f}%\")\n",
    "\n",
    "# Visualisation des prédictions vs. valeurs réelles\n",
    "plot_actual_vs_predicted(y_test, y_pred_baseline, 'Modèle de base')\n",
    "\n",
    "# Distribution des erreurs\n",
    "plot_error_distribution(y_test, y_pred_baseline, 'Modèle de base')\n",
    "\n",
    "# Performance par catégorie\n",
    "X_test_with_categories = X_test.reset_index(drop=True)\n",
    "category_performance = calculate_metrics_by_category(X_test_with_categories, y_test, y_pred_baseline)\n",
    "plot_category_performance(category_performance, 'Modèle de base')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Diagnostic du modèle de base\n",
    "\n",
    "Analysez les résultats ci-dessus et identifiez les faiblesses du modèle de base. Notez vos observations dans cette cellule.\n",
    "\n",
    "### Problèmes identifiés\n",
    "1. ...\n",
    "2. ...\n",
    "3. ...\n",
    "\n",
    "### Pistes d'amélioration\n",
    "1. ...\n",
    "2. ...\n",
    "3. ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Amélioration du modèle\n",
    "\n",
    "Implémentez vos améliorations ici. Pour chaque modification, documentez votre hypothèse, le changement apporté, et les résultats obtenus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Amélioration 1: [Nom de la modification]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hypothèse**: [Expliquez pourquoi vous pensez que cette modification améliorera les performances]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_improved_model_1(input_shape):\n",
    "    \"\"\"Création d'un modèle amélioré - Version 1\"\"\"\n",
    "    # Implémentez votre modèle amélioré ici\n",
    "    model = Sequential([\n",
    "        # Modifiez cette architecture selon votre hypothèse\n",
    "        Dense(128, activation='relu', input_shape=(input_shape,)),\n",
    "        Dropout(0.3),  # Exemple d'ajout de dropout pour réduire le surapprentissage\n",
    "        Dense(64, activation='relu'),\n",
    "        Dropout(0.2),\n",
    "        Dense(32, activation='relu'),\n",
    "        Dense(1)\n",
    "    ])\n",
    "    \n",
    "    # Modifiez l'optimiseur et les paramètres selon vos besoins\n",
    "    model.compile(\n",
    "        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),\n",
    "        loss='mse',\n",
    "        metrics=['mae']\n",
    "    )\n",
    "    \n",
    "    return model\n",
    "\n",
    "# Création et entraînement du modèle amélioré\n",
    "improved_model_1 = create_improved_model_1(X_train_processed.shape[1])\n",
    "improved_model_1.summary()\n",
    "\n",
    "# Entraînement du modèle\n",
    "improved_history_1 = improved_model_1.fit(\n",
    "    X_train_processed, y_train,\n",
    "    epochs=50,\n",
    "    batch_size=32,  # Vous pouvez modifier le batch size\n",
    "    validation_split=0.2,\n",
    "    callbacks=[early_stopping],\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Évaluation du modèle amélioré 1\n",
    "y_pred_improved_1 = improved_model_1.predict(X_test_processed).flatten()\n",
    "\n",
    "# Métriques globales\n",
    "improved_metrics_1 = evaluate_model(y_test, y_pred_improved_1)\n",
    "print(\"\\nMétriques du modèle amélioré 1:\")\n",
    "print(f\"RMSE: {improved_metrics_1['rmse']:.2f}\")\n",
    "print(f\"MAE: {improved_metrics_1['mae']:.2f}\")\n",
    "print(f\"R²: {improved_metrics_1['r2']:.4f}\")\n",
    "print(f\"MAPE: {improved_metrics_1['mape']:.2f}%\")\n",
    "\n",
    "# Amélioration par rapport au modèle de base\n",
    "rmse_improvement = (baseline_metrics['rmse'] - improved_metrics_1['rmse']) / baseline_metrics['rmse'] * 100\n",
    "print(f\"Amélioration de la RMSE: {rmse_improvement:.2f}%\")\n",
    "\n",
    "# Visualisation des prédictions vs. valeurs réelles\n",
    "plot_actual_vs_predicted(y_test, y_pred_improved_1, 'Modèle amélioré 1')\n",
    "\n",
    "# Performance par catégorie\n",
    "category_performance_improved_1 = calculate_metrics_by_category(X_test_with_categories, y_test, y_pred_improved_1)\n",
    "plot_category_performance(category_performance_improved_1, 'Modèle amélioré 1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Résultats et analyse**: [Commentez les résultats obtenus. L'amélioration est-elle significative? Votre hypothèse est-elle confirmée?]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Amélioration 2: Utilisation d'un modèle RNN (LSTM) pour capturer les tendances temporelles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hypothèse**: Les données de ventes sont séquentielles par nature et contiennent probablement des patterns temporels complexes. Un modèle LSTM devrait mieux capturer ces dépendances temporelles et améliorer les prédictions, particulièrement pour les périodes de ventes fluctuantes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Préparation des données pour LSTM (séquences)\n",
    "# Nous allons utiliser des séquences de 7 jours pour prédire le jour suivant\n",
    "sequence_length = 7\n",
    "\n",
    "# Préparation des séquences pour LSTM\n",
    "X_train_seq, y_train_seq, X_test_seq, y_test_seq = prepare_sequences(\n",
    "    data, sequence_length, train_size=0.8, random_state=42\n",
    ")\n",
    "\n",
    "print(f\"Forme des séquences d'entraînement: {X_train_seq.shape}\")\n",
    "print(f\"Forme des séquences de test: {X_test_seq.shape}\")\n",
    "\n",
    "def create_lstm_model(input_shape):\n",
    "    \"\"\"Création d'un modèle LSTM pour les séquences temporelles\"\"\"\n",
    "    model = Sequential([\n",
    "        LSTM(64, return_sequences=True, input_shape=input_shape),\n",
    "        Dropout(0.2),\n",
    "        LSTM(32),\n",
    "        Dropout(0.2),\n",
    "        Dense(16, activation='relu'),\n",
    "        Dense(1)\n",
    "    ])\n",
    "    \n",
    "    model.compile(\n",
    "        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),\n",
    "        loss='mse',\n",
    "        metrics=['mae']\n",
    "    )\n",
    "    \n",
    "    return model\n",
    "\n",
    "# Création du modèle LSTM\n",
    "lstm_model = create_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]))\n",
    "lstm_model.summary()\n",
    "\n",
    "# Entraînement du modèle\n",
    "lstm_history = lstm_model.fit(\n",
    "    X_train_seq, y_train_seq,\n",
    "    epochs=50,\n",
    "    batch_size=32,\n",
    "    validation_split=0.2,\n",
    "    callbacks=[early_stopping],\n",
    "    verbose=1\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Le code pour évaluer le modèle LSTM serait mis ici\n",
    "# Nous l'adaptons pour utiliser les séquences et comparer avec les modèles précédents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Amélioration 3: [Autre approche de votre choix]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hypothèse**: [Votre hypothèse pour cette troisième approche]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implémentez votre troisième approche ici"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Comparaison des modèles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tableau récapitulatif des performances\n",
    "models = ['Baseline', 'Improved 1', 'LSTM']\n",
    "rmse_values = [baseline_metrics['rmse'], improved_metrics_1['rmse'], 0]  # Complétez avec les résultats LSTM\n",
    "r2_values = [baseline_metrics['r2'], improved_metrics_1['r2'], 0]  # Complétez avec les résultats LSTM\n",
    "\n",
    "# Visualisation comparative\n",
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))\n",
    "\n",
    "# RMSE (plus petit est meilleur)\n",
    "ax1.bar(models, rmse_values, color=['lightcoral', 'lightgreen', 'lightblue'])\n",
    "ax1.set_title('Comparaison de la RMSE (plus bas = meilleur)')\n",
    "ax1.set_ylabel('RMSE')\n",
    "for i, v in enumerate(rmse_values):\n",
    "    if v > 0:\n",
    "        ax1.text(i, v + 0.5, f\"{v:.2f}\", ha='center')\n",
    "\n",
    "# R² (plus grand est meilleur)\n",
    "ax2.bar(models, r2_values, color=['lightcoral', 'lightgreen', 'lightblue'])\n",
    "ax2.set_title('Comparaison du R² (plus haut = meilleur)')\n",
    "ax2.set_ylabel('R²')\n",
    "for i, v in enumerate(r2_values):\n",
    "    if v > 0:\n",
    "        ax2.text(i, v + 0.01, f\"{v:.4f}\", ha='center')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Impact business\n",
    "\n",
    "Analysez l'impact business des améliorations apportées au modèle. Quantifiez les avantages pour l'entreprise.\n",
    "\n",
    "### Estimation de l'impact\n",
    "\n",
    "1. **Réduction des erreurs de prévision**:\n",
    "   - Modèle de base: RMSE de X\n",
    "   - Meilleur modèle: RMSE de Y\n",
    "   - Amélioration: Z%\n",
    "\n",
    "2. **Avantages opérationnels estimés**:\n",
    "   - Réduction potentielle des stocks excédentaires: ...\n",
    "   - Amélioration de la disponibilité des produits: ...\n",
    "   - Impact sur la satisfaction client: ...\n",
    "   - Estimation des économies annuelles: ...\n",
    "\n",
    "3. **Catégories de produits les plus impactées**:\n",
    "   - ...\n",
    "   - ...\n",
    "   - ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Conclusions et recommandations\n",
    "\n",
    "### Synthèse des améliorations\n",
    "1. ...\n",
    "2. ...\n",
    "3. ...\n",
    "\n",
    "### Recommandations techniques\n",
    "1. ...\n",
    "2. ...\n",
    "3. ...\n",
    "\n",
    "### Recommandations métier\n",
    "1. ...\n",
    "2. ...\n",
    "3. ...\n",
    "\n",
    "### Prochaines étapes\n",
    "1. ...\n",
    "2. ...\n",
    "3. ..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}