# üìä GetAround - Analyse des Retards

## Objectif
Analyser les retards au checkout et d√©terminer le seuil optimal de d√©lai minimum entre deux locations.

## Questions cl√©s √† r√©pondre
1. Quelle part du revenu des propri√©taires serait affect√©e par le d√©lai minimum ?
2. Combien de locations seraient impact√©es selon le seuil choisi ?
3. √Ä quelle fr√©quence les conducteurs sont-ils en retard ?
4. Combien de cas probl√©matiques seraient r√©solus selon le seuil ?

## 1. Chargement des biblioth√®ques

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Taille des figures
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Biblioth√®ques charg√©es avec succ√®s")

## 2. Chargement des donn√©es

In [None]:
# Charger le fichier Excel
df_delays = pd.read_excel('../data/get_around_delay_analysis.xlsx')

print(f"üìä Donn√©es charg√©es : {len(df_delays):,} lignes et {len(df_delays.columns)} colonnes")
print(f"\nüóìÔ∏è P√©riode des donn√©es : {df_delays.index.min()} √† {df_delays.index.max()}" if df_delays.index.name else "")

## 3. Exploration initiale

In [None]:
# Aper√ßu des donn√©es
print("="*80)
print("APER√áU DES DONN√âES")
print("="*80)
df_delays.head(10)

In [None]:
# Informations sur les colonnes
print("="*80)
print("INFORMATIONS SUR LES COLONNES")
print("="*80)
df_delays.info()

In [None]:
# Statistiques descriptives
print("="*80)
print("STATISTIQUES DESCRIPTIVES")
print("="*80)
df_delays.describe()

In [None]:
# Valeurs manquantes
print("="*80)
print("VALEURS MANQUANTES")
print("="*80)
missing = df_delays.isnull().sum()
missing_pct = (missing / len(df_delays) * 100).round(2)
missing_df = pd.DataFrame({
    'Nombre': missing,
    'Pourcentage': missing_pct
})
print(missing_df[missing_df['Nombre'] > 0].sort_values('Nombre', ascending=False))

## 4. Analyse des retards

### 4.1 Distribution g√©n√©rale des retards

In [None]:
# M√©triques globales sur les retards
print("="*80)
print("üìä M√âTRIQUES GLOBALES - RETARDS")
print("="*80)

total_rentals = len(df_delays)
late_rentals = (df_delays['delay_at_checkout_in_minutes'] > 0).sum()
on_time_rentals = total_rentals - late_rentals
late_pct = (late_rentals / total_rentals * 100)

print(f"\nüìç Total de locations : {total_rentals:,}")
print(f"‚úÖ Locations √† l'heure : {on_time_rentals:,} ({100-late_pct:.1f}%)")
print(f"‚è∞ Locations en retard : {late_rentals:,} ({late_pct:.1f}%)")

if late_rentals > 0:
    avg_delay = df_delays[df_delays['delay_at_checkout_in_minutes'] > 0]['delay_at_checkout_in_minutes'].mean()
    median_delay = df_delays[df_delays['delay_at_checkout_in_minutes'] > 0]['delay_at_checkout_in_minutes'].median()
    max_delay = df_delays['delay_at_checkout_in_minutes'].max()
    
    print(f"\nüìà Statistiques des retards :")
    print(f"   - Retard moyen : {avg_delay:.1f} minutes ({avg_delay/60:.1f}h)")
    print(f"   - Retard m√©dian : {median_delay:.1f} minutes ({median_delay/60:.1f}h)")
    print(f"   - Retard maximum : {max_delay:.1f} minutes ({max_delay/60:.1f}h)")

In [None]:
# Graphique : Distribution des retards
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Distribution des retards', 'Proportion retard vs √† l\'heure'),
    specs=[[{'type': 'histogram'}, {'type': 'pie'}]]
)

# Histogramme
fig.add_trace(
    go.Histogram(
        x=df_delays['delay_at_checkout_in_minutes'],
        nbinsx=50,
        name='Retards',
        marker_color='indianred'
    ),
    row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(
        labels=['√Ä l\'heure', 'En retard'],
        values=[on_time_rentals, late_rentals],
        marker_colors=['lightgreen', 'indianred']
    ),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=False, title_text="Vue d'ensemble des retards")
fig.show()

### 4.2 Retards par type de checkin

In [None]:
# Analyse par type de checkin
print("="*80)
print("üì± RETARDS PAR TYPE DE CHECKIN")
print("="*80)

checkin_analysis = df_delays.groupby('checkin_type').agg({
    'rental_id': 'count',
    'delay_at_checkout_in_minutes': ['mean', 'median']
})

# Calculer le pourcentage de retards par type
for checkin_type in df_delays['checkin_type'].unique():
    if pd.notna(checkin_type):
        subset = df_delays[df_delays['checkin_type'] == checkin_type]
        late_count = (subset['delay_at_checkout_in_minutes'] > 0).sum()
        late_pct = (late_count / len(subset) * 100)
        print(f"\n{checkin_type}:")
        print(f"  Total locations : {len(subset):,}")
        print(f"  En retard : {late_count:,} ({late_pct:.1f}%)")
        if late_count > 0:
            avg = subset[subset['delay_at_checkout_in_minutes'] > 0]['delay_at_checkout_in_minutes'].mean()
            print(f"  Retard moyen : {avg:.1f} min")

In [None]:
# Graphiques comparatifs par type
checkin_stats = []
for checkin_type in df_delays['checkin_type'].dropna().unique():
    subset = df_delays[df_delays['checkin_type'] == checkin_type]
    late_count = (subset['delay_at_checkout_in_minutes'] > 0).sum()
    late_pct = (late_count / len(subset) * 100)
    avg_delay = subset[subset['delay_at_checkout_in_minutes'] > 0]['delay_at_checkout_in_minutes'].mean() if late_count > 0 else 0
    
    checkin_stats.append({
        'Type': checkin_type,
        'Total': len(subset),
        'Retards': late_count,
        'Pct_retards': late_pct,
        'Retard_moyen': avg_delay
    })

df_checkin = pd.DataFrame(checkin_stats)

# Graphiques
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('% de retards par type', 'Retard moyen par type (minutes)')
)

fig.add_trace(
    go.Bar(x=df_checkin['Type'], y=df_checkin['Pct_retards'], name='% retards', marker_color='coral'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(x=df_checkin['Type'], y=df_checkin['Retard_moyen'], name='Retard moyen', marker_color='skyblue'),
    row=1, col=2
)

fig.update_layout(height=400, showlegend=False)
fig.show()

### 4.3 Impact sur les locations suivantes

In [None]:
# Filtrer les locations avec une location suivante
df_with_next = df_delays[df_delays['time_delta_with_previous_rental_in_minutes'].notna()].copy()

print("="*80)
print("üîó IMPACT SUR LES LOCATIONS SUIVANTES")
print("="*80)
print(f"\nLocations avec location suivante : {len(df_with_next):,}")
print(f"Pourcentage du total : {len(df_with_next)/len(df_delays)*100:.1f}%")

# Identifier les cas probl√©matiques
df_with_next['is_problematic'] = (
    (df_with_next['delay_at_checkout_in_minutes'] > 0) & 
    (df_with_next['delay_at_checkout_in_minutes'] > df_with_next['time_delta_with_previous_rental_in_minutes'])
)

total_problems = df_with_next['is_problematic'].sum()
problem_pct = (total_problems / len(df_with_next) * 100)

print(f"\n‚ö†Ô∏è Cas probl√©matiques (retard > d√©lai entre locations) :")
print(f"   Nombre : {total_problems:,}")
print(f"   Pourcentage : {problem_pct:.2f}%")
print(f"\nüí° Ces cas repr√©sentent des situations o√π le client suivant a √©t√© impact√©")

## 5. Simulation de seuils

### 5.1 Impact de diff√©rents seuils

In [None]:
# D√©finir les seuils √† tester
thresholds = [0, 30, 60, 120, 180, 240, 360, 480, 720]  # en minutes

print("="*80)
print("üéØ SIMULATION DE DIFF√âRENTS SEUILS")
print("="*80)

results = []

for threshold in thresholds:
    # Locations qui seraient bloqu√©es
    blocked = df_with_next[df_with_next['time_delta_with_previous_rental_in_minutes'] < threshold]
    blocked_count = len(blocked)
    blocked_pct = (blocked_count / len(df_with_next) * 100)
    
    # Probl√®mes r√©solus
    problems_solved = df_with_next[
        (df_with_next['is_problematic']) & 
        (df_with_next['time_delta_with_previous_rental_in_minutes'] < threshold)
    ]
    solved_count = len(problems_solved)
    solved_pct = (solved_count / total_problems * 100) if total_problems > 0 else 0
    
    results.append({
        'Seuil_min': threshold,
        'Seuil_h': threshold / 60,
        'Locations_bloquees': blocked_count,
        'Pct_bloquees': blocked_pct,
        'Problemes_resolus': solved_count,
        'Pct_resolus': solved_pct
    })
    
    print(f"\nSeuil : {threshold:3d} min ({threshold/60:5.1f}h)")
    print(f"  Locations bloqu√©es : {blocked_count:5,} ({blocked_pct:5.1f}%)")
    print(f"  Probl√®mes r√©solus  : {solved_count:5,} ({solved_pct:5.1f}%)")

df_results = pd.DataFrame(results)

In [None]:
# Afficher le tableau complet
print("\n" + "="*80)
print("üìä TABLEAU R√âCAPITULATIF")
print("="*80)
display(df_results.style.format({
    'Seuil_h': '{:.1f}h',
    'Locations_bloquees': '{:,.0f}',
    'Pct_bloquees': '{:.1f}%',
    'Problemes_resolus': '{:,.0f}',
    'Pct_resolus': '{:.1f}%'
}))

### 5.2 Graphique Trade-off

In [None]:
# Graphique interactif du trade-off
fig = go.Figure()

# Courbe 1 : Locations bloqu√©es (impact n√©gatif)
fig.add_trace(go.Scatter(
    x=df_results['Seuil_h'],
    y=df_results['Pct_bloquees'],
    mode='lines+markers',
    name='Locations bloqu√©es (%)',
    line=dict(color='red', width=3),
    marker=dict(size=10),
    hovertemplate='<b>Seuil</b>: %{x:.1f}h<br><b>Bloqu√©es</b>: %{y:.1f}%<extra></extra>'
))

# Courbe 2 : Probl√®mes r√©solus (impact positif)
fig.add_trace(go.Scatter(
    x=df_results['Seuil_h'],
    y=df_results['Pct_resolus'],
    mode='lines+markers',
    name='Probl√®mes r√©solus (%)',
    line=dict(color='green', width=3),
    marker=dict(size=10),
    hovertemplate='<b>Seuil</b>: %{x:.1f}h<br><b>R√©solus</b>: %{y:.1f}%<extra></extra>'
))

fig.update_layout(
    title='üéØ Trade-off : Locations bloqu√©es vs Probl√®mes r√©solus',
    xaxis_title='Seuil minimum (heures)',
    yaxis_title='Pourcentage (%)',
    hovermode='x unified',
    height=500,
    template='plotly_white',
    legend=dict(x=0.7, y=0.5)
)

fig.show()

## 6. Recommandations

In [None]:
# Identifier le seuil optimal (exemple : meilleur ratio probl√®mes r√©solus / locations bloqu√©es)
df_results['ratio'] = df_results['Pct_resolus'] / (df_results['Pct_bloquees'] + 1)  # +1 pour √©viter division par 0
optimal_idx = df_results['ratio'].idxmax()
optimal_threshold = df_results.loc[optimal_idx]

print("="*80)
print("üí° RECOMMANDATIONS")
print("="*80)

print(f"\nüéØ Seuil optimal sugg√©r√© : {optimal_threshold['Seuil_min']:.0f} minutes ({optimal_threshold['Seuil_h']:.1f} heures)")
print(f"\nüìä Impact attendu avec ce seuil :")
print(f"   ‚úÖ Probl√®mes r√©solus : {optimal_threshold['Problemes_resolus']:.0f} ({optimal_threshold['Pct_resolus']:.1f}%)")
print(f"   ‚ö†Ô∏è  Locations bloqu√©es : {optimal_threshold['Locations_bloquees']:.0f} ({optimal_threshold['Pct_bloquees']:.1f}%)")

print(f"\nüîç Insights cl√©s :")
print(f"   - {late_pct:.1f}% des locations sont en retard")
print(f"   - {problem_pct:.2f}% des locations cons√©cutives ont des probl√®mes")
print(f"   - Le type '{df_checkin.loc[df_checkin['Pct_retards'].idxmax(), 'Type']}' a le plus de retards")

print(f"\nüíº Recommandation de p√©rim√®tre :")
print(f"   - Commencer avec les voitures 'Connect' uniquement")
print(f"   - √âtendre progressivement selon les r√©sultats")

## 7. Sauvegarde des insights

In [None]:
# Sauvegarder les r√©sultats de simulation pour le dashboard
df_results.to_csv('../data/threshold_simulation_results.csv', index=False)
print("‚úÖ R√©sultats sauvegard√©s dans 'data/threshold_simulation_results.csv'")

## üìù Conclusions

### Points cl√©s √† retenir :
1. **Fr√©quence des retards** : [√Ä compl√©ter apr√®s ex√©cution]
2. **Impact sur les clients suivants** : [√Ä compl√©ter apr√®s ex√©cution]
3. **Seuil optimal** : [√Ä compl√©ter apr√®s ex√©cution]
4. **Trade-off** : Balance entre protection des clients et perte de revenus

### Prochaines √©tapes :
- Cr√©er le dashboard Streamlit avec ces insights
- Impl√©menter le simulateur interactif
- Pr√©senter les recommandations au Product Manager