### 1. Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

clients = pd.read_csv('../data/processed/clients.csv')
orders = pd.read_csv('../data/processed/orders.csv')
clients_monthly = pd.read_csv('../data/processed/clients_monthly.csv')

### 2. Data Overview

In [2]:
print("Clients DataFrame:")
print(clients.info())
print("
Orders DataFrame:")
print(orders.info())
print("
Clients Monthly DataFrame:")
print(clients_monthly.info())

SyntaxError: EOL while scanning string literal (236518696.py, line 3)

### 3. Target Variable Analysis

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='class', data=clients)
plt.title('Distribution of Client Classes')
plt.show()

The `class` column is the target variable for classification models. It is created based on the `median_ticket` and `efficiency` of the clients. The distribution of the classes is imbalanced, with the majority of clients being in the "HighTicket_Efficient" and "LowTicket_Efficient" classes.

### 4. Efficiency Analysis

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(clients['efficiency'], bins=30, kde=True)
plt.title('Distribution of Efficiency')
plt.show()

The `efficiency` is a key feature used to create the `class` target variable. It is calculated as the ratio of total orders to total promotor visits. The distribution of efficiency is skewed to the right, with a long tail of highly efficient clients.

### 5. Correlation Analysis

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(clients.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Client Features')
plt.show()

The correlation matrix shows the relationships between the different features in the `clients` dataframe. There are strong positive correlations between `total_orders`, `total_volume`, `total_income`, and `total_profit`. There is also a strong positive correlation between `total_promotor_visits` and `total_cost`.

### 6. Numerical Feature Distribution

In [None]:
clients.hist(figsize=(20, 15), bins=30, edgecolor='black')
plt.suptitle('Distribution of Numerical Features')
plt.show()

### 7. Categorical Feature Analysis

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(y='city', data=clients, order = clients['city'].value_counts().index)
plt.title('Distribution of Clients by City')
plt.show()

plt.figure(figsize=(10, 6))
sns.countplot(x='channel', data=clients)
plt.title('Distribution of Clients by Channel')
plt.show()

### 8. Relationship between Features and Target Variable

In [None]:
plt.figure(figsize=(12, 8))
sns.boxplot(x='class', y='median_ticket', data=clients)
plt.title('Median Ticket by Client Class')
plt.show()

plt.figure(figsize=(12, 8))
sns.boxplot(x='class', y='efficiency', data=clients)
plt.title('Efficiency by Client Class')
plt.show()

### 9. Time Series Analysis

In [None]:
orders['date'] = pd.to_datetime(orders['date'])
orders.set_index('date', inplace=True)

plt.figure(figsize=(15, 8))
orders['number_of_orders'].resample('M').sum().plot()
plt.title('Total Orders per Month')
plt.show()

plt.figure(figsize=(15, 8))
orders['income'].resample('M').sum().plot()
plt.title('Total Income per Month')
plt.show()