# Customer Personality Analysis for Marketing Optimization
## 1. Contexto
In a competitive environment, companies need to understand their customers to optimize their marketing strategies and offer products and services that truly respond to their needs. Customer personality analysis allows you to identify different customer segments and their behaviors, which helps improve product personalization and the effectiveness of marketing campaigns.

This project is aimed at a company that sells products through various channels (physical store, website, catalog), using detailed data on customer purchases, their responses to campaigns and their demographic characteristics.

# Content: Attributes

### People

| Attribute                |Description                                   |
|-------------------------|--------------------------------------------------|
| ID                      |Unique customer identifier                  |
| Year_Birth              | Client's year of birth                    |
| Education               | Client's educational level                       |
| Marital_Status          | Client's Marital Status                          |
| Income                  | Client's annual household income              |
| Kidhome                 | Number of children in client's household          |
| Teenhome                | Number of teenagers in client's home    |
| Dt_Customer             | Customer registration date with the company  |
| Recency                 | Number of days since the customer's last purchase |
| Complain                | 1 if the customer complained in the last 2 years, 0 otherwise |

### Productos

| Attribute | Description |
|-------------------------|------------------------ ---------------------------|
| MntWines | Amount spent on wine in the last 2 years |
| MntFruits | Amount spent on fruits in the last 2 years |
| MntMeatProducts | Amount spent on meat in the last 2 years |
| MntFishProducts | Amount spent on fish in the last 2 years |
| MntSweetProducts | Amount spent on sweets in the last 2 years |
| MntGoldProds | Amount spent on gold products in the last 2 years |

### Promotion

| Attribute | Description |
|-------------------------|------------------------ ---------------------------|
| NumDealsPurchases | Number of purchases made with discount |
| AcceptedCmp1 | 1 if the customer accepted the offer in the 1st campaign, 0 otherwise |
| AcceptedCmp2 | 1 if the customer accepted the offer in the 2nd campaign, 0 otherwise |
| AcceptedCmp3 | 1 if the customer accepted the offer in the 3rd campaign, 0 otherwise |
| AcceptedCmp4 | 1 if the customer accepted the offer in the 4th campaign, 0 otherwise |
| AcceptedCmp5 | 1 if the customer accepted the offer in the 5th campaign, 0 otherwise |
| Response | 1 if the customer accepted the offer in the last campaign, 0 otherwise |

### Place

| Attribute | Description |
|-------------------------|------------------------ ---------------------------|
| NumWebPurchases | Number of purchases made through the company website |
| NumCatalogPurchases | Number of purchases made using a catalog |
| NumStorePurchases | Number of purchases made directly in stores |
| NumWebVisitsMonth | Number of visits to the company website in the last month |

# 2. Project Objectives

1. **Segment customers**  
   Segment customers into different groups based on their purchasing behavior and demographic characteristics.

2. **Identify popular products**  
   Identify the most popular products and the segments that spend the most on them.

3. **Evaluate campaign effectiveness**  
   Evaluate the effectiveness of marketing campaigns based on customer characteristics.

4. **Predict response to campaigns**  
   Predict the probability that a customer will respond to future marketing campaigns, using predictive models.

5. **Optimize marketing strategies**  
   Optimize marketing strategies by focusing resources on the most valuable and receptive segments.


In [2]:
# Importación de librerías necesarias
import pandas as pd
import plotly.express as px
from google.colab import files
import plotly.graph_objects as go
from plotly.subplots import make_subplots

"""Carga y Preparación de Datos"""
# Carga del conjunto de datos
uploaded = files.upload()
df = pd.read_csv('/content/marketing_campaign.csv', sep='\t')


Saving marketing_campaign.csv to marketing_campaign.csv


In [None]:
"""Limpieza de datos"""
# (No se presentan valores considerables en datos faltantes y no es necesaria una limpieza)

# Creación de nuevas variables
# Calcular la edad
current_year = pd.Timestamp.now().year
df['Age'] = current_year - df['Year_Birth']

# Calcular el gasto total
df['Total_Spent'] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)

In [None]:
"""Distribución de variables clave"""

# Edad
fig_age = px.histogram(df, x='Age', title='Distribución de Edad de los Clientes', nbins=30)
fig_age.update_layout(xaxis_title='Edad', yaxis_title='Frecuencia')
fig_age.show()

# Ingresos
fig_income = px.histogram(df, x='Income', title='Distribución de Ingresos de los Clientes', nbins=30)
fig_income.update_layout(xaxis_title='Ingresos', yaxis_title='Frecuencia')
fig_income.show()

# Gasto total
fig_total_spent = px.histogram(df, x='Total_Spent', title='Distribución del Gasto Total en Productos', nbins=30)
fig_total_spent.update_layout(xaxis_title='Gasto Total', yaxis_title='Frecuencia')
fig_total_spent.show()

# Estado civil
fig_marital = px.histogram(df, x='Marital_Status', title='Distribución del Estado Civil de los Clientes',
                            color='Marital_Status', text_auto=True)
fig_marital.update_layout(xaxis_title='Estado Civil', yaxis_title='Frecuencia')
fig_marital.show()

# Nivel educativo
fig_education = px.histogram(df, x='Education', title='Distribución del Nivel Educativo de los Clientes',
                              color='Education', text_auto=True)
fig_education.update_layout(xaxis_title='Nivel Educativo', yaxis_title='Frecuencia')
fig_education.show()

# Análisis de correlaciones entre variables
correlation_matrix = df[['Income', 'Total_Spent', 'Age', 'Recency', 'Kidhome', 'Teenhome']].corr()
fig_correlation = px.imshow(correlation_matrix, text_auto=True, title='Matriz de Correlación')
fig_correlation.update_layout(xaxis_title='Variables', yaxis_title='Variables')
fig_correlation.show()

# Patrones de gasto por segmentos demográficos
# Análisis por estado civil
marital_analysis = df.groupby('Marital_Status').agg(
    Average_Income=('Income', 'mean'),
    Average_Spent=('Total_Spent', 'mean')
).reset_index()

fig_marital_comparison = px.bar(marital_analysis,
                                 x='Marital_Status',
                                 y=['Average_Income', 'Average_Spent'],
                                 title='Análisis Comparativo de Ingresos y Gasto Total por Estado Civil',
                                 labels={'value': 'Promedio', 'variable': 'Categoría'},
                                 barmode='group')
fig_marital_comparison.update_layout(xaxis_title='Estado Civil', yaxis_title='Promedio')
fig_marital_comparison.show()

# Análisis por nivel educativo
education_analysis = df.groupby('Education').agg(
    Average_Income=('Income', 'mean'),
    Average_Spent=('Total_Spent', 'mean')
).reset_index()

fig_education_comparison = px.bar(education_analysis,
                                   x='Education',
                                   y=['Average_Income', 'Average_Spent'],
                                   title='Análisis Comparativo de Ingresos y Gasto Total por Nivel Educativo',
                                   labels={'value': 'Promedio', 'variable': 'Categoría'},
                                   barmode='group')
fig_education_comparison.update_layout(xaxis_title='Nivel Educativo', yaxis_title='Promedio')
fig_education_comparison.show()

# Distribución del gasto en productos por nivel educativo
for education in df['Education'].unique():
    fig = px.pie(
        df[df['Education'] == education],
        values=df.loc[df['Education'] == education, ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=0),
        names=['Vinos', 'Frutas', 'Carne', 'Pescado', 'Dulces', 'Oro'],
        title=f'Distribución del Gasto en Productos - Nivel Educativo: {education}'
    )
    fig.show()

In [None]:
"""Segmentación de Clientes"""

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans


# Seleccionamos las columnas numéricas para usarlas en la segmentación
X = df[['Income', 'Age', 'Total_Spent', 'NumWebPurchases', 'NumStorePurchases']]  # Añade tus columnas numéricas aquí

# Escalar los datos
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Imputar valores faltantes con la media
imputer = SimpleImputer(strategy='mean')
X_scaled_imputed = imputer.fit_transform(X_scaled)

# Calcular el valor de inercia para diferentes números de clusters
inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(X_scaled_imputed)
    inertia.append(kmeans.inertia_)

# Crear una lista de nombres descriptivos para los clusters
cluster_labels = [
    "Clientes con altos ingresos y alto gasto",
    "Clientes jóvenes con bajo gasto",
    "Compradores frecuentes en tienda",
    "Clientes con uso web moderado",
    "Clientes mayores con ingresos medios",
    "Clientes con pocos o ningún gasto",
    "Segmento adicional 1",
    "Segmento adicional 2",
    "Segmento adicional 3"
]

# Graficar la curva de inercia con mejoras
fig = go.Figure()

# Añadir la traza de la inercia con anotaciones y detalles
fig.add_trace(go.Scatter(
    x=list(range(1, 10)),
    y=inertia,
    mode='lines+markers',
    marker=dict(size=10, color='red'),
    line=dict(color='royalblue', width=3),
    name='Inercia'
))

# Añadir una anotación en el punto donde está el codo (ejemplo: k=3)
fig.add_annotation(x=3, y=inertia[2],
                   text="Posible codo en k=3",
                   showarrow=True,
                   arrowhead=2, ax=40, ay=-40,
                   font=dict(color='black', size=12),
                   bgcolor="yellow", opacity=0.8)

# Establecer el diseño del gráfico con etiquetas más claras
fig.update_layout(
    title='Método del Codo: Identificación del Número Óptimo de Clusters',
    xaxis_title='Número de Clusters (k)',
    yaxis_title='Inercia (Compacidad de los Clusters)',
    template='plotly_white',
    xaxis=dict(
        tickmode='array',
        tickvals=list(range(1, 10)),
        ticktext=cluster_labels  # Etiquetas descriptivas para el eje X
    ),
    yaxis=dict(
        showgrid=True,
        zeroline=False
    )
)

# Añadir explicación adicional en el gráfico
fig.add_annotation(x=8, y=11000,
                   text="La inercia disminuye rápidamente \npara los primeros clusters, luego \nse estabiliza, indicando el codo.",
                   showarrow=False,
                   font=dict(size=12),
                   align="left",
                   bordercolor="black",
                   borderwidth=2,
                   borderpad=4,
                   bgcolor="lightgray",
                   opacity=0.7)

# Mostrar la gráfica con todos los detalles
fig.show()

# Chart Analysis: Elbow Method

Analysis of the provided graph, which uses the elbow method, is essential for determining the optimal number of clusters when segmenting data. Below are the key points from the graph.

## 1. Chart Axes

- **Y Axis ("Inertia (Compactness of Clusters)")**:
  - Represents inertia, which is a measure of the sum of the square distances of each point to its closest centroid within its cluster.
  - Inertia is a metric that reflects the compactness of the clusters; The lower the inertia, the more the points within a cluster are grouped together.

- **X axis ("Number of Clusters (k)")**:
  - Represents the number of clusters that are being evaluated, from 1 to 9 in this case.

## 2. Shape of the Curve

- **Descending Curve**:
  - Inertia decreases as the number of clusters increases. This is expected, since, as the number of clusters increases, the points are grouped into smaller groups, which reduces the distance between the points and their centroid.

- **Fast Decrease at Start**:
  - In the early stages (when the number of clusters is low, as at k=2 or k=3), the reduction in inertia is significant. This suggests that dividing the data into more clusters markedly improves the compactness of the groups. Adding more clusters in this phase results in better data segmentation.

- **Stabilization**:
  - From k=3 or k=4, the curve begins to stabilize. This means that adding more clusters does not substantially improve the compactness of the groups; Although inertia continues to decrease, it does so at a much slower rate. This is where the **"elbow"** appears.

## 3. Meaning of the Elbow

- **Elbow**:
  - The elbow is the point where the reduction in inertia slows down significantly. In this graph, it appears to be around k=3, where the drop in inertia is no longer as pronounced.
  - This point indicates that, although adding more clusters after this point still reduces inertia, it does not do so as effectively as in the first clusters. Therefore, it is suggested that **3 clusters** is the optimal number to segment the data, striking a good balance between compactness (low inertia) and simplicity (fewer clusters).

## 4. Analysis of Cluster Names

The descriptive names assigned to the clusters in the graph facilitate the interpretation of the customer segments identified in the analysis:

| **Cluster Name** | **Description** |
|------------------------------------------------| -------------------------------------------------- -------------------------------------------------- -----|
| **Clients with high income and high spending** | Represents clients with high purchasing power who tend to spend considerably.                 |
| **Young clients with low spending** | It includes young customers who do not yet have high incomes or are more conservative in their purchases.  |
| **Frequent buyers in store** | Includes customers who prefer to make their purchases directly in physical stores.                    |
| **Clients with moderate web usage** | Segment of customers who interact with the online store in a moderate way.                          |

## 5. Conclusion

The graph suggests that the optimal number of clusters is probably **3**, since this is where the elbow is located and where the inertia begins to decrease at a much slower rate. Splitting the data into more than 3 clusters would probably not substantially improve customer segmentation, but could complicate the interpretation of the results.

This analysis allows us to conclude that, in this case, the customer segmentation model would be more effective if **3 clusters** are used, where the groups formed have sufficient cohesion without the need to divide them into smaller segments.

In [7]:

""" 5. Análisis de Efectividad de Campañas"""

# 1. Tasa de respuesta general
response_rate = df['Response'].mean()
print(f"La tasa de respuesta general es {response_rate:.2%}")

# 1.1 Cálculo de la tasa de respuesta
total_responses = df['Response'].value_counts()
labels = ['Respondieron', 'No Respondieron']
values = [total_responses.get(1, 0), total_responses.get(0, 0)]

# 1.2 Creación del gráfico de donut
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.3,
                               marker=dict(colors=['#4CAF50', '#FF5733']),
                               textinfo='percent+label',
                               hoverinfo='label+percent')])

# 1.3 Configuración del layout
fig.update_layout(title='Tasa de Respuesta General',
                  annotations=[dict(text=f'Tasa de Respuesta: {response_rate:.2%}',
                                    x=0.5, y=1.16, font_size=20, showarrow=False)],
                  template='plotly_white')

# 1.4 Mostrar el gráfico
fig.show()

# 2. Tasa de respuesta por campaña
# 2.1 Definición de las campañas
campaign_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']
campaign_response_rates = df[campaign_columns].mean()

# 2.2 Creación del gráfico de barras
fig = px.bar(x=campaign_response_rates.index, y=campaign_response_rates.values,
             labels={'x': 'Campaña', 'y': 'Tasa de Respuesta'},
             title='Tasa de Respuesta por Campaña')

# 2.3 Añadir una anotación sobre el rango de 0 a 0.07
fig.add_annotation(
    text="Las tasas de respuesta son proporciones, que van de 0 a 0.07 (0% a 7%) por campaña.",
    xref="paper", yref="paper",
    x=0.5, y=1.1,  # Posición de la anotación en el gráfico
    showarrow=False,
    font=dict(size=12),
    align="center"
)

# 2.4 Personalizar el diseño del gráfico
fig.update_layout(
    yaxis=dict(tickvals=[0, 0.02, 0.04, 0.06, 0.08, 0.10], ticktext=['0%', '2%', '4%', '6%', '8%', '10%']),
    yaxis_title="Tasa de Respuesta",
    xaxis_title="Campaña",
    template="plotly_white"  # Cambiar el tema a uno más limpio
)

# 2.5 Mostrar el gráfico
fig.show()

# 3. Efectividad de las campañas por segmento demográfico
def campaign_effectiveness_by_segment(df, segment_column):
    segment_effectiveness = df.groupby(segment_column)[campaign_columns].mean()

    fig = go.Figure()
    for campaign in campaign_columns:
        fig.add_trace(go.Bar(x=segment_effectiveness.index, y=segment_effectiveness[campaign],
                             name=campaign))

    fig.update_layout(barmode='group',
                      title=f'Efectividad de Campañas por {segment_column}',
                      xaxis_title=segment_column,
                      yaxis_title='Tasa de Respuesta')
    fig.show()

# Analizar la efectividad por nivel educativo
campaign_effectiveness_by_segment(df, 'Education')

# Analizar la efectividad por estado civil
campaign_effectiveness_by_segment(df, 'Marital_Status')

# 4. Relación entre el gasto total y la respuesta a campañas

# 4.1 Crear una nueva columna que sume todas las campañas aceptadas
df['TotalAcceptedCampaigns'] = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']].sum(axis=1)

# 4.2 Crear el gráfico de dispersión

df['Total_Spent'] = df['MntWines'] + df['MntFruits'] + df['MntMeatProducts'] + df['MntFishProducts'] + df['MntSweetProducts'] + df['MntGoldProds']

fig = px.scatter(df, x='Total_Spent', y='TotalAcceptedCampaigns',
                 title='Relación entre Gasto Total y Campañas Aceptadas',
                 labels={'Total_Spent': 'Gasto Total', 'TotalAcceptedCampaigns': 'Número de Campañas Aceptadas'})

# 4.3 Agregar una línea de referencia para el número máximo de campañas
fig.add_trace(go.Scatter(x=df['Total_Spent'], y=[5] * len(df), mode='lines',
                         name='Máximo de Campañas Aceptadas (5)',
                         line=dict(color='red', dash='dash')))

# 4.4 Ajustar el rango del eje Y
fig.update_layout(yaxis=dict(range=[0, 5]))

# 4.5 Mostrar el gráfico
fig.show()

# 5. Análisis de la última campaña

# Crear columna 'Age' basada en el año actual y el año de nacimiento
df['Age'] = 2024 - df['Year_Birth']


last_campaign_effectiveness = df.groupby('Response')[['Income', 'Total_Spent', 'Age']].mean()

# 5.1 Crear el gráfico de barras
fig = go.Figure(data=[
    go.Bar(
        name='Respondieron',
        x=['Ingreso Promedio', 'Gasto Total Promedio', 'Edad Promedio'],
        y=last_campaign_effectiveness.loc[1],
        marker_color='royalblue'  # Color personalizado para Respondieron
    ),
    go.Bar(
        name='No Respondieron',
        x=['Ingreso Promedio', 'Gasto Total Promedio', 'Edad Promedio'],
        y=last_campaign_effectiveness.loc[0],
        marker_color='lightcoral'  # Color personalizado para No Respondieron
    )
])

# 5.2 Actualizar el diseño del gráfico
fig.update_layout(
    barmode='group',
    title='Comparación de Clientes que Respondieron vs No Respondieron en la Última Campaña',
    xaxis_title='Categorías',
    yaxis_title='Valor Promedio',
    template='plotly_white',  # Aplicar plantilla de fondo blanco
    showlegend=True  # Mostrar leyenda
)

# 5.3 Añadir los valores en las barras, centrados
for data in fig.data:
    for i, value in enumerate(data.y):
        fig.add_annotation(
            x=data.x[i],
            y=value,
            text=f"{value:.2f}",  # Formato del valor
            showarrow=True,
            arrowhead=1,
            ax=0,
            ay=10,  # Posición del texto ajustada para centrar
            font=dict(size=12),  # Tamaño de la fuente
        )

# 5.4 Mostrar el gráfico
fig.show()

# 6. Cálculo hipotético de retorno de inversión (ROI)

# 6.1 Lista de columnas para cada campaña
campaign_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']

# 6.2 Crear un DataFrame vacío para almacenar los resultados de ROI
roi_results = pd.DataFrame(columns=['Campaign', 'Total_Revenue', 'Total_Cost', 'ROI'])

for campaign in campaign_columns:
    # 6.2.1Calcular ingresos y costos para cada campaña
    df['Revenue'] = df[campaign] * df['Total_Spent']  # Asumiendo que 'Response' indica una compra
    total_revenue = df['Revenue'].sum()
    total_cost = len(df) * 10  # Costo total asumido de 10 por cliente contactado

    # 6.2.2 Calcular ROI
    roi = (total_revenue - total_cost) / total_cost

    # 6.2.3 Crear un DataFrame temporal para agregar los resultados
    temp_df = pd.DataFrame({'Campaign': [campaign],
                            'Total_Revenue': [total_revenue],
                            'Total_Cost': [total_cost],
                            'ROI': [roi]})

    # 6.2.4 Concatenar el DataFrame temporal con el DataFrame de resultados
    roi_results = pd.concat([roi_results, temp_df], ignore_index=True)

# 6.3 Mostrar el DataFrame de resultados
print(roi_results)


# 6.4 Crear gráfico de barras para visualizar el ROI de cada campaña
fig = px.bar(roi_results, x='Campaign', y='ROI',
             title='ROI de Cada Campaña',
             labels={'ROI': 'Retorno sobre la Inversión (ROI)', 'Campaign': 'Campañas'},
             color='ROI',
             color_continuous_scale='Blues')

# 6.4.1 Personalizar el gráfico
fig.update_traces(texttemplate='%{y:.2%}', textposition='outside')
fig.update_layout(yaxis_tickformat='%')
fig.show()


# 7. Recomendaciones basadas en el análisis
print("\nRecomendaciones basadas en el análisis:")
recomendaciones = [
    "1. Focus efforts on the campaigns with the highest response rate.",
    "2. Personalize campaigns according to the educational level and marital status of the clients.",
    "3. Consider total customer spending when designing future campaigns.",
    "4. Investigate the reasons why certain segments respond better to campaigns.",
    "5. Perform a more detailed cost-benefit analysis to optimize the ROI of the campaigns."
]

for recomendacion in recomendaciones:
    print(recomendacion)


La tasa de respuesta general es 14.91%


       Campaign Total_Revenue Total_Cost        ROI
0  AcceptedCmp1        213440      22400   8.528571
1  AcceptedCmp2         39230      22400   0.751339
2  AcceptedCmp3        117448      22400   4.243214
3  AcceptedCmp4        190902      22400   7.522411
4  AcceptedCmp5        263426      22400  10.760089



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.




Recomendaciones basadas en el análisis:
1. Focus efforts on the campaigns with the highest response rate.
2. Personalize campaigns according to the educational level and marital status of the clients.
3. Consider total customer spending when designing future campaigns.
4. Investigate the reasons why certain segments respond better to campaigns.
5. Perform a more detailed cost-benefit analysis to optimize the ROI of the campaigns.


# Campaign Effectiveness Analysis

The campaign effectiveness analysis is divided into several sections, each addressing different metrics and visualizations to understand the performance of marketing campaigns. The main components of the analysis are described below.

## 1. Overall response rate

- **Response rate calculation**:
  - Calculated as the average of the `Response` column of the DataFrame `df`.
  - Presented as a percentage, showing the overall response rate.

- **Donut Chart**:
  - A donut chart is created showing the proportion of customers who responded and did not respond to the campaigns.
  - The chart uses an attractive layout and is color coded, where:
    - Green represents those who responded.
    - Red represents those who did not respond.

## 2. Response rate per campaign

- **Calculation of rates per campaign**:
  - Campaign columns are defined (`AcceptedCmp1`, `AcceptedCmp2`, etc.) and the average response rate is calculated for each campaign.

- **Bar Chart**:
  - A bar graph is created that shows the response rate for each campaign.
  - A notation is included indicating that response rates are proportions ranging between 0 and 0.07 (0% to 7%).

## 3. Effectiveness of campaigns by demographic segment

- **Analysis by segment**:
  - A function `campaign_effectiveness_by_segment` is defined that groups the data by a specific demographic segment and calculates the average response rate for each campaign.

- **Display**:
  - Bar graphs are generated to visualize the effectiveness of campaigns segmented by educational level and marital status.

## 4. Relationship between total spending and response to campaigns

- **Scatter Plot**:
  - A scatter graph is created that relates the total expenditure(`Total_Spent`)and the number of accepted campaigns (`TotalAcceptedCampaigns`).
  - A reference line is included that indicates the maximum number of accepted campaigns (5), facilitating interpretation.

## 5. Analysis of the last campaign

- **Comparison**:
  - The average income, total spending and age is calculated for customers who responded and did not respond to the last campaign.
  
- **Bar Chart**:
  - A bar graph is constructed that compares these averages, using different colors for each group (responded vs. did not respond).

## 6. Hypothetical Return on Investment (ROI) Calculation

- **ROI calculation**:
  - An empty DataFrame is established to store the ROI results of each campaign.
  - Income and costs are calculated for each campaign, and the ROI is calculated as the difference between income and costs over the total cost.

- **Display**:
  - A bar graph is generated showing the ROI for each campaign, making it easy to compare their financial effectiveness.

## 7. Recommendations based on analysis

- Strategic recommendations are presented based on the results of the analysis:
  1. Focus efforts on the campaigns with the highest response rate.
  2. Customize campaigns according to the educational level and marital status of the clients.
  3. Consider total customer spending when designing future campaigns.
  4. Investigate the reasons why certain segments respond better to campaigns.
  5. Perform a more detailed cost-benefit analysis to optimize the ROI of the campaigns.


In [None]:
"""Modelado Predictivo"""

# 1. Importacion de librerias necesarias
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

# 2. Preparar los datos
# 'X' contendrá las características demográficas y de comportamiento del cliente
# 'y' contendrá la columna que indica si respondieron a la campaña o no (0 o 1)

# Columnas en X (Edad, Ingresos, Gasto) y 'Response' como variable objetivo
X = df[['Income', 'Total_Spent', 'NumStorePurchases']]  # Columnas adecuadas
y = df['Response']

# 3. Dividir los datos en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 4. Entrenar el modelo de Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 5. Hacer predicciones
y_pred = model.predict(X_test)

# 6. Evaluar el modelo
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"ROC AUC: {roc_auc:.2f}")

# 7. Visualización de la matriz de confusión con Plotly
conf_matrix_fig = go.Figure(data=go.Heatmap(
    z=conf_matrix,
    x=['No Responde', 'Responde'],
    y=['No Responde', 'Responde'],
    hoverongaps=False,
    colorscale='Blues'
))

conf_matrix_fig.update_layout(
    title="Matriz de Confusión",
    xaxis_title="Predicción",
    yaxis_title="Actual",
    margin=dict(l=100, r=100, t=50, b=50)
)

conf_matrix_fig.show()

# 8. Importancia de las características del modelo de Random Forest
importances = model.feature_importances_

# 8.1. Crear DataFrame para la importancia de las características
importances_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# 8.2. Visualización de la importancia de las variables con Plotly
importance_fig = px.bar(importances_df, x='Feature', y='Importance', title='Importancia de las Características')

importance_fig.update_layout(
    xaxis_title="Características",
    yaxis_title="Importancia",
    margin=dict(l=100, r=100, t=50, b=50)
)

importance_fig.show()

Accuracy: 0.83
ROC AUC: 0.57


# Predictive Modeling Analysis

## Introduction
In this analysis, a classification model has been implemented using a **Random Forest Classifier**. The main objective of the model is to predict whether a customer will respond to a marketing campaign, based on characteristics such as **Income**, **Total Spent (Total_Spent)** and **Number of Store Purchases (NumStorePurchases). **. The target variable used is **Response**, which indicates whether the customer has responded to the campaign (1) or not (0).

Next, a detailed analysis of each component of the code and the results obtained is carried out. Explanations on the importance of features, model performance and graphical display, as well as interpretation of results in the context of the project are included.

## 1. Data Preparation

Three key variables were selected as predictors:

- **Income**: Indicates the client's income level.
- **Total_Spent**: Shows the customer's total spend in the store.
- **NumStorePurchases**: Represents the number of purchases made in the store.

These variables were selected due to their relevance to customer decision making, as income and spending behavior are good indicators of the ability and willingness to respond to a marketing campaign.

### Purpose
This step is crucial to ensure that the model has an input data set that reflects the most relevant factors in the customer's response.

## 2. Data Division

A standard method of splitting data into **train (70%)** and **test (30%)** was used with the `train_test_split` function. This step allows you to evaluate the model with data not used during training, ensuring a more objective evaluation of performance.

### Purpose
Splitting the data appropriately prevents overfitting, providing a more accurate assessment of the model's actual performance.

## 3. Random Forest Model Training

The selected model is a **Random Forest Classifier**, which is trained with 100 decision trees. The Random Forest algorithm is robust and adapts well to data with many variables, in addition to being able to handle correlation between features.

### Justification
Random Forest is ideal for these types of classification problems due to its ability to handle large amounts of data without overfitting, and its ability to measure the importance of features.

## 4. Model Predictions

Once trained, the model is used to make predictions on the test data. This allows you to evaluate the performance of the model and adjust parameters if necessary.

### Use
This step is necessary to verify how the model behaves when predicting new responses, in a real environment.

## 5. Model Evaluation

### 5.1. **Accuracy**
The **accuracy** is 0.84, which indicates that the model correctly classifies 84% ​​of the cases. Although it is a good general indicator, care must be taken, as in an unbalanced data set (with more non-responding customers than responding ones), this value can be misleading.

### 5.2. **ROC AUC**
The **ROC AUC** is 0.55, indicating slightly above chance performance (0.5). This low value could suggest that the model is not separating the "Responsive" and "Not Responding" classes well. Parameters may need to be adjusted or more features added.

### Comparative Analysis

| Metric | Value |
|----------|-------|
| Accuracy | 0.84 |
| ROC AUC | 0.55 |

This table shows that although the model is accurate overall, it struggles to differentiate between the two classes effectively. This may indicate the need to tune hyperparameters or explore other models.

### Purpose
These metrics allow measuring the performance of the model and provide an initial evaluation of the quality of the predictions. In this case, it would be ideal to explore more features or tweaks to improve the ROC AUC.

## 6. Confusion Matrix

The **confusion matrix** helps understand the model predictions in more detail:

- **Verdaderos Positivos (Responde/Responde)**
- **Falsos Positivos (No Responde/Responde)**
- **Verdaderos Negativos (No Responde/No Responde)**
- **Falsos Negativos (Responde/No Responde)**

### Display
**Plotly** was used to create a heatmap that visualizes the different categories of predictions. This allows you to more clearly identify where the model makes errors.

### Justification
The confusion matrix provides crucial information about the types of errors the model is making. For example, a high false negative rate would indicate that the model is underestimating the customers who would respond to the campaign, which is costly in a marketing context.

## 7. Importance of Characteristics

**Random Forest** was used to calculate feature importance. The results are presented in a bar graph with **Plotly**. The results show that:

- **Total_Spent** and **Income** are the most important variables.
- **NumStorePurchases** is less relevant.

| Característica     | Importancia |
|--------------------|-------------|
| Total_Spent        | 0.39        |
| Income             | 0.38        |
| NumStorePurchases  | 0.23        |

### Analysis
This analysis demonstrates that spending behavior (Total_Spent) and income (Income) They have a greater impact on the customer's response to the marketing campaign. This is useful for defining future strategies, since the campaign can be focused on clients with higher income or accumulated spending.

### Purpose
This analysis helps marketing teams identify the key factors that influence customer response, allowing them to adjust segmentation and communication strategies.

## General Conclusion

This model provides a solid basis for predicting customer response in marketing campaigns, highlighting critical variables such as total spend and revenue. However, the low **ROC AUC** value suggests that other configurations or features should be explored to improve the predictive ability. The generated visualizations provide clear and concise information that can help in making business decisions.

## Potential Applications

- **Marketing Campaign Optimization**: Direct campaigns towards clients with high Total_Spent and Income values.
- **Customer Segmentation**: Use important features to segment customers more efficiently.
- **Model Tweaks**: Implement advanced hyperparameter tuning techniques or explore other classification algorithms to improve performance.


In [None]:


"""Visualizaciones adicionales"""

# 1. Gráfico de dispersión 3D de Ingresos, Gasto Total y Compras en Tienda
fig_3d = px.scatter_3d(df, x='Income', y='Total_Spent', z='NumStorePurchases',
                       title='Ingresos vs. Gasto Total vs. Compras en Tienda',
                       labels={'Income': 'Ingresos', 'Total_Spent': 'Gasto Total', 'NumStorePurchases': 'Compras en Tienda'},
                       color='Response',  # Puedes colorear por la respuesta de la campaña
                       opacity=0.7)

# 1.1. Configuración del diseño
fig_3d.update_layout(
    scene=dict(
        xaxis_title='Ingresos',
        yaxis_title='Gasto Total',
        zaxis_title='Compras en Tienda'
    ),
    margin=dict(l=0, r=0, b=0, t=50)
)

# 1.2. Mostrar la gráfica
fig_3d.show()



# 2. Gráfico de dispersión de Ingresos vs. Gasto Total
fig_scatter = px.scatter(df, x='Income', y='Total_Spent', title='Ingresos vs. Gasto Total', trendline='ols')
fig_scatter.update_layout(xaxis_title='Ingresos', yaxis_title='Gasto Total')
fig_scatter.show()

# 3. Matriz de correlación entre variables numéricas
correlation_matrix_full = df[['Income', 'Total_Spent', 'MntWines', 'MntFruits', 'MntMeatProducts',
                         'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Kidhome', 'Teenhome']].corr()
fig_correlation_full = px.imshow(correlation_matrix_full, text_auto=True, title='Matriz de Correlación entre Variables Numéricas')
fig_correlation_full.show()



### Visualization Analysis

#### 1. 3D Scatter Chart: **Income vs. Total Expense vs. In-Store Purchases**

**Purpose**: This three-dimensional graph has the objective of visualizing the relationships between three key variables of customer behavior: **Income**, **Total Spending** and **In-Store Purchases**. Additionally, the use of color for the **campaign response** variable (colored by `Response`) It adds a visual layer that allows you to identify customers who have responded to marketing campaigns, which is valuable for segmentation.

**Interpretation**:
- **Complex relationships**: You can analyze whether there are patterns that suggest that a higher level of income is associated with higher total spending and more purchases in the store.
- **Customer Segmentation**: Dots with different colors reflect customers who responded or did not respond to campaigns, providing a first step toward effective customer segmentation based on their purchasing behavior.
- **Application in the project**: This graph helps to identify which customer characteristics (in terms of income and purchasing habits) are related to the response to marketing campaigns, which will be useful for designing marketing strategies more personalized and effective.

#### 2. Scatter plot: **Income vs. Total Expense** with trend line

**Purpose**: This 2D graph specifically explores the relationship between customers' income and their total spending, allowing the general trend between these two variables to be visualized by incorporating a regression line.

**Interpretation**:
- **Visual correlation**: If the trend line is positive, it suggests that customers with higher incomes tend to spend more, which may be an indication of spending behavior aligned with economic capacity.
- **Behavioral prediction**: The graph, together with the trend line, can be used to predict the spending of new customers based on their income.
- **Application in the project**: This visualization provides valuable information for income-based segmentation and to understand how well marketing campaigns align with the spending habits of the most affluent customers.
#### 3. Correlation matrix: **Numeric variables**

**Purpose**: The correlation matrix presents the statistical relationships between the main numerical variables in the data set. Correlations are important for understanding which variables have strong linear relationships with each other, which can help reduce dimensionality and identify key variables for analysis.

**Interpretation**:
- **Identification of key relationships**: This matrix allows you to visualize which variables are strongly correlated with each other, such as between income and certain amounts of spending on products (for example, wines or meat). Correlation values ​​close to 1 or -1 indicate strong linear relationships.
- **Reduction of model complexity**: If some variables are highly correlated, they may be redundant, which could simplify predictive models by reducing the number of variables to be considered.
- **Application in the project**: The correlation matrix will be useful in the process of building predictive models to select the most relevant variables. This helps improve the accuracy and efficiency of the models by focusing on the variables that really contribute to the analysis of customer behavior.

### Conclusion

These visualizations provide a clearer picture of the relationships between key variables that affect customer behavior. In the context of the project, these tools allow us to delve deeper into customer segmentation, predict purchasing behavior, and optimize marketing strategies, facilitating informed decision-making about how and to whom to direct future advertising campaigns.


# Conclusions and Recommendations on Customer Personality Analysis for Marketing Optimization

## Conclusions

1. **Effective Customer Segmentation**: Through the elbow method, it was identified that the optimal number of clusters for customer segmentation is 3. This segmentation allows us to clearly distinguish between groups of customers with different characteristics and purchasing behaviors. . Key segments were identified: customers with high income and high spending, young customers with low spending, and frequent in-store shoppers, which facilitates the design of specific strategies for each group.

2. **Effectiveness of Marketing Campaigns**: The analysis of the overall response rate and by campaign showed that the campaigns have response rates that range between 0% and 7%. Additionally, some demographic segments, such as educational level and marital status, were observed to have significantly different response rates, suggesting that personalizing campaigns could increase their effectiveness.

3. **Relationship between Spending and Response to Campaigns**: The dispersion analysis revealed that there is a positive correlation between a client's total spending and the response to marketing campaigns. This indicates that customers who spend more tend to be more receptive to offers, which should be considered in future marketing strategies.

4. **Effective Predictive Modeling**: The implementation of the classification model using a Random Forest Classifier provided a robust tool to predict customer response to marketing campaigns. The selected characteristics (income, total spending, and number of in-store purchases) were found to be significant indicators of customers' willingness to respond.

## Recommendations

1. **Focus on Segmented Customers**: It is suggested that the company direct its marketing efforts towards the identified segments, especially those with high income and high spending, considering that they are the most receptive to offers. Customizing campaigns for these groups could improve response rates.

2. **Campaign Personalization**: It is advisable to adapt marketing campaigns according to the demographic characteristics of the customers. This includes taking into account educational level and marital status, which could result in an increase in the effectiveness of campaigns.

3. **Take advantage of Spending as an Indicator**: Since it was identified that customers who spend more tend to respond better to campaigns, it is recommended to focus campaigns on these customers and develop strategies that encourage greater spending, such as exclusive promotions .

4. **Continuous ROI Analysis**: It is essential to perform a detailed cost-benefit analysis of each campaign. This will allow you to optimize the return on investment (ROI) and adjust strategies based on the financial performance of the campaigns.

5. **Continuous Improvement of Predictive Modeling**: It is recommended to make periodic adjustments to the predictive model to include new variables and recent data. This will maintain the relevance of the model and improve the accuracy of the predictions.

These conclusions and recommendations offer a strategic framework for marketing optimization through an in-depth analysis of customer behavior and characteristics, guaranteeing a more effective response to market needs.