# Exploratoria
La idea en esta fase es revisar cómo son nuestros datos para entender más en detalle el problema, cuáles podrían ser buenas estrategias para el modelado y qué transformaciones podríamos realizar para obtener features que nos permitan obtener mejores resultados


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from ydata_profiling import ProfileReport
import seaborn as sns

CATEGORICAL_COLUMNS = [
    "country_name",
    "device_os",
    "device_lang",
]
NUMERICAL_COLUMNS = [
    "cnt_user_engagement",
    "cnt_level_start_quickplay",
    "cnt_level_end_quickplay",
    "cnt_level_complete_quickplay",
    "cnt_level_reset_quickplay",
    "cnt_post_score",
    "cnt_spend_virtual_currency",
    "cnt_ad_reward",
    "cnt_challenge_a_friend",
    "cnt_completed_5_levels",
    "cnt_use_extra_steps",
]
IGNORE_COLUMNS = [
    "user_first_engagement",
    "user_pseudo_id",
    "is_enable",
    "bounced",
    "device_lang",
]
LABEL_COLUMN = "churned"

In [2]:
user_dataset_path = "../data/users_train.csv"
user_dataset = pd.read_csv(
    user_dataset_path,
    index_col=0,
    parse_dates=["user_first_engagement"],
)

# Data Validation

In [3]:
user_dataset.dtypes

user_pseudo_id                  object
is_enable                        int64
bounced                          int64
country_name                    object
device_os                       object
device_lang                     object
cnt_user_engagement              int64
cnt_level_start_quickplay        int64
cnt_level_end_quickplay          int64
cnt_level_complete_quickplay     int64
cnt_level_reset_quickplay        int64
cnt_post_score                   int64
cnt_spend_virtual_currency       int64
cnt_ad_reward                    int64
cnt_challenge_a_friend           int64
cnt_completed_5_levels           int64
cnt_use_extra_steps              int64
churned                          int64
dtype: object

In [4]:
user_dataset.isna().sum()

user_pseudo_id                    0
is_enable                         0
bounced                           0
country_name                      3
device_os                       231
device_lang                       0
cnt_user_engagement               0
cnt_level_start_quickplay         0
cnt_level_end_quickplay           0
cnt_level_complete_quickplay      0
cnt_level_reset_quickplay         0
cnt_post_score                    0
cnt_spend_virtual_currency        0
cnt_ad_reward                     0
cnt_challenge_a_friend            0
cnt_completed_5_levels            0
cnt_use_extra_steps               0
churned                           0
dtype: int64

In [5]:
user_dataset.columns

Index(['user_pseudo_id', 'is_enable', 'bounced', 'country_name', 'device_os',
       'device_lang', 'cnt_user_engagement', 'cnt_level_start_quickplay',
       'cnt_level_end_quickplay', 'cnt_level_complete_quickplay',
       'cnt_level_reset_quickplay', 'cnt_post_score',
       'cnt_spend_virtual_currency', 'cnt_ad_reward', 'cnt_challenge_a_friend',
       'cnt_completed_5_levels', 'cnt_use_extra_steps', 'churned'],
      dtype='object')

# Pandas Profiling

In [None]:
profile = ProfileReport(user_dataset, title="Pandas Profiling Report", explorative=True)

profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

# EDA
## Evolución temporal 

In [None]:
data_plot = (
    user_dataset.groupby(LABEL_COLUMN)[LABEL_COLUMN]
    .resample("D")
    .count()
    .rename("count")
    .reset_index()
)
sns.relplot(
    data=data_plot,
    x="user_first_engagement",
    y="count",
    hue=LABEL_COLUMN,
    kind="line",
    height=5,
    aspect=2
)

## Pair plot de variables nuemericas
Como se relacionan entre ellas las variables numericas y como estas se relacionan con la variable que queremos predecir. Para ello podemos usar la grafica que seaborn no provee [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot)

In [None]:
sns.pairplot(
    data=user_dataset[NUMERICAL_COLUMNS + [LABEL_COLUMN]],
    hue=LABEL_COLUMN,
)

Seria interesante ver si hay diferencias en las distribuciones de las variables numericas con respecto a la variable a predecir. Para ello podemos usar un tipo de grafico llamado boxplot o boxenplot que nos permite visualizar de manera general como son las diferencias en las distribuciones

In [None]:
# ax = sns.boxplot(
#     x="label",
#     y="columna1",
#     data=dataset,
# )

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[0],
    data=user_dataset,
    hue=LABEL_COLUMN,
)
ax.set(ylim=(0, 200))

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[1],
    data=user_dataset,
    hue=LABEL_COLUMN,
)
ax.set(ylim=(0, 50))

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[2],
    data=user_dataset,
    hue=LABEL_COLUMN,
)
ax.set(ylim=(0, 20))

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[3],
    data=user_dataset,
    hue=LABEL_COLUMN,
)
ax.set(ylim=(0, 20))

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[-2],
    data=user_dataset,
    hue=LABEL_COLUMN,
)

In [None]:
ax = sns.boxplot(
    x=LABEL_COLUMN,
    y=NUMERICAL_COLUMNS[-1],
    data=user_dataset,
    hue=LABEL_COLUMN,
)
ax.set(ylim=(0, 20))

Las variables categoricas son homogeneas? o podemos encontrar deferentes categorias con el mismo significado pero en mayuscula y minusculas, caracteres especiales

In [None]:
user_dataset['country_name'].str.lower().nunique()

In [None]:
user_dataset['country_name'].nunique()

In [None]:
user_dataset['device_lang'].nunique()

In [None]:
user_dataset['device_lang'].str.lower().nunique()

In [None]:
user_dataset['device_lang'].value_counts()

In [None]:
user_dataset['device_lang'].apply(lambda x: x[:2]).value_counts()

Cuales son los principales paises y los principales idiomas? como se relacionan estas con la variable a predecir?

In [None]:
for cat_col in CATEGORICAL_COLUMNS:
    selector = user_dataset[cat_col].isin(user_dataset[cat_col].value_counts().iloc[:10].index.to_list())
    sns.catplot(
        x=cat_col,
        hue=LABEL_COLUMN,
        data=user_dataset.loc[selector,:].dropna(),
        kind="count",
        height=5,
        aspect=2,
    )