**Hipótesis:**
1. La edad, el trabajo, el estado civil, la educación, la situación de deuda y la forma de contacto pueden influir en la probabilidad de que un cliente acepte la oferta.
2. El número de veces que se ha contactado a un cliente en el pasado (campo campaign), el número de días que han pasado desde el último contacto (campo pdays), y el resultado de la campaña anterior (campo poutcome) pueden afectar la respuesta del cliente a una nueva oferta.
3. Las variables económicas (tales como el índice de precios al consumidor (cons.price.idx), la tasa de variación del empleo (emp.var.rate), etc.) pueden influir en la probabilidad de que un cliente acepte la oferta.
4. Los clientes que ya tienen una hipoteca (housing) o un préstamo (loan) pueden ser menos propensos a aceptar una nueva oferta, ya que podrían estar limitados financieramente.

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
pd.options.display.max_columns = None

In [2]:
df = pd.read_csv("data/pair1_bank_additional_full.csv", index_col = 0)
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,pdays,previous,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week
0,56,housemaid,married,basic.4y,0.0,0.0,0.0,telephone,261,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']"
1,57,services,married,high.school,,0.0,0.0,telephone,149,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']"
2,37,services,married,high.school,0.0,1.0,0.0,telephone,226,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']"
3,40,admin.,married,basic.6y,0.0,0.0,0.0,telephone,151,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']"
4,56,services,married,high.school,0.0,0.0,1.0,telephone,307,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,0.0,1.0,0.0,cellular,334,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']"
41184,46,blue-collar,married,professional.course,0.0,0.0,0.0,cellular,383,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']"
41185,56,retired,married,university.degree,0.0,1.0,0.0,cellular,189,2,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']"
41186,44,technician,married,professional.course,0.0,0.0,0.0,cellular,442,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']"


1. Columnas loan, housing y default: estas columnas contienen valores únicos de 0 y 1. Esto puede parecer poco intuitivo a la hora de la extracción de conclusiones y en las visualizaciones. El objetivo de este ejercicio es que cambies los valores númericos por "Si" y "No". A que corresponde cada uno de los valores lo tenéis en el pair de Limpieza I.


In [3]:
df["loan"].unique()

array([ 0.,  1., nan])

In [4]:


df["loan_modificado"] = pd.cut(df["loan"], 2, labels = ["sí", "no"])

df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,pdays,previous,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week,loan_modificado
0,56,housemaid,married,basic.4y,0.0,0.0,0.0,telephone,261,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí
1,57,services,married,high.school,,0.0,0.0,telephone,149,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí
2,37,services,married,high.school,0.0,1.0,0.0,telephone,226,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí
3,40,admin.,married,basic.6y,0.0,0.0,0.0,telephone,151,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí
4,56,services,married,high.school,0.0,0.0,1.0,telephone,307,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,0.0,1.0,0.0,cellular,334,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí
41184,46,blue-collar,married,professional.course,0.0,0.0,0.0,cellular,383,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí
41185,56,retired,married,university.degree,0.0,1.0,0.0,cellular,189,2,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí
41186,44,technician,married,professional.course,0.0,0.0,0.0,cellular,442,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí


In [5]:
df["housing_modificado"] = pd.cut(df["housing"], 2, labels = ["sí", "no"])

df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,pdays,previous,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week,loan_modificado,housing_modificado
0,56,housemaid,married,basic.4y,0.0,0.0,0.0,telephone,261,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí
1,57,services,married,high.school,,0.0,0.0,telephone,149,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí
2,37,services,married,high.school,0.0,1.0,0.0,telephone,226,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,no
3,40,admin.,married,basic.6y,0.0,0.0,0.0,telephone,151,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí
4,56,services,married,high.school,0.0,0.0,1.0,telephone,307,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",no,sí
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,0.0,1.0,0.0,cellular,334,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,no
41184,46,blue-collar,married,professional.course,0.0,0.0,0.0,cellular,383,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,sí
41185,56,retired,married,university.degree,0.0,1.0,0.0,cellular,189,2,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,no
41186,44,technician,married,professional.course,0.0,0.0,0.0,cellular,442,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,sí


In [6]:
df["default_modificado"] = pd.cut(df["default"], 2, labels = ["sí", "no"])

df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,pdays,previous,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week,loan_modificado,housing_modificado,default_modificado
0,56,housemaid,married,basic.4y,0.0,0.0,0.0,telephone,261,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,sí
1,57,services,married,high.school,,0.0,0.0,telephone,149,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,
2,37,services,married,high.school,0.0,1.0,0.0,telephone,226,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,no,sí
3,40,admin.,married,basic.6y,0.0,0.0,0.0,telephone,151,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,sí
4,56,services,married,high.school,0.0,0.0,1.0,telephone,307,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",no,sí,sí
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,0.0,1.0,0.0,cellular,334,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,no,sí
41184,46,blue-collar,married,professional.course,0.0,0.0,0.0,cellular,383,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,sí,sí
41185,56,retired,married,university.degree,0.0,1.0,0.0,cellular,189,2,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,no,sí
41186,44,technician,married,professional.course,0.0,0.0,0.0,cellular,442,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,sí,sí


2. Para la columna de education, si nos fijamos en sus valores únicos veremos que tenemos puntos. El objetivo de este ejercicio es que quitéis los puntos de esos valores y los reemplacéis por espacios.


In [7]:
df["education"].unique()

array(['basic.4y', 'high.school', 'basic.6y', 'basic.9y',
       'professional.course', nan, 'university.degree', 'illiterate'],
      dtype=object)

In [8]:
df['education'] = df['education'].str.replace('.', ' ', regex = True)
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,pdays,previous,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week,loan_modificado,housing_modificado,default_modificado
0,56,housemaid,married,,0.0,0.0,0.0,telephone,261,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,sí
1,57,services,married,,,0.0,0.0,telephone,149,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,
2,37,services,married,,0.0,1.0,0.0,telephone,226,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,no,sí
3,40,admin.,married,,0.0,0.0,0.0,telephone,151,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",sí,sí,sí
4,56,services,married,,0.0,0.0,1.0,telephone,307,1,999,0,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",no,sí,sí
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,,0.0,1.0,0.0,cellular,334,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,no,sí
41184,46,blue-collar,married,,0.0,0.0,0.0,cellular,383,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,sí,sí
41185,56,retired,married,,0.0,1.0,0.0,cellular,189,2,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,no,"['nov', 'fri']",sí,no,sí
41186,44,technician,married,,0.0,0.0,0.0,cellular,442,1,999,0,NONEXISTENT,-1.1,94.767,-50.8,1.028,4963.6,yes,"['nov', 'fri']",sí,sí,sí


3. Para la columna job, hay un valor único que esta abreviado (admin.), cambiad la abreviatura por el nombre completo.


In [9]:
df["job"].unique()

array(['housemaid', 'services', 'admin.', 'blue-collar', 'technician',
       'retired', 'management', 'unemployed', 'self-employed', nan,
       'entrepreneur', 'student'], dtype=object)

In [10]:
df['job'] = df['job'].str.replace('admin.', 'administrator', regex = True)


In [11]:

df["job"].unique()

array(['housemaid', 'services', 'administrator', 'blue-collar',
       'technician', 'retired', 'management', 'unemployed',
       'self-employed', nan, 'entrepreneur', 'student'], dtype=object)

4. La columna month_day_week tiene una lista que contiene información muy diferente. El objetivo de este ejercicio es separar esta columna en dos nuevas, donde tengamos los meses en una columna y los días de la semana en otra.


In [12]:

df["month_day_week"].unique()

array(["['may', 'mon']", "['may', 'tue']", "['may', 'wed']",
       "['may', 'thu']", "['may', 'fri']", "['jun', 'mon']",
       "['jun', 'tue']", "['jun', 'wed']", "['jun', 'thu']",
       "['jun', 'fri']", "['jul', 'tue']", "['jul', 'wed']",
       "['jul', 'thu']", "['jul', 'fri']", "['jul', 'mon']",
       "['aug', 'mon']", "['aug', 'tue']", "['aug', 'wed']",
       "['aug', 'thu']", "['aug', 'fri']", "['oct', 'fri']",
       "['oct', 'mon']", "['oct', 'tue']", "['oct', 'wed']",
       "['oct', 'thu']", "['nov', 'mon']", "['nov', 'tue']",
       "['nov', 'wed']", "['nov', 'thu']", "['nov', 'fri']",
       "['dec', 'mon']", "['dec', 'wed']", "['dec', 'thu']",
       "['dec', 'fri']", "['dec', 'tue']", "['mar', 'mon']",
       "['mar', 'tue']", "['mar', 'wed']", "['mar', 'thu']",
       "['mar', 'fri']", "['apr', 'wed']", "['apr', 'thu']",
       "['apr', 'fri']", "['apr', 'mon']", "['apr', 'tue']",
       "['sep', 'tue']", "['sep', 'wed']", "['sep', 'thu']",
       "['sep', 'fri']",

['may', 'mon']

[   :   0
'   :   1
m   :   2
a   :   3
y   :   4
'   :   5
,   :   6
    :   7
'   :   8
m   :   9
o   :   10
n   :   11
'   :   12
]   :   13

In [13]:
def obtain_month(row):
    """Obtiene el mes desde un str. Ejemplo: "['may', 'mon']". Se queda con los carácteres 'may'

    Args:
        row (str): Conjunto de mes y día

    Returns:
        str: Mes
    """
    return row[2:5]

In [14]:
df["month"] = df["month_day_week"].apply(obtain_month)

In [15]:
def obtain_day(row):
    """Obtiene el día de la semana desde un str. Ejemplo: "['may', 'mon']". Se queda con los carácteres 'mon'.

    Args:
        row (str): Conjunto de mes y día

    Returns:
        str: Día de la semana
    """
    return row[9:12]

In [16]:
df["day"] = df["month_day_week"].apply(obtain_day)

In [17]:
df[["month_day_week","month","day"]].head()

Unnamed: 0,month_day_week,month,day
0,"['may', 'mon']",may,mon
1,"['may', 'mon']",may,mon
2,"['may', 'mon']",may,mon
3,"['may', 'mon']",may,mon
4,"['may', 'mon']",may,mon


In [18]:
df["month_day_week"][0][2:5]

'may'

In [19]:
df["month_day_week"][0][9:12]

'mon'

In [20]:
#df[["month_day_week"]] = df["month_day_week"].str.split(',', expand=True, n=1)



5. Guarda el csv con las columnas limpias para seguir trabajando con este dataframe limpio.