<p align="center">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQe50pBmDfPWmuHmgaJxOFGmbks2QMDJmovCN43cpNO0Q&s">
</p>


<p align="justify">
👀 El objetivo es predecir si un cliente va a mejorar su suscripción a Disney, pasando a Disney +



 # **<font color="DarkBlue">Estado de suscripción a Disney +</font>**

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/cristiandarioortegayubro/BDS/main/datasets/EstadoSuscripcionDisney.csv")

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     1000 non-null   int64 
 1   first_name             1000 non-null   object
 2   last_name              1000 non-null   object
 3   email                  1000 non-null   object
 4   gender                 1000 non-null   object
 5   ip_address             1000 non-null   object
 6   address                1000 non-null   object
 7   country                1000 non-null   object
 8   country_code           1000 non-null   object
 9   city                   1000 non-null   object
 10  latitud                1000 non-null   object
 11  longitud               1000 non-null   object
 12  income_currency        1000 non-null   object
 13  income                 1000 non-null   object
 14  subscription           1000 non-null   object
 15  previous_subscription 

In [None]:
data.head()

Unnamed: 0,id,first_name,last_name,email,gender,ip_address,address,country,country_code,city,latitud,longitud,income_currency,income,subscription,previous_subscription
0,1,Dulcie,Dyerson,ddyerson0@istockphoto.com,Female,206.153.232.168,PO Box 22469,United States,US,Fort Lauderdale,260.576.497,-803.101.684,USD,$79095.08,Disney +,Disney
1,2,Dedra,Valler,dvaller1@kickstarter.com,Female,204.239.98.223,17th Floor,United States,US,Columbia,340.067.522,-810.330.246,USD,$66205.54,Disney +,Disney
2,3,Lorinda,Inderwick,linderwick2@goo.gl,Female,133.203.234.234,Room 845,United States,US,Glendale,34.144.801,-1.182.563.169,USD,$71724.29,Disney +,Disney +
3,4,Lucie,Noorwood,lnoorwood3@istockphoto.com,Female,62.255.251.39,Apt 246,United States,US,Mesa,334.335.164,-1.117.256.936,USD,$76316.55,Disney +,Disney +
4,5,Blisse,MacAloren,bmacaloren4@fastcompany.com,Female,28.15.55.123,PO Box 87287,United States,US,Irving,32.886.855,-96.967.936,USD,$45027.26,Disney,Disney


In [None]:
data['income'] = data['income'].str.replace('$', '').astype(float)
data['upgrade_subscription'] = 0
data.loc[(data['subscription'] == 'Disney +') & (data['previous_subscription'] == 'Disney'), 'upgrade_subscription'] = 1
data.drop(columns=["id","first_name","last_name","email","ip_address","address","country","country_code","income_currency","subscription","previous_subscription"], inplace=True)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   gender                1000 non-null   object 
 1   city                  1000 non-null   object 
 2   latitud               1000 non-null   object 
 3   longitud              1000 non-null   object 
 4   income                1000 non-null   float64
 5   upgrade_subscription  1000 non-null   int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 47.0+ KB


In [None]:
target_name = "upgrade_subscription"
y = data[target_name]
X = data.drop(columns=[target_name])

In [None]:
X.head()

Unnamed: 0,gender,city,latitud,longitud,income
0,Female,Fort Lauderdale,260.576.497,-803.101.684,79095.08
1,Female,Columbia,340.067.522,-810.330.246,66205.54
2,Female,Glendale,34.144.801,-1.182.563.169,71724.29
3,Female,Mesa,334.335.164,-1.117.256.936,76316.55
4,Female,Irving,32.886.855,-96.967.936,45027.26


In [None]:
y.head()

0    1
1    1
2    0
3    0
4    0
Name: upgrade_subscription, dtype: int64

 # **<font color="DarkBlue">Selección basada en tipos de datos</font>**

<p align="justify">
👀 Separaremos variables categóricas y numéricas usando sus tipos de datos para identificarlas, ya que vimos anteriormente que objeto corresponde a las columnas categóricas (cadenas de caracteres). Hacemos uso del <code>make_column_selector</code> para seleccionar las columnas correspondientes.
</p>


In [None]:
from sklearn.compose import make_column_selector as selector

<p align="justify">
👀 En el selector de las columnas numericas excluimos los tipos de datos <code>object</code> porque podemos tener numeros enteros o numeros decimales.
</p>


In [None]:
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

In [None]:
numerical_columns = numerical_columns_selector(X)
categorical_columns = categorical_columns_selector(X)

In [None]:
numerical_columns

['income']

In [None]:
categorical_columns

['gender', 'city', 'latitud', 'longitud']

 # **<font color="DarkBlue">Enviar columnas a un procesador específico</font>**

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [None]:
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

<p align="justify">
👀 Ahora, creamos el transformador y asociamos cada uno de estos preprocesadores con sus respectivas columnas.
</p>

https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)])

In [None]:
preprocessor

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [None]:
model = make_pipeline(preprocessor, LogisticRegression())
model

In [None]:
model.named_steps

{'columntransformer': ColumnTransformer(transformers=[('one-hot-encoder',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['gender', 'city', 'latitud', 'longitud']),
                                 ('standard_scaler', StandardScaler(),
                                  ['income'])]),
 'logisticregression': LogisticRegression()}

 # **<font color="DarkBlue">Train-test, división del conjunto de datos</font>**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
_ = model.fit(X_train, y_train)

 # **<font color="DarkBlue">Ajuste y prediccion</font>**

In [None]:
X_test.head()

Unnamed: 0,gender,city,latitud,longitud,income
521,Female,Charleston,38.35,-81.63,71631.13
737,Female,Charleston,38.36,-81.65,60700.39
740,Polygender,Fairbanks,648.377.778,-1.477.163.888,86230.75
660,Polygender,Charleston,327.830.575,-799.365.839,63951.78
411,Female,Bakersfield,351.268.513,-1.191.855.785,56267.92


In [None]:
model.predict(X_test)[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [None]:
y_test[:10]

521    0
737    0
740    0
660    0
411    0
678    0
626    0
513    0
859    0
136    1
Name: upgrade_subscription, dtype: int64

In [None]:
model.score(X_test, y_test).round(4)

0.68

 # **<font color="DarkBlue">Evaluación del modelo con Cross-validation</font>**

<p align="justify">
👀 Un modelo predictivo puede ser evaluado con validación cruzada....
</p>


In [None]:
from sklearn.model_selection import cross_validate

In [None]:
cv_results = cross_validate(model, X, y, cv=5)
cv_results

{'fit_time': array([0.06251693, 0.05361748, 0.05506682, 0.05700827, 0.04860902]),
 'score_time': array([0.0140028 , 0.01362681, 0.01359749, 0.01326466, 0.0084424 ]),
 'test_score': array([0.705, 0.715, 0.76 , 0.73 , 0.7  ])}

In [None]:
scores = cv_results["test_score"]
print("")
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")


The mean cross-validation accuracy is: 0.722 ± 0.022


 # **<font color="DarkBlue">¿Conclusiones?...</font>**

<p align="justify">
👀 En este colab nosotros:<br>
<br>
✅ Cargamos los datos de un archivo <code>CSV</code> usando <code>Pandas</code>.
<br>
✅ Se plantea el caso ¿Estará bien planteado?.
<br>
✅ Se crea una variable objetivo ¿Estará bien creada?.
<br>
✅ Se usó un <code>ColumnTransformer</code> para  variables categóricas y numéricas.
<br>
✅ Se usó un Pipeline para encadenar el preprocesamiento de <code>ColumnTransformer</code>.
<br>




<br>
<br>
<p align="center"><b>
💗
<font color="DarkBlue">
Hemos llegado al final de nuestro colab, a seguir codeando...
</font>
</p>
