<a href="https://colab.research.google.com/github/YaninaTesta/Proyectos/blob/main/Notebook_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programa Ingenias+ Data Science

Ya dijimos previamente que un proyecto de data science tiene varias etapas:

1. Recolección de Datos
2. Exploración y Procesamiento de los datos
3. Modelado
4. Puesta en Producción

En la clase anterior, hicimos el analisis exploratorio de los datos y pudimos observar el tipo de datos que teniamos. Pudimos hacernos preguntas y ver algunos patrones. Todo ese conocimiento que adquirimos es útil para llevar a cabo la próxima parte de esta etapa: el procesamiento de los datos.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

<font size=5>  🚀 Procesamiento de datos 👩🏽‍💻</font>

Una vez que visualizamos y exploramos el dataset tenemos una idea de como lucen nuestros datos. Es por eso que ahora debemos empezar a preparar nuestros datos para los siguientes pasos según lo que aprendimos de ellos y las preguntas que nos planteamos.

In [2]:
blackfriday = pd.read_csv('/content/BlackFriday.csv')

#### 1) TRANSFORMACION DE VARIABLES

- La mayoría de los algoritmos de machine learning no admiten `strings` como variables y requieren que las variables sean numericas. Por ese motivo, es necesario convertir las variables categoricas en su representación numerica. Para esto hay varias opciones que iremos viendo.

**`LabelEncoder()`**

In [3]:
from sklearn.preprocessing import LabelEncoder

In [4]:
test_encoder = LabelEncoder()

In [5]:
blackfriday.loc[:, 'City_Category'] = test_encoder.fit_transform(blackfriday['City_Category'])

In [6]:
blackfriday.head(10)

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,0,2,0.0,3.0,,,8370.0
1,1000001,P00248942,F,0-17,10,0,2,0.0,1.0,6.0,14.0,15200.0
2,1000001,P00087842,F,0-17,10,0,2,0.0,12.0,,,1422.0
3,1000001,P00085442,F,0-17,10,0,2,0.0,12.0,14.0,,1057.0
4,1000002,P00285442,M,55+,16,2,4+,0.0,8.0,,,7969.0
5,1000003,P00193542,M,26-35,15,0,3,0.0,1.0,2.0,,15227.0
6,1000004,P00184942,M,46-50,7,1,2,1.0,1.0,8.0,17.0,19215.0
7,1000004,P00346142,M,46-50,7,1,2,1.0,1.0,15.0,,15854.0
8,1000004,P0097242,M,46-50,7,1,2,1.0,1.0,16.0,,15686.0
9,1000005,P00274942,M,26-35,20,0,1,1.0,8.0,,,7871.0


**`get_dummies()`**

In [7]:
pd.get_dummies(blackfriday, columns=["Gender"])

Unnamed: 0,User_ID,Product_ID,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,Gender_F,Gender_M
0,1000001,P00069042,0-17,10,0,2,0.0,3.0,,,8370.0,True,False
1,1000001,P00248942,0-17,10,0,2,0.0,1.0,6.0,14.0,15200.0,True,False
2,1000001,P00087842,0-17,10,0,2,0.0,12.0,,,1422.0,True,False
3,1000001,P00085442,0-17,10,0,2,0.0,12.0,14.0,,1057.0,True,False
4,1000002,P00285442,55+,16,2,4+,0.0,8.0,,,7969.0,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
383975,1005077,P00150342,26-35,2,0,0,0.0,8.0,14.0,,2230.0,False,True
383976,1005077,P00110542,26-35,2,0,0,0.0,8.0,,,9768.0,False,True
383977,1005077,P00111942,26-35,2,0,0,0.0,8.0,17.0,,9903.0,False,True
383978,1005077,P00019142,26-35,2,0,0,0.0,11.0,15.0,,3067.0,False,True


In [8]:
pd.get_dummies(blackfriday["Gender"])

Unnamed: 0,F,M
0,True,False
1,True,False
2,True,False
3,True,False
4,False,True
...,...,...
383975,False,True
383976,False,True
383977,False,True
383978,False,True


In [9]:
blackfriday[['female', 'male']] = pd.get_dummies(blackfriday["Gender"])

In [10]:
blackfriday.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,female,male
0,1000001,P00069042,F,0-17,10,0,2,0.0,3.0,,,8370.0,True,False
1,1000001,P00248942,F,0-17,10,0,2,0.0,1.0,6.0,14.0,15200.0,True,False
2,1000001,P00087842,F,0-17,10,0,2,0.0,12.0,,,1422.0,True,False
3,1000001,P00085442,F,0-17,10,0,2,0.0,12.0,14.0,,1057.0,True,False
4,1000002,P00285442,M,55+,16,2,4+,0.0,8.0,,,7969.0,False,True


**`OneHotEncoder()`**

In [11]:
from sklearn.preprocessing import OneHotEncoder

In [12]:
blackfriday['Gender'].unique()

array(['F', 'M'], dtype=object)

In [13]:
gender_encoder = OneHotEncoder()

In [14]:
gender_encoder.fit_transform(blackfriday[['Gender']]).toarray()

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [15]:
gender_encoder.categories_

[array(['F', 'M'], dtype=object)]

In [16]:
niveles = gender_encoder.categories_[0].tolist()

In [17]:
one_hot_gender = pd.DataFrame(gender_encoder.fit_transform(blackfriday[['Gender']]).toarray(), columns=niveles)

In [18]:
one_hot_gender

Unnamed: 0,F,M
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0
...,...,...
383975,0.0,1.0
383976,0.0,1.0
383977,0.0,1.0
383978,0.0,1.0


In [19]:
new_df = pd.concat([blackfriday, one_hot_gender], axis=1)

In [20]:
new_df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,female,male,F,M
0,1000001,P00069042,F,0-17,10,0,2,0.0,3.0,,,8370.0,True,False,1.0,0.0
1,1000001,P00248942,F,0-17,10,0,2,0.0,1.0,6.0,14.0,15200.0,True,False,1.0,0.0
2,1000001,P00087842,F,0-17,10,0,2,0.0,12.0,,,1422.0,True,False,1.0,0.0
3,1000001,P00085442,F,0-17,10,0,2,0.0,12.0,14.0,,1057.0,True,False,1.0,0.0
4,1000002,P00285442,M,55+,16,2,4+,0.0,8.0,,,7969.0,False,True,0.0,1.0


**`binning`**

In [21]:
print(blackfriday.Age.min())
print(blackfriday.Age.max())

0-17
55+


In [23]:
blackfriday["Age"] = pd.to_numeric(blackfriday["Age"], errors='coerce')

In [24]:
bin_age = [10, 17, 70, 80]
labels = ["Adolescente", "Adulto", "Anciano"]

age_categories = pd.cut(blackfriday["Age"], bins=bin_age, labels=labels)

In [25]:
age_categories

Unnamed: 0,Age
0,
1,
2,
3,
4,
...,...
383975,
383976,
383977,
383978,
