# PREPROCESSING DE LOS DATOS

En este notebook vamos a ver el preprocessing, en el cual se realizan diferentes transformaciones sobre los datos, bien para eliminar o reemplazar información no útil, o bien para que los algoritmos de clasifiación funcionen correctamente. Por ejemplo, algoritmos como knn, logistic regression y support vector machine necesitan que los datos tengan la misma escala

Scikit posee un módulo, `preprocessing`, el cual contiene numerosas herramientas para llevar a cabo dicha operación

In [None]:
import numpy as np
import pandas as pd

In [None]:
dataframe = pd.read_csv('../../datasets/pima-indians-diabetes.csv')

### Trabajando con NANs

In [None]:
dataframe.info()

In [None]:
dataframe_clean = dataframe.dropna()
print(dataframe_clean.shape)

In [None]:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
mat_clean=imp.fit_transform(dataframe.values)
print(mat_clean)

In [None]:
# Usando Pandas
dataframe.fillna(dataframe.mean())

### Reescaleando los datos

In [None]:
X = mat_clean[:,0:8]
y = mat_clean[:,8]

In [None]:
# Rerscalear data (Entre 0 and 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])

In [None]:
# Estandarizar data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rescaledX = scaler.fit_transform(X)
print(rescaledX[0:5,:])

In [None]:
# Normalizar data
from sklearn.preprocessing import Normalizer
scaler = Normalizer()
normalizedX = scaler.fit_transform(X)
np.set_printoptions(precision=3)
print(normalizedX[0:5,:])

In [None]:
# binarizar los datos
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0)
binaryX = binarizer.fit_transform(X)
np.set_printoptions(precision=3)
print(binaryX[0:5,:])

### EFECTOS DEL PREPROCESSING

Manejo de diferentes tipos de datos

    Hay tres tipos de tipo de datos:
        Numericos, e.g. income, age
        Categóricos o nominales, e.g. gender, nationality
        Ordinales, e.g. low/medium/high

    En scikit solo features numéricas

    Debemos convertir las variables categóricas y ordinales en numéricas
        Create dummy features
        Transform a categorical feature into a set of dummy features, each representing a unique category
        In the set of dummy features, 1 indicates that the observation belongs to that category



In [None]:
X_train=pd.read_csv('../../datasets/loan_train.csv')
y_train=pd.read_csv('../../datasets/loan_target_train.csv')
X_test=pd.read_csv('../../datasets/loan_test.csv')
y_test=pd.read_csv('../../datasets/loan_target_test.csv')

In [None]:
print (X_train.head())

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 
                   'Loan_Amount_Term', 'Credit_History']],y_train)
print(knn.score(X_test[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 
                   'Loan_Amount_Term', 'Credit_History']],y_test))

In [None]:
y_test['Target'].value_counts()/y_test['Target'].count()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
# Scaling down both train and test data set
X_train_minmax=scaler.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_minmax=scaler.fit_transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])

In [None]:
knn.fit(X_train_minmax,y_train)
print(knn.score(X_test_minmax,y_test))

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log= LogisticRegression(penalty='l2',C=.01)
log.fit(X_train[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 
                   'Loan_Amount_Term', 'Credit_History']], y_train)
print(" Logistic regression antes de preprocessing: ", 
      log.score(X_test[['ApplicantIncome', 'CoapplicantIncome','LoanAmount', 
                   'Loan_Amount_Term', 'Credit_History']],y_test))
      
log.fit(X_train_minmax, y_train)
print(" Logistic regression antes de preprocessing ", log.score(X_test_minmax,y_test))

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train_scale=ss.fit_transform(X_train[['ApplicantIncome', 'CoapplicantIncome',
                'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
X_test_scale=ss.transform(X_test[['ApplicantIncome', 'CoapplicantIncome',
               'LoanAmount', 'Loan_Amount_Term', 'Credit_History']])
# Fitting logistic regression on our standardized data set

log.fit(X_train_scale,y_train)
print("Logistic Regression estandarizando: ", log.score(X_test_scale, y_test))

In [None]:
log.fit(X_train,y_train)

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [None]:
for col in X_test.columns:
   
   if X_test.loc[:,col].dtypes=='object':
   
        data=X_train.loc[:,col].append(X_test.loc[:,col])
        le.fit(data.values)
        X_train.loc[:,col]=le.transform(X_train.loc[:,col])
        X_test.loc[:,col]=le.transform(X_test.loc[:,col])

In [None]:
X_train_scale=ss.fit_transform(X_train)
X_test_scale=ss.transform(X_test)

log.fit(X_train_scale,y_train)
log.score(X_test_scale,y_test)

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder(sparse=False)
X_train_ohe=X_train.copy()
X_test_ohe=X_test.copy()
columns=['Gender', 'Married', 'Dependents', 'Education','Self_Employed',
          'Credit_History', 'Property_Area']
for col in columns:
    # Creamos una lista de todos los posibles valores categóricos
    data=X_train.loc[:,[col]].append(X_test.loc[:,[col]])
    enc.fit(data)
    # Transformamos los datos usando one hot encoder
    temp = enc.transform(X_train.loc[:,[col]])
    # Definimos un nuevo data frame
    temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data.loc[:,col]
        .value_counts().index])
    
    temp=temp.set_index(X_train.index.values)
    
    X_train_ohe=pd.concat([X_train_ohe,temp],axis=1)
    
    temp = enc.transform(X_test.loc[:,[col]])
    
    temp=pd.DataFrame(temp,columns=[(col+"_"+str(i)) for i in data.loc[:,col]
        .value_counts().index])
    
    temp=temp.set_index(X_test.index.values)
    
    X_test_ohe=pd.concat([X_test_ohe,temp],axis=1)
    X_train_ohe.drop(columns=[col], inplace=True)
    X_test_ohe.drop(columns=[col], inplace=True)

In [None]:
X_train_scale=ss.fit_transform(X_train_ohe)
X_test_scale=ss.transform(X_test_ohe)

log.fit(X_train_scale,y_train)

log.score(X_test_scale,y_test)

Para estos casos, yo recomiendo mejor usar pandas, que te permite hacer el label y one hot enconding fácilmente usando el método `get_dummies`

In [None]:
X_train=pd.read_csv('X_train.csv')
y_train=pd.read_csv('Y_train.csv')
# Importing testing data set
X_test=pd.read_csv('X_test.csv')
y_test=pd.read_csv('Y_test.csv')

In [None]:
pd.get_dummies(X_train['Married'])

In [None]:
X_train.shape

In [None]:
new_df =pd.concat([X_train,X_test], axis=0)

In [None]:
columns=['Gender', 'Married', 'Dependents', 'Education','Self_Employed',
          'Credit_History', 'Property_Area']
for col in columns:
    dummies = pd.get_dummies(new_df.loc[:,col], prefix=col, dummy_na=False)
    new_df = new_df.drop(col, 1)
    new_df = pd.concat([new_df, dummies], axis=1)

In [None]:
new_df.shape

In [None]:
np.concatenate((X_train_ohe, X_test_ohe)).shape

Hemos visto como la performance del clasificador puede cambiar según cómo manejemos los datos. No hay forma única y a veces es complicado saber qué procesamiento se debe adoptar. Algunos casos, como Decision Tree y Random Forest, apenas requieren mucho preprocessing. Otros, como support vector machine, logistic regression y knn requieren tratar los datos categóricos y poner todos los datos en la misma escala. Para estos casos, decidir sobre si estandarizar o sólo escalar los datos entre 0 y 1 depende de la naturaleza de los datos en si. Lo mejor, al principio es adoptar las diferentes posibilidades, comparar la performance en cada  y quedarte con el mejor de los casos

Referencias:

- http://scikit-learn.org/stable/modules/preprocessing.html