In [1]:
import pandas as pd

volunteer = pd.read_csv('volunteer_opportunities.csv')
wine = pd.read_csv('wine_types.csv')

Data preprocessing comes after we've explored and cleaned our dataset, so we understand its contents, structure, and quality. 

 preprocessing as a prerequisite for modeling.

 Good practices:
* Llamar al df
* Ver .head()
* Ver .info()
* Ver .describe()

# Preprocessing

## 1. Manejo de datos faltantes

1. Cuántos datos faltan

    * `df.isna().sum()`

2. Borrar en caso necesario

    * `df.dropna()` general, good option if only a small number or rows contain missing data
    
    * `df.drop([1, 2, 3])` rows

    * `df.drop('A', axis=1))` columnas

    * `df.dropna(subset=['B'])` de un subset

    * `df.dropna(thresh=2)` especifica cuántos se pueden quedar

In [2]:
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(['Latitude','Longitude'], axis=1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=['category_desc'])

# Print out the shape of the subset
print(volunteer_subset.shape)

(617, 33)


## 2. Transformar los datos en los tipos adecuados

* df.dtypes.value_counts() para saber cuántos hay de cada tipo
  
* df.info() info general
  
* df.dtypes  solo tipo de columna
  
* df['C'] = df['C'].astype('float') # convertir en el tipo adecuadonota que el tipo está entre comillas

In [3]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer['hits'].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes.value_counts())

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
object     14
float64    13
int64       8
Name: count, dtype: int64


## 3. Dividir df

    * train_test_split() (con stratify si es necesario)

In [6]:
from sklearn.model_selection import train_test_split

# Create a DataFrame with all columns except category_desc
X = volunteer.dropna()
X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train['category_desc'].value_counts())

ValueError: Input contains NaN

In [None]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

4. Standardization: "normaliza" distribuciones que no son estandares, especialmente para los modelos de scikit-learn. esto es para datos numéricos continuos, NO CATEGORICOS. The better practice is to perform standardization (or any feature scaling/normalization) after splitting the dataset into training and test sets. This approach prevents data leakage because the test data is kept completely separate and unseen during the fitting of the scaler

    * ¿Cuándo se usa?:
  
        * if we're working with any kind of model that uses a linear distance metric or operates in a linear space like k-nearest neighbors, linear regression, or k-means clustering, the model is assuming that the data and features we're giving it are related in a linear fashion, or can be measured with a linear distance metric, which may not always be the case.
        * Standardization should also be used when dataset features have a high variance, which is also related to distance metrics. **If a feature in our dataset has a variance that's an order of magnitude or more greater than the other features,** (¿qué significa eso?) this could impact the model's ability to learn from other features in the dataset.
        * Modeling a dataset that contains continuous features that are on different scales is another standardization scenario. For example, consider predicting house prices using two features: the number of bedrooms and the last sale price.

   

 * log normalization: useful when we have features with high variance.  Log normalization is a good strategy when you care about relative changes in a linear model, but still want to capture the magnitude of change, and when we want to keep everything in the positive space. It's a nice way to minimize the variance of a column and make it comparable to other columns for modeling.
  
        * observar la varianza de las columnas:
          `df.var()`
        * aplicar la función de numpy:
          `df['log_2'] = np.log(df['col2'])` es buena práctica crear otra columna que indique que tiene los datos normalizados
      
   

In [4]:
import numpy as np

# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


 * scaling: Scaling is a method of standardization that's most useful when we're working with a dataset that contains continuous features that are on different scales, and we're using a model that operates in some sort of linear space (like linear regression or k-nearest neighbors). Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. we have numbers that have consistent scales within columns, but not across columns. If we look at the variance, it's relatively low across columns.
  
    * escalado:
  
        from sklearn.preprocessing import StandardScaler # importar
        scaler = StandardScaler() # crear una variable con el escalador
        df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns) # crear nueva columna, usar fit_transform para mayor practicidad

In [13]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Subset the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to wine_subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

print(wine_subset.var())
print(pd.DataFrame(wine_subset_scaled).var())

Ash                    0.075265
Alcalinity of ash     11.152686
Magnesium            203.989335
dtype: float64
0    1.00565
1    1.00565
2    1.00565
dtype: float64


In [None]:
### EJEMPLO

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) 

'''
# Using the transform method means that the test features 
won't be used to fit the model and avoids data leakage. 
'''

knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

TRATAR DE APLICAR ESO AL DE VINOS