<a href="https://colab.research.google.com/github/financieras/ai/blob/main/logistic_regression/jupyter/normalizar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Normalizar dataset
Partimos de un archivo CSV y lo limpiamos, seleccionamos columnas y normalizamos.

Los archivos de origen están en Drive y son:
- 'datasets/dataset_train.csv'
- 'datasets/dataset_test.csv'

Los archivos están en Google Drive por lo que lo leemos y construimos un DataFrame.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd

# Ruta al archivo en Google Drive
input_file = '/content/drive/My Drive/datasets/dataset_train.csv'

# Leer el archivo CSV y crear el DataFrame
df = pd.read_csv(input_file)

# Mostrar información sobre las columnas del DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600 entries, 0 to 1599
Data columns (total 19 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Index                          1600 non-null   int64  
 1   Hogwarts House                 1600 non-null   object 
 2   First Name                     1600 non-null   object 
 3   Last Name                      1600 non-null   object 
 4   Birthday                       1600 non-null   object 
 5   Best Hand                      1600 non-null   object 
 6   Arithmancy                     1566 non-null   float64
 7   Astronomy                      1568 non-null   float64
 8   Herbology                      1567 non-null   float64
 9   Defense Against the Dark Arts  1569 non-null   float64
 10  Divination                     1561 non-null   float64
 11  Muggle Studies                 1565 non-null   float64
 12  Ancient Runes                  1565 non-null   f

In [3]:
# Borrar la columna 'Astronomy'
df = df.drop(columns=['Astronomy'])

In [4]:
# Convert 'Birthday' to datetime format
df['Birthday'] = pd.to_datetime(df['Birthday'])

# Convert 'Best Hand' to a binary variable (0 for Left, 1 for Right)
df['Best Hand'] = df['Best Hand'].map({'Left': 0, 'Right': 1})

# Reset the index to a continuous sequence
df = df.reset_index(drop=True)

# Remove the original 'Index' column if it exists
if 'Index' in df.columns:
    df = df.drop('Index', axis=1)

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600 entries, 0 to 1599
Data columns (total 17 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   Hogwarts House                 1600 non-null   object        
 1   First Name                     1600 non-null   object        
 2   Last Name                      1600 non-null   object        
 3   Birthday                       1600 non-null   datetime64[ns]
 4   Best Hand                      1600 non-null   int64         
 5   Arithmancy                     1566 non-null   float64       
 6   Herbology                      1567 non-null   float64       
 7   Defense Against the Dark Arts  1569 non-null   float64       
 8   Divination                     1561 non-null   float64       
 9   Muggle Studies                 1565 non-null   float64       
 10  Ancient Runes                  1565 non-null   float64       
 11  History of Magic 

In [5]:
# Calculate age based on the maximum date in Birthday column
df['Birthday'] = pd.to_datetime(df['Birthday'])
reference_date = df['Birthday'].max()
df['Age'] = (reference_date - df['Birthday']).dt.days / 365.25

# Remove unnecessary columns
columns_to_drop = ['First Name', 'Last Name', 'Birthday']
df = df.drop(columns=columns_to_drop)

# Select numerical columns to normalize (including Age which is already float64)
columns_to_normalize = df.select_dtypes(include=['float64']).columns.tolist()

# Function to normalize using mean and standard deviation
def normalize(column):
    mean = column.mean()
    std = column.std()
    return (column - mean) / std

# Apply normalization to the selected columns
df[columns_to_normalize] = df[columns_to_normalize].apply(normalize)

# Apply one-hot encoding for Hogwarts House
df = pd.get_dummies(df, columns=['Hogwarts House'], prefix='House', dtype=float)

# Convert int values in 'Best Hand' column to float64 values
df['Best Hand'] = df['Best Hand'].astype(float)


print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600 entries, 0 to 1599
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Best Hand                      1600 non-null   float64
 1   Arithmancy                     1566 non-null   float64
 2   Herbology                      1567 non-null   float64
 3   Defense Against the Dark Arts  1569 non-null   float64
 4   Divination                     1561 non-null   float64
 5   Muggle Studies                 1565 non-null   float64
 6   Ancient Runes                  1565 non-null   float64
 7   History of Magic               1557 non-null   float64
 8   Transfiguration                1566 non-null   float64
 9   Potions                        1570 non-null   float64
 10  Care of Magical Creatures      1560 non-null   float64
 11  Charms                         1600 non-null   float64
 12  Flying                         1600 non-null   f

In [6]:
# Drop rows with missing data
df = df.dropna()

# Remove duplicate rows
df = df.drop_duplicates()