# Machine Learning Preprocessing Notebook Template

## Preprocessing Best Practices

1. Inspect data
2. [Summary statistics](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/Data Analysis/EDA.md>)
3. Remove [missing data](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/MachineLearning/Preprocessing/Transforming Data.md>)
4. [Transform dataset](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/Data Analysis/Cleaning Data/Missing Data.md>)
5. [Data standardization](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/MachineLearning/Preprocessing/Transforming Data.md>)
    - [Data in numeric format](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/MachineLearning/Preprocessing/Transforming Data.md>)
	- Data stored in `pandas` DataFrame or `numpy` Array
	- Perform [EDA](</Users/jesus/Library/Mobile Documents/iCloud~md~obsidian/Documents/DataScience/Data Analysis/EDA.md>) first to ensure data is in the correct format

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

## Load the dataset

In [None]:
# Replace 'your_dataset.csv' with your actual dataset file
data = pd.read_csv("your_dataset.csv")

In [None]:
# Display the first few rows of the dataset
print(data.head())

## Handle missing values

In [None]:
# Example: Fill missing values with the mean of the column
data.fillna(data.mean(), inplace=True)

## Encode categorical variables

In [None]:
# Example: Encode a categorical column 'Category'
label_encoder = LabelEncoder()
data["Category"] = label_encoder.fit_transform(data["Category"])

## Split the dataset into features and target variable

In [None]:
# Replace 'target_column' with your actual target column name
X = data.drop("target_column", axis=1)
y = data["target_column"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Display the shapes of the training and testing sets
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

## Save the preprocessed data

In [None]:
# Replace 'preprocessed_data.csv' with your desired output file name
preprocessed_data = pd.DataFrame(X_train, columns=X.columns)
preprocessed_data["target_column"] = y_train.values
preprocessed_data.to_csv("preprocessed_data.csv", index=False)

print("Preprocessing complete. Preprocessed data saved to preprocessed_data.csv")
