# Machine Learning Data Preprocessing

This notebook will guide you through the basic steps to preprocess a machine learning dataset using basic Python and pandas code. We will be using a dataset containing housing prices to demonstrate the preprocessing steps.

## Step 1: Import Necessary Libraries

We will start by importing the necessary libraries.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Step 2: Load the Dataset

Next, we will load the dataset using pandas.

In [None]:
# Load the dataset
file_path = '/kaggle/input/house-prices-advanced-regression-techniques/train.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

## Step 3: Separate Features and Target Variable

We need to separate the features (input variables) and the target variable (output variable).

In [None]:
# Separate features and target variable
X = data.drop(columns=['SalePrice', 'Id'])
y = data['SalePrice']

## Step 4: Identify Numerical and Categorical Columns

We need to identify which columns are numerical and which are categorical.

In [None]:
# Identify numerical and categorical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

print(f"Numerical columns: {numerical_cols}")
print(f"Categorical columns: {categorical_cols}")

In [None]:
X['LotConfig'].mode()[0]

## Step 5: Handle Missing Values

For simplicity, we will fill missing values in numerical columns with the median value and in categorical columns with the most frequent value.

In [None]:
# Fill missing values in numerical columns with median
for col in numerical_cols:
    X[col].fillna(X[col].median(), inplace=True)

# Fill missing values in categorical columns with the most frequent value
for col in categorical_cols:
    X[col].fillna(X[col].mode()[0], inplace=True)

## Step 6: Encode Categorical Variables

We need to convert categorical variables into numerical values using one-hot encoding.

In [None]:
# One-hot encode categorical variables
X = pd.get_dummies(X, columns=categorical_cols)

In [None]:
# pd.get_dummies(X[categorical_cols], dtype='int', drop_first=True)

In [None]:
X.head()

## Step 7: Split the Data into Training and Test Sets

We will split the dataset into training and test sets to evaluate the performance of our machine learning model.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

With these steps, we have preprocessed our dataset and it's now ready for training a machine learning model.

## Model building 

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
mean_squared_error(y_test, y_pred)

In [None]:
# write code to draw a decision tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=X.columns, filled=True)
plt.show()


In [None]:
# write code to draw a decision tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

model = DecisionTreeRegressor(max_depth=3)
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=X.columns, filled=True)
plt.show()


In [None]:
# write code to draw a decision tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

model = DecisionTreeRegressor(min_samples_leaf=20)
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
plt.figure(figsize=(20, 10))
plot_tree(model, feature_names=X.columns, filled=True)
plt.show()


## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)
model_rf.score(X_train, y_train)

In [None]:

model_rf.score(X_test, y_test)

In [None]:
y_pred = model_rf.predict(X_test)
mean_squared_error(y_test, y_pred)

In [None]:
model_rf = RandomForestRegressor(n_estimators=50,min_samples_leaf=5,max_depth=3, random_state=42)
model_rf.fit(X_train, y_train)
print(model.score(X_train, y_train))
model_rf.score(X_test, y_test)

In [None]:
# Use LabelEncoder to convert categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
# Load the dataset
file_path = 'house_train.csv'
data = pd.read_csv(file_path)

# Separate features and target variable
X = data.drop(columns=['SalePrice', 'Id'])
y = data['SalePrice']

# Fill missing values in numerical columns with median
for col in numerical_cols:
    X[col].fillna(X[col].median(), inplace=True)

label_encoder = LabelEncoder()
for col in categorical_cols:
    X[col] = label_encoder.fit_transform(X[col])
    

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")

model_rf = RandomForestRegressor(n_estimators=20, min_samples_leaf=3, random_state=42)
model_rf.fit(X_train, y_train)
print(model_rf.score(X_train, y_train))
model_rf.score(X_test, y_test)