# House Prices Prediction Project using CRISP-DM Methodology
This notebook follows the CRISP-DM methodology for predicting house prices using the Kaggle house prices dataset.
We will cover the following phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

## Phase 1: Business Understanding
The goal of this project is to predict house prices based on various features in the dataset. Our key objectives are:
- Predict house prices
- Identify the most important features
- Improve decision-making for real estate professionals.

## Phase 2: Data Understanding
Let's start by loading and inspecting the dataset.

In [None]:
# Import necessary libraries
import pandas as pd
# Load the dataset
file_path = '/content/house prices dataset.csv'
df = pd.read_csv(file_path)
# Check the size of the dataset
print('Dataset shape:', df.shape)
# Display the first few rows of the dataset
df.head()

### Data Quality Assessment
Now, let's check for missing values, duplicated rows, and outliers.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print('Missing values per column:
', missing_values)
# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print('Number of duplicate rows:', duplicate_rows)

## Phase 3: Data Preparation
We'll clean the data, handle missing values, select features, and preprocess the data.

In [None]:
# Drop columns with more than 30% missing values
threshold = len(df) * 0.3
df_cleaned = df.dropna(thresh=threshold, axis=1)
# Fill missing numerical values with median
numerical_columns = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
df_cleaned[numerical_columns] = df_cleaned[numerical_columns].fillna(df_cleaned[numerical_columns].median())

## Phase 4: Modeling
We'll start with a baseline Linear Regression model and then move to more advanced models like Random Forest and XGBoost.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Split data into features (X) and target (y)
X = df_cleaned.drop('SalePrice', axis=1)
y = df_cleaned['SalePrice']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
lin_reg = LinearRegression()

# Train the model
lin_reg.fit(X_train, y_train)

# Make predictions
y_pred = lin_reg.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print(f'Baseline Linear Regression RMSE: {rmse:.2f}')

## Phase 5: Evaluation
Let's evaluate the models using additional metrics such as MAE and R².

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score
# Calculate MAE and R²
mae_lin = mean_absolute_error(y_test, y_pred)
r2_lin = r2_score(y_test, y_pred)
print(f'Linear Regression - MAE: {mae_lin:.2f}, R²: {r2_lin:.2f}')

## Phase 6: Deployment
We summarize our findings and discuss potential next steps.