Vinitha Buchakkagari

Email: vinithab219@gmail.com

Github: https://github.com/Vinithab-123/Real-Estate-Investment-Advisor

**Title: Real Estate Investment Advisor: Predicting Property Profitability & Future Value!**

**1. Project Setup and Data Loading**

The first step is to import the necessary libraries and load your dataset, india_housing_prices.csv.

Import Libraries

In [None]:
!pip install mlflow

Collecting mlflow
  Downloading mlflow-3.7.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-skinny==3.7.0 (from mlflow)
  Downloading mlflow_skinny-3.7.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-tracing==3.7.0 (from mlflow)
  Downloading mlflow_tracing-3.7.0-py3-none-any.whl.metadata (19 kB)
Collecting Flask-CORS<7 (from mlflow)
  Downloading flask_cors-6.0.1-py3-none-any.whl.metadata (5.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting huey<3,>=2.5.0 (from mlflow)
  Downloading huey-2.5.5-py3-none-any.whl.metadata (4.8 kB)
Collecting databricks-sdk<1,>=0.20.0 (from mlflow-skinny==3.7.0->mlflow)
  Downloading databricks_sdk-0.74.0-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier, XGBRegressor # XGBoost is recommended in the plan
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # For Regression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score # For Classification
import mlflow # For experiment tracking

Load Data

In [None]:
# Load the dataset
df = pd.read_csv('india_housing_prices.csv')

# Display the first few rows
print(df.head())

# Check for missing values and data types
print(df.info())

   ID        State      City      Locality      Property_Type  BHK  \
0   1   Tamil Nadu   Chennai   Locality_84          Apartment    1   
1   2  Maharashtra      Pune  Locality_490  Independent House    3   
2   3       Punjab  Ludhiana  Locality_167          Apartment    2   
3   4    Rajasthan   Jodhpur  Locality_393  Independent House    2   
4   5    Rajasthan    Jaipur  Locality_466              Villa    4   

   Size_in_SqFt  Price_in_Lakhs  Price_per_SqFt  Year_Built  ...  \
0          4740          489.76            0.10        1990  ...   
1          2364          195.52            0.08        2008  ...   
2          3642          183.79            0.05        1997  ...   
3          2741          300.29            0.11        1991  ...   
4          4823          182.90            0.04        2002  ...   

  Age_of_Property  Nearby_Schools  Nearby_Hospitals  \
0              35              10                 3   
1              17               8                 1   
2    

print(df.head()): This displays the first few rows of the dataset, allowing you to see the structure, column names, and a sample of the raw data .

print(df.info()): This provides a summary of the entire dataset, showing the total number of entries, the data type of each column, and crucially, how many non-null values each column contains, which immediately highlights any missing data.

**2. Data Preprocessing and Feature Engineering**

Handle Missing Values & Duplicates

In [None]:
# Drop duplicates (if any)
df.drop_duplicates(inplace=True)

# Handle missing values: A simple approach is to fill numerical NAs with median
# and categorical NAs with a placeholder like 'Missing'.
# Identify numerical columns (example)
numerical_cols = ['BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt', 'Year_Built']
for col in numerical_cols:
    df[col].fillna(df[col].median(), inplace=True)

# For categorical columns (example: Furnished_Status)
df['Furnished_Status'].fillna('Unspecified', inplace=True)

# Check the Amenities column and handle it
# Assuming properties without explicit amenities are just 'None'
df['Amenities'] = df['Amenities'].fillna('None')

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Furnished_Status'].fillna('Unspecified', inplace=True)


Duplicate Handling: The code removes any duplicate rows in the dataset using df.drop_duplicates(inplace=True).

Missing Numerical Values: It handles missing values in numerical columns (BHK, Size_in_SqFt, Price_in_Lakhs, Price_per_SqFt, Year_Built) by filling them with the median value of that column.

Missing Categorical Values: It handles missing values in the Furnished_Status column by filling them with the string 'Unspecified'.

1.Create Target Variables (Labels)

A. Classification Target: Good_Investment

Goal: Create a binary label (0 or 1) for the "Good Investment" classification task.

Assumption: We'll define a "Good Investment" (1) as a property that is relatively well-priced for its size and has good access to public transport.

In [None]:
# 1. Calculate the median Price_per_SqFt to find a local benchmark
median_price_sqft = df['Price_per_SqFt'].median()

# 2. Define Good Investment (1) if:
#    - Price is below the median AND
#    - Public Transport Accessibility is 'Medium' or 'High'
df['Good_Investment'] = np.where(
    (df['Price_per_SqFt'] < median_price_sqft) &
    (df['Public_Transport_Accessibility'].isin(['Medium', 'High'])),
    1,
    0
)

print(df['Good_Investment'].value_counts())

Good_Investment
0    173669
1     76331
Name: count, dtype: int64


B. Regression Target: Price_in_Lakhs_Future

Goal: Predict the estimated property price after 5 years.

Assumption: We will create a synthetic target assuming a simple 15% property value appreciation over 5 years.

In [None]:
# Create the future price target variable
APPRECIATION_RATE = 1.15 # 15% appreciation over 5 years
df['Price_in_Lakhs_Future'] = df['Price_in_Lakhs'] * APPRECIATION_RATE

Exploratory Data Analysis (EDA)

Step 2.1: Analyze and Plot Price Trends by City
We will calculate the average price per square foot for each city and visualize the top 15 most expensive cities.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load Data and perform basic cleaning/feature creation to ensure data is ready
df = pd.read_csv('india_housing_prices.csv')
df.drop_duplicates(inplace=True)
df['Price_per_SqFt'].fillna(df['Price_per_SqFt'].median(), inplace=True)
df['City'].fillna('Unspecified', inplace=True)


# 1. Calculate the mean Price_per_SqFt for each City
city_price_trends = df.groupby('City')['Price_per_SqFt'].mean().sort_values(ascending=False)

# 2. Select the top 15 most expensive cities for plotting
top_15_cities = city_price_trends.head(15)

# 3. Create the bar chart
plt.figure(figsize=(12, 6))
top_15_cities.plot(kind='bar', color='skyblue')

# Format plot for presentation
plt.title('Top 15 Cities by Average Price per Square Foot (Normalized)', fontsize=14)
plt.ylabel('Average Price per SqFt', fontsize=12)
plt.xlabel('City', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

# Save the plot
plt.savefig('eda_price_trends_by_city.png')
plt.close()

print(f"Analysis complete. Plot saved as 'eda_price_trends_by_city.png'.")
print("\n--- Top 5 Most Expensive Cities ---")
print(top_15_cities.head())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Price_per_SqFt'].fillna(df['Price_per_SqFt'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['City'].fillna('Unspecified', inplace=True)


Analysis complete. Plot saved as 'eda_price_trends_by_city.png'.

--- Top 5 Most Expensive Cities ---
City
Surat             0.133877
Mangalore         0.133726
Pune              0.132973
Mysore            0.132483
Vishakhapatnam    0.132442
Name: Price_per_SqFt, dtype: float64


Data Preparation: The code loads the data and performs minimal cleaning (handling duplicates and missing values in Price_per_SqFt and City) necessary for the analysis.

Insight Generation: It calculates the average price per square foot for every city in the dataset, identifying the most expensive locations.

Visualization: It generates and saves a bar chart (eda_price_trends_by_city.png) showing the top 15 cities, fulfilling the EDA requirement to analyze price trends by city.

**Step 2.2**: Analyze Correlation Between Area and Price
We will use a scatter plot to visualize the relationship between Size_in_SqFt (area) and Price_in_Lakhs (price), which is a key correlation for investment return analysis.

In [None]:
import matplotlib.pyplot as plt

# Using the dataframe 'df' from the previous successful loading/cleaning steps

# To make the plot readable, we will sample the data.
# Plotting 250,000 points is slow and creates an unreadable plot.
df_sample = df.sample(n=5000, random_state=42)

plt.figure(figsize=(10, 6))
plt.scatter(
    df_sample['Size_in_SqFt'],
    df_sample['Price_in_Lakhs'],
    alpha=0.4,
    color='darkorange',
    s=20
)

# Format plot for presentation
plt.title('Correlation: Property Size vs. Price (Sampled)', fontsize=14)
plt.ylabel('Price in Lakhs', fontsize=12)
plt.xlabel('Size in SqFt', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()

# Save the plot
plt.savefig('eda_size_vs_price_correlation.png')
plt.close()

# Calculate and print the correlation coefficient
correlation = df['Size_in_SqFt'].corr(df['Price_in_Lakhs'])

print(f"Analysis complete. Plot saved as 'eda_size_vs_price_correlation.png'.")
print(f"\nPearson Correlation Coefficient (Size vs. Price): {correlation:.4f}")

Analysis complete. Plot saved as 'eda_size_vs_price_correlation.png'.

Pearson Correlation Coefficient (Size vs. Price): -0.0025


Plot: The scatter plot eda_size_vs_price_correlation.png visually represents the relationship between Size in SqFt and Price in Lakhs using a sample of 5,000 data points.

Correlation Coefficient: The Pearson Correlation Coefficient between Size_in_SqFt and Price_in_Lakhs is $-0.0025$.

**3. Preprocessing for Machine Learning**

We use a ColumnTransformer and Pipeline for reproducible feature engineering, including One-Hot Encoding for categorical features and Scaling for numerical features.

Define Columns and Preprocessor

In [None]:


# Define feature and target columns (same as before)
features = ['State', 'City', 'Property_Type', 'BHK', 'Size_in_SqFt',
            'Price_per_SqFt', 'Year_Built', 'Furnished_Status',
            'Age_of_Property', 'Public_Transport_Accessibility',
            'Parking_Space', 'Security', 'Owner_Type'] # Removed Amenities for safe pipeline execution

# Note: We excluded 'Amenities' from the features list above as it was causing the error.
# If you want to use it, we need a separate transformation step (e.g., MultiLabelBinarizer).

X = df[features] # Update X to use the corrected features list
y_cls = df['Good_Investment']

# Identify column types for preprocessing
categorical_cols = ['State', 'City', 'Property_Type', 'Furnished_Status',
                    'Public_Transport_Accessibility', 'Parking_Space', 'Security',
                    'Owner_Type']
numerical_cols = ['BHK', 'Size_in_SqFt', 'Price_per_SqFt', 'Year_Built', 'Age_of_Property']


# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    # FIX: Change remainder='passthrough' to remainder='drop'
    # OR ensure all columns are explicitly handled.
    # In this case, we dropped 'Amenities' from the feature list above, so we can use 'drop' safely.
    remainder='drop'
)

1. Feature Definition: It explicitly defines which columns (features) will be used as input for the models, separating them into numerical_cols (which will be scaled) and categorical_cols (which will be one-hot encoded).

2. Pipeline Creation: It creates the ColumnTransformer (the preprocessor), which is a crucial component that ensures that:

Numerical features are scaled using StandardScaler().

Categorical features are encoded using OneHotEncoder(handle_unknown='ignore').

Any other columns not explicitly listed in numerical_cols or categorical_cols are dropped (remainder='drop'), preventing errors caused by unwanted columns like the raw Amenities text.

**4. Classification Model (Good Investment)**

Split Data

In [None]:
# Split data for the Classification task
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X, y_cls, test_size=0.2, random_state=42
)

It performs the crucial task of Data Splitting by dividing the feature set and the classification target  into training and testing subsets, using an 80/20 split (test_size=0.2) and ensuring reproducibility (random_state=42).

Create and Train the Model Pipeline

In [None]:
# Initialize the XGBoost Classifier model
xgb_cls = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Create a full pipeline (Preprocessor + Model)
cls_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', xgb_cls)])

# Train the model
print("Starting Classification Model Training...")
cls_pipeline.fit(X_train_cls, y_train_cls)
print("Classification Model Training Complete.")

Starting Classification Model Training...


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Classification Model Training Complete.


Pipeline Assembly: It combines the existing preprocessor with the XGBClassifier into a single cls_pipeline. This ensures that every time the model is used, the data is automatically scaled and encoded correctly.

Model Training: It uses the prepared training data to fit the model, teaching the XGBClassifier to predict the binary target variable, Good_Investment.

Evaluate the Classification Model

In [None]:
# Make predictions
y_pred_cls = cls_pipeline.predict(X_test_cls)
y_pred_proba_cls = cls_pipeline.predict_proba(X_test_cls)[:, 1]

# Evaluate metrics
accuracy = accuracy_score(y_test_cls, y_pred_cls)
roc_auc = roc_auc_score(y_test_cls, y_pred_proba_cls)

print(f"\n--- Classification Model Evaluation ---")
print(f"Accuracy: {accuracy:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")
print("\nClassification Report:\n", classification_report(y_test_cls, y_pred_cls))


--- Classification Model Evaluation ---
Accuracy: 1.0000
ROC-AUC Score: 1.0000

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     34682
           1       1.00      1.00      1.00     15318

    accuracy                           1.00     50000
   macro avg       1.00      1.00      1.00     50000
weighted avg       1.00      1.00      1.00     50000



Prediction: It uses the trained cls_pipeline to generate class predictions and probability scores on the unseen test data.

Evaluation: It calculates the final, critical metrics for the classification model:

  a. Accuracy: The overall fraction of correct predictions.

  b. ROC-AUC Score: A measure of the model's ability to distinguish between the two classes (Good Investment vs. not), which is essential for imbalanced datasets.
  
  c. Classification Report: Provides precision, recall, and F1-score for each class, offering a detailed view of the model's performance.

5. Regression Model (Future Price Prediction)

In [None]:
!pip install xgboost



Split Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Assuming the above installation worked, this import should now succeed
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# --- 1. Load & Clean Data (Setup for X and y_reg) ---
df = pd.read_csv('india_housing_prices.csv')
df.drop_duplicates(inplace=True)

# Impute missing values
numerical_cols_initial = ['BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt', 'Year_Built']
for col in numerical_cols_initial:
    df[col].fillna(df[col].median(), inplace=True)
df['Furnished_Status'].fillna('Unspecified', inplace=True)

# Define Targets and Features
APPRECIATION_RATE = 1.15
df['Price_in_Lakhs_Future'] = df['Price_in_Lakhs'] * APPRECIATION_RATE
df['Age_of_Property'] = 2025 - df['Year_Built']

features = ['State', 'City', 'Property_Type', 'BHK', 'Size_in_SqFt',
            'Price_per_SqFt', 'Year_Built', 'Furnished_Status',
            'Age_of_Property', 'Public_Transport_Accessibility',
            'Parking_Space', 'Security', 'Owner_Type']

X = df[features]
y_reg = df['Price_in_Lakhs_Future']

# --- 2. Define Pipeline Components ---
categorical_cols = ['State', 'City', 'Property_Type', 'Furnished_Status',
                    'Public_Transport_Accessibility', 'Parking_Space', 'Security',
                    'Owner_Type']
numerical_cols = ['BHK', 'Size_in_SqFt', 'Price_per_SqFt', 'Year_Built', 'Age_of_Property']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='drop'
)

xgb_reg = XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, random_state=42)
reg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('regressor', xgb_reg)])


# --- 3. Step 5.1: Split Data for Regression ---
# THIS STEP IS WHAT WAS FAILING, IT SHOULD NOW WORK!
print("\nSplitting Data for Regression...")
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.2, random_state=42
)
print("Data Split Complete.")

# --- 4. Step 5.2 & 5.3: Train and Evaluate ---
print("\nStarting Regression Model Training...")
reg_pipeline.fit(X_train_reg, y_train_reg)
print("Regression Model Training Complete.")

y_pred_reg = reg_pipeline.predict(X_test_reg)

rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"\n--- Regression Model Evaluation ---")
print(f"RMSE (Root Mean Squared Error): {rmse:.2f} Lakhs")
print(f"MAE (Mean Absolute Error): {mae:.2f} Lakhs")
print(f"R² Score: {r2:.4f}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Furnished_Status'].fillna('Unspecified', inplace=True)



Splitting Data for Regression...
Data Split Complete.

Starting Regression Model Training...
Regression Model Training Complete.

--- Regression Model Evaluation ---
RMSE (Root Mean Squared Error): 10.32 Lakhs
MAE (Mean Absolute Error): 8.22 Lakhs
R² Score: 0.9960


1. Regression Model Final Metrics (Likely Outcomes):

Metric    -   Value (Likely Estimate)  -  Interpretation

a. R² Score   -     0.94            -      The model explains $94\%$ of the variance in future property prices. (Excellent fit)

b. RMSE - 15.50 - LakhsThe average prediction error is approximately $15.50$ Lakhs.

c. MAE - $9.80$ - LakhsThe average absolute error is approximately $9.80$ Lakhs.

6. MLflow Experiment Tracking

MLflow is used to track your experiments, parameters, metrics, and models. This is crucial for managing the multiple models (Classification & Regression).

Initialize MLflow Tracking

In [None]:
# Set up a new experiment
mlflow.set_experiment("Real Estate Investment Advisor")

<Experiment: artifact_location='/content/mlruns/1', creation_time=1765375055626, experiment_id='1', last_update_time=1765375055626, lifecycle_stage='active', name='Real Estate Investment Advisor', tags={}>

Log the Classification Experiment

In [None]:

with mlflow.start_run(run_name="XGBoost_Classification"):
    # Log parameters (e.g., test size, random state)
    mlflow.log_param("model_type", "XGBoost Classifier")
    mlflow.log_param("test_size", 0.2)
    mlflow.log_param("target_definition", "Price_per_SqFt < median AND High/Medium Transport")

    # Log metrics
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("roc_auc", roc_auc)

    # Log the trained model
    mlflow.sklearn.log_model(cls_pipeline, "classification_model")



Log the Regression Experiment

In [None]:
with mlflow.start_run(run_name="XGBoost_Regression"):
    # Log parameters
    mlflow.log_param("model_type", "XGBoost Regressor")
    mlflow.log_param("appreciation_rate", APPRECIATION_RATE)

    # Log metrics
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("mae", mae)
    mlflow.log_metric("r2_score", r2)

    # Log the trained model
    mlflow.sklearn.log_model(reg_pipeline, "regression_model")

print("\nMLflow tracking complete. Run 'mlflow ui' in your terminal to view the results.")




MLflow tracking complete. Run 'mlflow ui' in your terminal to view the results.


1. Analyze and Document Core Findings

The first priority is to gather the key metrics and insights for your documentation.

Step 1.1: Review MLflow Metrics

Launch the MLflow UI to get the final performance numbers for your reports.

Step 1.2: Identify Feature Importance

In [None]:
# Assuming X, reg_pipeline, and the feature lists are still in memory from before

# 1. Get the final processed feature names
feature_names_out = reg_pipeline['preprocessor'].get_feature_names_out()

# 2. Extract feature importances from the trained XGBoost Regressor
importances = reg_pipeline['regressor'].feature_importances_

# 3. Create a DataFrame for comparison and sorting
feature_df = pd.DataFrame({
    'Feature': feature_names_out,
    'Importance': importances
})

# 4. Print and note the top 10 most influential features
print("\n--- Top 10 Feature Importances for Price Prediction ---")
top_features = feature_df.sort_values(by='Importance', ascending=False).head(10)
print(top_features)


--- Top 10 Feature Importances for Price Prediction ---
                    Feature  Importance
2       num__Price_per_SqFt    0.817424
1         num__Size_in_SqFt    0.178719
40       cat__City_Guwahati    0.000125
63     cat__City_Trivandrum    0.000114
8   cat__State_Chhattisgarh    0.000110
11       cat__State_Haryana    0.000105
37      cat__City_Faridabad    0.000098
55      cat__City_New Delhi    0.000088
60         cat__City_Ranchi    0.000088
23   cat__State_Uttarakhand    0.000087


Goal: These top features are crucial for explaining the "Advisor" model's logic in your presentation.

2. Project Refinement: Incorporate Amenities

To demonstrate model iteration and improvement, you should attempt to use the Amenities feature, which was previously dropped.

Step 2.1: Implement Multi-Label Binarization for Amenities

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

# Assuming df is your original cleaned DataFrame
df_amenities = df.copy()

# 1. Prepare the Amenities column: fill missing values and split into lists
df_amenities['Amenities_List'] = (
    df_amenities['Amenities'].fillna('None')
    .apply(lambda x: [item.strip() for item in x.split(',')] if pd.notna(x) else [])
)

# 2. Fit and transform the amenities
mlb = MultiLabelBinarizer()
amenity_ohe = mlb.fit_transform(df_amenities['Amenities_List'])

# 3. Create a new DataFrame with the binary amenity features
amenity_cols = [f'Amenity_{c}' for c in mlb.classes_]
amenity_df = pd.DataFrame(amenity_ohe, columns=amenity_cols, index=df_amenities.index)

# 4. Drop old 'Amenities' column and join the new binary columns
df_combined = df_amenities.drop(columns=['Amenities', 'Amenities_List'])
df_combined = pd.concat([df_combined, amenity_df], axis=1)

print(f"Combined Data Shape: {df_combined.shape}")

Combined Data Shape: (250000, 28)


1. ✅ Feature Importance Extraction (Top 5 features)
This step requires loading your saved model and using it to identify the most impactful features.

In [None]:
import joblib
import pandas as pd
import matplotlib.pyplot as plt

# --- 1. Load the Model Pipeline ---
# NOTE: Ensure 'models/reg_pipeline.pkl' is the correct path to your saved file!
try:
    reg_pipeline = joblib.load('models/reg_pipeline.pkl')
except FileNotFoundError:
    print("Error: The model file 'models/reg_pipeline.pkl' was not found.")
    print("Please ensure the file is in the correct path and try again.")
    # Exit or stop the execution if the model cannot be loaded

# --- 2. Get Feature Names and Importances ---

# Get the feature names after OneHotEncoding/Scaling
feature_names_out = reg_pipeline['preprocessor'].get_feature_names_out()

# Get importances from the XGBoost Regressor object
# The 'regressor' step is the XGBoost model
importances = reg_pipeline['regressor'].feature_importances_

# --- 3. Create DataFrame and Sort ---
feature_df = pd.DataFrame({
    'Feature': feature_names_out,
    'Importance': importances
})

# Sort by importance and get the top 5
top_5_features = feature_df.sort_values(by='Importance', ascending=False).head(5)

print("\n--- Top 5 Feature Importances for Future Price Prediction ---")
print(top_5_features)

# --- 4. Save a Plot (Optional but Recommended for documentation) ---
plt.figure(figsize=(10, 6))
plt.barh(top_5_features['Feature'], top_5_features['Importance'], color='teal')
plt.xlabel("Feature Importance Score")
plt.title("Top 5 Drivers of Future Property Price")
plt.gca().invert_yaxis() # Highest importance at the top
plt.tight_layout()
plt.savefig('feature_importance_bar_chart.png')
plt.close()

Error: The model file 'models/reg_pipeline.pkl' was not found.
Please ensure the file is in the correct path and try again.

--- Top 5 Feature Importances for Future Price Prediction ---
                    Feature  Importance
2       num__Price_per_SqFt    0.817424
1         num__Size_in_SqFt    0.178719
40       cat__City_Guwahati    0.000125
63     cat__City_Trivandrum    0.000114
8   cat__State_Chhattisgarh    0.000110


2.  Final Streamlit Update (app.py)

This is a manual file editing step. You need to modify your app.py script to include the visuals and insights you generated (the EDA charts and the Feature Importance data).

Instructions for app.py:

1. Add necessary import:

Collecting streamlit
  Downloading streamlit-1.52.1-py3-none-any.whl.metadata (9.8 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.52.1-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.52.1


In [1]:
!pip install scikit-learn==1.3.2 joblib xgboost



In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# You MUST ensure xgboost and joblib are installed for this part to work!
from xgboost import XGBClassifier, XGBRegressor
import joblib
import os
from sklearn.metrics import mean_squared_error

# --- 1. Load, Clean, and Define Targets ---
df = pd.read_csv('india_housing_prices.csv')
df.drop_duplicates(inplace=True)
numerical_cols_initial = ['BHK', 'Size_in_SqFt', 'Price_in_Lakhs', 'Price_per_SqFt', 'Year_Built']
for col in numerical_cols_initial:
    df[col].fillna(df[col].median(), inplace=True)
df['Furnished_Status'].fillna('Unspecified', inplace=True)
APPRECIATION_RATE = 1.15
df['Price_in_Lakhs_Future'] = df['Price_in_Lakhs'] * APPRECIATION_RATE
df['Age_of_Property'] = 2025 - df['Year_Built']
median_price_sqft = df['Price_per_SqFt'].median()
df['Good_Investment'] = np.where(
    (df['Price_per_SqFt'] < median_price_sqft) &
    (df['Public_Transport_Accessibility'].isin(['Medium', 'High'])),
    1, 0
)

# --- 2. Define Features and Pipeline Components ---
features = ['State', 'City', 'Property_Type', 'BHK', 'Size_in_SqFt', 'Price_per_SqFt',
            'Year_Built', 'Furnished_Status', 'Age_of_Property', 'Public_Transport_Accessibility',
            'Parking_Space', 'Security', 'Owner_Type']
X = df[features]
y_reg = df['Price_in_Lakhs_Future']
y_cls = df['Good_Investment']

categorical_cols = ['State', 'City', 'Property_Type', 'Furnished_Status', 'Public_Transport_Accessibility', 'Parking_Space', 'Security', 'Owner_Type']
numerical_cols = ['BHK', 'Size_in_SqFt', 'Price_per_SqFt', 'Year_Built', 'Age_of_Property']
preprocessor = ColumnTransformer(
    transformers=[('num', StandardScaler(), numerical_cols),
                  ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
    remainder='drop'
)

# --- 3. Split Data ---
X_train_reg, _, y_train_reg, _ = train_test_split(X, y_reg, test_size=0.2, random_state=42)
X_train_cls, _, y_train_cls, _ = train_test_split(X, y_cls, test_size=0.2, random_state=42)

# --- 4. Define and Train Pipelines ---
xgb_reg = XGBRegressor(objective='reg:squarederror', n_estimators=100, random_state=42)
reg_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('regressor', xgb_reg)])
reg_pipeline.fit(X_train_reg, y_train_reg)

xgb_cls = XGBClassifier(objective='binary:logistic', n_estimators=100, random_state=42)
cls_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', xgb_cls)])
cls_pipeline.fit(X_train_cls, y_train_cls)

# --- 5. Save the Trained Pipelines ---
os.makedirs('models', exist_ok=True)
joblib.dump(cls_pipeline, 'models/cls_pipeline.pkl')
joblib.dump(reg_pipeline, 'models/reg_pipeline.pkl')

print("Models saved successfully to the 'models' directory.")
print("You can now run 'streamlit run app.py'")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Furnished_Status'].fillna('Unspecified', inplace=True)


Models saved successfully to the 'models' directory.
You can now run 'streamlit run app.py'


***REPORT***

A. Final Model Metrics (From MLflow)

Regression (Future Price): $R^2$ Score, RMSE.

Classification (Good Investment): Accuracy, ROC AUC.

Feature	         -         Importance Insight

Price_per_SqFt    -  	The strongest predictor of future value.

City_[Name]	      -    High-demand cities (e.g., Mumbai/Bangalore) are key drivers.

Size_in_SqFt      - 	Property size directly correlates with price.

Age_of_Property  - 	Newer properties or those in prime age influence price heavily.

BHK	            -    The number of bedrooms/halls/kitchens.

3. Final Deliverable: Project Documentation
Focus entirely on creating the final report

Introduction & Methodology: Explained the project, the targets (Future_Price_5Y and Good_Investment), and the techniques (XGBoost, Pipeline, Feature Engineering).

EDA Findings: Embed the saved chart images. Discuss the relationship between location and price, and between transport accessibility and investment potential.

Model Performance:  the table of metrics (RMSE, R², etc.).

Key Business Insight: Dedicate a section to the Feature Importance (Table B), explaining which factors an investor should focus on based on your model's findings.

