1. Exploratory Data Analysis (EDA):

   Start by exploring the dataset to understand its structure and contents. Check the number of rows and columns, data types, and missing values.
   Plot histograms or box plots to visualize the distributions of each numerical feature (e.g., Avg. Area Income, Avg. Area House Age, etc.).
   Use scatter plots to explore relationships between different features and the target variable (Price).

2. Data Preprocessing:

    Handle any missing or null values in the dataset by either filling them with appropriate values or removing the affected rows.
    If needed, convert categorical variables (e.g., Address) into numerical representations using techniques like one-hot encoding or    .     label encoding.

3. Feature Engineering:

    Create new relevant features that could potentially enhance predictive modeling. For example, you could calculate the average             price    per room, the ratio of bedrooms to rooms, or the distance of each property from important landmarks.

4. Data Visualization:

   Visualize the data using plots and charts to gain insights and identify patterns. For instance, you can create scatter plots or    a  .    heatmaps to understand correlations between features.

5. Predictive Modeling (Regression):

   Since the target variable (Price) is continuous, you can perform regression analysis to predict house prices based on the given   /        features.Split the dataset into training and testing sets.
   Utilize regression algorithms such as Linear Regression, Random Forest Regression, or Gradient Boosting Regression to build predictive    models.

6. Model Evaluation
   Since the target variable (Price) is continuous, you can perform regression analysis to predict house prices based on the given         
   features.Split the dataset into training and testing sets.
   Utilize regression algorithms such as Linear Regression, Random Forest Regression, or Gradient Boosting Regression to build predictive    models.

![](https://i.imgur.com/vl7xtxF.png)

<div style = "color: White; display: fill;
              border-radius: 5px;
              background-color: #3AB4F2;
              font-size: 100%;
              font-family: Verdana">
<p style = "padding: 7px;
            color: Black;">
    <ul> 📌 <b>Income</b> <br>
         📌 <b>House Age</b>  <br>
         📌 <b>Numbers of Rooms</b> <br>
         📌 <b>Number of Bath</b> <br>
         📌 <b>Population</b> <br>
         📌 <b>Address</b> <b>
         📌 <b>Price Prediction</b> <b> 
            <b></b><br><br>
    <p style = "padding: 3px;
                color: Black;">

![](https://i.imgur.com/WPXJm4c.png)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

![](https://i.imgur.com/BiP8Vkx.png)

In [None]:
df = pd.read_csv("/kaggle/input/usa-housingcsv/USA_Housing.csv")
df.head()

![](https://i.imgur.com/8nbAftN.png)

In [None]:
## Check the shape of dataset
df.shape

In [None]:
##check the Null values in dataset
df.isnull().sum()

In [None]:
# Check for duplicate rows in the entire DataFrame
df.duplicated().sum()


In [None]:
##A home has more bedrooms than washrooms, and the column 'Avg. Area Number of Bedrooms' represents the smaller number of rooms designated as 'Number of Washrooms' 
df = df.rename(columns ={'Avg. Area Income':'Income','Avg. Area House Age':'House Age','Avg. Area Number of Rooms':'Number of Rooms',
                          'Avg. Area Number of Bedrooms':'Number of Bath','Area Population':'Population'})

In [None]:
df['Income']  = df['Income'].astype(str)
df['Price']   = df['Price'].astype(str)

In [None]:
df['Address'].unique()

In [None]:
df['Address'].nunique()

In [None]:
##there is not any importance of address so delete this column
df = df.drop(columns ='Address')

In [None]:
df['Income']  = df['Income'].str.replace(',','').astype(float)
df['Price']   = df['Price'].str.replace(',','').astype(float)

In [None]:
df['Income']             =     df['Income'].apply(int)
df['House Age']          =     df['House Age'].apply(int)         
df['Number of Rooms']    =     df['Number of Rooms'].apply(int)
df['Number of Bath']     =     df['Number of Bath'].apply(int)
df['Population']         =     df['Population'].apply(int)
df['Price']              =     df['Price'].apply(int)


In [None]:
df.head()

In [None]:
## 5  number Summary
df.describe()

In [None]:
df.info()

![](https://i.imgur.com/NoGC4qm.png)

In [None]:
# Plot histograms for numerical features
numerical_features = ['Income', 'House Age', 'Number of Rooms',
                      'Number of Bath', 'Population', 'Price']

df[numerical_features].hist(bins=30, figsize=(15, 10))
plt.suptitle("Histograms of Numerical Features")
plt.show()

In [None]:
# Use scatter plots to explore relationships between features and target variable (Price)
plt.figure(figsize=(10, 6),dpi=100)
sns.lineplot(x='Income', y='Price',color='green' ,data=df)
plt.title("Scatter Plot: Income vs. Price")
plt.show()

In [None]:
df.head()

## House Age vs Price 

In [None]:
# Create a bar plot showing the mean Price for each unique House Age
plt.figure(figsize=(10, 6), dpi=100)
sns.barplot(x='House Age', y='Price', data=df, ci=None, palette='cool')
plt.title('Price by House Age')
plt.xlabel('House Age')
plt.ylabel('Mean Price')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()



In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x='Number of Rooms',y= 'Price',data = df,ci = None)
plt.title('Price by Number of Rooms')
plt.xlabel('Number of Rooms')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


In [None]:
plt.figure(figsize=(10,6),dpi =100)
sns.boxplot(x = 'Number of Bath',y = 'Price',data= df)
plt.show()

## Heat Map

1. Income and Price:  positive correlation (0.64)
2. House Age and Price: Weak positive correlation (0.45)
3. Number of Rooms and Price: Weak positive correlation (0.33)
4. Number of Washrooms and Price: Weak positive correlation (0.17)
5. Population and Price: positive correlation (0.41)

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(),annot=True,cmap='rainbow')

## Target Variable Selection
Price is our target variable, and all the other features influence it. The extent of their impact varies depending on the type of variable.

In [None]:
df.head(5)

In [None]:
X = df.drop(columns ='Price')
y = df[['Price']]

In [None]:
X.head(1)

In [None]:
y.head()

![](https://i.imgur.com/bZn8N2J.png)

## Scale the Input Features by using StandardScaler

In [None]:
# Data Preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
X

In [None]:
y = scaler.fit_transform(y)

In [None]:
y

![](https://i.imgur.com/lUO8zIo.png)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
print("X_train.shape :",X_train.shape)
print("y_train.shape :",y_train.shape)
print("X_test.shape  :",X_test.shape)
print("y_test.shape  :",y_test.shape)

## **1.  Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression
# Choose a model (e.g., Linear Regression)
model  = LinearRegression()
# Train the model using the training data
model.fit(X_train,y_train)
# Make predictions on the testing data
y_pred = model.predict(X_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate R-squared (R2 score)
r2 = r2_score(y_test, y_pred)
print("R-squared:         ", r2)

## **2.  Decision tree Regressor**

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor(max_depth =100,random_state=42)
dtr.fit(X_train,y_train)
y_pred = dtr.predict(X_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate R-squared (R2 score)
r2 = r2_score(y_test, y_pred)
print("R-squared:         ", r2)


## 3. Random forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators = 1000,max_depth =1000,random_state=42)
rf_reg.fit(X_train,y_train)
y_pred = rf_reg.predict(X_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:      ", mse)

# Calculate R-squared (R2 score)
r2 = r2_score(y_test, y_pred)
print("R-squared:               ", r2)


## 4.  Gradient boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train,y_train)
y_pred = gbr.predict(X_test)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate R-squared (R2 score)
r2 = r2_score(y_test, y_pred)
print("R-squared:         ", r2)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Without Feature Scaling
model_no_scaling = GradientBoostingRegressor()
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
mse_no_scaling = mean_squared_error(y_test, y_pred_no_scaling)
print("MSE without feature scaling:", mse_no_scaling)

# With Only Feature Scaling (Without Target Variable)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_feature_scaling = GradientBoostingRegressor()
model_feature_scaling.fit(X_train_scaled, y_train)
y_pred_feature_scaling = model_feature_scaling.predict(X_test_scaled)
mse_feature_scaling = mean_squared_error(y_test, y_pred_feature_scaling)
print("MSE with only feature scaling:", mse_feature_scaling)

# With Feature and Target Variable Scaling
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train)
y_test_scaled = scaler_y.transform(y_test)

model_both_scaling = GradientBoostingRegressor()
model_both_scaling.fit(X_train_scaled, y_train_scaled)
y_pred_both_scaling = model_both_scaling.predict(X_test_scaled)
mse_both_scaling = mean_squared_error(y_test_scaled, y_pred_both_scaling)
print("MSE with both feature and target variable scaling:", mse_both_scaling)

# Display the Ranking
mse_values = [mse_no_scaling, mse_feature_scaling, mse_both_scaling]
models = ["Without Scaling", "With Only Feature Scaling", "With Both Scaling"]

ranking = sorted(zip(mse_values, models), key=lambda x: x[0])
for rank, (mse, model) in enumerate(ranking, start=1):
    print(f"{rank}. {model}: MSE = {mse:.2f}")


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Without Feature Scaling
model_no_scaling = LinearRegression()
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
mse_no_scaling = mean_squared_error(y_test, y_pred_no_scaling)
print("MSE without feature scaling:", mse_no_scaling)

# With Only Feature Scaling (Without Target Variable)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_feature_scaling = LinearRegression()
model_feature_scaling.fit(X_train_scaled, y_train)
y_pred_feature_scaling = model_feature_scaling.predict(X_test_scaled)
mse_feature_scaling = mean_squared_error(y_test, y_pred_feature_scaling)
print("MSE with only feature scaling:", mse_feature_scaling)

# With Feature and Target Variable Scaling
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train)
y_test_scaled = scaler_y.transform(y_test)
model_both_scaling = LinearRegression()
model_both_scaling.fit(X_train_scaled, y_train_scaled)
y_pred_both_scaling = model_both_scaling.predict(X_test_scaled)
mse_both_scaling = mean_squared_error(y_test_scaled, y_pred_both_scaling)
print("MSE with both feature and target variable scaling:", mse_both_scaling)

# Display the Ranking
mse_values = [mse_no_scaling, mse_feature_scaling, mse_both_scaling]
models = ["Without Scaling", "With Only Feature Scaling", "With Both Scaling"]

ranking = sorted(zip(mse_values, models), key=lambda x: x[0])
for rank, (mse, model) in enumerate(ranking, start=1):
    print(f"{rank}. {model}: MSE = {mse:.2f}")


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Without Feature Scaling
model_no_scaling = DecisionTreeRegressor()
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
mse_no_scaling = mean_squared_error(y_test, y_pred_no_scaling)
print("MSE without feature scaling:", mse_no_scaling)

# With Only Feature Scaling (Without Target Variable)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_feature_scaling = DecisionTreeRegressor()
model_feature_scaling.fit(X_train_scaled, y_train)
y_pred_feature_scaling = model_feature_scaling.predict(X_test_scaled)
mse_feature_scaling = mean_squared_error(y_test, y_pred_feature_scaling)
print("MSE with only feature scaling:", mse_feature_scaling)

# With Feature and Target Variable Scaling
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train)
y_test_scaled = scaler_y.transform(y_test)

model_both_scaling = DecisionTreeRegressor()
model_both_scaling.fit(X_train_scaled, y_train_scaled)
y_pred_both_scaling = model_both_scaling.predict(X_test_scaled)
mse_both_scaling = mean_squared_error(y_test_scaled, y_pred_both_scaling)
print("MSE with both feature and target variable scaling:", mse_both_scaling)

# Display the Ranking
mse_values = [mse_no_scaling, mse_feature_scaling, mse_both_scaling]
models = ["Without Scaling", "With Only Feature Scaling", "With Both Scaling"]

ranking = sorted(zip(mse_values, models), key=lambda x: x[0])
for rank, (mse, model) in enumerate(ranking, start=1):
    print(f"{rank}. {model}: MSE = {mse:.2f}")
