### House Price Prediction Using Machine Learning

**1. Preprocessing**:
The preprocessing step is crucial in any machine learning project as it helps to clean and prepare the data for analysis. For this project, we follow these steps:

- **Data Collection**: Gather historical data on house prices along with various features such as location, size, number of rooms, age of the property, and more.
- **Data Cleaning**: Handle missing values by either imputing them with the mean/median or removing records with excessive missing data. Remove duplicates and irrelevant features.
- **Feature Engineering**: Create new features that can provide additional insights. For instance, calculate the age of the house from the year built, or derive the price per square foot.
- **Data Transformation**: Standardize numerical features (e.g., size, price) to have a consistent scale. Encode categorical variables (e.g., location, type of house) using techniques like one-hot encoding or label encoding.
- **Splitting the Dataset**: Divide the dataset into training and testing sets. Typically, 70-80% of the data is used for training, and the remaining 20-30% is used for testing.
  
**2. Model Building**:
Once the data is preprocessed, we can proceed to build and train our machine learning model:

- **Selecting Models**: Choose different machine learning algorithms like Linear Regression, Decision Trees, Random Forests, or Gradient Boosting Machines. in this project we are using Linear Regression.
- **Training the Models**: Train the model on the training set, optimizing for performance by tuning hyperparameters.
- **Model Evaluation**: Evaluate the models using the testing set, checking metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R² score. Compare the performance to select the best model.

**3. Model Deployment**:
Now, we deploy the model to predict house prices for new data:

- **Validation**: Validate the model with real-world data to ensure it generalizes well and produces accurate predictions.
- **Integration**: Integrate the model into a user-friendly application or API where users can input features to get price predictions.
- **Monitoring and Maintenance**: Continuously monitor the model's performance and update it as needed to maintain accuracy over time.

**4. Visualization**:
Finally, visualize the results and insights:

- **Feature Importance**: Display which features have the most significant impact on house prices.
- **Price Distribution**: Plot the distribution of predicted vs. actual house prices.
- **Comparative Analysis**: Show a comparison of different regions or types of houses based on predicted prices.

This project provides valuable insights into the factors influencing house prices and helps stakeholders make informed decisions in the real estate market.


### Feature explaination
Here's a brief description of each feature:

1. **Id**: A unique identifier for each property.
2. **MSSubClass**: The type of dwelling involved in the sale (e.g., 20 for a 1-story 1946 & newer, all styles; 30 for a 1-story 1945 & older, all styles).
3. **MSZoning**: The general zoning classification (e.g., RL for residential low density, RM for residential medium density).
4. **LotArea**: The lot size of the property in square feet.
5. **LotConfig**: The lot configuration (e.g., Inside, Corner, CulDSac).
6. **BldgType**: The type of building (e.g., 1Fam for single-family detached, TwnhsE for townhouse end unit).
7. **OverallCond**: The overall condition rating of the house (on a scale from 1 to 10, where 1 is very poor and 10 is excellent).
8. **YearBuilt**: The year the house was originally constructed.
9. **YearRemodAdd**: The year when the house was remodeled or added to.
10. **Exterior1st**: The primary exterior material used on the house (e.g., VinylSd for Vinyl Siding, BrkFace for Brick Face).
11. **BsmtFinSF2**: The area of the basement that is finished, in square feet.
12. **TotalBsmtSF**: The total area of the basement, in square feet.
13. **SalePrice**: The sale price of the house, which is the target variable we aim to predict.

In [64]:
import pandas as pd #used for data analysis
import numpy as np  #used for performing mathematical operations on arrays

import matplotlib.pyplot as plt  #is used to plot visualization graphs
import seaborn as sns  #used for data visualization and data anlysis

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


In [29]:
file_path = '/content/HousePricePrediction.xlsx'
df = pd.read_excel(file_path)
df_Tour = df.copy()
print(f'there are {df_Tour.shape[0]} rows , and {df_Tour.shape[1]} columns')

there are 2919 rows , and 13 columns


In [30]:
df_Tour.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,RL,8450,Inside,1Fam,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,RL,9600,FR2,1Fam,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,RL,11250,Inside,1Fam,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,RL,9550,Corner,1Fam,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,RL,14260,FR2,1Fam,5,2000,2000,VinylSd,0.0,1145.0,250000.0


In [31]:
df_Tour.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
2914,2914,160,RM,1936,Inside,Twnhs,7,1970,1970,CemntBd,0.0,546.0,
2915,2915,160,RM,1894,Inside,TwnhsE,5,1970,1970,CemntBd,0.0,546.0,
2916,2916,20,RL,20000,Inside,1Fam,7,1960,1996,VinylSd,0.0,1224.0,
2917,2917,85,RL,10441,Inside,1Fam,5,1992,1992,HdBoard,0.0,912.0,
2918,2918,60,RL,9627,Inside,1Fam,5,1993,1994,HdBoard,0.0,996.0,


In [32]:
print('number of rows: ', df_Tour.shape[0])
print('number of columns: ', df_Tour.shape[1])
print('features are: ' , df_Tour.columns.tolist())
print('\n')

number of rows:  2919
number of columns:  13
features are:  ['Id', 'MSSubClass', 'MSZoning', 'LotArea', 'LotConfig', 'BldgType', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'Exterior1st', 'BsmtFinSF2', 'TotalBsmtSF', 'SalePrice']




In [33]:
print('Duplicated values are: \n', df_Tour.duplicated().sum())

Duplicated values are: 
 0


In [34]:
print('Missing values are: \n', df_Tour.isnull().sum().sort_values(ascending=False))

Missing values are: 
 SalePrice       1459
MSZoning           4
Exterior1st        1
BsmtFinSF2         1
TotalBsmtSF        1
Id                 0
MSSubClass         0
LotArea            0
LotConfig          0
BldgType           0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
dtype: int64


In [35]:
df_Tour.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            2919 non-null   int64  
 1   MSSubClass    2919 non-null   int64  
 2   MSZoning      2915 non-null   object 
 3   LotArea       2919 non-null   int64  
 4   LotConfig     2919 non-null   object 
 5   BldgType      2919 non-null   object 
 6   OverallCond   2919 non-null   int64  
 7   YearBuilt     2919 non-null   int64  
 8   YearRemodAdd  2919 non-null   int64  
 9   Exterior1st   2918 non-null   object 
 10  BsmtFinSF2    2918 non-null   float64
 11  TotalBsmtSF   2918 non-null   float64
 12  SalePrice     1460 non-null   float64
dtypes: float64(3), int64(6), object(4)
memory usage: 296.6+ KB


In [36]:
df_cleaned = df_Tour.dropna(subset=['MSZoning','Exterior1st','BsmtFinSF2','TotalBsmtSF'])
#print('Rows after dropping missing values in :\n', df_cleaned.head())
df_cleaned.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,RL,8450,Inside,1Fam,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,RL,9600,FR2,1Fam,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,RL,11250,Inside,1Fam,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,RL,9550,Corner,1Fam,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,RL,14260,FR2,1Fam,5,2000,2000,VinylSd,0.0,1145.0,250000.0


In [37]:
print('Missing values are: \n', df_cleaned.isnull().sum().sort_values(ascending=False))

Missing values are: 
 SalePrice       1453
Id                 0
MSSubClass         0
MSZoning           0
LotArea            0
LotConfig          0
BldgType           0
OverallCond        0
YearBuilt          0
YearRemodAdd       0
Exterior1st        0
BsmtFinSF2         0
TotalBsmtSF        0
dtype: int64


In [38]:
print(df_cleaned['MSZoning'].unique())
print(df_cleaned['LotConfig'].unique())
print(df_cleaned['BldgType'].unique())
print(df_cleaned['Exterior1st'].unique())


['RL' 'RM' 'C (all)' 'FV' 'RH']
['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc'
 'CBlock']


In [39]:
# Mapping dictionary
mapping_MSZoning = {'RL': 1, 'RM': 2, 'C (all)': 3, 'FV': 4, 'RH': 5}

# Replace non-numerical values with numerical values
df_cleaned['MSZoning'] = df_cleaned['MSZoning'].map(mapping_MSZoning)

df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['MSZoning'] = df_cleaned['MSZoning'].map(mapping_MSZoning)


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,1,8450,Inside,1Fam,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,1,9600,FR2,1Fam,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,1,11250,Inside,1Fam,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,1,9550,Corner,1Fam,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,1,14260,FR2,1Fam,5,2000,2000,VinylSd,0.0,1145.0,250000.0


In [40]:
# Mapping dictionary
mapping_LotConfig = {'Inside': 1, 'FR2': 2, 'Corner': 3, 'Corner': 4, 'CulDSac': 5 , 'FR3': 6}

# Replace non-numerical values with numerical values
df_cleaned['LotConfig'] = df_cleaned['LotConfig'].map(mapping_LotConfig)

df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['LotConfig'] = df_cleaned['LotConfig'].map(mapping_LotConfig)


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,1,8450,1,1Fam,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,1,9600,2,1Fam,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,1,11250,1,1Fam,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,1,9550,4,1Fam,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,1,14260,2,1Fam,5,2000,2000,VinylSd,0.0,1145.0,250000.0


In [41]:
# Mapping dictionary
mapping_BldgType = {'1Fam': 1, '2fmCon': 2, 'Duplex': 3, 'TwnhsE': 4, 'Twnhs': 5}

# Replace non-numerical values with numerical values
df_cleaned['BldgType'] = df_cleaned['BldgType'].map(mapping_BldgType)

df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['BldgType'] = df_cleaned['BldgType'].map(mapping_BldgType)


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,1,8450,1,1,5,2003,2003,VinylSd,0.0,856.0,208500.0
1,1,20,1,9600,2,1,8,1976,1976,MetalSd,0.0,1262.0,181500.0
2,2,60,1,11250,1,1,5,2001,2002,VinylSd,0.0,920.0,223500.0
3,3,70,1,9550,4,1,5,1915,1970,Wd Sdng,0.0,756.0,140000.0
4,4,60,1,14260,2,1,5,2000,2000,VinylSd,0.0,1145.0,250000.0


In [42]:
# Mapping dictionary
mapping_Exterior1st = {'VinylSd': 1 , 'MetalSd': 2 , 'Wd Sdng': 3 , 'HdBoard': 4 , 'BrkFace': 5 , 'WdShing': 6 , 'CemntBd': 7 , 'Plywood': 8 , 'AsbShng': 9 , 'Stucco': 10 , 'BrkComm': 11 , 'AsphShn': 12 , 'Stone': 13 , 'ImStucc': 14 , 'CBlock': 15}

# Replace non-numerical values with numerical values
df_cleaned['Exterior1st'] = df_cleaned['Exterior1st'].map(mapping_Exterior1st)

df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Exterior1st'] = df_cleaned['Exterior1st'].map(mapping_Exterior1st)


Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0,60,1,8450,1,1,5,2003,2003,1,0.0,856.0,208500.0
1,1,20,1,9600,2,1,8,1976,1976,2,0.0,1262.0,181500.0
2,2,60,1,11250,1,1,5,2001,2002,1,0.0,920.0,223500.0
3,3,70,1,9550,4,1,5,1915,1970,3,0.0,756.0,140000.0
4,4,60,1,14260,2,1,5,2000,2000,1,0.0,1145.0,250000.0


In [43]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2913 entries, 0 to 2918
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            2913 non-null   int64  
 1   MSSubClass    2913 non-null   int64  
 2   MSZoning      2913 non-null   int64  
 3   LotArea       2913 non-null   int64  
 4   LotConfig     2913 non-null   int64  
 5   BldgType      2913 non-null   int64  
 6   OverallCond   2913 non-null   int64  
 7   YearBuilt     2913 non-null   int64  
 8   YearRemodAdd  2913 non-null   int64  
 9   Exterior1st   2913 non-null   int64  
 10  BsmtFinSF2    2913 non-null   float64
 11  TotalBsmtSF   2913 non-null   float64
 12  SalePrice     1460 non-null   float64
dtypes: float64(3), int64(10)
memory usage: 318.6 KB


In [44]:
df_cleaned = df_cleaned.drop(columns = 'Id')
df_cleaned.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,60,1,8450,1,1,5,2003,2003,1,0.0,856.0,208500.0
1,20,1,9600,2,1,8,1976,1976,2,0.0,1262.0,181500.0
2,60,1,11250,1,1,5,2001,2002,1,0.0,920.0,223500.0
3,70,1,9550,4,1,5,1915,1970,3,0.0,756.0,140000.0
4,60,1,14260,2,1,5,2000,2000,1,0.0,1145.0,250000.0


In [47]:
# Use interpolation to fill missing values
df_cleaned['SalePrice'].interpolate(method='linear', inplace=True)
df_cleaned.tail()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['SalePrice'].interpolate(method='linear', inplace=True)


Unnamed: 0,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
2914,160,2,1936,1,5,7,1970,1970,7,0.0,546.0,147500.0
2915,160,2,1894,1,4,5,1970,1970,7,0.0,546.0,147500.0
2916,20,1,20000,1,1,7,1960,1996,1,0.0,1224.0,147500.0
2917,85,1,10441,1,1,5,1992,1992,4,0.0,912.0,147500.0
2918,60,1,9627,1,1,5,1993,1994,4,0.0,996.0,147500.0


In [48]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2913 entries, 0 to 2918
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    2913 non-null   int64  
 1   MSZoning      2913 non-null   int64  
 2   LotArea       2913 non-null   int64  
 3   LotConfig     2913 non-null   int64  
 4   BldgType      2913 non-null   int64  
 5   OverallCond   2913 non-null   int64  
 6   YearBuilt     2913 non-null   int64  
 7   YearRemodAdd  2913 non-null   int64  
 8   Exterior1st   2913 non-null   int64  
 9   BsmtFinSF2    2913 non-null   float64
 10  TotalBsmtSF   2913 non-null   float64
 11  SalePrice     2913 non-null   float64
dtypes: float64(3), int64(9)
memory usage: 295.9 KB


In [49]:
#Normalised the data with mean
mean = df_cleaned.mean()[0]
stddev = df_cleaned.std()[0]
df_cleaned = (df_cleaned - df_cleaned.mean())/df_cleaned.std()

df_cleaned.head()

  mean = df_cleaned.mean()[0]
  stddev = df_cleaned.std()[0]


Unnamed: 0,MSSubClass,MSZoning,LotArea,LotConfig,BldgType,OverallCond,YearBuilt,YearRemodAdd,Exterior1st,BsmtFinSF2,TotalBsmtSF,SalePrice
0,0.066054,-0.442207,-0.215467,-0.577198,-0.42383,-0.511074,1.044612,0.895547,-0.883466,-0.293302,-0.447601,0.75429
1,-0.87428,-0.442207,-0.068645,0.127192,-0.42383,2.194122,0.152308,-0.398489,-0.480596,-0.293302,0.47578,0.294038
2,0.066054,-0.442207,0.142013,-0.577198,-0.42383,-0.511074,0.978515,0.84762,-0.883466,-0.293302,-0.302043,1.009986
3,0.301138,-0.442207,-0.075028,1.535971,-0.42383,-0.511074,-1.863638,-0.686053,-0.077725,-0.293302,-0.675035,-0.413388
4,0.066054,-0.442207,0.526305,0.127192,-0.42383,-0.511074,0.945467,0.751765,-0.883466,-0.293302,0.209683,1.461716


In [62]:
#Separate out the Feature and Target matrices
df_cleaned=np.asarray(df_cleaned)

#Y is assigned the last column of data (target values).
Y=df_cleaned[:,-1]

#X is assigned all columns except the last one (feature values).
X=df_cleaned[:,:-1]

print(df_cleaned)
print('\n','Y (target values):')
print(Y)
print('\n','X (features):')
print(X)



[[ 0.0660542  -0.44220671 -0.21546692 ... -0.29330249 -0.44760099
   0.75429037]
 [-0.87428001 -0.44220671 -0.06864465 ... -0.29330249  0.47577997
   0.29403756]
 [ 0.0660542  -0.44220671  0.1420134  ... -0.29330249 -0.3020434
   1.00998637]
 ...
 [-0.87428001 -0.44220671  1.25913943 ... -0.29330249  0.38935515
  -0.28554006]
 [ 0.65376308 -0.44220671  0.03872712 ... -0.29330249 -0.3202381
  -0.28554006]
 [ 0.0660542  -0.44220671 -0.06519751 ... -0.29330249 -0.12919376
  -0.28554006]]

 Y (target values):
[ 0.75429037  0.29403756  1.00998637 ... -0.28554006 -0.28554006
 -0.28554006]

 X (features):
[[ 0.0660542  -0.44220671 -0.21546692 ... -0.88346599 -0.29330249
  -0.44760099]
 [-0.87428001 -0.44220671 -0.06864465 ... -0.48059554 -0.29330249
   0.47577997]
 [ 0.0660542  -0.44220671  0.1420134  ... -0.88346599 -0.29330249
  -0.3020434 ]
 ...
 [-0.87428001 -0.44220671  1.25913943 ... -0.88346599 -0.29330249
   0.38935515]
 [ 0.65376308 -0.44220671  0.03872712 ...  0.32514536 -0.29330249

In [63]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Display the splits
print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("Y_train:\n", Y_train)
print("Y_test:\n", Y_test)

X_train:
 [[-0.87428001 -0.44220671 -0.34428751 ... -0.88346599 -0.29330249
   0.92837311]
 [-0.63919646 -0.44220671  0.08456121 ... -0.88346599 -0.29330249
  -0.96160123]
 [-0.87428001  3.30968109 -0.3596081  ...  1.5337567  -0.29330249
   0.65090395]
 ...
 [-0.16902935 -0.44220671 -0.29794274 ...  1.13088625 -0.29330249
   0.15737275]
 [-0.87428001 -0.44220671 -0.25095961 ...  1.13088625 -0.29330249
  -0.42940629]
 [-0.16902935 -0.44220671 -0.31862553 ... -0.07772509 -0.29330249
  -0.3202381 ]]
X_test:
 [[-0.87428001 -0.44220671  0.93012985 ... -0.88346599 -0.29330249
   2.6454978 ]
 [-0.87428001 -0.44220671  0.2178503  ... -0.88346599 -0.29330249
   2.25886045]
 [-0.16902935 -0.44220671  0.00336211 ... -0.88346599 -0.29330249
  -0.13829111]
 ...
 [-0.87428001 -0.44220671 -0.3116036  ... -0.88346599 -0.29330249
   0.43939058]
 [-0.87428001 -0.44220671 -0.09762609 ...  0.32514536  0.69879422
   1.29226708]
 [-0.63919646 -0.44220671 -0.67470147 ... -0.48059554 -0.29330249
   0.12325769

In [66]:
# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, Y_train)

# Make predictions on the test set
Y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print("Mean Squared Error (MSE):", mse)
print('\n')
print("R-squared (R2) Score:", r2)
print('\n')

print("Actual values:", Y_test)
print('\n')
print("Predicted values:", Y_pred)
print('\n')

Mean Squared Error (MSE): 0.7841137690300221


R-squared (R2) Score: 0.322852538218814


Actual values: [ 5.75463033e+00  3.54990004e+00 -2.85540058e-01 -2.85540058e-01
 -2.85540058e-01 -2.85540058e-01 -1.04410488e+00 -9.24780074e-01
 -2.85540058e-01 -2.85540058e-01  2.17328755e-01 -2.85540058e-01
 -1.77710010e+00  1.81969040e+00 -6.86130468e-01  1.20601998e+00
 -2.85540058e-01 -2.85540058e-01 -4.30434461e-01 -2.85540058e-01
 -3.36679259e-01 -1.08257493e-01 -2.85540058e-01  5.58256763e-01
 -2.85540058e-01 -2.85540058e-01  2.26289681e+00 -2.85540058e-01
 -8.95064529e-02 -3.62248860e-01  1.65774959e+00 -2.85540058e-01
  2.09396698e+00 -4.13388061e-01 -2.85540058e-01  1.63217999e+00
 -9.58872875e-01 -2.08831256e-01 -2.25877656e-01 -2.85540058e-01
 -2.85540058e-01 -2.85540058e-01  2.61649396e-01  8.48045571e-01
  3.70746359e-01 -2.85540058e-01 -2.85540058e-01 -2.85540058e-01
 -2.85540058e-01 -2.85540058e-01 -2.85540058e-01 -8.48071272e-01
 -2.85540058e-01 -2.85540058e-01 -4.73050463e-01  4