In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

We will be using house price data from Kaggle: https://www.kaggle.com/datasets/ashydv/housing-dataset/code
kaggle is a popular site for machine learning, with user-posted datasets, models, and competitions


In [2]:
houses = pd.read_csv("Housing.csv")

In [13]:
print(type(houses))
print(houses.head())


<class 'pandas.core.frame.DataFrame'>
      price  area  bedrooms  bathrooms  stories mainroad guestroom basement  \
0  13300000  7420         4          2        3      yes        no       no   
1  12250000  8960         4          4        4      yes        no       no   
2  12250000  9960         3          2        2      yes        no      yes   
3  12215000  7500         4          2        2      yes        no      yes   
4  11410000  7420         4          1        2      yes       yes      yes   

  hotwaterheating airconditioning  parking prefarea furnishingstatus  
0              no             yes        2      yes        furnished  
1              no             yes        3       no        furnished  
2              no              no        2      yes   semi-furnished  
3              no             yes        3      yes        furnished  
4              no             yes        2       no        furnished  


It is good practice to have a train and test set.
This technique helps us to measure the model's ability to track actual trends
If we tested on the same data as we trained the model, memorizing points would score best, but isn't useful

In [14]:
train, test = train_test_split(houses)

In [18]:
print(len(train))
print(len(test))
print(len(houses))

408
137
545


In the modeling process, we first create an empty model, then fit it on training data.
Once fit, we can use it to make predictions.

In [4]:
m1 = LinearRegression()

In [5]:
m1

Fitting a model will take X columns (what we are making predictions with) and a Y column (what we are trying to predict).

In [6]:
m1.fit(train[["area"]], train["price"])

In [7]:
m1

In [8]:
# The x columns must be a 2D array, like a Pandas DataFrame, even if you are only using 1 column.
# Using [[xcol1, xcol2, ...]] with a DataFrame will trim it to a DataFrame with only the columns listed.
# The y column must just be a 1D array, like a Pandas Series.

train[["area"]] # this will generate a DataFrame, which is different than train["area"], which would be a Series

Unnamed: 0,area
359,3600
526,3180
340,5300
211,12900
423,3750
...,...
114,6800
177,6050
471,3750
344,3850


In [20]:
print(type(train["area"]))

<class 'pandas.core.series.Series'>


When using the model to predict, we must pass the x columns in the same format as the fitting process.
The model's job is to use the trends learned when fitted to predict a y column, which it does not have access to.

In [9]:
predictions = m1.predict(test[["area"]])

In [24]:
print(len(predictions))
print(type(predictions))
print(type(predictions-test["price"]))

137
<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>


In [10]:
# We can compare these predictions with the actual house prices to compare the accuracy of different models.
abs(predictions - test["price"]).mean()

np.float64(1128224.8880968918)

In [None]:
# On average, we can see that the estimate is off by $1192153 in this base model

# try adding new columns in your x columns to increase your prediction accuracy

In [11]:
m2 = LinearRegression()

In [25]:
m2.fit(train[["area","bedrooms","bathrooms","stories","parking"]],train["price"])

In [27]:
#predictionsModel2OneParameter=m2.predict(test[["area"]])
#can't happen because we have to provide exactly 5 parameters-not more not less and in same order and same parameters only
predectionsManyParameters=m2.predict(test[["area","bedrooms","bathrooms","stories","parking"]])

In [28]:
abs(predictions-test["price"]).mean()

np.float64(1583092.5975698265)

In [29]:
feature_cols = ["area","bedrooms","bathrooms","stories","parking"]
coefs = pd.Series(m2.coef_, index=feature_cols)
print("Intercept:", m2.intercept_)
print("\nFeature importances:")
print(coefs.sort_values(ascending=False))

Intercept: -129815.32501989603

Feature importances:
bathrooms    1.139359e+06
stories      5.267558e+05
parking      3.457122e+05
bedrooms     1.931476e+05
area         3.198092e+02
dtype: float64
