In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Preprocessing3.csv')

What are the unique values present in the Locality feature of the dataset?

In [6]:
df.Locality.unique()

array(['Greenwich', 'Norwalk', 'Waterbury', nan, 'Bridgeport',
       'Fairfield', 'West Hartford', 'Stamford'], dtype=object)

In [7]:
df['Locality'].nunique()

7

Which of the following columns have categorical data?

In [8]:
df.dtypes

Date                  object
Year                   int64
Locality              object
Estimated Value      float64
Sale Price           float64
Property              object
Residential           object
num_rooms              int64
num_bathrooms          int64
carpet_area          float64
property_tax_rate    float64
Face                  object
dtype: object

In [9]:
cat = df.select_dtypes(include=['category', 'object']).columns
print(cat)

Index(['Date', 'Locality', 'Property', 'Residential', 'Face'], dtype='object')


Which of the following features have missing(NaN) or unknown ("?") values present in the dataset?

In [10]:
df.isna().sum()

Date                    0
Year                    0
Locality             1285
Estimated Value      1281
Sale Price              0
Property                0
Residential             0
num_rooms               0
num_bathrooms           0
carpet_area          1282
property_tax_rate       0
Face                    0
dtype: int64

In [11]:
df.isin(['?']).sum()

Date                    0
Year                    0
Locality                0
Estimated Value         0
Sale Price              0
Property             1873
Residential             0
num_rooms               0
num_bathrooms           0
carpet_area             0
property_tax_rate       0
Face                    0
dtype: int64

In the Year 2022, how many houses (rows) located in the Greenwich Locality have more than equal to 3 num_room, and facing towards either the North or East?

In [15]:
cond1 = df[
    (df['Year']==2022) &
    (df['Locality']=='Grennwich') &
    (df['num_rooms'] >=3) &
    (df['Face'].isin(['North', 'East']))
]
count = cond1.shape[0]
print(count)

0


Split the dataset into train dataset and test dataset in the following manner
data(rows) collected before the year of 2021 [2009-2020] should be the train dataset
data(rows) collected in the year of 2021 and 2022 (both inclusive) should be the test dataset
columns except of the label vector should be the feature matrix (X_train or X_test)
make label vector (Y_train or y_test) containing values only from the target feature.
How many rows are in the feature matrix of the test dataset ?

In [16]:
X_train = df[(df.Year>=2009)&(df.Year<=2020)]
X_test = df[df.Year>=2021]
y_train = X_train['Sale Price']
y_test = X_test['Sale Price']
X_train = X_train.drop('Sale Price', axis=1)
X_test = X_test.drop('Sale Price', axis=1)

In [17]:
X_test.shape[0]

1728

compute the instructed statistical values for different columns in train dataset only to repalce missing(NaN) and unknown("?") values respectively.

Replace the missing(NaN) and unknown("?") values from the train and test dataset with the instructed statistical values computed using the train dataset only.

Ignore the missing and unknown values while calculating the statistical values.

Replace missing values(NaN) with the MOST FREQUENT value of the Locality feature

Repalce missing values(NaN) with the MEDIAN value of the Estimated Value feature

Replace missing values(NaN) with the MEAN value of the carpet_area feature

Replace Unknown values("?") with the MOST FREQUENT value of the Property.

In [18]:
(df == '?').any()

Date                 False
Year                 False
Locality             False
Estimated Value      False
Sale Price           False
Property              True
Residential          False
num_rooms            False
num_bathrooms        False
carpet_area          False
property_tax_rate    False
Face                 False
dtype: bool

In [19]:
X_train.Property.replace({'?': np.nan}, inplace=True)
X_test.Property.replace({'?': np.nan}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test.Property.replace({'?': np.nan}, inplace=True)


In [20]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
ct = ColumnTransformer([
    ('mode', SimpleImputer(strategy='most_frequent'), ['Locality', 'Property']),
    ('median', SimpleImputer(strategy='median'), ['Estimated Value']),
    ('mean', SimpleImputer(strategy='mean'), ['carpet_area'])
], remainder='passthrough', verbose_feature_names_out=False).set_output(transform='pandas')
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

Write the MEAN value of the carpet_area column of the train dataset you found to replace all the Missing Values (NaN).

In [22]:
X_train.carpet_area.mean().round(2)

np.float64(1113.4)

What value you used to replace the unknown value ("?") in the Property column of the train dataset?

In [23]:
most_freq_value = X_train['Property'].mode()[0]
print("Value used to replace '?':", most_freq_value)


Value used to replace '?': Single Family


Apply preprocessing on features of train and test datasets.
Drop the 'Date' Column before the preprocessing steps.

before applying any preprocessing there should not be any missing or unknown values present in the train and test dataset.

fitting (learning) should be done only on train dataset.

transform the test dataset using the fitting (learning) of train dataset

For Numerical Features
Scale the numerical feature of the feature matrix using the Min-Max Scale

For Categorical Features
One-Hot Encode all categorical features(object columns) in the feature matrix

In [24]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder

In [25]:
X_train = X_train.drop('Date', axis=1)
X_test = X_test.drop('Date', axis=1)
ct = ColumnTransformer([
    ('num', MinMaxScaler(), X_train.select_dtypes('number').columns),
    ('ohe', OneHotEncoder(), X_train.select_dtypes('category').columns)
], verbose_feature_names_out=False).set_output(transform='pandas')

X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

In [26]:
X_train.shape

(8272, 6)

In [27]:
df = pd.read_csv('Model_Building_1.csv')

Split the dataset into train dataset and test dataset in the following manner :

data (rows) index [0, 8271] should be the train dataset
data (rows) index from 8272 till last row should be the test dataset
columns except of the label(Sale Price) vector should be the feature matrix (X_train or X_test)
make label vector (Y_train or y_test) containing values only from the target feature.

In [28]:
X_train = df.iloc[:8272]
X_test = df.iloc[8272:]

In [29]:
y_train = X_train['Sale Price']
y_test = X_test['Sale Price']

X_train = X_train.drop('Sale Price', axis=1)
X_test = X_test.drop('Sale Price', axis=1)

Apply LinearRegression on the train dataset(X_train and y_train). What is the R2 score on the test dataset(X_test and y_test). 

In [30]:
from sklearn.linear_model import Ridge, Lasso, LinearRegression
lr=LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.6492749995107592

Enter the maximum value of ùëÖ2 score you got using the LinearRegression model when computed with 5 folds from the training dataset (X_train and y_train) using cross_val_score. ( Upto 4 digits after decimal points) obtained.

In [34]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
cross_val_score(lr, X_train, y_train, cv=5).max()

np.float64(0.8173506206254761)

Train Ridge and Lasso with random_state=27 and keep other parameters as default using train dataset. Which one has the least "mean squared error" for the test dataset.

In [35]:
ridge = Ridge(random_state=27)
lasso = Lasso(random_state=27)

ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

print(f"Ridge loss: {mean_squared_error(y_test, ridge.predict(X_test))}")
print(f"Lasso loss: {mean_squared_error(y_test, lasso.predict(X_test))}")
print(mean_squared_error(y_test, lasso.predict(X_test)))


Ridge loss: 301994665976.23914
Lasso loss: 305853086563.8274
305853086563.8274


  model = cd_fast.enet_coordinate_descent(


Train the SGDRegressor(random_state=27,warm_start=True) with maximum passes over the train dataset can be 100. Write the correct R2 score for this estimator on test dataset 

In [36]:
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import r2_score


In [38]:
# Create the SGDRegressor
sgd_model = SGDRegressor(random_state=27, warm_start=True, max_iter=100)

# Train on the training dataset
sgd_model.fit(X_train, y_train)
y_pred = sgd_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("R¬≤ score on test dataset:", r2)


R¬≤ score on test dataset: 0.5267937468412169




create a pipeline of the PolynomialFeatures(interaction_only=True) as transformer and Lasso as an estimator.

Use GridSearchCV for tuning the hyperparameters of the created pipeline on training dataset.

Keep polynomial degree as : [1,2]

lasso alpha value to be taken as : [10,100,1000,10000]

scoring : neg_mean_absolute_error

cv = 5

n_jobs = -1 (negative one) [it helps in using all the computational power to run this job]

(Note: Kindly ignore the warning.)

In [46]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('poly', PolynomialFeatures(interaction_only=True)),
    ('lasso', Lasso())
])
param_grid = {
    'poly__degree': [1,2],
    'lasso__alpha': [10,100,1000,10000]
}

gscv = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)

gscv.fit(X_train, y_train)


0,1,2
,estimator,"Pipeline(step...o', Lasso())])"
,param_grid,"{'lasso__alpha': [10, 100, ...], 'poly__degree': [1, 2]}"
,scoring,'neg_mean_absolute_error'
,n_jobs,-1
,refit,True
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,degree,1
,interaction_only,True
,include_bias,True
,order,'C'

0,1,2
,alpha,1000
,fit_intercept,True
,precompute,False
,copy_X,True
,max_iter,1000
,tol,0.0001
,warm_start,False
,positive,False
,random_state,
,selection,'cyclic'


In [None]:
gscv.best_params_

{'lasso__alpha': 1000, 'poly__degree': 1}

In [56]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5, svd_solver='full', whiten=True, random_state=42)
X_train_pca = pca.fit_transform(X_train)  # transformed training data
X_test_pca  = pca.transform(X_test)       # transformed test data

# Check type of transformed data
print(type(X_test_pca))  # <class 'numpy.ndarray'>

# You can convert back to DataFrame if you want 'head()'
import pandas as pd
X_test_pca_df = pd.DataFrame(X_test_pca)
print(X_test_pca_df.head())


TypeError: float() argument must be a string or a real number, not 'PCA'

TypeError: float() argument must be a string or a real number, not 'PCA'