# Objective: Feature Subset Selection to Improve Software Cost Estimation

## Dataset
This is a PROMISE Software Engineering Repository data set made publicly available to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. The main objective is to estimate the software cost estimation using feature subset selection techniques.

## Attributes
1.	RELY {Nominal,Very_High,High,Low} 
2.	DATA {High,Low,Nominal,Very_High} 
3.	CPLX {Very_High,High,Nominal,Extra_High,Low} 
4.	TIME {Nominal,Very_High,High,Extra_High} 
5.	STOR {Nominal,Very_High,High,Extra_High} 
6.	VIRT {Low,Nominal,High}
7.	TURN {Nominal,High,Low}
8.	ACAP {High,Very_High,Nominal} 
9.	AEXP {Nominal,Very_High,High} 
10.	PCAP {Very_High,High,Nominal}
11.	VEXP {Low,Nominal,High}
12.	LEXP {Nominal,High,Very_Low,Low} 
13.	MODP {High,Nominal,Very_High,Low}
14.	TOOL {Nominal,High,Very_High,Very_Low,Low} 
15.	SCED {Low,Nominal,High}
16.	LOC numeric 

## Target Class
ACT_EFFORT numeric %17

### Source: http://promise.site.uottawa.ca/SERepository/datasets/cocomonasa_v1.arff

Tasks:
1.	Obtain the software cost estimation dataset
2.	Apply pre-processing techniques (if any)
3.	Apply feature subset selection techniques such as correlation analysis, forward selection, backward elimination, recursive feature elimination etc. Find best possible subset of features from each method.
4.	Divide dataset into training and testing set, respectively.
5.	Implement support vector regression (SVR), Linear regression, and Decision tree.
6.	Ensemble SVR, Linear regression and Decision tree. 
7.	Evaluate Coefficient of determination and Root mean square error for all the models including the ensemble one.
8.	Conclude the results

Helpful links: https://scikit-learn.org/stable/modules/ensemble.html
https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/
https://medium.com/pursuitnotes/support-vector-regression-in-6-steps-with-python-c4569acd062d
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html


## Task 1: Implementation of regression models 

In [1]:
# Load the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.preprocessing import LabelEncoder,MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from scipy.io import arff
from math import sqrt

In [2]:
# Load the dataset 
d=arff.loadarff("cocomonasa_v1.arff")
data=pd.DataFrame(d[0])
for col in data.columns:
    if(data[col].dtype==object):
        data[col]=data[col].str.decode('utf-8')
data

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,Nominal,High,Very_High,Nominal,Nominal,Low,Nominal,High,Nominal,Very_High,Low,Nominal,High,Nominal,Low,70.0,278.0
1,Very_High,High,High,Very_High,Very_High,Nominal,Nominal,Very_High,Very_High,Very_High,Nominal,High,High,High,Low,227.0,1181.0
2,Nominal,High,High,Very_High,High,Low,High,High,Nominal,High,Low,High,High,Nominal,Low,177.9,1248.0
3,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,115.8,480.0
4,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,29.5,120.0
5,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,19.7,60.0
6,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,66.6,300.0
7,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,5.5,18.0
8,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,10.4,50.0
9,High,Low,High,Nominal,Nominal,Low,Low,Nominal,Nominal,Nominal,Nominal,High,High,Nominal,Low,14.0,60.0


In [3]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
le=LabelEncoder()
minmax=MinMaxScaler()
for col in data.columns:
    if(data[col].dtype==object):
        data[col]=le.fit_transform(data[col])
temp=data.iloc[:,-1]
data=pd.DataFrame(minmax.fit_transform(data.iloc[:,:-1]),columns=data.iloc[:,:-1].columns)
data["ACT_EFFORT"]=temp
data

Unnamed: 0,RELY,DATA,CPLX,TIME,STOR,VIRT,TURN,ACAP,AEXP,PCAP,VEXP,LEXP,MODP,TOOL,SCED,LOC,ACT_EFFORT
0,0.666667,0.0,1.0,0.666667,0.666667,0.5,1.0,0.0,0.5,1.0,0.5,0.666667,0.0,0.5,0.5,0.161122,278.0
1,1.0,0.0,0.25,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.5,0.534221,1181.0
2,0.666667,0.0,0.25,1.0,0.333333,0.5,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.5,0.5,0.417538,1248.0
3,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.269962,480.0
4,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.064876,120.0
5,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.041587,60.0
6,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.153042,300.0
7,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.007842,18.0
8,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.019487,50.0
9,0.0,0.333333,0.25,0.666667,0.666667,0.5,0.5,0.5,0.5,0.5,1.0,0.0,0.0,0.5,0.5,0.028042,60.0


In [4]:
# Apply feature subset selection techniques 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,f_classif
x=data.iloc[:,:-1]
y=data.iloc[:,-1]
x_new = SelectKBest(f_classif,k=10).fit_transform(x,y)

In [5]:
# Divide the dataset to training and testing set
x_train,x_test,y_train,y_test=train_test_split(x_new,np.array(y),random_state=42)

In [6]:
# Build regression models 
dtr=DecisionTreeRegressor()
dtr.fit(x_train,y_train)
lr=LinearRegression()
lr.fit(x_train,y_train)
svr=SVR(kernel="linear")
svr.fit(x_train,y_train)

SVR(kernel='linear')

In [7]:
# Evaluate the build model on test dataset
dtr_pred=dtr.predict(x_test)
lr_pred=lr.predict(x_test)
svr_pred=svr.predict(x_test)

In [8]:
# Evaluate training and testing coefficient of determination and root mean squre error
print("Decision Tree",r2_score(y_test,dtr_pred),sqrt(mean_squared_error(y_test,dtr_pred)))
print("Linear Regression",r2_score(y_test,lr_pred),sqrt(mean_squared_error(y_test,lr_pred)))
print("SVR",r2_score(y_test,svr_pred),sqrt(mean_squared_error(y_test,svr_pred)))

Decision Tree 0.6227524529746309 550.3178838937849
Linear Regression 0.8578181559726794 337.84888149489274
SVR -0.30065195900250274 1021.8354465796518


##Task 2: Ensemble regression models


In [9]:
# Ensemble the regression models
ada_dtr=AdaBoostRegressor(base_estimator=dtr)
ada_dtr.fit(x_train,y_train)
ada_dtr_pred=ada_dtr.predict(x_test)
ada_lr=AdaBoostRegressor(base_estimator=lr)
ada_lr.fit(x_train,y_train)
ada_lr_pred=ada_lr.predict(x_test)
ada_svr=AdaBoostRegressor(base_estimator=svr)
ada_svr.fit(x_train,y_train)
ada_svr_pred=ada_svr.predict(x_test)

In [10]:
# Evaluate Coefficient of determination and Root mean square error 
print("Decision Tree",r2_score(y_test,ada_dtr_pred),sqrt(mean_squared_error(y_test,ada_dtr_pred)))
print("Linear Regression",r2_score(y_test,ada_lr_pred),sqrt(mean_squared_error(y_test,ada_lr_pred)))
print("SVR",r2_score(y_test,ada_svr_pred),sqrt(mean_squared_error(y_test,ada_svr_pred)))

Decision Tree 0.5140361217539982 624.6006169278627
Linear Regression 0.8970042663784344 287.54778853207915
SVR -0.2501416252564541 1001.7976812490925
