The goal of this project is to design an intelligent model that predicts the development effort of a software project.


# DATA IMPORT

First, we load the data.

In [1]:
import pandas as pd
df=pd.read_csv("zia.csv",sep = ';')
df

Unnamed: 0,storyPoint,velocity,Effort
0,156,2.7,63
1,202,2.5,92
2,173,3.3,56
3,331,3.8,86
4,124,4.2,32
5,339,3.6,91
6,97,3.4,35
7,257,3.0,93
8,84,2.4,36
9,211,3.2,62


In [2]:
Team_Size = [5, 5,5,5,5, 7, 5, 5,5,5,5,5,5,7,5,5,5,5,5,5,5]

df['Team Size'] = Team_Size

# Observe the result
print(df)


    storyPoint  velocity  Effort  Team Size
0          156       2.7      63          5
1          202       2.5      92          5
2          173       3.3      56          5
3          331       3.8      86          5
4          124       4.2      32          5
5          339       3.6      91          7
6           97       3.4      35          5
7          257       3.0      93          5
8           84       2.4      36          5
9          211       3.2      62          5
10         131       3.2      45          5
11         112       2.9      37          5
12         101       2.9      32          5
13          74       2.9      30          7
14          62       2.9      21          5
15         289       2.8     112          5
16         113       2.8      39          5
17         141       2.8      52          5
18         213       2.8      80          5
19         137       2.7      56          5
20          91       2.7      35          5


Our dataset consists of 4 columns: Story Point representing the project size in story points, Velocity representing the quantity (in story points) the team can develop during a period, Team Size representing the development team size, and finally the Effort column, which we aim to predict its values.


# DATA PREPROCESSING

Before moving to the modeling phase, we need to clean our data! This preprocessing phase is crucial, as it determines 80% of the success of any algorithm. First, we check if there are any null values in our table.




In [3]:
print(df.isnull().sum())

storyPoint    0
velocity      0
Effort        0
Team Size     0
dtype: int64


We can see that there are no null values in our table. Now, we can split our data into a training set and a test set to evaluate the quality of our model. Then, we can directly train the model and make predictions on the test set.


In [4]:
x=df.drop(columns=["Effort"])
y=df["Effort"]

In [5]:
x

Unnamed: 0,storyPoint,velocity,Team Size
0,156,2.7,5
1,202,2.5,5
2,173,3.3,5
3,331,3.8,5
4,124,4.2,5
5,339,3.6,7
6,97,3.4,5
7,257,3.0,5
8,84,2.4,5
9,211,3.2,5


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
x_train , x_test,y_train , y_test=train_test_split(x,y,test_size=0.3,random_state=1)
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)
model2=tree.DecisionTreeRegressor()
model3=LinearRegression()

Now, we will evaluate our model using the following classifiers: KNN, Decision Trees, and Linear Regression. With a sufficiently high score and a conclusive confusion matrix, we can determine the best classifier.


# MODEL TRAINING

TECHNOLOGY 1: KNeighborsRegressor


In [7]:
#KNeighborsRegressor
print("RESULTS GIVEN BY THE KNeighborsRegressor MODEL: ")
model.fit(x_train,y_train)
a=model.score(x_train,y_train)
print("accuracy on train :",a,"%")
a2=model.score(x_test,y_test)
print("accuracy on test :",a2,"%")
y_pred=model.predict(x_test)
#calculate MMRE and Pred(25%) indicators
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("the MMRE indicator of this model is :",MMRE ,"%")
print("the PRED(25%) indicator of this model is :",Pred ,"%")


RESULTS GIVEN BY THE KNeighborsRegressor MODEL: 
accuracy on train : 0.8464483418697295 %
accuracy on test : 0.7968316716411592 %
the MMRE indicator of this model is : 0.18363155240231652 %
the PRED(25%) indicator of this model is : 0.7142857142857143 %


TECHNOLOGIE 2 : DecisionTreeRegressor

In [8]:
#DecisionTreeRegressor
print("RESULTS GIVEN BY THE DecisionTreeRegressor MODEL: ")
model2.fit(x_train,y_train)
a=model2.score(x_train,y_train)
print("accuracy on train :",a,"%")
a2=model2.score(x_test,y_test)
print("accuracy on test :",a2,"%")
y_pred=model2.predict(x_test)
#calculate MMRE and Pred(25%) indicators
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("the MMRE indicator of this model is :",MMRE ,"%")
print("the PRED(25%) indicator of this model is :",Pred ,"%")


RESULTS GIVEN BY THE DecisionTreeRegressor MODEL: 
accuracy on train : 1.0 %
accuracy on test : 0.7587873880739189 %
the MMRE indicator of this model is : 0.2408068445028578 %
the PRED(25%) indicator of this model is : 0.7142857142857143 %


TECHNOLOGIE 3 : Linear Regression

In [None]:
#Linear Regression
print("RESULTATS DONNES PAR LE MODELE lINEAR REGRESSION : ")
model3.fit(x_train,y_train)
a=model3.score(x_train,y_train)
print("accuracy sur train :",a,"%")
a2=model3.score(x_test,y_test)
print("accuracy sur test :",a2,"%")
y_pred=model3.predict(x_test)
#calcule des indicateurs MRE et ABE
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("l'indicateur MMRE de ce model est :",MMRE ,"%")
print("l'indicateur PRED(25%) de ce model est :",Pred ,"%")



RESULTATS DONNES PAR LE MODELE lINEAR REGRESSION : 
accuracy sur train : 0.9218128883532224 %
accuracy sur test : 0.9374300534761124 %
l'indicateur MMRE de ce model est : 0.11229893045123487 %
l'indicateur PRED(25%) de ce model est : 0.8571428571428571 %


Now we will try to use one of the techniques to improve the model's performance, which is feature selection.


Using the "find_important_data" function, we can select the features that contribute most to our prediction variable. This is done using the VarianceThreshold feature selector, which removes all features with variance less than the 'threshold' variance threshold.


In [9]:
# Accuracy after using feature selection
from sklearn.feature_selection import VarianceThreshold
def find_important_data(x_train, TR):
    Filter=VarianceThreshold(threshold=TR)
    Filtered_Data=Filter.fit_transform(x_train)
    FS=list(Filter.get_support());
    X=[]
    for i in range(0,len(FS)):
        if FS[i]==True:
            X+=[i]
    return tuple(X)

print(x_train.var(axis=0))

storyPoint    5771.670330
velocity         0.229890
Team Size        0.527473
dtype: float64


In [10]:
x_test.var(axis=0)

storyPoint    9674.809524
velocity         0.141429
Team Size        0.000000
dtype: float64

In [11]:
x_train.shape

(14, 3)

In [14]:
from sklearn.feature_selection import VarianceThreshold

def trouver_donnees_importantes(x_train, TR):
    Filter = VarianceThreshold(threshold=TR)
    Donnees_Filtrees = Filter.fit_transform(x_train)
    FS = list(Filter.get_support())
    X = []
    for i in range(0, len(FS)):
        if FS[i] == True:
            X += [i]
    return tuple(X)


In [15]:
x2=trouver_donnees_importantes(x_train,0.4)
x1_train=x_train.to_numpy()
x3_train=x1_train[:,x2]
x3_train.shape

(14, 2)

In [16]:
y_train.shape


(14,)

#Results after using the Feature Selection technique:

TECHNOLOGIE 1: KNeighborsRegressor

In [17]:
#KNeighborsRegressor
model.fit(x3_train,y_train)
a=model.score(x3_train,y_train)
print("accuracy score on train:",a,"%")
x1_test=x_test.to_numpy()
x3_test=x1_test[:,x2]
a2=model.score(x3_test,y_test)
print("accuracy score on test:",a2,"%")
y_pred=model.predict(x3_test)
# calculation of MMRE and Pred(25%) indicators
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("the MMRE indicator of this model is:",MMRE ,"%")
print("the indicator PRED(25%) of this model is:",Pred ,"%")

accuracy score on train: 0.8464483418697295 %
accuracy score on test: 0.7968316716411592 %
the MMRE indicator of this model is: 0.18363155240231652 %
the indicator PRED(25%) of this model is: 0.7142857142857143 %


TECHNOLOGIE 2: DecisionTreeRegressor

In [None]:
#DecisionTreeRegressor
model2.fit(x3_train,y_train)
a=model2.score(x3_train,y_train)
print("accuracy score sur train :",a,"%")
x1_test=x_test.to_numpy()
x3_test=x1_test[:,x2]
a2=model2.score(x3_test,y_test)
print("accuracy score sur test :",a2,"%")
y_pred=model2.predict(x3_test)
#calcule des indicateurs MMRE et Pred(25%)
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("l'indicateur MMRE de ce model est :",MMRE ,"%")
print("l'indicateur PRED(25%) de ce model est :",Pred ,"%")


accuracy score sur train : 1.0 %
accuracy score sur test : 0.8446370737283292 %
l'indicateur MMRE de ce model est : 0.17914334189251133 %
l'indicateur PRED(25%) de ce model est : 0.8571428571428571 %


TECHNOLOGIE 3: Linear Regression

In [18]:
model3.fit(x3_train,y_train)
a=model3.score(x3_train,y_train)
print("accuracy score on train:",a)
x1_test=x_test.to_numpy()
x3_test=x1_test[:,x2]
a2=model3.score(x3_test,y_test)
print("accuracy score on test:",a2)
y_pred=model3.predict(x3_test)
# calculation of MMRE and Pred(25%) indicators
ABE=abs(y_test-y_pred)
MRE=ABE/y_test
import statistics
MMRE= statistics.mean(MRE)
a=[]
for i in MRE:
    if i<=0.25:
        a.append(1)
    else:
        a.append(0)

Pred=sum(a)/len(a)
print("the MMRE indicator of this model is:",MMRE ,"%")
print("the indicator PRED(25%) of this model is:",Pred ,"%")


accuracy score on train: 0.832078959502604
accuracy score on test: 0.854037115102271
the MMRE indicator of this model is: 0.14295503251606848 %
the indicator PRED(25%) of this model is: 0.8571428571428571 %


# Conclusion


The evaluation of various regression techniques for predicting software development effort unveils interesting insights. Linear regression and decision tree algorithms emerge as frontrunners, showcasing comparable performance metrics on the test dataset after employing feature selection. However, prior to feature selection, linear regression demonstrates superior accuracy (94%) compared to other methods such as Decision Tree (87%) and KNN (79%). Notably, linear regression also exhibits a lower Mean Magnitude of Relative Error (MMRE) at 11% pre-feature selection and 14% post-feature selection, indicating its robustness. By juxtaposing linear regression's performance pre and post feature selection, it becomes evident that linear regression consistently outperforms other techniques. Consequently, the study concludes that linear regression stands as the optimal classifier for this particular software development effort prediction task.