## STEPDISC CANDISC - oliveoil dataset

In [1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

### oliveoil dataset

In [2]:
#vins dataset
from discrimintools.datasets import load_oliveoil
D = load_oliveoil("train")
print(D.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CLASSE       569 non-null    object
 1   palmitic     569 non-null    int64 
 2   palmitoleic  569 non-null    int64 
 3   stearic      569 non-null    int64 
 4   oleic        569 non-null    int64 
 5   linoleic     569 non-null    int64 
 6   linolenic    569 non-null    int64 
 7   arachidic    569 non-null    int64 
 8   eicosenoic   569 non-null    int64 
dtypes: int64(8), object(1)
memory usage: 40.1+ KB
None


### Forward selection

In [3]:
from discrimintools import CANDISC, STEPDISC
#split into X and y
y, X = D["CLASSE"], D.drop(columns=["CLASSE"])
clf = CANDISC(n_components=2).fit(X,y)
clf2 = STEPDISC(method="forward",alpha=0.01,verbose=True)
clf2.fit(clf)


             Wilks' Lambda  Partial R-Square      F Value  Num DF  Den DF  \
palmitic          0.538509          0.461491   242.524854       2     566   
palmitoleic       0.604905          0.395095   184.841942       2     566   
stearic           0.998272          0.001728     0.489942       2     566   
oleic             0.473479          0.526521   314.703134       2     566   
linoleic          0.550371          0.449629   231.198312       2     566   
linolenic         0.687722          0.312278   128.503464       2     566   
arachidic         0.662890          0.337110   143.918675       2     566   
eicosenoic        0.202071          0.797929  1117.498522       2     566   

                      Pr>F  
palmitic      8.465810e-77  
palmitoleic   1.650063e-62  
stearic       6.129213e-01  
oleic         1.288711e-92  
linoleic      4.032628e-74  
linolenic     9.724383e-47  
arachidic     2.936859e-51  
eicosenoic   2.867939e-197  

Variable eicosenoic will enter


          

0,1,2
,method,'forward'
,alpha,0.01
,lambda_init,
,verbose,True


#### Selected variables

In [4]:
#selected variables
print(clf2.summary_.selected)

['eicosenoic', 'linoleic', 'palmitoleic', 'arachidic', 'linolenic', 'palmitic', 'oleic']


#### summary

In [5]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)

                     Stepwise Discriminant Analysis - Results                     


                     Canonical Discriminant Analysis - Results                     

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic       

#### Evaluation of prediction on testing dataset

In [6]:
#testining data
DTest = load_oliveoil("test")
#split into X and y
yTest, XTest = DTest["CLASSE"], DTest.drop(columns=["CLASSE"])
#evaluation of prediction on testing dataset
eval_test = clf2.eval_predict(XTest,yTest,verbose=True)

Observation Profile:
                        Read  Used
Number of Observations     3     3

Number of Observations Classified into CLASSE:
prediction    Centre_North  Sardinia  South  Total
CLASSE                                            
Centre_North             1         0      0      1
Sardinia                 0         1      0      1
South                    0         0      1      1
Total                    1         1      1      3

Percent Classified into CLASSE:
prediction    Centre_North    Sardinia       South  Total
CLASSE                                                   
Centre_North    100.000000    0.000000    0.000000  100.0
Sardinia          0.000000  100.000000    0.000000  100.0
South             0.000000    0.000000  100.000000  100.0
Total            33.333333   33.333333   33.333333  100.0
Priors            0.263620    0.170475    0.565905    NaN

Error Count Estimates for CLASSE:
        Centre_North  Sardinia     South  Total
Rate         0.00000  0.000000  0

### backward selection

In [7]:
#backward selection
clf2 = STEPDISC(method="backward",alpha=0.01,verbose=True)
clf2.fit(clf)


         Wilks' Lambda  Partial R-Square      F Value  Num DF  Den DF  Pr>F
stearic            1.0           0.96804  8465.862059       2     559   0.0

No variable can be removed


Since only one feature is selected, CANDISC procedure cannot be updated.


0,1,2
,method,'backward'
,alpha,0.01
,lambda_init,
,verbose,True


#### Selected variables

In [8]:
#selected variables
print(clf2.summary_.selected)

['stearic']


#### Summary

In [9]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)

                     Stepwise Discriminant Analysis - Results                     


                     Canonical Discriminant Analysis - Results                     

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic       