diff --git a/DESCRIPTION b/DESCRIPTION new file mode 100644 index 0000000..0ee9308 --- /dev/null +++ b/DESCRIPTION @@ -0,0 +1,36 @@ +Package: practicalSigni +Type: Package +Title: Practical Significance Ranking of Regressors +Version: 0.1.0 +Date: 2023-02-16 +Authors@R: c(person("Hrishikesh", "Vinod", role = c("aut", "cre"), + email= "vinod@fordham.edu")) +Encoding: UTF-8 +Depends: R (>= 4.2.0), np (>= 0.60), xtable (>= 1.8), generalCorr (>= + 1.2), NNS (>= 0.9), randomForest (>= 4.7), +Suggests: R.rsp +VignetteBuilder: R.rsp +Description: Consider a possibly nonlinear nonparametric regression + with p regressors. We provide evaluations by 13 methods to rank + regressors by their practical significance or importance using + various methods, including machine learning tools. Comprehensive + methods are as follows. + m6=Generalized partial correlation coefficient or + GPCC by Vinod (2021) and + Vinod (2022). + m7= a generalization of psychologists' effect size incorporating + nonlinearity and many variables. + m8= local linear partial (dy/dxi) using the 'np' package for kernel + regressions. + m9= partial (dy/dxi) using the 'NNS' package. + m10= importance measure using the 'NNS' boost function. + m11= Shapley Value measure of importance (cooperative game theory). + m12 and m13= two versions of the random forest algorithm. +License: GPL (>= 2) +RoxygenNote: 7.2.3 +NeedsCompilation: no +Packaged: 2023-02-16 20:40:15 UTC; vinod +Author: Hrishikesh Vinod [aut, cre] +Maintainer: Hrishikesh Vinod +Repository: CRAN +Date/Publication: 2023-02-17 10:00:09 UTC diff --git a/MD5 b/MD5 new file mode 100644 index 0000000..28c0781 --- /dev/null +++ b/MD5 @@ -0,0 +1,17 @@ +92c1f695ab61b5e80990cdd319bfc34d *DESCRIPTION +a32fe68dcc5cbc8ce3fca00bd5233ceb *NAMESPACE +af0760a4e6931ee6d4eba7dfada61307 *R/effSizCut.R +88191dcd47435a1b93d49d308d7380b4 *R/fncut.R +bb793c2f927ef0f65325c1f5bb34cff6 *R/pracSig13.R +4d3098a8fc36f18af32bbfa8a01e0651 *R/reportRank.R +9d749b214937d8705a43ebf337b5b90d *R/shapleyvalue.R +cb4924f3573383fca290a0a8e2ca5294 *build/partial.rdb +413069f2e65b506d9ce5f353d3dde1b2 *build/vignette.rds +9444c4cba73807f33dfd5c504eae8bd5 *inst/doc/practicalSigni-vignette.pdf +d245490b43d80a01ad113320d14427c7 *inst/doc/practicalSigni-vignette.pdf.asis +3c34399a2a413e3326a47930a7194bf7 *man/effSizCut.Rd +800d600966d85d4a64388adf63dae6a4 *man/fncut.Rd +7a69afc9d5d2cd34d022dff7ba99ab1a *man/pracSig13.Rd +52f100cc0eb342987faa543ec8e403f0 *man/reportRank.Rd +cd2e8d19432223a68fb2cb9eddcf6949 *man/shapleyvalue.Rd +d245490b43d80a01ad113320d14427c7 *vignettes/practicalSigni-vignette.pdf.asis diff --git a/NAMESPACE b/NAMESPACE new file mode 100644 index 0000000..e105aad --- /dev/null +++ b/NAMESPACE @@ -0,0 +1,25 @@ +# Generated by roxygen2: do not edit by hand + +export(effSizCut) +export(fncut) +export(pracSig13) +export(reportRank) +export(shapleyvalue) +importFrom(NNS,NNS.boost) +importFrom(NNS,dy.d_) +importFrom(generalCorr,depMeas) +importFrom(generalCorr,kern) +importFrom(generalCorr,kern2) +importFrom(generalCorr,parcorVec) +importFrom(np,npreg) +importFrom(np,npregbw) +importFrom(randomForest,importance) +importFrom(randomForest,randomForest) +importFrom(stats,coef) +importFrom(stats,cor) +importFrom(stats,lm) +importFrom(stats,median) +importFrom(stats,residuals) +importFrom(stats,var) +importFrom(utils,combn) +importFrom(xtable,xtable) diff --git a/R/effSizCut.R b/R/effSizCut.R new file mode 100644 index 0000000..7f3021f --- /dev/null +++ b/R/effSizCut.R @@ -0,0 +1,84 @@ +#' Compute Effect Sizes for continuous or categorical data +#' +#' Psychologists' so-called "effect size" reveals +#' the practical significance of only one +#' regressor. This function generalizes their algorithm +#' to two or more regressors (p>2). Generalization first +#' converts the xi regressor into a categorical treatment variable +#' with only two categories. One imagines that observations +#' larger than the +#' median (xit> median(xi)) are "treated," and those +#' below the median are "untreated." +#' The aim is the measure the size of the +#' (treatment) effect of (xi) on y. Denote other variables +#' with postscript "o" as (xo). Since we have p regressors in +#' our multiple regression, we need to remove the nonlinear +#' kernel regression effect of +#' other variables (xo) on y while focusing on the effect of xi. +#' There are two options in treating (xo) (i) letting xo be +#' as they are in the data (ii) converting xo to binary +#' at the median. One chooses the first option (i) by setting the +#' logical argument ane=TRUE in calling the function. +#' ane=TRUE is the default. Set ane=FALSE for the second option. +#' +#' @param y { (T x 1) vector of dependent variable data values} +#' @param bigx { (T x p) data matrix of xi regressor variables associated +#' with the regression} +#' @param ane {logical variable controls the treatment of other regressors. +#' If ane=TRUE (default), other regressors are used in kernel regression +#' without forcing them to be binary variables. When ane=FALSE, +#' the kernel regression removes the effect of other regressors +#' when other regressors are also binary type categorical variables,} +#' @return out vector with p values of t-statistics for p regressors +#' @note The aim is to answer the following question. +#' Which regressor has the largest +#' effect on the dependent variable? We assume that the signs +#' of regressors are already adjusted such that a numerically +#' larger effect size suggests that the corresponding regressor +#' is most important, having the largest effect size in explaining +#' y the dependent variable. +#' @author Prof. H. D. Vinod, Economics Dept., Fordham University, NY +#' @seealso \code{\link{pracSig13}} +#' @importFrom generalCorr kern +#' @examples +#' set.seed(9) +#' y=sample(1:15,replace = TRUE) +#' x1=sample(2:16, replace = TRUE) +#' x2=sample(3:17, replace = TRUE) +#' effSizCut(y,bigx=cbind(x1,x2),ane=TRUE) +#' +#' @export + + +effSizCut=function(y,bigx, ane=TRUE){ #get t-stats + p=NCOL(bigx) + out=rep(NA,p) +bigx2=apply(bigx,2,fncut) +logi=apply(bigx2,2,as.logical) + for ( i in 1:p){ + kxi=generalCorr::kern(dep.y=y, reg.x=bigx2[,i],residuals=TRUE)#r=resid + rxi=residuals(kxi)#resid= (y - yhat) so yhat=(y-resid) + xihat=y-rxi +if(!ane){kother=generalCorr::kern(dep.y=y, reg.x=bigx[,-i],residuals=TRUE) + rother=residuals(kother) + xotherhat=y-rother } +if(ane) xotherhat=rep(mean(y,na.rm=TRUE),length(y)) + mylogi=logi[,i] + myxi=xihat[mylogi] + xibar=mean(myxi,na.rm=TRUE) + xivar=var(myxi,na.rm=TRUE) + myxo=xotherhat[mylogi] + xobar=mean(myxo,na.rm=TRUE) #o=other + xovar=var(myxo,na.rm=TRUE) + tim1=length(myxi)-1 #Ti-1, where ti=Ti, m1=minus 1 +# print(c("i,Ti-1= ",i,tim1)) + if(xivar<10e-9) denom=1 #var=variance + if(xovar<10e-9) denom=1 +# if(denom==1) print(c("t-stat=NA, zero var. col.No.",i)) + if(denom !=1) denom=sqrt(xivar/tim1+xovar/tim1) + #print(xivar,xovar) + + out[i]=(xibar-xobar)/denom }#end for loop + return(out) +} + diff --git a/R/fncut.R b/R/fncut.R new file mode 100644 index 0000000..cbed837 --- /dev/null +++ b/R/fncut.R @@ -0,0 +1,23 @@ +#' fncut auxiliary converts continuous data into two categories +#' +#' +#' This is an internal function of the R package practicalSigni +#' Psychologists use effect size to evaluate the practical +#' importance of a treatment on a dependent variable using +#' a binary [0,1] variable. Assuming numerical data, we +#' can always compute the median and regard values < or = the +#' median as zero and other values as unity. +#' +#' @param x {numerical vector of data values} +#' @return x {vector of zeros and ones split at the median} +#' @importFrom stats median +#' @author Prof. H. D. Vinod, Fordham University, NY +#' +#' +#' @export + +fncut=function(x){ + mix=median(x) + x[x<=mix]=0 + x[x>mix]=1 + return(x)} diff --git a/R/pracSig13.R b/R/pracSig13.R new file mode 100644 index 0000000..7f6195d --- /dev/null +++ b/R/pracSig13.R @@ -0,0 +1,240 @@ +#' Compute thirteen measures of practical significance +#' +#' Thirteen methods are denoted m1 to m13. Each +#' yields p numbers when there are p regressors denoted xi. +#' m1=OLS coefficient slopes. m2= t-stat of each slope. +#' m3= beta coefficients OLS after all variables have +#' mean zero and sd=1. +#' m4= Pearson correlation coefficient between y and xi (only two +#' variables at a time, assuming linearity). +#' Let r*(y|xi) denote the generalized correlation coefficient allowing +#' for nonlinearity from Vinod (2021, 2022). It does not equal +#' analogous r*(xi|y). The larger of the two, +#' max(r*(y|xi), r*(xi|y)), is given by +#' the function depMeas() from the 'generalCorr' package. +#' m5= depMeas, which allows nonlinearity. m5 is not comprehensive +#' because it measures only two variables, y and xi, at a time. +#' m6= generalized partial correlation coefficient or +#' GPCC. This is the first comprehensive measure +#' of practical significance. +#' m7=a generalization of psychologists' "effect size" after incorporating +#' the nonlinear effect of other variables. +#' m8= local linear partial (dy/dxi) using the 'np' package +#' for kernel regressions and local linear derivatives. +#' m9= partial derivative (dy/dxi) using the 'NNS' package. +#' m10=importance measure using NNS.boost() function of 'NNS.' +#' m11=Shapley Value measure of importance (cooperative game theory). +#' m12 and m13= two versions of the random forest algorithm +#' measuring the importance of regressors. +#' +#' If m6, m10 slow down computations, we recommend setting +#' yes13[6]=0=yes13[10] to turn of computation of m6 and m10 +#' . +#' @param y {input dependent variable data as a vector} +#' @param bigx {input matrix of p regressor variables} +#' @param yes13 {default vector of ones to compute all 13 measures. +#' e.g., yes13[10]=0 means do not compute the m10 method.} +#' @param verbo {logical to print results along the way default=FALSE} +#' @return output matrix (p x 13) containing m1 to m13 +#' criteria (numerical measures of practical significance) +#' along columns +#' and a row for each regressor (excluding the intercept). +#' @note needs the function kern(), which requires package 'np'. +#' also needs 'NNS', 'randomForest', packages. +#' @author Prof. H. D. Vinod, Economics Dept., Fordham University, NY +#' @seealso See Also as \code{\link{effSizCut}}, +#' @seealso See Also as \code{\link{reportRank}} +#' @importFrom stats coef cor lm median +#' @importFrom stats residuals var +#' @importFrom utils combn +#' @importFrom generalCorr depMeas parcorVec kern kern2 +#' @importFrom NNS dy.d_ NNS.boost +#' @importFrom randomForest randomForest importance +#' @importFrom np npreg npregbw +#' @importFrom xtable xtable +#' @references Vinod, H. D."Generalized Correlation and Kernel Causality with +#' Applications in Development Economics" in Communications in +#' Statistics -Simulation and Computation, 2015, +#' \doi{10.1080/03610918.2015.1122048} +#' @keywords GPCC +#' @keywords Shapley Value +#' @keywords Random Forest +#' @note The machine learning methods are subject to random seeds. +#' For some seed values, m10 values from NNS.boost() become +#' degenerate and are reported as NA or missing. In that case +#' the average ranking output r613 from reportRank() needs manual adjustments. +#' +#' @references Vinod, H. D.", "Generalized Correlations and Instantaneous Causality +#' for Data Pairs Benchmark," (March 8, 2015). +#' \url{https://www.ssrn.com/abstract=2574891} +#' @references Vinod, H. D. “Generalized, Partial and Canonical +#' Correlation Coefficients,” Computational Economics +#' (2021) SpringerLink vol. 59, pp.1-28. +#' URL https://link.springer.com/article/10.1007/s10614-021-10190-x +#' @references Vinod, H. D. “Kernel regression coefficients for practical +#' significance," Journal of Risk and Financial Management +#' 15(1), 2022 pp.1-13. https://doi.org/10.3390/jrfm15010032 +#' @references Vinod, H. D.", "Hands-On Intermediate Econometrics +#' Using R" (2022) World Scientific Publishers: Hackensack, NJ. +#' \url{https://www.worldscientific.com/worldscibooks/10.1142/12831} +#' +#' @export + + + +pracSig13=function(y,bigx, yes13=rep(1,13),verbo=FALSE){ + if (verbo){ + print("m1=ols coeff values") + print("Linear regression focusing of statistical significance")} + importance=NULL + p=NCOL(bigx) + mall=rep(NA,13) + for(i in 1:13) mall[i]=paste("m",i,sep="") + for(i in 1:13) assign(mall[i],rep(NA,p)) + if(verbo) print(c("initialize typical mi, m4=", m4)) + myp=1:p + nam=colnames(bigx) + inam=as.character(1:p);inam + if(length(nam)==0) nam=paste("x",eval(myp),sep="") + if(verbo) print(nam) + reg1=lm(y ~ bigx) + if (verbo){ + print("compute m1=OLS coeff. values using original data")} + if (yes13[1]==1){ + m1=coef(reg1)[2:(p+1)]} #omit intercept coeff magnitudes + if (verbo){ + print(m1)} + + if (verbo){ + print("compute m2=t-stat values")} + stu1x=summary(reg1)$coef[,3] #third col=t-values + if (yes13[2]==1){ + m2=stu1x[2:(p+1)] + if (verbo) print(m2)} + + if (verbo){ + print("m3=OLS regression coefficients on standardized data")} + if (yes13[3]==1){ + sy=scale(y) + sbigx=apply(bigx,2,scale) #data are standardized + if (verbo){ + print("regression forced through origin, so NO intercept")} + reg2=lm(sy~sbigx-1); m3=coef(reg2) + if (verbo){ + print("standardized regr coefficients=practical importance beta") + print(m3)} + } #endif yes13[3] + + if (yes13[4]==1){ + m4=rep(NA,p) + if (verbo){ + print("Pearson corr is m4")} + for (i in 1:p) m4[i]=cor(y,bigx[,i]) + if (verbo){ + print(m4)} + } # endif yes13[5] + + if (yes13[5]==1){ + m5=rep(NA,p) + if (verbo){ + print("generalized corr max(r*ij, r*ji) is m5")} + for (i in 1:p) m5[i]=generalCorr::depMeas(y,bigx[,i]) + if (verbo){ + print(m5)} + } + + if (yes13[6]==1){ + if (verbo){ + print("m6=generalized partial correl. coeff")} + m6=as.vector(generalCorr::parcorVec(cbind(y,bigx))) + if (verbo){ + print(m6)}} + + if (yes13[7]==1){ + if (verbo){ + print("m7=effect size Psychology, force xi=(0,1) at median")} + m7=effSizCut(y,bigx,ane=TRUE) + if (verbo){ + print(m7)} } + + if (yes13[8]==1){ + if (verbo){ + print("now m8 for local linear partials dy/dxi using np package")} + options(np.messages=FALSE) + scaly=as.vector(scale(y)) + sbigx=apply(bigx,2,scale) + m8=rep(NA,p) #place to store + for ( i in 1:p){ + kxi=generalCorr::kern2(dep.y=y, reg.x=sbigx[,i],gradients=TRUE) + m8[i]=apply(kxi$grad,2,mean) } + if (verbo){ + print("np estimate of mean of derivatives standardized units") + print(round(m8,3))} +} #endif m8 calculations + + if (yes13[9]==1){ + if (verbo){ + print("now m9 dy/dxi using NNS package")} + scaly=scale(y) + sbigx=apply(bigx,2,scale) + nnsout=NNS::dy.d_(x=sbigx, y=scaly[,1], wrt=1:p, + messages=FALSE, eval.points = "obs") + lapply(nnsout["First",], mean) + m9x= lapply(nnsout["First",], mean) + m9=as.numeric(m9x) + if (verbo){ + print(m9)} +} #endif m9 calculations + + if (yes13[10]==1 & verbo) { + print("now m10 boost importance using NNS package")} + + if (yes13[10]==1){ + NNSb=NNS::NNS.boost(IVs.train = bigx, DV.train = y,status=FALSE, + feature.importance = FALSE) + m10=as.numeric(NNSb$feature.weights) + if(verbo)print(m10) + }# endif m10 + + if (yes13[11]==1){ + if (verbo){ + print("now m11 Shapley measure of importance")} + bigxx=data.frame(bigx) + m11=as.numeric(shapleyvalue(y,bigxx)[2,]) + if (verbo) print(m11) + } # endif m11 + + if (yes13[12]==1 | yes13[13]==1){ + if (verbo){ + print("now m12, m13 random forest measures of importance")} + y.rf = randomForest(x=bigx,y=y, importance=TRUE) + imp12=round(importance(y.rf), 3) + if(yes13[12]==1) m12=as.numeric(imp12[,1]) + if (verbo){ + print(m12)} + if(yes13[13]==1) m13=as.numeric(imp12[,2]) + if (verbo){ + print(m13)} +} #endif m12 m13 + + out=matrix(NA,nrow=p, ncol=13) + + for (j in 1:13){ + if (j==1 & yes13[j]==1) out[,j]=m1 + if (j==2 & yes13[j]==1) out[,j]=m2 + if (j==3 & yes13[j]==1) out[,j]=m3 + if (j==4 & yes13[j]==1) out[,j]=m4 + if (j==5 & yes13[j]==1) out[,j]=m5 + if (j==6 & yes13[j]==1) out[,j]=m6 + if (j==7 & yes13[j]==1) out[,j]=m7 + if (j==8 & yes13[j]==1) out[,j]=m8 + if (j==9 & yes13[j]==1) out[,j]=m9 + if(length(m10)==p){ + if (j==10 & yes13[j]==1) out[,j]=m10} + if (j==11 & yes13[j]==1) out[,j]=m11 + if (j==12 & yes13[j]==1) out[,j]=m12 + if (j==13 & yes13[j]==1) out[,j]=m13 + } + rownames(out)=nam + return(out) +} diff --git a/R/reportRank.R b/R/reportRank.R new file mode 100644 index 0000000..32c5e80 --- /dev/null +++ b/R/reportRank.R @@ -0,0 +1,134 @@ +#' Function to report ranks of 13 criteria for practical significance +#' +#' This function generates a report based on the +#' regression of y on bigx. +#' It acknowledges that some methods for evaluating the importance +#' of regressor in explaining y may give the importance value +#' with a wrong (unrealistic) sign. For example, m2 reports t-values. Imagine +#' that due to collinearity m2 value is negative when the correct +#' sign from prior knowledge of the subject matter is that the +#' coefficient should be positive, and hence the t-stat should be positive. +#' The wrong sign means the importance of regressor in explaining y +#' should be regarded as relatively less important. The larger the +#' absolute size of the t-stat, the less its true importance in +#' measuring y. The ranking of coefficients computed here +#' suitably deprecates the importance of the regressor +#' when its coefficient has the wrong sign (perverse +#' direction). +#' +#' +#' +#' @param y { (T x 1) vector of dependent variable data values} +#' @param bigx { (T x p) data marix of xi regressor variables associated +#' with the regression} +#' @param yesLatex {default 1 means print Latex-ready Tables} +#' @param yes13 {default vector of ones to compute all 13 measures.} +#' @param bsign {A (p x 1) vector of right signs of regression +#' coefficients. Default is bsign=0 means the right sign is the same +#' as the sign of the covariance, cov(y, xi) } +#' @param dig {digits to be printed in latex tables, default, dig=d33} +#' @param verbo {logical to print results by pracSig13, default=FALSE} +#' @author Prof. H. D. Vinod, Economics Dept., Fordham University, NY +#' @importFrom stats coef cor lm median residuals var +#' @seealso \code{\link{pracSig13}} +#' @note The machine learning methods are subject to random seeds. +#' For some seed values, m10 values from NNS.boost() rarely become +#' degenerate and are reported as NA or missing. In that case +#' the average ranking output r613 here needs adjustment. +#' @return +#' \item{v15}{practical significance index values (sign adjusted) +#' for m1 to m5 using older linear and /or bivariate methods} +#' \item{v613}{practical significance index values +#' for m6 to m13 newer +#' comprehensive and nonlinear methods} +#' \item{r15}{ranks and average rank for m1 to m5 +#' using older linear and /or bivariate methods} +#' \item{r613}{ranks and average rank for m6 to m13 newer +#' comprehensive and nonlinear methods} +#' +#' +#' @examples +#' \donttest{ +#' set.seed(9) +#' y=sample(1:15,replace = TRUE) +#' x1=sample(2:16, replace = TRUE) +#' x2=sample(3:17, replace = TRUE) +#' x3=sample(4:18,replace = TRUE) +#' reportRank(y,bigx=cbind(x1,x2,x3)) +#' } +#' +#' @export + + +reportRank=function(y, bigx, yesLatex=1, yes13=rep(1,13), + bsign=0,dig=3,verbo=FALSE) { + d3=dig + p=NCOL(bigx) + m1=NULL;m2=NULL;m3=NULL;m4=NULL;m5=NULL;m6=NULL;m7=NULL + m8=NULL;m9=NULL;m10=NULL;m11=NULL;m12=NULL;m13=NULL + nam=colnames(bigx) + if (bsign==0) { + bsign=rep(1,p) + for (i in 1:p) bsign[i]=sign(cor(y,bigx[,i])) } + p13=pracSig13(y,bigx,verbo=verbo,yes13=yes13) + if(verbo){ + print("values before sign adjustment for m1 to m9") + print("Sign always positive for importance indexes m10 t0 m13") + print(p13[,1:6],digits=d3) + print(p13[,7:13],digits=d3)} + if(verbo){ + if(yesLatex==1){ + print("values before sign adjustment for m1 to m9") + print(xtable::xtable(p13[,1:6],digits=d3)) + print(xtable::xtable(p13[,7:13],digits=d3)) } } + sg19=diag(bsign)%*%p13[,1:9] + PSM=cbind(sg19,p13[,10:13]) + colnames(PSM)=c("m1","m2","m3","m4","m5","m6", + "m7","m8","m9","m10","m11","m12","m13") + if(verbo){ + print("values AFTER sign adjustment for m1 to m9") + print(PSM[,1:6],digits=d3) + print(PSM[,7:13],digits=d3) + if(yesLatex==1){ + print("values before sign adjustment for m1 to m9") + print(xtable::xtable(PSM[,1:6],digits=d3)) + print(xtable::xtable(PSM[,7:13],digits=d3)) }} +n13=NCOL(PSM) +for(j in 1:13) assign(paste("m",j,sep=""),PSM[,j]) +#attach(data.frame(PSM)) +if(verbo) print(nam) +v15=cbind(m1,m2,m3,m4,m5) #v=values 15=m1 to m5 +rownames(v15)=nam +if(verbo){ +print("values of m1 to m5 after sign adjustments if any") +print(v15) +if(verbo){ +if(yesLatex==1) print(xtable::xtable(v15,digits=4)) + print("above v15 values of m1 to m5")}} +#now compute rank numbers for m1 to m5 +nr=NROW(v15) +rank15=((nr+1)-apply(v15,2,rank,na.last=TRUE, ties.method="average")) +if(verbo) print("ranks values as they are") +avrank15=round(apply(rank15,1,mean),1) +r15=cbind(rank15,avrank15) +if(verbo) {print(r15) +if(yesLatex==1) print(xtable::xtable(r15,digits=2))} +v613=cbind(m6,m7,m8,m9,m10,m11,m12,m13) +#detach(data.frame(PSM)) +rownames(v613)=nam +if(verbo) {print(v613) +if(yesLatex==1) print(xtable::xtable(v613,digits=4))} +nr=NROW(v613) +rank613=((nr+1)-apply(v613,2,rank,na.last=TRUE, + ties.method="average")) +if(verbo){ +print("sign adjusted (if necessary) ranks of p regressors")} +avrank613=round(apply(rank613,1,mean),1) +r613=cbind(rank613,avrank613) +if(verbo){ +print(r613) +if(yesLatex==1) print(xtable::xtable(r613,digits=2)) +} +list(v15=v15, v613=v613, r15=r15, r613=r613) +} + diff --git a/R/shapleyvalue.R b/R/shapleyvalue.R new file mode 100644 index 0000000..c1d67bc --- /dev/null +++ b/R/shapleyvalue.R @@ -0,0 +1,140 @@ +#' Shapley Value Regression importance of attributes in linear regression +#' +#' 'ShapleyValue' package function shapleyvalue() +#' calculates the relative importance +#' of independent variables in linear regression while avoiding +#' collinearity. This function is a minor modification of the function +#' in the package ShapleyValue. The original function fails for me when +#' a variable name is x1. Since Jingyi Liang, +#' the maintainer of 'ShapleyValue,' has ignored several requests +#' to fix the error, this function +#' is reluctantly included here as a separate +#' function. +#' @param y {input dependent variable data as a vector} +#' @param x {input matrix of p regressor variables must be a data frame} +#' @return The structure of the output is a datatable, with two rows: +#' the unstandardized and standardized relative importance of each +#' attributes using shapley value regression method. +#' @author H. D. Vinod, Fordham University, NY +#' @importFrom stats lm +#' @references Liang J (2021). ShapleyValue: Shapley Value Regression for +#' Relative Importance of Attributes. R package version 0.2.0, +#' \url{https://CRAN.R-project.org/package=ShapleyValue}. +#' +#' @export + + +shapleyvalue= + function (y, x) + { + k <- ncol(x) + times_wt <- NULL + for (i in 1:k - 1) { + times_wt[i] <- choose(k - 1, i) + } + data <- data.frame(matrix(ncol = k, nrow = 2^(k - 1))) + name <- colnames(x) + if (k == 1) { + stop(simpleError("No need for using shapley regression!")) + } + else if (k == 2) { + for (i in 1:k) { + n <- name[i] + lm <- lm(y ~ x[, n]) + lm2 <- lm(y ~ x[, -i]) + lm_all <- lm(y ~ ., data = x) + r1 <- summary(lm)$r.squared + r2 <- summary(lm2)$r.squared + r3 <- summary(lm_all)$r.squared + data[1, i] <- r1 + data[2, i] <- r3 - r2 + } + weight <- c(1/2, 1/2) + data <- as.matrix(data) + dat <- weight %*% data + dat <- as.data.frame(dat) + sum <- rowSums(dat) + uni_value <- dat/sum + dat <- rbind(round(dat, 4), round(uni_value, 4)) + colnames(dat) <- name + rownames(dat) <- c("Shapley Value", "Standardized Shapley Value") + return(dat) + } + else { + for (i in 1:k) { + num <- 1 + for (j in 1:k) { + if (j == 1) { + n <- name[i] + lm <- lm(y ~ x[, n]) + data[1, i] <- summary(lm)$r.squared + } + else if (j == k) { + xx <- combn(k, k) + xlab <- NULL + for (l in 1:j) { + x_lab[l] <- xx[l] + if (x_lab[l] == i) { + x1dummy <- x[, x_lab[l]] + x_reslab <- xx[-l] + x_res <- x[x_reslab] + x_res <- cbind(x_res, x1dummy) + lm <- lm(y ~ . - x1dummy, data = x_res) + lm2 <- lm(y ~ ., data = x_res) + r <- summary(lm)$r.squared + r2 <- summary(lm2)$r.squared + } + } + data[2^(k - 1), i] <- r2 - r + } + else { + times <- choose(k - 1, j - 1) + times2 <- choose(k, j) + combine <- combn(k, j) + n <- NULL + nn <- combine + for (m in 1:j) { + nn[m, ] <- combine[m, ] == i + } + for (m in 1:times2) { + n[m] <- 1 %in% nn[, m] + } + xx <- combine[, which(n)] + for (o in 1:times) { + num <- num + 1 + x_lab <- NULL + for (l in 1:j) { + x_lab[l] <- xx[, o][l] + if (x_lab[l] == i) { + x1dummy <- x[, x_lab[l]] + x_reslab <- xx[, o][-l] + x_res <- x[x_reslab] + x_res <- cbind(x_res, x1dummy) + lm <- lm(y ~ . - x1dummy, data = x_res) + lm2 <- lm(y ~ ., data = x_res) + r <- summary(lm)$r.squared + r2 <- summary(lm2)$r.squared + } + } + data[num, i] <- r2 - r + } + } + } + } + weight <- 1/k + for (q in 1:k - 1) { + weight <- c(weight, rep(1/k/times_wt[q], times_wt[q])) + } + data <- as.matrix(data) + weight <- as.matrix(weight) + weight <- t(weight) + dat2 <- weight %*% data + dat2 <- as.data.frame(dat2) + sum <- rowSums(dat2) + uni_value <- dat2/sum + dat2 <- rbind(round(dat2, 4), round(uni_value, 4)) + colnames(dat2) <- name + rownames(dat2) <- c("Shapley Value", "Standardized Shapley Value") + return(dat2) + } + } diff --git a/build/partial.rdb b/build/partial.rdb new file mode 100644 index 0000000..46a8ca1 Binary files /dev/null and b/build/partial.rdb differ diff --git a/build/vignette.rds b/build/vignette.rds new file mode 100644 index 0000000..ab0dda2 Binary files /dev/null and b/build/vignette.rds differ diff --git a/inst/doc/practicalSigni-vignette.pdf b/inst/doc/practicalSigni-vignette.pdf new file mode 100644 index 0000000..2bd7cfa Binary files /dev/null and b/inst/doc/practicalSigni-vignette.pdf differ diff --git a/inst/doc/practicalSigni-vignette.pdf.asis b/inst/doc/practicalSigni-vignette.pdf.asis new file mode 100644 index 0000000..a099b2a --- /dev/null +++ b/inst/doc/practicalSigni-vignette.pdf.asis @@ -0,0 +1,6 @@ +%\VignetteIndexEntry{practicalSigni-vignette} +%\VignetteEngine{R.rsp::asis} +%\VignetteKeyword{PDF} +%\VignetteKeyword{HTML} +%\VignetteKeyword{vignette} +%\VignetteKeyword{package} diff --git a/man/effSizCut.Rd b/man/effSizCut.Rd new file mode 100644 index 0000000..19b7dc0 --- /dev/null +++ b/man/effSizCut.Rd @@ -0,0 +1,68 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/effSizCut.R +\name{effSizCut} +\alias{effSizCut} +\title{Compute Effect Sizes for continuous or categorical data} +\usage{ +effSizCut(y, bigx, ane = TRUE) +} +\arguments{ +\item{y}{{ (T x 1) vector of dependent variable data values}} + +\item{bigx}{{ (T x p) data matrix of xi regressor variables associated +with the regression}} + +\item{ane}{{logical variable controls the treatment of other regressors. +If ane=TRUE (default), other regressors are used in kernel regression +without forcing them to be binary variables. When ane=FALSE, +the kernel regression removes the effect of other regressors +when other regressors are also binary type categorical variables,}} +} +\value{ +out vector with p values of t-statistics for p regressors +} +\description{ +Psychologists' so-called "effect size" reveals +the practical significance of only one +regressor. This function generalizes their algorithm +to two or more regressors (p>2). Generalization first +converts the xi regressor into a categorical treatment variable +with only two categories. One imagines that observations +larger than the +median (xit> median(xi)) are "treated," and those +below the median are "untreated." +The aim is the measure the size of the +(treatment) effect of (xi) on y. Denote other variables +with postscript "o" as (xo). Since we have p regressors in +our multiple regression, we need to remove the nonlinear +kernel regression effect of +other variables (xo) on y while focusing on the effect of xi. +There are two options in treating (xo) (i) letting xo be +as they are in the data (ii) converting xo to binary +at the median. One chooses the first option (i) by setting the +logical argument ane=TRUE in calling the function. +ane=TRUE is the default. Set ane=FALSE for the second option. +} +\note{ +The aim is to answer the following question. +Which regressor has the largest +effect on the dependent variable? We assume that the signs +of regressors are already adjusted such that a numerically +larger effect size suggests that the corresponding regressor +is most important, having the largest effect size in explaining +y the dependent variable. +} +\examples{ +set.seed(9) + y=sample(1:15,replace = TRUE) + x1=sample(2:16, replace = TRUE) + x2=sample(3:17, replace = TRUE) +effSizCut(y,bigx=cbind(x1,x2),ane=TRUE) + +} +\seealso{ +\code{\link{pracSig13}} +} +\author{ +Prof. H. D. Vinod, Economics Dept., Fordham University, NY +} diff --git a/man/fncut.Rd b/man/fncut.Rd new file mode 100644 index 0000000..3712f3e --- /dev/null +++ b/man/fncut.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/fncut.R +\name{fncut} +\alias{fncut} +\title{fncut auxiliary converts continuous data into two categories} +\usage{ +fncut(x) +} +\arguments{ +\item{x}{{numerical vector of data values}} +} +\value{ +x {vector of zeros and ones split at the median} +} +\description{ +This is an internal function of the R package practicalSigni + Psychologists use effect size to evaluate the practical + importance of a treatment on a dependent variable using + a binary [0,1] variable. Assuming numerical data, we + can always compute the median and regard values < or = the + median as zero and other values as unity. +} +\author{ +Prof. H. D. Vinod, Fordham University, NY +} diff --git a/man/pracSig13.Rd b/man/pracSig13.Rd new file mode 100644 index 0000000..f21f0a2 --- /dev/null +++ b/man/pracSig13.Rd @@ -0,0 +1,102 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/pracSig13.R +\name{pracSig13} +\alias{pracSig13} +\title{Compute thirteen measures of practical significance} +\usage{ +pracSig13(y, bigx, yes13 = rep(1, 13), verbo = FALSE) +} +\arguments{ +\item{y}{{input dependent variable data as a vector}} + +\item{bigx}{{input matrix of p regressor variables}} + +\item{yes13}{{default vector of ones to compute all 13 measures. +e.g., yes13[10]=0 means do not compute the m10 method.}} + +\item{verbo}{{logical to print results along the way default=FALSE}} +} +\value{ +output matrix (p x 13) containing m1 to m13 +criteria (numerical measures of practical significance) +along columns +and a row for each regressor (excluding the intercept). +} +\description{ +Thirteen methods are denoted m1 to m13. Each +yields p numbers when there are p regressors denoted xi. +m1=OLS coefficient slopes. m2= t-stat of each slope. +m3= beta coefficients OLS after all variables have +mean zero and sd=1. +m4= Pearson correlation coefficient between y and xi (only two +variables at a time, assuming linearity). +Let r*(y|xi) denote the generalized correlation coefficient allowing +for nonlinearity from Vinod (2021, 2022). It does not equal +analogous r*(xi|y). The larger of the two, +max(r*(y|xi), r*(xi|y)), is given by +the function depMeas() from the 'generalCorr' package. +m5= depMeas, which allows nonlinearity. m5 is not comprehensive +because it measures only two variables, y and xi, at a time. +m6= generalized partial correlation coefficient or +GPCC. This is the first comprehensive measure +of practical significance. +m7=a generalization of psychologists' "effect size" after incorporating +the nonlinear effect of other variables. +m8= local linear partial (dy/dxi) using the 'np' package +for kernel regressions and local linear derivatives. +m9= partial derivative (dy/dxi) using the 'NNS' package. +m10=importance measure using NNS.boost() function of 'NNS.' +m11=Shapley Value measure of importance (cooperative game theory). +m12 and m13= two versions of the random forest algorithm +measuring the importance of regressors. +} +\details{ +If m6, m10 slow down computations, we recommend setting +yes13[6]=0=yes13[10] to turn of computation of m6 and m10 +. +} +\note{ +needs the function kern(), which requires package 'np'. +also needs 'NNS', 'randomForest', packages. + +The machine learning methods are subject to random seeds. +For some seed values, m10 values from NNS.boost() become +degenerate and are reported as NA or missing. In that case +the average ranking output r613 from reportRank() needs manual adjustments. +} +\references{ +Vinod, H. D."Generalized Correlation and Kernel Causality with + Applications in Development Economics" in Communications in + Statistics -Simulation and Computation, 2015, + \doi{10.1080/03610918.2015.1122048} + +Vinod, H. D.", "Generalized Correlations and Instantaneous Causality + for Data Pairs Benchmark," (March 8, 2015). + \url{https://www.ssrn.com/abstract=2574891} + +Vinod, H. D. “Generalized, Partial and Canonical +Correlation Coefficients,” Computational Economics +(2021) SpringerLink vol. 59, pp.1-28. +URL https://link.springer.com/article/10.1007/s10614-021-10190-x + +Vinod, H. D. “Kernel regression coefficients for practical +significance," Journal of Risk and Financial Management +15(1), 2022 pp.1-13. https://doi.org/10.3390/jrfm15010032 + +Vinod, H. D.", "Hands-On Intermediate Econometrics +Using R" (2022) World Scientific Publishers: Hackensack, NJ. +\url{https://www.worldscientific.com/worldscibooks/10.1142/12831} +} +\seealso{ +See Also as \code{\link{effSizCut}}, + +See Also as \code{\link{reportRank}} +} +\author{ +Prof. H. D. Vinod, Economics Dept., Fordham University, NY +} +\keyword{Forest} +\keyword{GPCC} +\keyword{Random} +\keyword{Shapley} +\keyword{Value} diff --git a/man/reportRank.Rd b/man/reportRank.Rd new file mode 100644 index 0000000..e9af3d9 --- /dev/null +++ b/man/reportRank.Rd @@ -0,0 +1,85 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/reportRank.R +\name{reportRank} +\alias{reportRank} +\title{Function to report ranks of 13 criteria for practical significance} +\usage{ +reportRank( + y, + bigx, + yesLatex = 1, + yes13 = rep(1, 13), + bsign = 0, + dig = 3, + verbo = FALSE +) +} +\arguments{ +\item{y}{{ (T x 1) vector of dependent variable data values}} + +\item{bigx}{{ (T x p) data marix of xi regressor variables associated +with the regression}} + +\item{yesLatex}{{default 1 means print Latex-ready Tables}} + +\item{yes13}{{default vector of ones to compute all 13 measures.}} + +\item{bsign}{{A (p x 1) vector of right signs of regression +coefficients. Default is bsign=0 means the right sign is the same +as the sign of the covariance, cov(y, xi) }} + +\item{dig}{{digits to be printed in latex tables, default, dig=d33}} + +\item{verbo}{{logical to print results by pracSig13, default=FALSE}} +} +\value{ +\item{v15}{practical significance index values (sign adjusted) + for m1 to m5 using older linear and /or bivariate methods} + \item{v613}{practical significance index values + for m6 to m13 newer + comprehensive and nonlinear methods} + \item{r15}{ranks and average rank for m1 to m5 + using older linear and /or bivariate methods} + \item{r613}{ranks and average rank for m6 to m13 newer + comprehensive and nonlinear methods} +} +\description{ +This function generates a report based on the +regression of y on bigx. +It acknowledges that some methods for evaluating the importance +of regressor in explaining y may give the importance value +with a wrong (unrealistic) sign. For example, m2 reports t-values. Imagine +that due to collinearity m2 value is negative when the correct +sign from prior knowledge of the subject matter is that the +coefficient should be positive, and hence the t-stat should be positive. +The wrong sign means the importance of regressor in explaining y +should be regarded as relatively less important. The larger the +absolute size of the t-stat, the less its true importance in +measuring y. The ranking of coefficients computed here +suitably deprecates the importance of the regressor +when its coefficient has the wrong sign (perverse +direction). +} +\note{ +The machine learning methods are subject to random seeds. +For some seed values, m10 values from NNS.boost() rarely become +degenerate and are reported as NA or missing. In that case +the average ranking output r613 here needs adjustment. +} +\examples{ +\donttest{ +set.seed(9) +y=sample(1:15,replace = TRUE) +x1=sample(2:16, replace = TRUE) +x2=sample(3:17, replace = TRUE) +x3=sample(4:18,replace = TRUE) +reportRank(y,bigx=cbind(x1,x2,x3)) +} + +} +\seealso{ +\code{\link{pracSig13}} +} +\author{ +Prof. H. D. Vinod, Economics Dept., Fordham University, NY +} diff --git a/man/shapleyvalue.Rd b/man/shapleyvalue.Rd new file mode 100644 index 0000000..f76d3e8 --- /dev/null +++ b/man/shapleyvalue.Rd @@ -0,0 +1,38 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/shapleyvalue.R +\name{shapleyvalue} +\alias{shapleyvalue} +\title{Shapley Value Regression importance of attributes in linear regression} +\usage{ +shapleyvalue(y, x) +} +\arguments{ +\item{y}{{input dependent variable data as a vector}} + +\item{x}{{input matrix of p regressor variables must be a data frame}} +} +\value{ +The structure of the output is a datatable, with two rows: +the unstandardized and standardized relative importance of each +attributes using shapley value regression method. +} +\description{ +'ShapleyValue' package function shapleyvalue() +calculates the relative importance +of independent variables in linear regression while avoiding +collinearity. This function is a minor modification of the function +in the package ShapleyValue. The original function fails for me when +a variable name is x1. Since Jingyi Liang, +the maintainer of 'ShapleyValue,' has ignored several requests +to fix the error, this function +is reluctantly included here as a separate +function. +} +\references{ +Liang J (2021). ShapleyValue: Shapley Value Regression for +Relative Importance of Attributes. R package version 0.2.0, +\url{https://CRAN.R-project.org/package=ShapleyValue}. +} +\author{ +H. D. Vinod, Fordham University, NY +} diff --git a/vignettes/practicalSigni-vignette.pdf.asis b/vignettes/practicalSigni-vignette.pdf.asis new file mode 100644 index 0000000..a099b2a --- /dev/null +++ b/vignettes/practicalSigni-vignette.pdf.asis @@ -0,0 +1,6 @@ +%\VignetteIndexEntry{practicalSigni-vignette} +%\VignetteEngine{R.rsp::asis} +%\VignetteKeyword{PDF} +%\VignetteKeyword{HTML} +%\VignetteKeyword{vignette} +%\VignetteKeyword{package}