Context: A customer wants to get their source code under control. With this analysis, we analyze existing concepts in the source code based on naming conventions. The goals is to find common used naming conventions and document them in the architecture documentation so that every developer can understand those concept if they come across those in the source code.

In [1]:
import glob
path = "../../OpenClinica/"
java_filelist = glob.glob(path + "**/*.java", recursive=True)
java_filelist[:5]

['../../OpenClinica/core/src/main/java/org/akaza/openclinica/bean/admin/AuditBean.java',
 '../../OpenClinica/core/src/main/java/org/akaza/openclinica/bean/admin/AuditEventBean.java',
 '../../OpenClinica/core/src/main/java/org/akaza/openclinica/bean/admin/CRFBean.java',
 '../../OpenClinica/core/src/main/java/org/akaza/openclinica/bean/admin/DeletedEventCRFBean.java',
 '../../OpenClinica/core/src/main/java/org/akaza/openclinica/bean/admin/DisplayStudyBean.java']

In [2]:
import pandas as pd

code = pd.DataFrame(java_filelist, columns=["filepath"])
code["filepath"] = code["filepath"].str.replace(path, "", regex=False)
code.head()

Unnamed: 0,filepath
0,core/src/main/java/org/akaza/openclinica/bean/...
1,core/src/main/java/org/akaza/openclinica/bean/...
2,core/src/main/java/org/akaza/openclinica/bean/...
3,core/src/main/java/org/akaza/openclinica/bean/...
4,core/src/main/java/org/akaza/openclinica/bean/...


In [3]:
code["type"] = code['filepath'].str.rsplit("/", 1).str[-1].str.replace(".java","", regex=False)
code.head()

Unnamed: 0,filepath,type
0,core/src/main/java/org/akaza/openclinica/bean/...,AuditBean
1,core/src/main/java/org/akaza/openclinica/bean/...,AuditEventBean
2,core/src/main/java/org/akaza/openclinica/bean/...,CRFBean
3,core/src/main/java/org/akaza/openclinica/bean/...,DeletedEventCRFBean
4,core/src/main/java/org/akaza/openclinica/bean/...,DisplayStudyBean


In [4]:
import re
 
def split_camel_case_split(str):
    return re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', str)

code["splitted"] = code["type"].apply(split_camel_case_split)
code.head()

Unnamed: 0,filepath,type,splitted
0,core/src/main/java/org/akaza/openclinica/bean/...,AuditBean,"[Audit, Bean]"
1,core/src/main/java/org/akaza/openclinica/bean/...,AuditEventBean,"[Audit, Event, Bean]"
2,core/src/main/java/org/akaza/openclinica/bean/...,CRFBean,"[CRF, Bean]"
3,core/src/main/java/org/akaza/openclinica/bean/...,DeletedEventCRFBean,"[Deleted, Event, CRF, Bean]"
4,core/src/main/java/org/akaza/openclinica/bean/...,DisplayStudyBean,"[Display, Study, Bean]"


In [5]:
code["name_-1"] = code['splitted'].str[-1].fillna("")
code["name_-2"] = code['splitted'].str[-2].fillna("")
code["name_-3"] = code['splitted'].str[-3].fillna("")
code["name_-2_-1"] = code["name_-2"] + code["name_-1"]
code["name_-3_-2_-1"] = code["name_-3"] + code["name_-2"] + code["name_-1"]
code.iloc[:,-5:].head()

Unnamed: 0,name_-1,name_-2,name_-3,name_-2_-1,name_-3_-2_-1
0,Bean,Audit,,AuditBean,AuditBean
1,Bean,Event,Audit,EventBean,AuditEventBean
2,Bean,CRF,,CRFBean,CRFBean
3,Bean,CRF,Event,CRFBean,EventCRFBean
4,Bean,Study,Display,StudyBean,DisplayStudyBean


In [6]:
pd.DataFrame(code['name_-1'].value_counts()).head()

Unnamed: 0,name_-1
Servlet,205
Bean,187
Dao,60
Service,41
DAO,38


In [7]:
pd.DataFrame(code['name_-2_-1'].value_counts()).head()

Unnamed: 0,name_-2_-1
CRFServlet,16
SubjectServlet,15
StudyServlet,15
TableFactory,15
DataBean,14


Taking level -3 into consideration, makes it clear that this might not be the best choice because those stereotypes consist partly of domain names. Thus, level -2 seems to be a good candidate to analyze the corresponding stereotypes a little bit more in detail.

In [8]:
pd.DataFrame(code['name_-3_-2_-1'].value_counts()).head()

Unnamed: 0,name_-3_-2_-1
CRFVersionServlet,9
StudyEventServlet,8
EventDefinitionServlet,8
EventCRFServlet,7
StudySubjectServlet,7


Getting a list of source code files that resemble one concept for level -1.

In [9]:
code_stereotype_per_file = code.groupby(['name_-1', 'filepath'])[['type']].count()
code_stereotype_per_file.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,type
name_-1,filepath,Unnamed: 2_level_1
API,web/src/main/java/org/akaza/openclinica/web/pform/EnketoAPI.java,1
AUTH,core/src/main/java/org/akaza/openclinica/log/LogFilterFacilityAUTH.java,1
AUTHPRIV,core/src/main/java/org/akaza/openclinica/log/LogFilterFacilityAUTHPRIV.java,1
Access,web/src/main/java/org/akaza/openclinica/control/SpringServletAccess.java,1
Access,ws/src/main/java/org/akaza/openclinica/control/form/SpringServletAccess.java,1


In [10]:
code_stereotypes = code_stereotype_per_file.groupby(['name_-1']).transform(sum).sort_values(by="type", ascending=False)
code_stereotypes

Unnamed: 0_level_0,Unnamed: 1_level_0,type
name_-1,filepath,Unnamed: 2_level_1
Servlet,web/src/main/java/org/akaza/openclinica/control/managestudy/UpdateStudyServlet.java,205
Servlet,web/src/main/java/org/akaza/openclinica/control/managestudy/RestoreSiteServlet.java,205
Servlet,web/src/main/java/org/akaza/openclinica/control/extract/DiscrepancyNoteOutputServlet.java,205
Servlet,web/src/main/java/org/akaza/openclinica/control/extract/EditDatasetServlet.java,205
Servlet,web/src/main/java/org/akaza/openclinica/control/extract/EditFilterServlet.java,205
...,...,...
Itext,core/src/main/java/org/akaza/openclinica/domain/xform/dto/Itext.java,1
Sender,core/src/main/java/org/akaza/openclinica/core/OpenClinicaMailSender.java,1
Text,core/src/main/java/org/akaza/openclinica/domain/xform/dto/Text.java,1
Thread,core/src/main/java/org/akaza/openclinica/service/DiscrepancyNoteThread.java,1


In [11]:
code_stereotypes.to_excel("output/openclinica_stereotypes_-1.xlsx")

The same for level -2.

In [12]:
code_stereotype_per_file_2_1 = code.groupby(['name_-2_-1', 'filepath'])[['type']].count()
code_stereotype_per_file_2_1.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,type
name_-2_-1,filepath,Unnamed: 2_level_1
AbstractFunction,core/src/main/java/org/akaza/openclinica/logic/score/function/AbstractFunction.java,1
AccountBean,core/src/main/java/org/akaza/openclinica/bean/login/UserAccountBean.java,1
AccountController,web/src/main/java/org/akaza/openclinica/controller/AccountController.java,1
AccountController,web/src/main/java/org/akaza/openclinica/controller/UserAccountController.java,1
AccountDAO,core/src/main/java/org/akaza/openclinica/dao/login/UserAccountDAO.java,1
AccountDao,core/src/main/java/org/akaza/openclinica/dao/hibernate/UserAccountDao.java,1
AccountRow,web/src/main/java/org/akaza/openclinica/web/bean/UserAccountRow.java,1
AccountServlet,web/src/main/java/org/akaza/openclinica/control/admin/CreateUserAccountServlet.java,1
AccountServlet,web/src/main/java/org/akaza/openclinica/control/admin/EditUserAccountServlet.java,1
AccountServlet,web/src/main/java/org/akaza/openclinica/control/admin/ViewUserAccountServlet.java,1


In [13]:
code_stereotypes_2_1 = code_stereotype_per_file_2_1 \
    .groupby(['name_-2_-1']) \
    .transform(sum) \
    .sort_values(by=["type", "name_-2_-1", "filepath"], ascending=False) \
    .reset_index()
code_stereotypes_2_1.head(20)

Unnamed: 0,name_-2_-1,filepath,type
0,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
1,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
2,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
3,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
4,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
5,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
6,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
7,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
8,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16
9,CRFServlet,web/src/main/java/org/akaza/openclinica/contro...,16


In [14]:
code_stereotypes_2_1.to_excel("output/openclinica_stereotypes_-2_-1.xlsx", index=None)