Skip to content

Commit

Permalink
paper version
Browse files Browse the repository at this point in the history
  • Loading branch information
angeloschatzimparmpas committed Feb 3, 2023
1 parent 91b9f74 commit 9c4d632
Show file tree
Hide file tree
Showing 62 changed files with 155 additions and 3 deletions.
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,5 +60,25 @@ FLASK_APP=run.py flask run

Then, open your browser and point it to `localhost:8080`. We recommend using an up-to-date version of Google Chrome.

# Reproducibility of the Results #
The following instructions describe how to reach the results present in Figure 1 of the article. The aforementioned figure is connected with Section 5.2 (*Use case: explorative sampling for better classification*), and it is the main use case described in the paper.

**Note:** We used OSX and Google Chrome in all our tests, so we cannot guarantee that it works in other OS or browser. However, since HardVis is written in JS and Python, it should work in all the most common platforms.

**Tip:** You will have to see a red loading bar on the very top of your browser whenever something is processing.

**Tip:** Our [demonstration video](https://vimeo.com/772796696) also presents the following steps, using the same data set (from 02:04 until 08:00).

- Step 1: Make sure the "Vehicle Silhouette" data set is selected (top-left corner), then reload/refresh the `localhost:8080` page open in your browser. **Please note** that the first time you execute the analysis and, consequently, run the hyperparameter search, it might take a few minutes before the XGBoost classifier's hyperparameters have been tuned, using Bayesian Optimization. After the first time, the results are cached and will be re-used to make the process faster.
- Step 2: When *Data Space* is populated with the data points, click on the stacked bar chart with value *13* for the *Number of Neighbors*, as is shown in Figure 6(a).
- Step 3: We continue by selecting *Undersampling (US)* from the *Data Sets and Sampling Techniques* panel, and then click on the *OSS* option to activate this undersampling algorithm.
- Step 4: After the loading process is over, we set the *Seeds* value to *250* (see Figure 6(c)). Afterward, we choose value *125* for this same parameter (cf. Figure 6(d)).
- Step 5: At this point, we click on *Rare* from the *Types* parameter to deactivate the algorithm's application to these types of instances. In Figure 6(f), we can observe the result of this action. After everything gets reloaded, we click on *Outlier* type to deactivate this particular type, too (visible due to the removal of the *tick* symbol).
- Step 6: Next, we select all data points in *Data Space* view by holding down the left click button and moving the mouse to surround all data points. This process is performed with the help of the lasso functionality implemented in HardVis, with dashed lines appearing in the *Data Space* view. After waiting a while until the dashed lines disappear, we press the *Execute Undersample* button in this same view.
- Step 7: Afterward, we try out another undersampling phase. Thus, we click on the *OSS* button again to repeat the process one more time. Since the results are becoming worse, we completely deactivate this undersampling algorithm by clicking on the *Disabled* option. Please wait until the red loading bar on the very top is no longer visible.
- Step 8: To receive the image shown in Figure 1, we have to switch to the *Oversampling (OS)* and click on the *SMOTE* option to activate this oversampling algorithm, as illustrated in Figure 1(a). Please wait until everything loads. Finally, we deactivate the *Outlier* option from the *Types* parameter.

**Outcome:** The above process describes how you will be able to reproduce precisely the results presented in Figure 1 of the paper. Thank you for your time!

# Corresponding Author #
For any questions with regard to the implementation or the paper, feel free to contact [Angelos Chatzimparmpas](mailto:angelos.chatzimparmpas@lnu.se).
Binary file added __pycache__/DBCV.cpython-38.pyc
Binary file not shown.
Binary file added __pycache__/__init__.cpython-38.pyc
Binary file not shown.
Binary file added __pycache__/run.cpython-38.pyc
Binary file not shown.
Binary file added __pycache__/run.cpython-39.pyc
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 151.95722007751465, "input_args": {}}
Binary file not shown.
92 changes: 92 additions & 0 deletions cachedir/joblib/run/callKSearch/func_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# first line: 480
@memory.cache
def callKSearch ():
print('findKValueNow!!!')
global countPercentageList
countPercentageList = []
global storeAllMetricsList
storeAllMetricsList = []
global sortShepCorrList
sortShepCorrList = []
global GatherSafe
GatherSafe = []
global GatherBorder
GatherBorder = []
global GatherRare
GatherRare = []
global GatherOut
GatherOut = []
global UMAPModalStore
UMAPModalStore = []
global MaxValue
MaxValue = []
global MaxIndex
MaxIndex = []

kValuesAll = [5,6,7,8,9,10,11,12,13]
mDistanceAll = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
dataNP = XData.to_numpy()
D_highSpace = distance.squareform(distance.pdist(dataNP))

for val in kValuesAll:
safeIndCounter = []
borderlineIndCounter = []
rareIndCounter = []
outIndCounter = []
countPercentage = []

nbrs = NearestNeighbors(n_neighbors=val, metric="euclidean", n_jobs = -1).fit(XData)
distances, indices = nbrs.kneighbors(XData)

summarizePerc = []
for idx, el in enumerate(indices):
computePer = -1
for each in el:
if (yData[idx] == yData[each]):
computePer = computePer + 1
summarizePerc.append(computePer)

for i, el in enumerate(summarizePerc):
if (el >= (0.8 * val)):
safeIndCounter.append(i)
elif (el >= (0.5 * val)):
borderlineIndCounter.append(i)
elif (el >= (0.2 * val)):
rareIndCounter.append(i)
else:
outIndCounter.append(i)

percsafeIndCounter = len(safeIndCounter) / (len(safeIndCounter)+len(borderlineIndCounter)+len(rareIndCounter)+len(outIndCounter))
percborderlineIndCounter = len(borderlineIndCounter) / (len(safeIndCounter)+len(borderlineIndCounter)+len(rareIndCounter)+len(outIndCounter))
percrareIndCounter = len(rareIndCounter) / (len(safeIndCounter)+len(borderlineIndCounter)+len(rareIndCounter)+len(outIndCounter))
percoutIndCounter = len(outIndCounter) / (len(safeIndCounter)+len(borderlineIndCounter)+len(rareIndCounter)+len(outIndCounter))

countPercentage.append(percsafeIndCounter*100)
countPercentage.append(percborderlineIndCounter*100)
countPercentage.append(percrareIndCounter*100)
countPercentage.append(percoutIndCounter*100)

countPercentageList.append(countPercentage)

metricShepCorr = []
for dis in mDistanceAll:
SearchUMAP = FunUMAPAll(XData, val, dis)
D_lowSpace = distance.squareform(distance.pdist(SearchUMAP))
resultShep = shepard_diagram_correlation(D_highSpace, D_lowSpace)
metricShepCorr.append(resultShep*100)
storeAllMetricsList.append(metricShepCorr)
sortShepCorr = sorted(range(len(metricShepCorr)), key=lambda k: metricShepCorr[k], reverse=True)[0]
sortShepCorrList.append(sortShepCorr)

max_value = max(metricShepCorr)
max_index = metricShepCorr.index(max_value)
UMAPModal = FunUMAP(XData, val, mDistanceAll[max_index])
UMAPModalStore.append(UMAPModal)
GatherSafe.append(safeIndCounter)
GatherBorder.append(borderlineIndCounter)
GatherRare.append(rareIndCounter)
GatherOut.append(outIndCounter)
MaxValue.append(max_value)
MaxIndex.append(max_index)

return [countPercentageList,sortShepCorrList,storeAllMetricsList,UMAPModalStore,GatherSafe,GatherBorder,GatherRare,GatherOut,MaxValue,MaxIndex]
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.770894765853882, "input_args": {"n_estimators": "122.21742728992572", "eta": "0.06452090304204987", "max_depth": "11.197056874649611", "subsample": "0.941614515559209", "colsample_bytree": "0.8311989040672406"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.1827189922332764, "input_args": {"n_estimators": "121.73840441842214", "eta": "0.28767857660247903", "max_depth": "10.39196365086843", "subsample": "0.8312037280884873", "colsample_bytree": "0.8749080237694725"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.5318150520324707, "input_args": {"n_estimators": "114.44905352605177", "eta": "0.22831119680574874", "max_depth": "10.564710291701385", "subsample": "0.9541934359909122", "colsample_bytree": "0.8239188491876603"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.116663694381714, "input_args": {"n_estimators": "72.47197879739036", "eta": "0.10844857557015576", "max_depth": "6.3014333988327635", "subsample": "0.9021891613488681", "colsample_bytree": "0.9944126787690761"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 1.6168720722198486, "input_args": {"n_estimators": "46.40612658226385", "eta": "0.2924774630404986", "max_depth": "10.99465584480253", "subsample": "0.8363649934414201", "colsample_bytree": "0.8041168988591605"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.2985379695892334, "input_args": {"n_estimators": "164.01497854869265", "eta": "0.0996789203835431", "max_depth": "6.033132702741614", "subsample": "0.9413714687695234", "colsample_bytree": "0.9544489538593315"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.4228010177612305, "input_args": {"n_estimators": "112.07270186497003", "eta": "0.14082774572514967", "max_depth": "11.929172318453045", "subsample": "0.9543434280122952", "colsample_bytree": "0.9863189678041957"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 1.3942599296569824, "input_args": {"n_estimators": "74.90081706613316", "eta": "0.24281758667148645", "max_depth": "6.444267910404542", "subsample": "0.823173811905026", "colsample_bytree": "0.9458014336081975"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.0340428352355957, "input_args": {"n_estimators": "123.31233076168427", "eta": "0.3", "max_depth": "8.868729253728564", "subsample": "0.8", "colsample_bytree": "0.8478619789553219"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 0.4379770755767822, "input_args": {"n_estimators": "13.819321337554923", "eta": "0.07212312551297988", "max_depth": "7.175897174514871", "subsample": "0.8650660661526529", "colsample_bytree": "0.9843748470046234"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.51269793510437, "input_args": {"n_estimators": "179.94824742138775", "eta": "0.26054377863611405", "max_depth": "7.776976381564504", "subsample": "0.9240763899531264", "colsample_bytree": "0.9165355749433655"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.693844795227051, "input_args": {"n_estimators": "138.4254401698706", "eta": "0.1261534422933427", "max_depth": "6.586032684038303", "subsample": "0.8880304987479203", "colsample_bytree": "0.9616794696232922"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.362119197845459, "input_args": {"n_estimators": "179.4913333333915", "eta": "0.24378320584027863", "max_depth": "11.636993649385134", "subsample": "0.919579995762217", "colsample_bytree": "0.9939169255529117"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 1.0949859619140625, "input_args": {"n_estimators": "17.393878305774606", "eta": "0.20582453170688947", "max_depth": "7.985388149115895", "subsample": "0.8621964643431325", "colsample_bytree": "0.9726206851751187"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.5751378536224365, "input_args": {"n_estimators": "120.52084092809828", "eta": "0.09991844553958994", "max_depth": "9.08540663048167", "subsample": "0.8092900825439996", "colsample_bytree": "0.9570351922786027"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.3303470611572266, "input_args": {"n_estimators": "89.22927863521258", "eta": "0.12606056073988442", "max_depth": "9.148538589793427", "subsample": "0.8582458280396084", "colsample_bytree": "0.8366809019706868"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 0.18697500228881836, "input_args": {"n_estimators": "9.956729715098561", "eta": "0.18068320734549853", "max_depth": "8.565246110151298", "subsample": "0.821578285398661", "colsample_bytree": "0.8987591192728782"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.523658275604248, "input_args": {"n_estimators": "76.44055944226989", "eta": "0.08487346516301046", "max_depth": "7.752867891211309", "subsample": "0.8912139968434072", "colsample_bytree": "0.9223705789444759"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 3.6108059883117676, "input_args": {"n_estimators": "74.56689870524991", "eta": "0.11783725794347398", "max_depth": "10.972425054911575", "subsample": "0.8561869019374762", "colsample_bytree": "0.8777354579378964"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 1.841581106185913, "input_args": {"n_estimators": "114.22795618793002", "eta": "0.1910107263433683", "max_depth": "8.16101270435952", "subsample": "0.8926456235313323", "colsample_bytree": "0.8"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.5729548931121826, "input_args": {"n_estimators": "182.3174784053625", "eta": "0.17379422752781754", "max_depth": "6.20633112669131", "subsample": "0.8517559963200034", "colsample_bytree": "0.8244076469689559"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.7057480812072754, "input_args": {"n_estimators": "111.60850447193953", "eta": "0.12792776902235276", "max_depth": "9.120408127066865", "subsample": "0.8369708911051055", "colsample_bytree": "0.9325044568707964"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.622046947479248, "input_args": {"n_estimators": "190.03267976439997", "eta": "0.09263103092182289", "max_depth": "6.390309557911677", "subsample": "0.9931264066149119", "colsample_bytree": "0.9215089703802877"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 2.799971103668213, "input_args": {"n_estimators": "178.00648480238368", "eta": "0.232401544584516", "max_depth": "9.825344828131279", "subsample": "0.8944429850323898", "colsample_bytree": "0.8650366644053494"}}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"duration": 1.4785480499267578, "input_args": {"n_estimators": "19.53737551755531", "eta": "0.08523105624369066", "max_depth": "10.813181884524237", "subsample": "0.9973773873201035", "colsample_bytree": "0.9085392166316497"}}
Binary file not shown.
12 changes: 12 additions & 0 deletions cachedir/joblib/run/estimator/func_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# first line: 464
@memory.cache
def estimator(n_estimators, eta, max_depth, subsample, colsample_bytree):
# initialize model
print('modelsCompNow!!!!!')
n_estimators = int(n_estimators)
max_depth = int(max_depth)
model = XGBClassifier(n_estimators=n_estimators, eta=eta, max_depth=max_depth, subsample=subsample, colsample_bytree=colsample_bytree, n_jobs=-1, random_state=RANDOM_SEED, seed=RANDOM_SEED, silent=True, verbosity = 0, use_label_encoder=False)
# set in cross-validation
result = cross_validate(model, XData, yData, cv=crossValidation, scoring='accuracy')
# result is mean of test_score
return np.mean(result['test_score'])
4 changes: 2 additions & 2 deletions frontend/src/components/DataSetSlider.vue
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
<div>
<label id="data" for="param-dataset" data-toggle="tooltip" data-placement="right" title="Tip: use one of the data sets already provided or upload a new file.">{{ dataset }}</label>
<select id="selectFile" @change="selectDataSet()">
<option value="VehicleC.csv" >Vehicle Silhouette</option>
<option value="VehicleC.csv" selected>Vehicle Silhouette</option>
<option value="breastC.csv" >Breast Cancer</option>
<option value="IrisC.csv" selected>Iris Flower</option>
<option value="IrisC.csv" >Iris Flower</option>
</select>
<button style="float: right;" class="btn-outline-dark"
id="know"
Expand Down
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@ pymongo~=3.11.0
Flask~=1.1.2
Flask-PyMongo~=2.3.0
Flask-Cors~=3.0.9
scipy~=1.5.2
numpy~=1.21.4
pandas~=1.1.2
joblib~=1.1.0
scikit-learn~=0.23.2
scipy~=1.5.2
xgboost~=1.3.3
bayesian-optimization~=1.2.0
umap-learn~=0.5.3
imblearn~=0.0
Binary file added thumbnail_representative.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9c4d632

Please sign in to comment.