# 💻 DataQAHelper Model Comparison and Interpretation Tutorial

DataQAHelper is a Python framework that supports low-code data-to-text application development. It aims to automate machine learning workflows and convert unintuitive analysis results into easy-to-understand FAQ-like text reports.

Compared with other data science tools, DataQAHelper focuses on the easily overlooked data interpretation stage, using FAQ-like textual reports to help users understand data analysis results faster and take necessary actions.


# 💻 Before Starting

DataQAHelper is tested and supported on Python 3.10.

The framework is mainly supported by three core components:

- DataScienceComponents.py contains many data science algorithms.
- NLGComponents.py contains many default natural language generation algorithms for editing templates in template files.
- IntegratedPipelines.py contains many predefined pipelines, which connect DataScienceComponents.py and NLGComponents.py in the form of questions and answers.

If the user wants to quickly understand a certain dataset, calling the corresponding function in IntegratedPipelines.py is enough. Similarly, users can also customize the above components to make the generated FAQ-like reports more in line with their requirements.

You can download DataQAHelper from GitHub.
If you are working on Colab, here are some example commands to install the required packages and set the path to the framework (Please note the folder name, the folder name used here is DataQAHelperWithoutLLM):

In [1]:
from google.colab import drive
drive.mount('/content/drive')
!pip install -r /content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/requirements.txt
import sys
sys.path.append('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Because the version of Colab is updated, some packages must be upgraded. There may be an error reminder, but it has no effect on the use of the framework in Colab. There is no such issue when using the framework locally after downloading it.
!pip install pandas==2.0.0
!pip install scipy==1.8.0

Collecting pandas==2.0.0
  Using cached pandas-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Using cached pandas-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
arviz 0.18.0 requires scipy>=1.9.0, but you have scipy 1.8.0 which is incompatible.
google-colab 1.0.0 requires pandas==2.1.4, but you have pandas 2.0.0 which is incompatible.
plotnine 0.12.4 requires statsmodels>=0.14.0, but you have statsmodels 0.13.5 which is incompatible.
pycaret 3.0.0 requires pandas<1.6.0,>=1.3.0, but you have pandas 2.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed

# 🚀 Quick start

The integrated pipelines in DataQAHelper not only support most supervised machine learning modules for estimating the relationships between a dependent variable (outcome or target) and one or more independent variables (features, predictors, or covariates), but it also can provide recommendations by comparing the performance of different models when the user is unsure which machine learning model to use for a dataset.

## Example 1

The following example shows how to quickly build an application based on the framework to recommend a classifier for users.

In [3]:
## An example to show that if a user has a dataset suitable for a classifier,
## but does not know how to start data analysis,
## then use the automatic search for the most suitable classifier provided by our framework.

# Read dataset
from pandas import read_csv
irisdata = read_csv('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/data/Iris.csv',header=0)
# Setting independent and dependent variables
Xcol=['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
ycol='Species'

In [4]:
# Import the IntegratedPipelines from framework
import IntegratedPipelines as IP
# Select the model that the user does not want to use for comparison.
# It can be left blank, in which case the predefined pipeline will use the default value.
exclude =['qda','knn','nb','dummy']
pipeline = IP.find_best_mode_pipeline()
pipeline.FindBestClassifierPipeline(irisdata,Xcol,ycol,exclude=exclude)

Unnamed: 0,Description,Value
0,Session id,7164
1,Target,Species
2,Target type,Multiclass
3,Target mapping,"Iris-setosa: 0, Iris-versicolor: 1, Iris-virginica: 2"
4,Original data shape,"(120, 5)"
5,Transformed data shape,"(120, 5)"
6,Transformed train set shape,"(84, 5)"
7,Transformed test set shape,"(36, 5)"
8,Numeric features,4
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.9639,1.0,0.9639,0.975,0.9636,0.9461,0.9516,0.121
lr,Logistic Regression,0.9528,1.0,0.9528,0.9667,0.9521,0.9295,0.9365,0.686
gbc,Gradient Boosting Classifier,0.9417,0.9773,0.9417,0.9583,0.9407,0.9128,0.9214,0.799
rf,Random Forest Classifier,0.9403,1.0,0.9403,0.9573,0.9385,0.9099,0.9191,0.799
et,Extra Trees Classifier,0.9403,0.9975,0.9403,0.9573,0.9385,0.9099,0.9191,0.579
lightgbm,Light Gradient Boosting Machine,0.9389,0.9715,0.9389,0.9573,0.9374,0.908,0.9175,0.206
ada,Ada Boost Classifier,0.9292,0.9899,0.9292,0.949,0.927,0.8933,0.904,0.272
xgboost,Extreme Gradient Boosting,0.9278,0.9733,0.9278,0.9479,0.9218,0.8913,0.9034,0.136
dt,Decision Tree Classifier,0.9042,0.93,0.9042,0.9312,0.9017,0.8556,0.8703,0.079
ridge,Ridge Classifier,0.8333,0.0,0.8333,0.8535,0.8312,0.748,0.7592,0.092


Processing:   0%|          | 0/49 [00:00<?, ?it/s]

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,0.8889,1.0,0.8889,0.9167,0.8857,0.8333,0.8492
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


lda


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,1.0,1.0,0,0,0,1.0,1.0


The accuracy of the tuned Linear Discriminant Analysis is 1.0, indicating a high accuracy. The classifier's performance is excellent, and the predictions are very reliable.

Additionally, the AUC of the tuned model is 1.0, which indicates excellent discriminatory ability. The classifier's predictions are very reliable.

The table below shows information about the results of fitting each untuned model.
Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



<IPython.core.display.Javascript object>

<Figure size 800x550 with 0 Axes>

## Example 2

Another example shows how to quickly build an application based on the framework to recommend a regression model for users.

In [5]:
## An example to show that if a user has a dataset suitable for a regression,
## but does not know how to start data analysis,
## then use the automatic search for the most suitable regression model provided by our framework.
data = read_csv('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/data/CarPrice_Assignment.csv',header=0)
Xcol=['carlength', 'carwidth', 'carheight','curbweight','enginesize','boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg']
ycol='price'
pipeline = IP.find_best_mode_pipeline()

In [6]:
# Select the model that the user does not want to use for comparison.
# It can be left blank, in which case the predefined pipeline will use the default value.
exclude=['dt','knn','dummy','xgboost','et','rf','lightgbm','gbr','ada','lasso','en']
pipeline.FindBestRegressionPipeline(data,Xcol,ycol,exclude=exclude)

Unnamed: 0,Description,Value
0,Session id,4834
1,Target,price
2,Target type,Regression
3,Original data shape,"(164, 13)"
4,Transformed data shape,"(164, 13)"
5,Transformed train set shape,"(114, 13)"
6,Transformed test set shape,"(50, 13)"
7,Numeric features,12
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
ridge,Ridge Regression,2497.937,10994906.7329,3214.5932,0.6796,0.2263,0.1935,0.094
llar,Lasso Least Angle Regression,2496.8741,11083908.436,3225.8767,0.6789,0.2259,0.193,0.119
lr,Linear Regression,2497.8232,11095766.6942,3227.6177,0.6787,0.226,0.1932,0.081
lar,Least Angle Regression,2533.9931,11328599.2647,3264.0936,0.6727,0.2315,0.1964,0.102
huber,Huber Regressor,2404.9149,12811889.3153,3456.4602,0.6628,0.2077,0.1646,0.098
br,Bayesian Ridge,2433.8398,11839636.5666,3358.1741,0.6407,0.2066,0.1724,0.064
omp,Orthogonal Matching Pursuit,2771.4979,19240990.9767,4079.5812,0.4632,0.2402,0.1953,0.105
par,Passive Aggressive Regressor,4442.8463,38378157.0009,5778.8454,0.0481,0.4854,0.3626,0.051


Processing:   0%|          | 0/37 [00:00<?, ?it/s]

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2093.3578,6391567.1534,2528.1549,0.6577,0.2156,0.1787
1,2221.0328,7736238.5367,2781.4095,0.8424,0.2846,0.2012
2,2395.1858,10747155.0296,3278.2854,0.9043,0.1779,0.1663
3,2404.0381,10597741.7644,3255.4173,0.521,0.2381,0.2119
4,2548.7544,9493247.4481,3081.1114,0.1404,0.2151,0.21
5,2276.5061,9172945.39,3028.6871,0.8315,0.2161,0.1848
6,3208.5196,16811244.9785,4100.1518,0.8122,0.2082,0.1788
7,2108.2264,7542981.9585,2746.449,0.5352,0.2801,0.2369
8,1715.1598,4742743.2113,2177.784,0.7603,0.1868,0.1495
9,4008.5898,26713201.8589,5168.4816,0.7906,0.2399,0.2167


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2025.3183,6089367.7084,2467.6644,0.6739,0.2119,0.173
1,2251.7904,7817234.5764,2795.9318,0.8407,0.2875,0.2034
2,2409.9434,10776365.0593,3282.7374,0.9041,0.179,0.1678
3,2407.0386,10634563.6408,3261.0679,0.5193,0.2403,0.2127
4,2597.0903,9702280.0449,3114.8483,0.1215,0.2181,0.2145
5,2247.8904,8990889.9415,2998.4813,0.8349,0.2149,0.1835
6,3197.3022,16853608.0795,4105.3146,0.8117,0.208,0.1788
7,2122.3046,7581250.9516,2753.4072,0.5328,0.2827,0.2388
8,1732.1407,4790649.09,2188.7551,0.7579,0.1869,0.1506
9,4026.7544,26104015.1868,5109.2089,0.7953,0.2396,0.219


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


ridge


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,2390.152,10348904.9376,3216.9714,0.8285,0.2458,0.1921


Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



<IPython.core.display.Javascript object>

<Figure size 800x550 with 0 Axes>