# 💻 DataQAHelper Model Fitting and Interpretation Tutorial

DataQAHelper is a Python framework that supports low-code data-to-text application development. It aims to automate machine learning workflows and convert unintuitive analysis results into easy-to-understand FAQ-like text reports.

Compared with other data science tools, DataQAHelper focuses on the easily overlooked data interpretation stage, using FAQ-like textual reports to help users understand data analysis results faster and take necessary actions.


# 💻 Before Starting

DataQAHelper is tested and supported on Python 3.10.

The framework is mainly supported by three core components:

- DataScienceComponents.py contains many data science algorithms.
- NLGComponents.py contains many default natural language generation algorithms for editing templates in template files.
- IntegratedPipelines.py contains many predefined pipelines, which connect DataScienceComponents.py and NLGComponents.py in the form of questions and answers.

If the user wants to quickly understand a certain dataset, calling the corresponding function in IntegratedPipelines.py is enough. Similarly, users can also customize the above components to make the generated FAQ-like reports more in line with their requirements.

You can download DataQAHelper from GitHub.
If you are working on Colab, here are some example commands to install the required packages and set the path to the framework:

In [1]:
from google.colab import drive
drive.mount('/content/drive')
!pip install -r /content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/requirements.txt
import sys
sys.path.append('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Because the version of Colab is updated, some packages must be upgraded. There may be an error reminder, but it has no effect on the use of the framework in Colab. There is no such issue when using the framework locally after downloading it.
!pip install pandas==2.0.0
!pip install scipy==1.8.0

Collecting pandas==2.0.0
  Using cached pandas-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Using cached pandas-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
arviz 0.18.0 requires scipy>=1.9.0, but you have scipy 1.8.0 which is incompatible.
google-colab 1.0.0 requires pandas==2.1.4, but you have pandas 2.0.0 which is incompatible.
plotnine 0.12.4 requires statsmodels>=0.14.0, but you have statsmodels 0.13.5 which is incompatible.
pycaret 3.0.0 requires pandas<1.6.0,>=1.3.0, but you have pandas 2.0.0 which is incompatible.[0m[31m
[0mSuccessfully installed

# 🚀 Quick start

The integrated pipelines in DataQAHelper support most supervised machine learning modules used to estimate the relationships between a dependent variable (outcome or target) and one or more independent variables (features, predictors, or covariates). The integrated pipelines in DataQAHelper support most supervised machine learning modules, including many models used for regression and classification. The objective of regression is to predict continuous values such as sales amounts, quantities, temperatures, etc. The goal of classification is to predict categorical class labels, which can be either multiclass, such as species types, or binary, such as whether a disease is present.

A typical workflow in DataQAHelper application development consists of the following three steps in order:

## **Rapid Prototyping** ➡️ **Reviewing Analysis Results** ➡️ **Iterative Prototype Updates**

## Example 1

The following example shows how to quickly build an application based on the framework to use a logistic regression model to quickly explore a classic dataset - the diabetes dataset.

The analysis results of the logistic regression model are not intuitive.

There are many criteria for judging the quality of a logistic regression model.

There are also differences in the interpretation of the coefficients of the independent variables.

At the same time, according to different requirements, the P-value may also need to be considered.

In [3]:
# Read dataset
from pandas import read_csv
data=read_csv('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/data/diabetesWithhead.csv',header=0)
# Setting independent and dependent variables
Xcol=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']
ycol='Classification'

In [4]:
# import IntegratedPipelines
import IntegratedPipelines as IP
## Running the pipeline as the prototype, and review the analysis results.
pipeline = IP.general_datastory_pipeline()
pipeline.LogisticFit(data, Xcol, ycol)

Optimization terminated successfully.
         Current function value: 0.622858
         Iterations 5
Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



<IPython.core.display.Javascript object>

<Figure size 800x550 with 0 Axes>

In [5]:
## After review the FAQ-like reports above, making any necessary updates to the prototype.
## The examples below show some minor updates.
## Users also can modify the DataScienceComponents.py or NLGComponents.py to add more functionality or directly refine templates.

# Make the independent variables more readable.
Xnewname = ['Number of Pregnancies', 'Glucose Level', 'Blood Pressure', 'Skin Thickness', 'Insulin Level', 'Body Mass Index (BMI)', 'Diabetes Pedigree Function', 'Age']
# Give meaning to positive classification results.
pos_class_mean='the individual has been diagnosed with diabetes'
# Running the application again to check the better results.
pipeline.LogisticFit(data, Xcol, ycol,Xnewname=Xnewname,pos_class_mean=pos_class_mean)

Optimization terminated successfully.
         Current function value: 0.622858
         Iterations 5
Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



<IPython.core.display.Javascript object>

<Figure size 800x550 with 0 Axes>

## Example 2

Another example to show the domain independence, generality and reusability of these predefined pipelines.

And demonstrate the functions in the data science component that can be used to assist in data preprocessing.

The dataset used is lending club data. The following examples will use the same logistic regression model and the same predefined pipeline as Example 1 to explore how different factors affect whether a loan is likely to default (this dataset is nearly 400,000 rows, and it will take more time to run all data).

In [6]:
# Read dataset
data = read_csv('/content/drive/MyDrive/ColabNotebooks/DataQAHelperWithoutLLM/data/lending_club_loan_two.csv',header=0)
# In order to quickly show the example, only the first 1000 rows of data are processed.
# If you want to process all the data, you can comment out the following line.
data = data.head(1000)

In [7]:
# Import functions from the data science component to assist with data preprocessing.
import DataScienceComponents as DC
import numpy as np
data=DC.DataEngineering().set_date_columns_to_datetime(data,['earliest_cr_line', 'issue_d'])
data=DC.DataEngineering().set_values_to_0_and_1(data,'loan_status',['Fully Paid', 'Current', 'Does not meet the credit policy. Status:Fully Paid'])
data=DC.DataEngineering().set_values_to_0_and_1(data,'term',' 36 months')
data['credit_hist_in_months'] = ((data['issue_d'] - data['earliest_cr_line'])/np.timedelta64(1, 'M')).astype(int)
data['cb_person_bk_on_file_Y'] = data['pub_rec_bankruptcies'].apply(lambda x: 1 if x >= 1 else 0)

In [8]:
# Setting independent and dependent variables
Xcol=['credit_hist_in_months','cb_person_bk_on_file_Y','annual_inc','dti','loan_amnt','int_rate','term']
ycol='loan_status'
# Optional: Make the independent variables more readable. Give meaning to positive classification results.
Xnewname=['length of credit history','personal bankruptcy','annual income','debt-to-income ratio','loan amount','interest rate','loan term']
pos_class_mean='the loan has problems or is in default'

In [9]:
# Running the pipeline.
pipeline = IP.general_datastory_pipeline()
pipeline.LogisticFit(data, Xcol, ycol,Xnewname=Xnewname,pos_class_mean=pos_class_mean)

Optimization terminated successfully.
         Current function value: 0.468862
         Iterations 6
Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



<IPython.core.display.Javascript object>

<Figure size 800x550 with 0 Axes>