Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.1.0.post1 #2

Merged
merged 9 commits into from Jun 15, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 13 additions & 11 deletions README.md
@@ -1,33 +1,35 @@
# MLQA <img src="docs/_static/mlqa.png" align="right" width="120"/>

A package to perform QA for Machine Learning Models.
A Package to perform QA on data flows for Machine Learning.

## Introduction
## Introduction

MLQA is a Python package that is created to help data scientists, analysts and developers to perform quality assurance (i.e. QA) on [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and 1d arrays, especially for machine learning modeling data flows. It's designed to work with [logging](https://docs.python.org/3/library/logging.html) library to log and notify QA steps in a descriptive way.
MLQA is a Python package that is created to help data scientists, analysts and developers to perform quality assurance (i.e. QA) on [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and 1d arrays, especially for machine learning modeling data flows. It's designed to work with [logging](https://docs.python.org/3/library/logging.html) library to log and notify QA steps in a descriptive way. It includes stand alone functions (i.e. [checkers](mlqa/checkers.py)) for different QA activities and [DiffChecker](mlqa/identifiers.py) class for integrated QA capabilities on data.

## Installation
## Installation

You can install MLQA with pip.

`pip install mlqa`
You can install MLQA with pip.

MLQA depends on Pandas and Numpy and works in Python 3.5+.
`pip install mlqa`

MLQA depends on Pandas and Numpy and works in Python 3.5+.

## Quickstart

You can easily initiate the object and fit a pd.DataFrame.
[DiffChecker](mlqa/identifiers.py) is designed to perform QA on data flows for ML. You can easily save statistics from the origin data such as missing value rate, mean, min/max, percentile, outliers, etc., then to compare against the new data. This is especially important if you want to keep the prediction data under the same assumptions with the training data.

Below is a quick example on how it works, just initiate and save statistics from the input data.
```python
>>> from mlqa.identifiers import DiffChecker
>>> import pandas as pd
>>> dc = DiffChecker()
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
```

Then, you can check on new data if it's okay for given criteria. Below, you can see data with increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the fitted data. NA rate is 50% in the fitted data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.
Then, you can check on new data if it's okay for given criteria. Below, you can see some data that is very similar in column `mean_col` but increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the origin data. NA rate is 50% in the origin data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.

```python
>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
True
```

Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/favicon/android-chrome-512x512.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/favicon/apple-touch-icon.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/favicon/favicon-16x16.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/favicon/favicon-32x32.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/favicon/favicon.ico
Binary file not shown.
1 change: 1 addition & 0 deletions docs/_static/favicon/site.webmanifest
@@ -0,0 +1 @@
{"name":"","short_name":"","icons":[{"src":"/android-chrome-192x192.png","sizes":"192x192","type":"image/png"},{"src":"/android-chrome-512x512.png","sizes":"512x512","type":"image/png"}],"theme_color":"#ffffff","background_color":"#ffffff","display":"standalone"}
3 changes: 2 additions & 1 deletion docs/conf.py
Expand Up @@ -25,7 +25,7 @@
author = 'Dogan Askan'

# The full version, including alpha/beta/rc tags
release = '0.1.0'
release = '0.1.0.post1'


# -- General configuration ---------------------------------------------------
Expand Down Expand Up @@ -58,6 +58,7 @@
# html_theme = 'alabaster'
html_theme = "pydata_sphinx_theme"
html_logo = "_static/mlqa.png"
html_favicon = "_static/favicon/favicon.ico"
html_theme_options = {
# 'github_user': 'ddaskan',
# 'github_repo': 'mlqa',
Expand Down
32 changes: 23 additions & 9 deletions docs/source/quickstart.rst
Expand Up @@ -6,29 +6,30 @@ Here, you can see some quick examples on how to utilize the package. For more de
DiffChecker Basics
------------------

`DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is designed to perform QA in an integrated way on pd.DataFrame.
`DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is designed to perform QA on data flows for ML. You can easily save statistics from the origin data such as missing value rate, mean, min/max, percentile, outliers, etc., then to compare against the new data. This is especially important if you want to keep the prediction data under the same assumptions with the training data.

You can easily initiate the object and fit a pd.DataFrame.
Below is a quick example on how it works, just initiate and save statistics from the input data.

.. code-block:: python

>>> from mlqa.identifiers import DiffChecker
>>> dc = DiffChecker()
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
>>> from mlqa.identifiers import DiffChecker
>>> import pandas as pd
>>> dc = DiffChecker()
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))

Then, you can check on new data if it's okay for given criteria. Below, you can see data with increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the fitted data. NA rate is 50% in the fitted data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.
Then, you can check on new data if it's okay for given criteria. Below, you can see some data that is very similar in column `mean_col` but increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the origin data. NA rate is 50% in the origin data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.

.. code-block:: python

>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
True

If you think the `threshold <identifiers.html#identifiers.DiffChecker.threshold>`_ is too loose, you can adjust as you wish with `set_threshold <identifiers.html#identifiers.DiffChecker.set_threshold>`_ method. And, now the same returns `False` indicating the QA has failed.

.. code-block:: python

>>> dc.set_threshold(0.1)
>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
False

DiffChecker Details
Expand All @@ -53,6 +54,8 @@ To be more precise, you can set both `threshold <identifiers.html#identifiers.Di

.. code-block:: python

>>> import pandas as pd
>>> import numpy as np
>>> dc = DiffChecker()
>>> dc.set_threshold(0.2)
>>> dc.set_stats(['mean', 'max', np.sum])
Expand Down Expand Up @@ -110,6 +113,7 @@ Just initiate the class with `logger='<your-logger-name>.log'` argument.
.. code-block:: python

>>> from mlqa.identifiers import DiffChecker
>>> import pandas as pd
>>> dc = DiffChecker(logger='mylog.log')
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
>>> dc.set_threshold(0.1)
Expand All @@ -125,6 +129,10 @@ If you open `mylog.log`, you'll see something like below.

If you initiate the class with also `log_info=True` argument, then the other class steps (e.g. `set_threshold <identifiers.html#identifiers.DiffChecker.set_threshold>`_, `check <identifiers.html#identifiers.DiffChecker.check>`_) would be logged, too.

.. note::

Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object by just passing a file name (i.e. `logger='mylog.log'`), creating the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object externally then passing accordingly (i.e. `logger=<mylogger>`) is highly recommended.

Checkers with Logging
---------------------

Expand Down Expand Up @@ -198,7 +206,13 @@ This should log something like below.
WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Male. Expected=0.5, Actual=0.6666666666666666
WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Female. Expected=0.5, Actual=0.3333333333333333

NOTE: sorry for the long lines, I had to write like that because of a `bug <https://github.com/executablebooks/sphinx-copybutton/issues/65>`_ in `sphinx-copybutton` extension.
.. note::

Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object by just passing a file name (i.e. `logger='mylog.log'`), creating the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object externally then passing accordingly (i.e. `logger=<mylogger>`) is highly recommended.

.. note::

Sorry for the long lines, I had to write like that because of a `bug <https://github.com/executablebooks/sphinx-copybutton/issues/65>`_ in `sphinx-copybutton` extension.



Expand Down
12 changes: 10 additions & 2 deletions mlqa/identifiers.py
Expand Up @@ -25,15 +25,23 @@ class DiffChecker():
log_info (bool): `True` if method calls or arguments also need to be
logged

Notes:
Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is
able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_
object by just passing a file name (i.e. `logger='mylog.log'`), creating
the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_
object externally then passing accordingly (i.e. `logger=<mylogger>`)
is highly recommended.

Example:
Basic usage:

>>> dc = DiffChecker()
>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
True
>>> dc.set_threshold(0.1)
>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
False

Quick set for `qa_level`:
Expand Down
12 changes: 10 additions & 2 deletions setup.py
Expand Up @@ -5,13 +5,21 @@

setuptools.setup(
name="mlqa",
version="0.1.0",
version="0.1.0.post1",
author="Dogan Askan",
author_email="doganaskan@gmail.com",
description=" A package to perform QA for Machine Learning Models.",
description="A Package to perform QA on data flows for Machine Learning.",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/ddaskan/mlqa",
download_url="https://pypi.python.org/pypi/mlqa",
project_urls={
"Bug Tracker": "https://github.com/ddaskan/mlqa/issues",
"Documentation": "http://www.doganaskan.com/mlqa/",
"Source Code": "https://github.com/ddaskan/mlqa",
},
license='MIT',
keywords='qa ml ai data analysis machine learning quality assurance',
packages=setuptools.find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
Expand Down