ddaskan · ddaskan · Jun 15, 2020 · Jun 3, 2020 · Jun 3, 2020 · Jun 3, 2020
diff --git a/README.md b/README.md
@@ -1,33 +1,35 @@
 # MLQA <img src="docs/_static/mlqa.png" align="right" width="120"/>
 
- A package to perform QA for Machine Learning Models.
+A Package to perform QA on data flows for Machine Learning.
 
- ## Introduction
+## Introduction
 
- MLQA is a Python package that is created to help data scientists, analysts and developers to perform quality assurance (i.e. QA) on [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and 1d arrays, especially for machine learning modeling data flows. It's designed to work with [logging](https://docs.python.org/3/library/logging.html) library to log and notify QA steps in a descriptive way.
+MLQA is a Python package that is created to help data scientists, analysts and developers to perform quality assurance (i.e. QA) on [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and 1d arrays, especially for machine learning modeling data flows. It's designed to work with [logging](https://docs.python.org/3/library/logging.html) library to log and notify QA steps in a descriptive way. It includes stand alone functions (i.e. [checkers](mlqa/checkers.py)) for different QA activities and [DiffChecker](mlqa/identifiers.py) class for integrated QA capabilities on data.
 
- ## Installation
+## Installation
 
- You can install MLQA with pip.
-
- `pip install mlqa`
+You can install MLQA with pip.
 
- MLQA depends on Pandas and Numpy and works in Python 3.5+.
+`pip install mlqa`
+
+MLQA depends on Pandas and Numpy and works in Python 3.5+.
 
 ## Quickstart
 
-You can easily initiate the object and fit a pd.DataFrame.
+[DiffChecker](mlqa/identifiers.py) is designed to perform QA on data flows for ML. You can easily save statistics from the origin data such as missing value rate, mean, min/max, percentile, outliers, etc., then to compare against the new data. This is especially important if you want to keep the prediction data under the same assumptions with the training data.
+
+Below is a quick example on how it works, just initiate and save statistics from the input data.
 ```python
 >>> from mlqa.identifiers import DiffChecker
 >>> import pandas as pd
 >>> dc = DiffChecker()
 >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
 ```
 
-Then, you can check on new data if it's okay for given criteria. Below, you can see data with increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the fitted data. NA rate is 50% in the fitted data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes. 
+Then, you can check on new data if it's okay for given criteria. Below, you can see some data that is very similar in column `mean_col` but increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the origin data. NA rate is 50% in the origin data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes.
 
 ```python
->>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
+>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
 True
 ```
 

diff --git a/docs/_static/favicon/android-chrome-192x192.png b/docs/_static/favicon/android-chrome-192x192.png
diff --git a/docs/_static/favicon/android-chrome-512x512.png b/docs/_static/favicon/android-chrome-512x512.png
diff --git a/docs/_static/favicon/apple-touch-icon.png b/docs/_static/favicon/apple-touch-icon.png
diff --git a/docs/_static/favicon/favicon-16x16.png b/docs/_static/favicon/favicon-16x16.png
diff --git a/docs/_static/favicon/favicon-32x32.png b/docs/_static/favicon/favicon-32x32.png
diff --git a/docs/_static/favicon/favicon.ico b/docs/_static/favicon/favicon.ico
diff --git a/docs/_static/favicon/site.webmanifest b/docs/_static/favicon/site.webmanifest
@@ -0,0 +1 @@
+{"name":"","short_name":"","icons":[{"src":"/android-chrome-192x192.png","sizes":"192x192","type":"image/png"},{"src":"/android-chrome-512x512.png","sizes":"512x512","type":"image/png"}],"theme_color":"#ffffff","background_color":"#ffffff","display":"standalone"}
diff --git a/docs/conf.py b/docs/conf.py
@@ -25,7 +25,7 @@
 author = 'Dogan Askan'
 
 # The full version, including alpha/beta/rc tags
-release = '0.1.0'
+release = '0.1.0.post1'
 
 
 # -- General configuration ---------------------------------------------------
@@ -58,6 +58,7 @@
 # html_theme = 'alabaster'
 html_theme = "pydata_sphinx_theme"
 html_logo = "_static/mlqa.png"
+html_favicon = "_static/favicon/favicon.ico"
 html_theme_options = {
     # 'github_user': 'ddaskan',
     # 'github_repo': 'mlqa',

diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst
@@ -6,29 +6,30 @@ Here, you can see some quick examples on how to utilize the package. For more de
 DiffChecker Basics
 ------------------
 
-`DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is designed to perform QA in an integrated way on pd.DataFrame.
+`DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is designed to perform QA on data flows for ML. You can easily save statistics from the origin data such as missing value rate, mean, min/max, percentile, outliers, etc., then to compare against the new data. This is especially important if you want to keep the prediction data under the same assumptions with the training data.
 
-You can easily initiate the object and fit a pd.DataFrame.
+Below is a quick example on how it works, just initiate and save statistics from the input data.
 
 .. code-block:: python
 
-	>>> from mlqa.identifiers import DiffChecker
-	>>> dc = DiffChecker()
-	>>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
+    >>> from mlqa.identifiers import DiffChecker
+    >>> import pandas as pd
+    >>> dc = DiffChecker()
+    >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
 
-Then, you can check on new data if it's okay for given criteria. Below, you can see data with increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the fitted data. NA rate is 50% in the fitted data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes. 
+Then, you can check on new data if it's okay for given criteria. Below, you can see some data that is very similar in column `mean_col` but increased NA count in column `na_col`. The default threshold is 0.5 which means it should be okay if NA rate is 50% more than the origin data. NA rate is 50% in the origin data so up to 75% (i.e. 50*(1+0.5)) should be okay. NA rate is 70% in the new data and, as expected, the QA passes. 
 
 .. code-block:: python
 
-	>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
+	>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
 	True
 
 If you think the `threshold <identifiers.html#identifiers.DiffChecker.threshold>`_ is too loose, you can adjust as you wish with `set_threshold <identifiers.html#identifiers.DiffChecker.set_threshold>`_ method. And, now the same returns `False` indicating the QA has failed.
 
 .. code-block:: python
 
 	>>> dc.set_threshold(0.1)
-	>>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
+	>>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
 	False
 
 DiffChecker Details
@@ -53,6 +54,8 @@ To be more precise, you can set both `threshold <identifiers.html#identifiers.Di
 
 .. code-block:: python
 
+    >>> import pandas as pd
+    >>> import numpy as np
     >>> dc = DiffChecker()
     >>> dc.set_threshold(0.2)
     >>> dc.set_stats(['mean', 'max', np.sum])
@@ -110,6 +113,7 @@ Just initiate the class with `logger='<your-logger-name>.log'` argument.
 .. code-block:: python
 
     >>> from mlqa.identifiers import DiffChecker
+    >>> import pandas as pd
     >>> dc = DiffChecker(logger='mylog.log')
     >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
     >>> dc.set_threshold(0.1)
@@ -125,6 +129,10 @@ If you open `mylog.log`, you'll see something like below.
 
 If you initiate the class with also `log_info=True` argument, then the other class steps (e.g. `set_threshold <identifiers.html#identifiers.DiffChecker.set_threshold>`_, `check <identifiers.html#identifiers.DiffChecker.check>`_) would be logged, too.
 
+.. note::
+
+    Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object by just passing a file name (i.e. `logger='mylog.log'`), creating the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object externally then passing accordingly (i.e. `logger=<mylogger>`) is highly recommended.
+
 Checkers with Logging
 ---------------------
 
@@ -198,7 +206,13 @@ This should log something like below.
 	WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Male. Expected=0.5, Actual=0.6666666666666666
 	WARNING|2020-05-31 18:21:20,019|Gender distribution looks wrong, check Weight for Gender=Female. Expected=0.5, Actual=0.3333333333333333
 
-NOTE: sorry for the long lines, I had to write like that because of a `bug <https://github.com/executablebooks/sphinx-copybutton/issues/65>`_ in `sphinx-copybutton` extension.
+.. note::
+
+    Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object by just passing a file name (i.e. `logger='mylog.log'`), creating the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_ object externally then passing accordingly (i.e. `logger=<mylogger>`) is highly recommended.
+
+.. note::
+
+    Sorry for the long lines, I had to write like that because of a `bug <https://github.com/executablebooks/sphinx-copybutton/issues/65>`_ in `sphinx-copybutton` extension.
 
 
 

diff --git a/mlqa/identifiers.py b/mlqa/identifiers.py
@@ -25,15 +25,23 @@ class DiffChecker():
         log_info (bool): `True` if method calls or arguments also need to be
             logged
 
+    Notes:
+        Although `DiffChecker <identifiers.html#identifiers.DiffChecker>`_ is
+        able to create a `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_
+        object by just passing a file name (i.e. `logger='mylog.log'`), creating
+        the `Logger <https://docs.python.org/3/library/logging.html#logging.Logger>`_
+        object externally then passing accordingly (i.e. `logger=<mylogger>`)
+        is highly recommended.
+
     Example:
         Basic usage:
 
         >>> dc = DiffChecker()
         >>> dc.fit(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*50+[1]*50}))
-        >>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
+        >>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
         True
         >>> dc.set_threshold(0.1)
-        >>> dc.check(pd.DataFrame({'mean_col':[1, 2]*50, 'na_col':[None]*70+[1]*30}))
+        >>> dc.check(pd.DataFrame({'mean_col':[.99, 2.1]*50, 'na_col':[None]*70+[1]*30}))
         False
 
         Quick set for `qa_level`:

diff --git a/setup.py b/setup.py
@@ -5,13 +5,21 @@
 
 setuptools.setup(
     name="mlqa",
-    version="0.1.0",
+    version="0.1.0.post1",
     author="Dogan Askan",
     author_email="doganaskan@gmail.com",
-    description=" A package to perform QA for Machine Learning Models.",
+    description="A Package to perform QA on data flows for Machine Learning.",
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="https://github.com/ddaskan/mlqa",
+    download_url="https://pypi.python.org/pypi/mlqa",
+    project_urls={
+        "Bug Tracker": "https://github.com/ddaskan/mlqa/issues",
+        "Documentation": "http://www.doganaskan.com/mlqa/",
+        "Source Code": "https://github.com/ddaskan/mlqa",
+    },
+    license='MIT',
+    keywords='qa ml ai data analysis machine learning quality assurance',
     packages=setuptools.find_packages(),
     classifiers=[
         "Programming Language :: Python :: 3",