# 1.0 An end-to-end classification problem (Data Check)



## 1.1 Dataset description

The datasets accessed were from **bank-full.csv** ordered by date (from May 2008 to November 2010) with all examples and 17 entries, sorted by date (older version of this dataset with fewer entries).

The input variables are related the data of **bank customers**, related with the **last contact** of the current campaign and attributes related to **campaign previus**.

The classification goal is to predict if the client will **subscribe** (yes/no) a **term deposit** (variable y).

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. ETL (done!!!)
4. Data Checks

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install, load libraries and setup wandb

In [1]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.12.16-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 4.1 MB/s 
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 51.4 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.12-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 48.7 MB/s 
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.5 MB/s 
[?25hCollecting smm

In [2]:
!pip install pytest pytest-sugar

Collecting pytest-sugar
  Downloading pytest-sugar-0.9.4.tar.gz (12 kB)
Building wheels for collected packages: pytest-sugar
  Building wheel for pytest-sugar (setup.py) ... [?25l[?25hdone
  Created wheel for pytest-sugar: filename=pytest_sugar-0.9.4-py2.py3-none-any.whl size=8985 sha256=40a6df67ee0398c67004edb5adfa7dfd4465934ccaa2cb5174da0c7fe1aba9d3
  Stored in directory: /root/.cache/pip/wheels/9c/e8/b6/5009ec756a2f40eed690a8b0a95549cd788c1ec968832876df
Successfully built pytest-sugar
Installing collected packages: pytest-sugar
Successfully installed pytest-sugar-0.9.4


In [3]:
import wandb

In [4]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1.2 Pytest


### 1.2.1 How pytest discovers tests



pytests uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:
  1. files with tests should be called `test_*.py` or `*_test.py `
  2. test function name should start with `test_`




### 1.2.2 Fixture


An important aspect when using ``pytest`` is understanding the fixture's scope works. 

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

### 1.2.3 Create and run a test file


In [5]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="decision_tree_bank", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("decision_tree_bank/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 17

def test_column_presence_and_type(data):

    required_columns = {
        "age": pd.api.types.is_int64_dtype,
        "job": pd.api.types.is_object_dtype,
        "marital": pd.api.types.is_object_dtype,
        "education": pd.api.types.is_object_dtype,
        "default": pd.api.types.is_object_dtype,
        "balance": pd.api.types.is_int64_dtype,
        "housing": pd.api.types.is_object_dtype,
        "loan": pd.api.types.is_object_dtype,
        "contact": pd.api.types.is_object_dtype,
        "day": pd.api.types.is_int64_dtype,
        "month": pd.api.types.is_object_dtype,
        "duration": pd.api.types.is_int64_dtype,  
        "campaign": pd.api.types.is_int64_dtype,
        "pdays": pd.api.types.is_int64_dtype,
        "previous": pd.api.types.is_int64_dtype,
        "poutcome": pd.api.types.is_object_dtype,
        "y": pd.api.types.is_object_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        "no",
        "yes"
    ]

    assert data["y"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {                          # ver critérios para seleção dos intervalos
        "age": (18, 99),
        "balance": (-20000,200000),
        "day": (1, 31),
        "duration": (0, 9999),
        "campaign": (0, 999),
        "pdays": (-1, 999),
        "previous": (0, 999)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Writing test_data.py


Now lets run pytest

In [6]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m40% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█      [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m60% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█    [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                 [32m80% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                             

In [None]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()