## 1.2 Install, load libraries and setup wandb

In [23]:
!pip install wandb
!pip install pytest pytest-sugar
import wandb
import pandas as pd

# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1.3 Pytest

### 1.3.1 How pytest discovers tests


pytests uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:
  1. files with tests should be called `test_*.py` or `*_test.py `
  2. test function name should start with `test_`

##1.3.2 Fixture

An important aspect when using ``pytest`` is understanding the fixture's scope works. 

The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example.

### 1.2.3 Create and run a test file


In [24]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="decision_tree", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("decision_tree/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 7

def test_column_presence_and_type(data):

    required_columns = {
        "buying": pd.api.types.is_object_dtype,
        "maint": pd.api.types.is_object_dtype,
        "doors": pd.api.types.is_object_dtype,
        "persons": pd.api.types.is_object_dtype,
        "lug_boot": pd.api.types.is_object_dtype,
        "safety": pd.api.types.is_object_dtype,
        "assessment": pd.api.types.is_object_dtype,
       
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        "unacc",
        "acc",
        "good",
        "vgood"
    ]

    assert data["assessment"].isin(known_classes).all()


# def test_column_ranges(data):
#
 #   ranges = {
 #       "age": (17, 90),
  #      "fnlwgt": (1.228500e+04, 1.484705e+06),
   #     "education_num": (1, 16),
   #     "capital_gain": (0, 99999),
    #    "capital_loss": (0, 4356),
     #   "hours_per_week": (1, 99)
    #}

  #  for col_name, (minimum, maximum) in ranges.items():

   #     assert data[col_name].dropna().between(minimum, maximum).all(), (
    #        f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
     #       f"instead min={data[col_name].min()} and max={data[col_name].max()}"
      #  )




Overwriting test_data.py


Now lets run pytest

In [25]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m25% [0m[40m[32m█[0m[40m[32m█▌       [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m50% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██     [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m75% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m[40m[32m█[0m[40m[32m█▌  [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                [32m100% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m██[0m

Results (7.62s):
[32m       4 passed[0m


In [26]:
# close the run
# waiting a while after run the previous cell before execute this
run.finish()

NameError: ignored