# testing goals

- Understandable: At its core, a test is just a step-by-step procedure. It exercises a behavior and verifies the outcome. In a sense, tests are living specifications – they detail exactly how a feature should function. Everyone should be able to intuitively understand how a test works. Follow conventions like Arrange-Act-Assert or Given-When-Then. Seek conciseness without vagueness. Avoid walls of text. If you find yourself struggling to write a test in plain language, then you should review the design for the feature under test. If you can’t explain it, then how will others know how to use it?

- Unique: Each test case in a suite should cover a unique behavior. Don’t Repeat Yourself – repetitive tests with few differences bear a heavy cost to maintain and execute without delivering much additional value. If a test can cover multiple inputs, then focus on one variation per equivalence class.

- Individual: Test one thing at a time. Tests that each focus on one main behavior are easier to formulate and automate. They naturally become understandable and maintainable. When a test covering only one behavior fails, then its failure reason is straightforward to deduce. Any time you want to combine multiple behaviors into one test, consider separating them into different tests. Make a clear distinction between “arrange” and “act” steps. Write atomic tests as much as possible. Avoid writing “world tours,” too. I’ve seen repositories where tests are a hundred steps long and meander through an application like Mr. Toad’s Wild Ride.

- Independent: Each test should be independent of all other tests. That means testers should be able to run each test as a standalone unit. Each test should have appropriate setup and cleanup routines to do no harm and leave no trace. Set up new resources for each test. Automated tests should use patterns like dependency injection instead of global variables. If one test fails, others should still run successfully. Test case independence is the cornerstone for scalable, parallelizable tests. Modern test automation frameworks strongly support test independence. However, folks who are new to automation frequently presume interdependence – they think the end of one test is the starting point for the next one in the source code file. Don’t write tests like that! Write your tests as if each one could run on its own, or as if the suite’s test order could be randomized.

- Repeatable: Testing tends to be a repetitive activity. Test suites need to run continuously to provide fast feedback as development progresses. Every time they run, they must yield deterministic results because teams expect consistency.Unfortunately, manual tests are not very repeatable. They require lots of time to run, and human testers may not run them exactly the same way each iteration. Test automation enables tests to be truly repeatable. Tests can be automated once and run repeatedly and continuously. Automated scripts always run the same way, too.

- Reliable: Tests must run successfully to completion, whether they return PASS or FAIL results. “Flaky” tests – tests that occasionally fail for arbitrary reasons – waste time and create doubt. If a test cannot run reliably, then how can its results be trusted? And why would a team invest so much time developing tests if they don’t run well? You shouldn’t need to rerun tests to get good results. If tests fail intermittently, find out why. Correct any automation errors. Tune automation timeouts. Scale infrastructure to the appropriate sizes. Prioritize test stability over speed. And don’t overlook any wonky bugs that could be lurking in the product under test!

- Efficient: Providing fast feedback is testing’s main purpose. Fast feedback helps teams catch issues early and keep developing safely. Fast tests enable fast feedback. Slow tests cause slow feedback. They force teams to limit coverage. They waste time and money, and they increase the risk that bugs do more damage.Optimize tests to be as efficient as possible without jeopardizing stability. Don’t include unnecessary steps. Use smart waits instead of hard sleeps. Write atomic tests that cover individual behaviors. For example, use APIs instead of UIs to prep data. Set up tests to run in parallel. Run tests as part of Continuous Integration pipelines so that they deliver results immediately.

- Organized: An effective test has a clear identity:
    - Purpose: Why run this test?
    - Coverage: What behavior or feature does this test cover?
    - Level: Should this test be a unit, integration, or end-to-end test?
    - Identity informs placement and type. Make sure tests belong to appropriate suites. For example, tests that interact with Web UIs via Selenium WebDriver do not belong in unit test suites. Group related tests together using subdirectories and/or tags.

- Reportable: Functional tests yield PASS or FAIL results with logs, screenshots, and other artifacts. Large suites yield lots of results. Reports should present results in a readable, searchable format. They should make failures stand out with colors and error messages. They should also include other helpful information like duration times and pass rates. Unit test reports should include code coverage, too. Publish test reports to public dashboards so everyone can see them. Most Continuous Integration servers like Jenkins include some sort of test reporting mechanism. Furthermore, capture metrics like test result histories and duration times in data formats instead of textual reports so they can be analyzed for trends.

- Maintainable: Tests are inherently fragile because they depend upon the features they cover. If features change, then tests probably break. Furthermore, automated tests are susceptible to code duplication because they frequently repeat similar steps. Code duplication is code cancer – it copies problems throughout a code base.Fragility and duplication cause a nightmare for maintainability. To mitigate the maintenance burden, develop tests using the same practices as developing products. Don’t Repeat Yourself. Simple is better than complex. Do test reviews. For automation, follow good design principles like separating concerns and building solution layers. Make tests easy to update in the future!

- Trustworthy: A test is “successful” if it runs to completion and yields a correct PASS or FAIL result. The veracity of the outcome matters. Tests that report false failures make teams waste time doing unnecessary triage. Tests that report false passing results give a false sense of security and let bugs go undetected. Both ways ultimately cause teams to mistrust the tests.Unfortunately, I’ve seen quite a few untrustworthy tests before. Sometimes, test assertions don’t check the right things, or they might be missing entirely! I’ve also seen tests for which the title does not match the behavior under test. These problems tend to go unnoticed in large test suites, too. Make sure every single test is trustworthy. Review new tests carefully, and take time to improve existing tests whenever problems are discovered.

- Valuable: Testing takes a lot of work. It takes time away from developing new things. Therefore, testing must be worth the effort. Since covering every single behavior is impossible, teams should apply a risk-based strategy to determine which behaviors pose the most risk if they fail and then prioritize testing for those behaviors. If you are unsure if a test is genuinely valuable, ask this question: If the test fails, will the team take action to fix the defect? If the answer is yes, then the test is very valuable. If the answer is no, then look for other, more important behaviors to cover with test

- https://automationpanda.com/2020/07/09/12-traits-of-highly-effective-tests/

- atomic: when creating functions and classes, we need to ensure that they have a single responsibility so that we can easily test them. If not, we'll need to split them into more granular components.
- compose: when we create new components, we want to compose tests to validate their functionality. It's a great way to ensure reliability and catch errors early on.
- reuse: we should maintain central repositories where core functionality is tested at the source and reused across many projects. This significantly reduces testing efforts for each new project's code base.
- regression: we want to account for new errors we come across with a regression test so we can ensure we don't reintroduce the same errors in the future.
- coverage: we want to ensure 100% coverage for our codebase. This doesn't mean writing a test for every single line of code but rather accounting for every single line.

# Is test coverage a useful metric? ¶
  - No. Yes. Sometimes. It depends. “Test coverage” is shorthand for “code covered by running automated tests”. The multiple quality dimensions thing applies here too. Which stakeholders are we providing evidence for through this particular test? How does this inform their confidence? The fact that some code exercised some other code tells me that at best we gained some evidence for at least one stakeholder.
   - One thing test coverage can reveal is code that has no automated tests at all! A lack of coverage tells us that code is not being checked by automated tests. But even that is not necessarily a concern if we know that we are verifying the code in other ways. For instance, a user interface that developers and testers see many times each day is unlikely to have a glaring layout error on it, even though there is no automated test to confirm this.


# testing mindset
- Empathy: the ability to get inside a stakeholder’s head, see the world from their perspective, understand their causes for concern.
- Scepticism: the ability to doubt the work you are doing even while you are doing it. This is especially hard for a programmer: our ego and confirmation bias are always there. This scepticism aligns with the Scientific Method, in which we try to falsify a hypothesis, not to “prove” it.
- Ingenuity: the ability and determination to do whatever it takes to give that peace of mind to your stakeholder—or to discover they were right to be sceptical in the first place! Testing is non-linear, non-obvious, and often emergent. Poking around in a database; sniffing packets on a network; injecting a service proxy to record - interactions; tracking eye movements; hacking DNS; writing code that breaks other code; nothing is off limits to a good tester.

# The many faces of testing
- Testing is a wide area, and as such it's filled with quite a lot of buzzwords when it comes to types of tests 🐝. 
- Each one reflects answer different question:
    - What is the scope of the test: unit testing, integration testing, E2E testing.
    - how the test is running - reference testing, property testing, chaos testing.
    - Where is the test running: production testing.
    - What the purpose of the test: functional_testing, performance testing, load testing, document testing.     

## Testing Strategies 🎮
- Unit tests : assess the fitness of code unit
1. unit test is great but it has costst it fit perfectly for logical code units (if/for)
- Property testing
 - or even generating mocked datasets using hypothesis https://www.hillelwayne.com/post/property-testing-complex-inputs/ 
 - https://www.youtube.com/watch?v=UfZ26kwwLx0&list=WL&index=9&ab_channel=PyConIsrael
- Component tests
- Integration Testing: check the collaboration of units. integration tests are more broad but more cost efficient
 - Because "The unclouded eye was better, no matter what it saw." Frank Herbert.
 - These tests aim to determine whether modules that have been developed separately work as expected when brought together. In terms of a data pipeline, these can check that:
 - The data cleaning process results in a dataset appropriate for the model
 - The model training can handle the data provided to it and outputs results (ensurign that code can be refactored in the future) integration tests which are typically longer-running tests that observe higher-level behaviors that leverage multiple components in the codebase,
These tests aim to determine whether modules that have been developed separately work as expected when brought together. In terms of a data pipeline, these can check that:
The data cleaning process results in a dataset appropriate for the model
The model training can handle the data provided to it and outputs results (ensurign that code can be refactored in the future)
integration tests are more broad but more cost efficient

 - One of the core concepts behind Integration Testing is the System Under Test. In Unit Testing, the SUT is the unit (i.e. the class or the module, as described above). In Integration Testing, the SUT needs to be defined per test: it can be as small as two units collaborating, and as big as the whole system.
 - There's a raging debate about unit testing vs. integration testing. Some consider only the former, while other only the latter. If you want to know my personal stance on this, please read this previous post.
 - In the context of integration testing, the "testability" aspect is not defined by how well the code can be isolated, but instead by how well the actual infrastructure accommodates and facilitates testing. This puts a certain prerequisite on the responsible person and the team in general in terms of technical expertise.
 - It may also take some time to set up and configure the testing environment, as it includes creating fixtures, wiring fake implementations, adding custom initialization and cleanup behavior, and so on. All these things need to be maintained as the project scales and becomes more complicated.
 - Another common concern is that high-level tests often suffer from a lack of locality. If a test fails, either due to unmet expectations or because of an unhandled exception, it's usually unclear what exactly caused the error.
 - Although there are ways to mitigate this issue, ultimately it's always going to be a trade-off: isolated tests are better at indicating the cause of an error, while integrated tests are better at highlighting the impact. Both are equally useful, so it comes down to what you consider to be more important.
- Regression tests:
 - whenever developer change or modify the functionality/feature there is a huge possibility that thse updates may cause unexpected behaviors. regression testubg us oerfirned to make sure that a change or addition hasnt broken any of the existing functionality.
- Data tests Dataset Dimensions: Integrity check and check if the data is correct
- chaos: verify the resillency of a system
- penetration: asses the vulnerability of a system
- mutation: check the quliaty of the test
- end-to-end validate the flow from gui to the datastore and back
- stress testing is the testung the system until it break. the load increase gradually.
- load testing is in general what is implicitly meant when people talk about peformance testing. in this context one set different paramets for the test. those parameters model a representative load from production environment. in reality most organizations are unable to provuide quality sample. mititgatioj trhough performance
- Load Testing is in general what is implicitly meant when people talk about Performance Testing. In this context, one sets different parameters for the tests. Those parameters model a representative load from the production environment. In reality, most organizations are unable to provide a quality sample of the production load, and the load is inferred.
 - at google unit tests are intended to be small and fast because they need to fit into our standards test execution infrastructure and also be run many times as part of frictionless cdeveloper workflow. but performance, load, and stress testing often require sending large volumes of traffic to a given binary. these volumes become difficult to test in the model of typical unit test. and our large volume are big, often thousand or million of queries per seconds (in case of ads, real time bidding)
 -https://github.com/locustio/locust
 - https://github.com/great-expectations/great_expectations
- performance testing: most of previous approaches focus on testing functional requirements its easy to forget that the fitness of software component encompasses both functional and non functional requiremnts
1. sometimes it worth monitoring things only on production. when done well it can be cheap, otherwise very expensive.
- contract
 - https://github.com/deadpixi/contract
 - https://www.hillelwayne.com/post/pbt-contracts/
 - https://github.com/ksindi/implements
 - https://sethmlarson.dev/blog/2021-10-18/tests-arent-enough-case-study-after-adding-types-to-urllib3
 
 
 . sometimes it worth monitoring things only on production. when done well it can be cheap, otherwise very expensive.
 

# When to test

- always write tests for newly introduced logic when contributing code
- when contributing a bug fix
- be sure to write a test to capture the bug and prevent future regressions.
- Any activity that changes a system incurs risk—the possibility of Bad Things—along many dimensions.

- Why don’t we just automate all the testing? 
    

In [None]:
- When we think about “enough” testing, it can lead to constructive discussions about alternatives with different implications. These are almost always trade-offs; it is rare that one solution is objectively “better” along all dimensions than another. Although it is fair to say that a simpler, smaller change is usually less risky than a larger, more complicated one.


``` sql
--- File special_query.sql
SELECT species,
       AVG(sepal_length) AS avg_sepal_length,
       AVG(sepal_width) AS avg_sepal_width,
       AVG(petal_length) AS avg_petal_length,
       AVG(petal_width) AS avg_petal_width,
FROM iris
GROUP by species;
```

``` json
{
 "dataset": { "artifacts_path": "<DATASET_ARTIFACTS_PATH>"},
 "train":   {"script_path": "<TRAIN_SCRIPTS_PATH>", "artifacts_path": "<TRAIN_ARTIFACTS_PATH>"},
 "evaluate": {"metric": AUC}
}
 ```


    
    - logging
- monitoring
- regression tests 
- non functional testing
- computer and io
- test only what matter


Testing vs. monitoring
We'll conclude by talking about the similarities and distinctions between testing and monitoring. They're both integral parts of the ML development pipeline and depend on each other for iteration. Testing is assuring that our system (code, data and models) passes the expectations that we've established offline. Whereas, monitoring involves that these expectations continue to pass online on live production data while also ensuring that their data distributions are comparable to the reference window (typically subset of training data) through . When these conditions no longer hold true, we need to inspect more closely (retraining may not always fix our root problem).

With monitoring, there are quite a few distinct concerns that we didn't have to consider during testing since it involves (live) data we have yet to see.

features and prediction distributions (drift), typing, schema mismatches, etc.
determining model performance (rolling and window metrics on overall and slices of data) using indirect signals (since labels may not be readily available).
in situations with large data, we need to know which data points to label and upsample for training.
identifying anomalies and outliers.
We'll cover all of these concepts in much more depth (and code) in our monitoring lesson.


- pytest [1](https://towardsdatascience.com/pytest-features-that-you-need-in-your-testing-life-31488dc7d9eb ),[2](https://blog.daftcode.pl/the-cleaning-hand-of-pytest-28f434f4b684), [3](https://www.youtube.com/watch?v=fv259R38gqc&ab_channel=MattLayman )
- [property](https://medium.com/clarityai-engineering/property-based-testing-a-practical-approach-in-python-with-hypothesis-and-pandas-6082d737c3ee)
- [entire flow](https://www.jeremyjordan.me/testing-ml/),



- Can mock inputs:
    - Write mocks hard coded.
    - Inject hard coded mocks using [fixures](https://docs.pytest.org/en/6.2.x/fixture.html).
    - Inject or using fixture.
    - Generators like [mimesis](https://github.com/lk-geimfari/mimesis) and [faker](https://github.com/joke2k/faker).
    - [Recording the required inputs](https://vcrpy.readthedocs.io/en/latest/).
    - Mock functions call using [fixures](https://docs.pytest.org/en/6.2.x/fixture.html).
    



- static data prep
    - advantages:
        - manual configuration is easy
        - automated configuration create fresh data anytime
        - cloned database easy to copy. all data at once
        - mocked endpoint avoid dependencies, control all data values
    - disadcantages:
        - manual configuration is slow and not scalable may fall into disrepair
        - automated configuration must maintain tools and scripts
        - clone database might be too much data, might need extra refinement
        - mocked nedpoint is difficult to set up and maintain
    - propeties
        - created before test run
        - good for slow or complicated data
        - may make test run faster
        - may make tests brittle as data changes
        - may turn stale over time
    - factor     
        - size
        - freshness
        - updated frequncy
        - dificuly
        - bureacracy
        - cost
        - skill
    - tools
        - https://github.com/spulec/freezegun
        - https://github.com/lk-geimfari/mimesis
        - https://github.com/joke2k/faker
- dyanmic data prep
    - properties
        - create when tests  runs
        - avoid brittle reference
        - exlusive use
    - tools
        - https://github.com/obspy/vcr, https://medium.com/@light_khan/mock-testing-in-python-using-vcrpy-ff3eb05ae5ec


"    

In [None]:
texts = [
    "CNNs for text classification.",  # CNNs are typically seen in computer-vision projects
    "This should not produce any relevant topics."  # should predict `other` label
]
predict.predict(texts=texts, artifacts=artifacts)
    ['natural-language-processing', 'other']     
                
                
                
# tests/model/test_behavioral.py
from pathlib import Path
import pytest
from config import config
from tagifai import main, predict

@pytest.fixture(scope="module")
def artifacts():
    run_id = open(Path(config.CONFIG_DIR, "run_id.txt")).read()
    artifacts = main.load_artifacts(run_id=run_id)
    return artifacts

@pytest.mark.parametrize(
    "text_a, text_b, tag",
    [
        (
            "Transformers applied to NLP have revolutionized machine learning.",
            "Transformers applied to NLP have disrupted machine learning.",
            "natural-language-processing",
        ),
    ],
)
def test_inv(text_a, text_b, tag, artifacts):
    """INVariance via verb injection (changes should not affect outputs)."""
    tag_a = predict.predict(texts=[text_a], artifacts=artifacts)[0]["predicted_tag"]
    tag_b = predict.predict(texts=[text_b], artifacts=artifacts)[0]["predicted_tag"]
    assert tag_a == tag_b == tag  
    
    

# Minimum Functionality Tests (simple input/output pairs)
tokens = ["natural language processing", "mlops"]
texts = [f"{token} is the next big wave in machine learning." for token in tokens]
predict.predict(texts=texts, artifacts=artifacts)

# DIRectional expectations (changes with known outputs)
tokens = ["text classification", "image classification"]
texts = [f"ML applied to {token}." for token in tokens]
predict.predict(texts=texts, artifacts=artifacts
                
# INVariance via verb injection (changes should not affect outputs)
tokens = ["revolutionized", "disrupted"]
texts = [f"Transformers applied to NLP have {token} the ML field." for token in tokens]
predict.predict(texts=texts, artifacts=artifacts)
                 
                 
https://madewithml.com/courses/mlops/testing/
https://slack.engineering/continuous-load-testing/

In [None]:
https://www.hillelwayne.com/post/metamorphic-testing/
    https://blog.twitter.com/engineering/en_us/a/2015/diffy-testing-services-without-writing-tests
        https://www.hillelwayne.com/post/a-bunch-of-tests/
            https://www.hillelwayne.com/post/cross-branch-testing/
https://www.hillelwayne.com/post/pbt-contracts/
    https://sre.google/sre-book/testing-reliability/
        https://www.sqlite.org/testing.html
            https://blog.acolyer.org/2016/11/29/early-detection-of-configuration-errors-to-reduce-failure-damage/

In [None]:
- sometimes it worth monitoring things only on production. when done well it can be cheap, otherwise very expensive.
- In order to catch errors early on, it is imperative that when new components are created, and accompanying test should be created to validate the functionality. This helps ensure code reliability and you can easily trace the problems.
- Pro Tip #1❗You should consider using testing as documentation aid. In cases your invokation code and the result return consider using doctests.

- Pro Tip #2❗even for integration tests, small is better .
- Pro Tip #3 isolate the test environments
    - restrict access while testing
    - deploy new containers
    - make database clones?

- Pro Tip #1❗ Testing in production introduces a change to the production environment; the people operating the service don't have much of an inkling as to whether the test would succeed or fail. It becomes important idempotent.