# Notes:

## Module 5.1 : Intro to ML Modeling

We have learnt how to run and track ML experiments, and deploy the chosen models into production. The prediction services are now up and running, and generating predictions for given data.

So are we done now?

No, not yet. With time the business concept changes and so is the data. We need to be cognizant about the model performance from time to time so as to take timely appropriate action.

Monitoring ML models is mostly around monitoring four sectors:

- Service Health: General Software health check
- Model Performance: Depending on metrics for the problem
- Data Quality and integrity
- Data Drift & Concept Drift

Over time, ML models may degrade. This is due to one of two effects:

- Data Drift: In which new input data is no longer represented by the model's training dataset. Example: 3 new popular venues were opened in the last month, our Taxi duration model hasn't got samples of this new data in its training dataset

- Concept Drift: In which the concept changes, i.e: The relationship between inputs and outputs has changed (Not necessarily the data itself however).This drift as the name implies is due to "concepts" (i.e: hidden variables, underpinning hypotheses..etc) changing. Example: Taxi cars have been replaced by newer, faster, nimbler cars. Our model can no longer accurately predict trip durations

### Types of monitoring:

Depending upon the requirements we can go for **online** monitoring where we continuously read input and output data so as to find inconsistencies or we read the stored input and output data periodically to monitor the status that is **batch** monitoring.

Batch monitoring:

- is implemented in most of the production scenarios. 
- Pipelines are orchestrated with tools like Prefect or Airflow, where after some steps in the pipelines, monitoring related calculations are done to generate metrics to determine if data and model are behaving as per the expectation.
- From the stored data, metrices can be calculated and stored in a SQL or NOSQL database and visualizations can be prepared with help of Tableau or Power BI. 
- However, Evidently can help in creating required metrics out of the box and produce the visualizations automatically as well. 

In case of online/real time monitoring Evidently generated metrices can be stored in Prometheus database that very well integrates with Grafana for visualization.

### Environment Setup:

#### Creating and activate virtual environment:

1. Go to the correct directory by: `cd "mlops-zoomcamp/05-monitoring"`
2. Activate any existing conda virtual evironment: `. /opt/homebrew/anaconda3/bin/activate && conda activate /opt/homebrew/anaconda3/envs/mlops-zoomcamp-venv;`
3. Create a new virtual environment: `conda create -n py11 python=3.11`
4. Activate it by `conda activate py11`
5. Install the required packages by doing `pip install -r requirements.txt`

#### Creating config files:

Docker Compose configuration file is a YAML file where you can list all the services. You can use this file to build and run all the services you had mentioned in the file.

Here is the [link](docker-compose.yml) to the completed docker compose file.

Here is the [link](/mlops-zoomcamp/05-monitoring/config/grafana_datasources.yaml) to the completed grafana data sources file.

Once you create the above 2 files, open the Docker app and then in the command line, run `docker-compose up --build` (in your conda environment).

After running the build command, once you get green messages like below, you can check whether you're able to access the services or not:

<img src="notes-images/docker-compose build.png" width="700"/>

- To access Grafana open localhost:3000 on browser and use admin as both user and password.
- To access Adminer open localhost:8080 on browser