# WATER WUALITY MONITORING

### Attributes
- **Sensor ID:** Unique identifier for the sensor collecting the data.
- **Location:** The geographical location where the data was collected.
- **Timestamp:** The date and time when the data was recorded.
- **Temperature:** The water temperature measured by the sensor.
- **pH:** The pH level of the water.
- **Conductivity:** The conductivity of the water.
- **Additional Attributes (optional):** Attributes like turbidity, dissolved oxygen, etc., depending on the available sensors and their capabilities.

### 1. POST
- **Purpose:** Used to create a new water quality record on the server.
- **URL Format:** Typically targets the base URL of the resource collection.
- **Example:** Creating a new water quality record.
- **Endpoint:** `/api/water-quality`
- **Method:** POST
- **Request Body:**
    ```json
    {
      "sensor_id": "sensor123",
      "temperature": 22.5,
      "pH": 7.2,
      "conductivity": 500,
      "location": "Lake A",
      "timestamp": "2024-07-31T10:00:00Z"
    }
    ```
- **Responses:**
  - `201:` Record successfully created.
  - `400:` Invalid data format.

### 2. GET
- **Purpose:** Used to retrieve water quality records.
- **URL Format:** Typically targets a specific resource or the base URL of the resource collection.
- **Example:** Retrieving water quality records for a specific location.
- **Endpoint:** `/api/water-quality`
- **Method:** GET
- **Request Parameters:**
  - `start_date` (string): The start date for the records to retrieve.
  - `end_date` (string): The end date for the records to retrieve.
  - `location` (string): The location for which to retrieve water quality records.
- **Response Body:**
    ```json
    [
      {
        "id": 1,
        "sensor_id": "sensor123",
        "temperature": 22.5,
        "pH": 7.2,
        "conductivity": 500,
        "location": "Lake A",
        "timestamp": "2024-07-31T10:00:00Z"
      }
    ]
    ```
- **Responses:**
  - `200:` OK.
  - `404:` No records found.

### 3. PUT
- **Purpose:** Used to update an existing water quality record on the server.
- **URL Format:** Typically targets a specific resource.
- **Example:** Updating the temperature of a water quality record.
- **Endpoint:** `/api/water-quality/{record_id}`
- **Method:** PUT
- **Request Parameters:**
  - `record_id` (integer): The ID of the water quality record to update.
- **Request Body:**
    ```json
    {
      "temperature": 23.5
    }
    ```
- **Responses:**
  - `200:` Record successfully updated.
  - `404:` Record not found.
  - `400:` Invalid data format.

### 4. DELETE
- **Purpose:** Used to delete an existing water quality record from the server.
- **URL Format:** Typically targets a specific resource.
- **Example:** Deleting a water quality record.
- **Endpoint:** `/api/water-quality/{record_id}`
- **Method:** DELETE
- **Request Parameters:**
  - `record_id` (integer): The ID of the water quality record to delete.
- **Responses:**
  - `200:` Record successfully deleted.
  - `404:` Record not found.


## 2. Machine Learning

### Indicators Quality of the Data and Measures to Mitigate Effects of Poor Data Quality

- **Missing Values:**
  - **Detection:** We should look for any missing values in the dataset.
  - **Mitigation:** If they are really few, we can remove the records or maybe we can impute them.

- **Outliers:**
  - **Detection:** Identify data points that are significantly different from others.
  - **Mitigation:** We can use IQR or Z-scores (boxplots and scatter plots are useful). Depending on the data and field, we can remove them or we may have to be aware of them.

- **Data Distribution:**
  - **Analysis:** Distribution of data in each column, histograms, summary statistics (mean, median, standard deviation), unbalanced categorical classes.
  - **Mitigation:** For example, if we have a lot of categories and some have little proportions like 0.5% or 1%, we could join them and put others in order to have our models achieve better accuracy later. We can also use oversampling with data generation, apply log transformation to normalize, or adjust class weights.

- **Consistency and Validity:**
  - **Consistency:** Ensure that the data is in the same format. For instance, in age, some might categorize as 'avi' and some as 'grandpa'.
  - **Validity:** Apply constraints (e.g., enforce a limit on age values).

- **Handle Duplicates:**
  - **Detection:** Check for duplicates in the records.
  - **Mitigation:** Remove them if found.

- **Correlation:**
  - **Analysis:** Check the correlation of the values.
  - **Mitigation:** If they are correlated, they bring the same information (redundant information).

### Machine Learning Model

- **Regression Model:**
  - **Use Case:** If we want to do a simpler model, we could use a regression model.
  - **Pros:** Simple and quick.
  - **Cons:** Assumes linear relationship between features and target, which might not hold in complex problems.

- **Deep Learning:**
  - **Use Case:** Can capture complex non-linear relationships in the data.
  - **Pros:** Powerful and versatile.
  - **Cons:** Prone to overfitting, especially with small datasets. Techniques to mitigate overfitting include early stopping, dropouts, and regularization.

- **Transformers:**
  - **Use Case:** Suitable for capturing long-term dependencies and relationships in the data.
  - **Pros:** 
    - Can handle large amounts of data and capture intricate patterns.
    - Excellent at understanding contextual relationships due to self-attention mechanisms.
  - **Cons:**
    - Requires substantial computational resources.
    - Needs a large amount of data for effective training.
    - Data needs to be formatted into sequences, which can add complexity to preprocessing.

- **Time Series Forecasting Models:**
  - **Use Case:** Specifically designed to capture temporal dependencies and trends in time-series data.
  - **Pros:**
    - Tailored for time-dependent data, making them highly effective for forecasting.
    - Can model trends, seasonality, and cyclic behaviors in data.
  - **Cons:**
    - May require extensive historical data to capture long-term patterns.
    - Complex models can be prone to overfitting, especially with limited data.
    - Need careful handling of non-stationary data and potential external influencing factors.


### Other Solutions Based on This Data

1. **Water Quality Forecasting Dashboard:**
   - **Description:** We could create a comprehensive dashboard that not only predicts future water quality but also visualizes historical data, trends, and potential future scenarios.

2. **Real-Time Water Quality Mapping:**
   - **Description:** We could develop a real-time water quality mapping system to visualize water quality across different regions.

3. **Real-Time Public Health Alerts:**
   - **Description:** Develop a system that issues real-time alerts to the public about water quality issues that could impact health, such as contamination, unsafe swimming conditions, or potential increases in waterborne illnesses. This is crucial as many people drink water from the sink.


## 3. Machine Learning Life Cycle and Operations

### Tools
I especially like Jupyter Notebooks and Visual Studio Code. I think the debugging works pretty well and allows you to work efficiently. I also used IntelliJ for backend frameworks in the past, but not for writing Python.

For libraries, there are lots of them:

- **Data Manipulation:** pandas, numpy
- **Data Visualization:** matplotlib, seaborn
- **Machine Learning:** scikit-learn, XGBoost, LightGBM, TensorFlow, Keras, PyTorch
- **Monitoring and Logging:** MLflow, TensorBoard, Prometheus, Grafana

### Deployment

Explain first the part of GitHub and Google Cloud Platform (CI/CD pipelines), that you can create with Docker image containing the model and its dependencies and YAML files for deployment configurations, and when a push is done, it can be trained in the cloud or deployed.

I think both have their advantages and disadvantages:

- **Cron Jobs:** 
  - **Purpose:** If we have a routine for new data refresh or new models from time to time, we can use cron jobs to automate the process. 
  - **Process:** Get the data again, preprocess it, and then train the model again. This helps in getting better models in the future.

- **HTTP Requests:** 
  - **Purpose:** If we don't have a system for routine updates, we could use HTTP requests, similar to the git push explained before.
  - **Implementation:** Although I have never done it with HTTP, we should:

#### 1. POST
- **Purpose:** Used to send data to the server and receive predictions based on the provided data.
- **URL Format:** Targets the endpoint responsible for making predictions.
- **Example:** Sending data to get water quality predictions.
- **Endpoint:** `/api/predict`
- **Method:** POST
- **Request Body:**
    - **Description:** The request body should contain the data for which predictions are needed.
    - **Example:**
    
      ```json
      {
        "feature1": 22.5,
        "feature2": 7.2,
        "feature3": 5
      }
      ```
- **Response Body:**
    - **Description:** The response will contain the predictions for the provided data.
    - **Example:**
      ```json
      [0.5]
      ```

### Monitor

I haven't touched much on monitoring models in the cloud, but most platforms provide metrics about performance and predictions. I would use the metrics of train, validation, and test to check performance. Also, the F1 score (precision and recall) is useful for imbalanced datasets as it checks true positives, false positives, and false negatives.

Also, there are tools like Prometheus and Grafana, or even MLflow, to manage and deploy models.


## 4 Database
### MongoDB:
- **Pros:**
  - Schema flexibility
  - Horizontal scalability
  - High write throughput
  - Document-oriented storage
  - Powerful aggregation framework
- **Cons:**
  - Eventual consistency
  - Limited ACID transactions
  - Less efficient for complex relational queries
- **Uses:**
  - CMS
  - IoT
  - Real-time analytics
  - E-commerce
  - Big data applications

### PostgreSQL:
- **Pros:**
  - Strong ACID compliance
  - Advanced querying capabilities
  - Data integrity
  - Full-text search
  - Extensibility
- **Cons:**
  - Schema rigidity
  - Primarily vertical scalability
  - Complex horizontal scaling
- **Uses:**
  - Transactional systems
  - Data warehousing
  - Enterprise applications
  - Geospatial data
  - Business intelligence

### Choose MongoDB:

#### High Data Ingestion Rates:
- **Requirement:** We need to continuously ingest large volumes of data to keep our visualizations up to date.
- **Advantage:** MongoDB is optimized for high write throughput, making it ideal for real-time data ingestion from various sensors.

#### Frequent Data Reads:
- **Requirement:** Our project involves reading large datasets to feed our machine learning models.
- **Advantage:** MongoDB's efficient read performance, especially when properly indexed, ensures that we can quickly access the data needed for model training and analytics.

#### Schema Flexibility:
- **Requirement:** The nature of our data from various sensors means that the structure can evolve over time.
- **Advantage:** MongoDB's schema-less design allows us to handle this variability seamlessly, without the need for complex schema migrations.

#### Scalability:
- **Requirement:** As our project grows, we will need to scale our database.
- **Advantage:** MongoDB's horizontal scalability allows us to distribute the data across multiple nodes easily, ensuring that we can handle increasing data volumes without performance degradation.

#### Reduced Need for Joins:
- **Requirement:** Given that we are not a bank or large enterprise, our use case does not require complex transactions or joins.
- **Advantage:** MongoDB's document-oriented approach fits well with our data structure and access patterns, reducing the overhead of managing complex relationships.

#### Eventual Consistency:
- **Requirement:** While strong consistency is crucial for financial institutions, our project can tolerate eventual consistency.
- **Advantage:** This trade-off allows us to benefit from MongoDB's performance and flexibility.

### Conclusion:
MongoDB is well-suited for our machine learning and analytics project due to its high write throughput, efficient read performance, schema flexibility, and scalability. These features align well with our need for real-time data ingestion, frequent data reads for model training, and the ability to handle evolving data structures.


### Collections

## Embedded vs. Referenced in MongoDB

### Embedded is Nice When:
- **One-to-One Relationship:** For instance, a sensor can only be in one location. (Consider if each location (e.g., Lake A) has one sensor or multiple sensors).
- **Combined Information Needs:** When we need information about the sensor along with its location (eliminating the need for joins).
- **Static Location:** If the sensor's location doesn't change often.
- **Minimal Information:** If there's not too much information to store.

### Referenced is Nice When:
- **Dynamic Location:** The sensor's location changes often.
- **Shared Location:** If many sensors share the same location (reducing redundancy).
- **Multiple Sensors per Location:** If one location has many sensors.

### Decision:
We assume that one location has only one sensor, the sensors are static and do not change places, so we will use embedded documents.

### Example Collections

#### Sensors Collection (with Embedded Location):
```json
{
  "_id": "sensor123",
  "sensor_name": "Temperature Sensor",
  "sensor_type": "Temperature",
  "location": {
    "name": "Lake A",
    "latitude": 40.7128,
    "longitude": -74.0060,
    "description": "A large freshwater lake"
  }
}
```

#### Water quality records collection (referencing snesor):
```json
{
  "_id": "record123",
  "sensor_id": "sensor123",
  "timestamp": "2024-07-31T10:00:00Z",
  "temperature": 22.5,
  "pH": 7.2,
  "conductivity": 500,
  "additional_attributes": {
    "turbidity": 5.0,
    "dissolved_oxygen": 8.0
  }
}
```
### Configuration Database

The decision to use embedding or referencing is important during the design phase. We should also implement:

- **Replication:** Ensures high availability and protection against data loss.
- **Sharding:** Distributes data across multiple machines for very large datasets. Each shard contains a subset of the data, providing horizontal scalability and load balancing, which increases performance.
- **Indexes:** Creating indexes on frequently queried fields can speed up read operations.


## 5 Deployment

For deployment, I would use Kubernetes because it is the most effective way to deploy both the database and the models.
(I used heroku because its free in his moment)
I think there are two main approaches:

#### Use Server of Your Own (On-Premises):

- **Pros:**
  - Full control over the infrastructure.
  - Customization according to specific needs.
  - Enhanced data privacy.

- **Cons:**
  - Really expensive due to infrastructure and maintenance costs.
  - Requires a dedicated IT team for maintenance.
  - Sometimes difficult scalability.

#### Use Cloud Providers (AWS, GCP, Azure):
AWS is the most used and most cloud provider (rnge of servics, entreprise grade applcaitions)
GCP is great for data analytics, ML, (so maybe this case)
Azure is grreat for companies they alredy have a range of microsoft prodcuts
- **Pros:**
  - Scalability is easily managed.
  - Kubernetes is easy to use with cloud providers.
  - High availability of services.

- **Cons:**
  - Less expensive than owning your own server, but still incurs pay-as-you-go pricing.
  - Dependency on another company.
  - More challenges in terms of data privacy.

## 6. Testing, Documentation, and Version Control

### Testing

Here, I would distinguish different kinds of tests:

- **Unit Testing:**
  - Useful for testing individual components or functions to ensure they work as expected.
  - Tools: `unittest`, `pytest`, `assert`.

- **Integration Testing:**
  - Tests interactions between different components, such as between the API and the database.

- **End-to-End Testing:**
  - Verifies the entire application flow from start to finish, such as retrieving data, training the model, and deploying it.

### Documentation

For documentation, I think using docstrings and commenting on the code functions is really useful. For example:

```python
def add_record(data):
    """
    Add a new water quality record to the database.

    Parameters:
        data (dict): A dictionary containing water quality information.
            - sensor_id (str): The ID of the sensor.
            - temperature (float): The temperature reading.
            - pH (float): The pH level.
            - conductivity (int): The conductivity value.
            - location (str): The location name.
            - timestamp (str): The time of the reading.

    Returns:
        dict: Response message and status.
    """
```

There are also technologies that help you create documentation, such as:

- **Sphinx:** Generates documentation for Python projects in HTML, PDF formats.
- **Swagger:** Used for API documentation.

### Version Control

We will create a Git repository and link it to GitHub.

**In local:**

```bash
git init
git add .
git commit -m "Initial commit: Set up project structure"
```

**In the github:**

- Create a repository named water_quality_monitoring.
- Link the local repository to GitHub:
```bash
git remote add origin https://github.com/eduardpuga/water_quality_monitoring.git
git push -u origin master
```

Create a development branch:

```bash
git checkout -b develop
```