# Valve Position Prediction Project
This project is aimed at using the behaviour of a `PID regulator` based control system obtained from `Matlab`
to predict the opening level of the actuating mechanism (i.e. the valve) through various regressors such as `RandomForestRegressor` and `CatBoostRegressor`.
## 1. Problem definition
> How well can a valve's position be predicted using the data obtained from a Matlab simulation of a PID based control system
## 2. Data
> We have data from Matlab-Simulink based models of a PID based liquid level control system at `training_data_from_matlab2.csv/training_data_from_matlab3.csv etc`
* The dataset has timestamps in `seconds` format.
* It has the desired input levels of the liquid in meters.
* The output liquid levels in meters as well
* The feedback: The difference the desired output (input) and the actual output.
* Opening levels of the valve in % (0-100)
  
### 2.1. How to format the data
* We need to check what kind of format, the data under label `Time` are in and then convert them into either datetime.seconds or int64 format
* Convert `string` datatypes to float using `pd.to_numeric(downcast = 'float')`
* Eliminate any missing data by first generating `is_missing` boolean data for each column of the input data
* Then eliminate by checking through `isna().sum()` and `_is_missing.value_counts()` and then eliminating the missing values by replacing them with `medians`
* Eliminate erronious data. I.e. liquid levels above `1.0` in `input` and `liquid_level` columns and valve opening levels above `100` in `valve_positions` by replacing them with medians.

> The data input and processing will be done using a function `preprocess_data(FILE_PATH)` fulfilling all the steps specified above. 
## 3. Modelling
* Use `train_test_split()` from `sklearn` to divide the dataset into training and validation datasets.
* Use `RandomForestRegressor` to train the model using the training dataset.
* Use `CatBoostRegressor` to train the model using the training dataset.
* Use `RandomizedSearchCV` to get the optimize the model using the best hyperparameters `RandomForestRegressor`
* Use `catboostcv` to obtain best hyperparameters for `CatBoostRegressor`
* Evaluate the optimized model on the validation datasets.
* Use test datasets to check the predictions of the model against true labels.
## 4. Saving the model for export and reuse
At this step, we will save the model. 
* Use the `pickle` module to save the models
* Or the `joblib` module to save the models.
  
## 5. Applying the output of the model on the transfer function of the control object
* This purpose can be achieved using the `control` library.
* Use the `control` library to simulate the output liquid level due to the valve opening levels predicted by the models in step 3

## 2. Input and Preprocessing of data: 
Define a function named `preprocess_data(FILE_PATH)` which will take the liquid level control system modelling data obtained from simulations in MATLAB and preprocess the data into a format that can be used for modelling

### Import the libraries

In [1]:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import control as ct
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from catboost import CatBoostRegressor
import pickle
from joblib import load, dump
%matplotlib inline

### 2.1. Preparing the function to preprocess the data
This function will: 
* Convert the data in the dataframe: presented in `string` format (`Time`) to numerical 
* Eliminate erroneous data (`Valve_Positions > 100`) and (`Liquid_levels > 1.5`)
* Eliminate the missing values

In [2]:
#FILE_PATH = "training_data_from_matlab5.csv" or anything like that....

def preprocess_data(PROVIDED_FILE_PATH):
    """
    This function is meant to predict the opening level of a valve on the basis of input data, feedback data and timestamps
    The data is to be imported from the datasheets as specified in "FILE_PATH"
    The file will imported to a pandas dataframe
    The time data is to be converted from string to numeric
    A check will be performed to remove string type data, which is to be converted to float type
    Then a check will be performed for any missing data
    Then erroneous data (liquid levels > 1.2 and valve positions > 100) will be replaced with the median values of the respective columns
    The dataframe will be ready for further action!
    """
    # Step-1 read data from the file path
    print("Reading data....")
    df_data = pd.read_csv(PROVIDED_FILE_PATH)
    # Step-2 change the datatype of time column from string to numeric
    # In this case, we need to remove the phrase 'sec' from the column
    print("Changing the Time data from string to numeric....")
    df_data["Time"] = df_data["Time"].str.replace(" sec","",regex=True)
    # Then we will have to change the datatype from object to numeric
    df_data["Time"] = pd.to_numeric(df_data["Time"])
    # Step-3: Check for string type data
    print("Checking for string type data....")
    for label, content in df_data.items():
        if pd.api.types.is_string_dtype(content):
            print("There is string type data in: ",label)
            print("Converting strings to float in: ", label)
            df_data[label] = pd.to_numeric(df_data[label],errors = "coerce")
        else:
            print("There are no strings in: ", label)
    # Step-4: Check for missing data
    print("Checking for missing data....")
    for label, content in df_data.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                print("There are missing values in: ", label)
                df_data[label + "is_missing"] = pd.isnull(content)
                # Now fill the missing data with zeroes
                df_data[label] = content.fillna(0)
            else:
                print("There are no missing values in: ",label)
    # Step-5: Now check for erroneous data
    print("Checking for erroneous data....")
    # In this case first we will be removing any values 
    for content in df_data["Liquid_levels"]:
        if (content > 1.5):
            print("Erroneous data found")
            content = df_data["Liquid_levels"].median()
    for content in df_data["Valve_Positions"]:
        if (content > 100.0):
            print("Erroneous data found")
            content = df_data["Valve_Positions"].median()


    
    print("Preprocessing complete....")
    print(df_data.info())
    return df_data
            

In [3]:
# Test the preprocessing data function
FILE_PATH = "training_data_from_matlab7.csv"
df_training_dataset1 = preprocess_data(FILE_PATH)

Reading data....
Changing the Time data from string to numeric....
Checking for string type data....
There are no strings in:  Time
There are no strings in:  Input
There are no strings in:  Liquid_levels
There are no strings in:  Feedback
There are no strings in:  Valve_Positions
Checking for missing data....
There are no missing values in:  Time
There are no missing values in:  Input
There are no missing values in:  Liquid_levels
There are no missing values in:  Feedback
There are no missing values in:  Valve_Positions
Checking for erroneous data....
Preprocessing complete....
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25474 entries, 0 to 25473
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Time             25474 non-null  float64
 1   Input            25474 non-null  float64
 2   Liquid_levels    25474 non-null  float64
 3   Feedback         25474 non-null  float64
 4   Valve_Positions  25474 non

## 3. Modelling using RandomForestRegressor
In this step, the data is to be modelled using `RandomForestRegressor`. 
Three steps are required to prepare and fit a model
### 3.1. Preparing the DataFrame for modelling
A function to prepare the data for modelling by dividing it into features and labels dataframes;
   * The function `prepare_for_modelling(DataFrame)` will take a DataFrame as input
   * Drop the `Valve_Positions` column as the label (target variable)
   * Use the other columns to set up the features dataframe (independent variables)
   

In [4]:
def prepare_for_modelling(DataFrame):
    """
    Function to prepare the data for fitting into the model
    Takes a dataframe as input. Drops the Valve_positions data for X and Valve_positions data is used as y
    X and y are returned
    """
    X = DataFrame.drop("Valve_Positions", axis = 1)
    y = DataFrame["Valve_Positions"]
    return X,y

### 3.2. Setup the dataframes for modelling 

At this stage, dataframes are to be prepared with different features. 
1. `df_training_data_2`: Contains all the features 
2. `df_training_data_3`: Contains only the `Feedback` feature and `Valve_Positions` (labels)
3. `df_training_data_4`: Contains `Input`, `Liquid_levels` and `Feedback` features besides the labels
4. `df_training_data_5`: Contains the `Input` and `Feedback` features besides the labels
5. `df_training_data_6`: Contains the `Input` and `Liquid_levels` features besides the labels

In [5]:
# Dataframe with all features
df_training_data_2 = df_training_dataset1.copy()
#df_training_data_2.info()

In [6]:
# Dataframe with only Feedback and Valve_Positions
df_training_data_3 = df_training_dataset1.drop(["Time", "Input", "Liquid_levels"], axis = 1)
#df_training_data_3.info()

In [7]:
# Dataframe with Input, Liquid_levels and Feedback
df_training_data_4 = df_training_dataset1.drop("Time", axis=1)
#df_training_data_4.info()

In [8]:
# Dataframe with Input and Feedback 
df_training_data_5 = df_training_dataset1.drop(["Time", "Liquid_levels"], axis = 1)
#df_training_data_5.info()

In [9]:
# Dataframe with Input and Liquid_levels
df_training_data_6 = df_training_dataset1.drop(["Time", "Feedback"], axis = 1)
#df_training_data_6.info()

Create a function to create the modified dataframes named `create_modified_dataframe_from_original(DataFrame, dataframetype)`

In [10]:
def create_modified_dataframe(DataFrame, dataframetype):
    """
    A function to create modified dataframes from the input DataFrame containing data about the Liquid level control system,
    the variable "dataframetype"'s value determines what type of dataframe will be returned. 
    The prediction label 'Valve_Positions' is to remain in all dataframes since it will be removed during preparation for modelling
    dataframetype = 1: Returns the dataframe with all columns
    dataframetype = 2: Returns the dataframe with only "Feedback" 
    dataframetype = 3: Returns the dataframe with "Input", "Liquid_levels" and "Feedback" 
    dataframetype = 4: Returns the dataframe with "Input" and "Feedback"
    dataframetype = 5: Returns the dataframe with "Input" and "Liquid_levels"
    """
    print("Creating modified dataframe")
    match dataframetype:
        case 1:
            print("Dataframe with all the features")
            df_training_data = DataFrame.copy()
            return df_training_data
        case 2: 
            print("Dataframe with only Feedback feature and label")
            df_training_data = DataFrame.drop(["Time", "Input", "Liquid_levels"], axis = 1)
            return df_training_data
        case 3:
            print("Dataframe with Input, Liquid_levels and Feedback features apart from label")
            df_training_data = DataFrame.drop("Time", axis=1)
            return df_training_data
        case 4: 
            print("Dataframe with Input and Feedback features as well as label")
            df_training_data = DataFrame.drop(["Time", "Liquid_levels"], axis = 1)
            return df_training_data
        case 5: 
            print("Dataframe with Input and Liquid_levels features as well as label")
            df_training_data = DataFrame.drop(["Time", "Feedback"], axis = 1)
        case _:
            print("Invalid input, try using only numbers between 1 and 5")
            return None

            
            
    

In [13]:
df_training_data_1 = create_modified_dataframe(df_training_dataset1, 1)
df_training_data_3 = create_modified_dataframe(df_training_dataset1, 2)

Creating modified dataframe
Dataframe with all the features
Creating modified dataframe
Dataframe with only Feedback feature and label
