# Homework 7 MLOps and Git

## Objectives

The goal of this assignment is to reinforce your understanding of MLOps,
Git commands for source control, the common structure of a machine
learning repository, and a typical structure of a py file designed for
running in a command line environment (including in a cloud
environment).



## A. Short Answers:



If you consult Gen AI, i prefer you refine/synthesize answers from it
rather than directly copy and paste. The common reference requirements
still apply.



1.  Can you explain the concept of MLOps in your own words and its
    importance in the industry?


Answer:


2.  Can you describe in your own words, how we monitor and troubleshoot
    machine learning models in production, in reference to our MLOps
    framework?


Answer:



3.  What is CI/CD in MLOps? What problems do they solve? What tools are
    commonly used for CI/CD?


Answer:


4.  What is a **feature store** in MLOps; What are the benefits?


Answer:

## B Git Source Control




### 1. Set Up

We leverage the Github classroom feature, which allows me to set up one
common assignment for everyone on github, and provides a convenient view
of everybody’s work (git commits). To get a sense of how Git Classroom
works, and how to work on a github assignment, please check these pages:
videos:

-   [GitHub Classroom Get Started](https://classroom.github.com/videos)
-   [How students complete assignments GitHub Classroom: How students
    complete assignments](https://www.youtube.com/watch?v=ObaFRGp_Eko)

We will use your name as an identifier. Once you accept the invitation,
you will have a new repository created for you with a starting template
(some data files and some Jupyter notebooks).

The video illustrates commitments through Github, but I expect you to
**practice git commands** through a terminal (e.g., accessed through
VSCode).



### 2. Update the `README.md` file.

README.md is a markdown file for other developers who use your
repository. It typically includes information such as project
description, requirements, installation instructions, and usage
documentation.

Please update this file by adding your name and date.

Then `commit` and `push` the changes to the remote repository (using the
comment “update readme”).


> Enter the commands used, plus any note you may have for them.



### 3. Add a `.gitignore` file

`.gitignore` is a special file that lets Git know that it should ignore
certain files and not track them. You can add it to your project
directory or subdirectories. Read more here about [how to use
.gitignore](https://www.pluralsight.com/guides/how-to-use-gitignore-file).

-   Log files
-   Files with API keys/secrets, credentials, or sensitive information
-   Useless system-generated files like `.DS_Store` on macOS,
    `.ipynb_checkpoints` for jupyter notebooks
-   Dependencies that can be downloaded from a package manager.

Create a `.gitignore` file with the following content:

``` python
# Byte-compiled / optimized / DLL files
__pycache__/

# Jupyter Notebook
.ipynb_checkpoints

# Environments
env

# log file
*.log
```

Then `commit` and `push` the changes to the remote repository (using
comment “add .gitignore”).


> Enter the commands used, plus any note you may have for them.



### 4. Review the two Jupyter Notebooks

I took two (slightly-edited) user-contributed notebooks from Kaggle as a
starting point for your project. They are located in the `notebooks`
folder. Please note that they are not all consistent with each other.

-   **houseprice-eda.ipynb**: we will re-use the decision tree portion
    of this notebook
-   **random-forests.ipynb**: we will re-use the train_test_split and
    random forest portions of this notebook

If you’re converting ipynb to py files, please note the following:

-   the path may need updating. We will run the script from the project
    folder using `python scripts/[myfile].py`. So the relative path of
    your data files is `data/[datafile].csv`
-   some plotting/display commands may be removed since our output will
    be a text-based console.
-   some markdowns may be removed
-   It is common for `py` files to have defined

``` python

"""

brief description of the script

@author: [author name]

"""

import ...

# variables
# functions

def main():
    #codes for the main function


if __name__ == "__main__":
    #entry point for command line execution
    main()
```


> Enter the commands used, plus any note you may have for them.



### 5. Create a “decision_tree” branch

Using git command to create/switch to a `decision_tree` branch for
developing the decision tree model.


> Enter the commands used, plus any note you may have for them.



### 6. Create `data_prep.py` in the `decision_tree` branch

Please follow the data preparation codes based on
`random-forests.ipynb`, except that

-   save the train and test dataframes (with `label`) as `train.csv` and
    `test.csv` respectively in the `data` folder, with header row but
    not index.

Test your script using:

``` bash
python scripts/data_prep.py
```

Then `commit` and `push` the changes to the remote repository.


> Enter the commands used, plus any note you may have for them.



### 7. Create `train.py` in the `decision_tree` branch

`train.py` should read from `data/train.csv` and train a DecisionTree
model (with `random_state` set to 0 for reproducibility).

-   it should display the mean absolute error for the model (when
    applying to the training data).

You shall save the resulting model as `models/trained_model_dt.pkl`

``` python
import pickle
with open('models/[file].pkl', 'wb') as file:
    pickle.dump(trained_model, file)
```

Test your script and then `commit` and `push` the changes to the remote
repository.

> Enter the commands used, plus any note you may have for them.



### 8. Create `test.py` in the `decision_tree` branch

`test.py` should read one line of input as a command line argument
(`--inputdata`), and use the model saved in
`models/trained_model_dt.pkl` to make a prediction. It should print the
predicted housing price.

``` bash
python scripts/test.py --inputdata "710000.0,3,1.0,594.0,-37.7385,145.0409" 
```

To set up the command line argument, you may use the `argparse` package:

``` python
    import argparse
    parser = argparse.ArgumentParser(description='[description of the module]')
    # The name of the argument is inputdata, -i is a shorthand, the argument is required.
    parser.add_argument('-i', '--inputdata', required=True, help='CSV data as a string')
    # parse the command for arguments
    args = parser.parse_args()
    # You may subsequently access the input argument using args.inputdata
```

Please note that input data is a single line from the CSV without a
column header.

You can use pandas’s `.read_csv(buffer, names=[list_of_col_names])`
where buffer could be created from the string you obtain from the
command argument:

``` python
from io import StringIO
buffer = StringIO(your_string)
```

To load the trained model from the pickle file:

``` python
import pickle
with open(pickle_file_path, 'rb') as file:
    model = pickle.load(file)
```

Test your script and then `commit` and `push` the changes to the remote
repository.


> Enter the commands used, plus any note you may have for them.



### 9. merge the `decision_tree` branch into the `main` branch.

Havign tested the decision_tree branch, you should not merge the branch
into the main.


> Enter the commands used, plus any note you may have for them.



### 10. Clone the github repo to a different folder `hw7copy`

Check out the `main` branch.

Then create a `randomforest` branch for developing a random forest
model.



### 11. Develop the training code for the random forest model in the `randomforest` branch.

While on the `randomforest` branch, add a new script `train_rf.py`, that
follows the same steps as `train.py` except:

-   It should train a random forest model
-   it should save the resulting model as `trained_model_rf.pkl` in the
    models folder.

Test your script and then `commit` and `push` the changes to the remote
repository.


> Enter the commands used, plus any note you may have for them.



### 12. Develop the testing code for the random forest model in the `randomforest` branch.

Instead of writing code, it may be useful to augment the exiting test.py
to allow the users to choose the model they want to apply. This can be
done by adding an argument, e.g.:

``` bash
python scripts/test.py --inputdata "710000.0,3,1.0,594.0,-37.7385,145.0409" --model dt
```

where `--model` can take a value of `dt` or `rf`.

Modify the `test.py` to:

-   add another required argument `--model`.
-   depending on the value of args.model, load different pickle files.

the rest would remain the same similar.

Test your script and then `commit` and `push` the changes to the remote
repository.


> Enter the commands used, plus any note you may have for them.



### 13. Merge the `randomforest` branch into the main branch

Now that you have tested the random forest model training code as well
as the inference code, you should proceed to merge the branch into the
main branch.


> Enter the commands used, plus any note you may have for them.



### 14. Create conflicting changes

In the original folder, check out the `decision_tree` branch, add a
`usage` section to the `README.md`:

    ## usage
    - `python scripts/data_prep.py`: to prepare the training/test data files
    - `python scripts/train.py`: to train a decision tree model on the training data 
    - `python scripts/test.py --inputdata "349000.0,1,1.0,1958.0,38.0,0.0,-37.8369,145.0077" --model dt`: to predict the house price for a single house using the trained decision tree model

Commit the changes and merge into the `main` branch, push to the remote
repository:

In the `hw7copy` folder, check out the `randomforest` branch, and add a
`usage` section to the `README.md`:

    ## usage
    - `python scripts/data_prep.py`: to prepare the training/test data files
    - `python scripts/train_rf.py`: to train a random forest model on the training data 
    - `python scripts/test.py --inputdata "349000.0,1,1.0,1958.0,38.0,0.0,-37.8369,145.0077" --model rf`: to predict the house price for a single house using a trained random forest model

Commit the changes and merge them into the `main` branch, and push to
the remote repository. Resolve any issues/conflicts in the process. The
result could have two lines for training (one for each model) and two
lines for testing.


> Enter the commands used, plus any note you may have for them.
