# 1. Steps executed so far:
(Refer to `"Data Versioning using DVC"` document)

### **Initialize DVC repository:**

- Create a new GitHub repo
- Start a codespace
- Upload data (data/bike-sharing-dataset.csv)
- Create and activate venv
- Install packages: <br>```pip install dvc==3.55.2 dvc-ssh==4.1.1 asyncssh==2.18.0```

- Initialize dvc repository: <br>`dvc init`


### **Configure Remote Storage:**

- Add remote storage: <br>`dvc remote add -d myremote ssh://<username>@<vm-ip>:22/<path-to-dvc-storage-folder>`

- Provide AWS credentials: (will be stored locally)
<br>`dvc remote modify --local myremote password <your-vm-password>`

- Commit remote-storage config: <br>`git add .dvc/config`
<br>`git commit -m "remote storage configured"`
<br>`git push`




### **Data Versioning:**

- Add data for dvc tracking: <br>`dvc add data/bike-sharing-dataset.csv`

- Add metadata files for git tracking: <br>`git add data/.gitignore data/bike-sharing-dataset.dvc`

- Commit changes in git: <br>`git commit -m "Initial data"`

- Tag the data:  <br>`git tag -a v1.1 -m "Dataset v1.1"`
<br> The `git tag` command in Git is used to create a reference to a specific point in your repository’s history, typically to mark a particular commit as important. Tags are often used to denote releases (e.g., version 1.0, 2.0, etc.).

- Push data to remote-storage: <br>`dvc push`

- Push metadata file to git repo: <br>`git push`

- Push tag: <br>`git push origin v1.1`


### **Whenever new data comes in, use the below commands in sequence:**

- `dvc status`
- `dvc add data/bike-sharing-dataset.csv`
- `git add data/bike-sharing-dataset.dvc`
- `git commit -m "dataset updated"`
- `git tag -a v1.x -m "Dataset v1.x"`
- `dvc push`
- `git push`
- `git push origin v1.x`

# 2. Use DVC-API to load the data:

In [None]:
%%capture
# Install dvc
!pip -q install dvc==3.55.2
!pip -q install dvc-ssh==4.1.1
!pip -q install asyncssh==2.18.0

In [None]:
!dvc --version

Save the below credentials in Colab secrets:

- your GitHub Username (save with key name: `GH_USERNAME`)
- your GitHub Access token (save with key name: `GH_ACCESS_TOKEN`)
- your VM Password (save with key name: `VM_PASSWORD`)

Later, access your secrets in Python via:

```python
from google.colab import userdata
userdata.get('GH_USERNAME')
userdata.get('GH_ACCESS_TOKEN')
userdata.get('VM_PASSWORD')
```

In [None]:
# Run this cell only after adding your credentials in Colab Secrets.
# Also, click on `Grant access` when prompted while running this cell.

import os
from google.colab import userdata

# Read the secrets from colab and save them as environment variables. Later, whenever needed, access the credentials using environment variables only.
#todo

In [None]:
import dvc.api
import pandas as pd

# Repo format:  "https://<github-username>:<github-token>@github.com/<github-username>/<repo-name>"
# For example: "https://yrajm1997:ghp_abcdefxxxx@github.com/yrajm1997/titanic-data-repo"

#todo

# Data version to retrieve
#todo

# Configurations to access remote storage
#todo

# Open data file using dvc-api and load the dataset
#todo

### References:

- [dvc.api.open()](https://dvc.org/doc/api-reference/open)