# Mastering MLOps: Hands-on Exercise 3
## Versioning dataset with DVC

Third hands-on exercise, downloading and modifying data and versioning it using DVC.

The followed steps are:

### 1. Download the raw data

In [7]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/srees1988/predict-churn-py/main/customer_churn_data.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.to_csv("customer_churn_data_raw.csv")

### 2. Preprocess and store data

The preprocessing is the same done in the exercise 1:
- Remove `customerID`, `MonthlyCharges` and `MultipleLines`.
- Convert `TotalCharges` to a numeric variable.

In [5]:
df = df.drop(columns=["customerID", "MonthlyCharges", "MultipleLines"])

In [8]:
df["TotalCharges"] = df["TotalCharges"].replace(" ", np.nan)

In [9]:
df.to_csv("customer_churn_data_preprocessed.csv")

### 3. Move the data to the repo

Move both `customer_churn_data_raw.csv` and `customer_churn_data_preprocessed.csv` to the DagsHub repo. In this case, the local repo is the `excercise_2` folder.

### Initialize DVC

From a terminal, in local DagsHub repo (`exercise_2` in this case), run the following commands:

```shell
dvc init
```

Then, add `data` and `outputs` folders:
```shell
dvc add data
dvc add outputs
```

And push changes:
```shell
git add data.dvc outputs.dvc
git commit -m "Added data and outputs to DVC"
git push -u origin master
```

To add the remote DVC environment, run the following command with the corresponding URL (DagsHub Repo -> Remote -> Data -> DVC -> HTTP). 
```shell
dvc remote add origin https://dagshub.com/dgcanalesr/churn-classification.dvc
```

To configure DVC credentials, using the DagsHub private _token_, type:
```shell
dvc remote modify origin --local auth basic
dvc remote modify origin --local user dgcanalesr 
dvc remote modify origin --local token
```

Push changes to Git:
```shell
git add data.dvc outputs.dvc
git commit -m "Added DVC config"
git push -u origin master
```

And push DVC changes:
```shell
dvc push -r origin
```