### Part - 01 DVC

- Import the data

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
import csv

# Load the data
df = pd.read_csv(
    'data/SMSSpamCollection', 
    sep='\t', 
    names=['label', 'sms'],
    quoting=csv.QUOTE_NONE
)

# Save the raw data
df.to_csv('data/raw_data.csv', index=False)
print("Raw data saved as data/raw_data.csv")

Raw data saved as data/raw_data.csv


- Data Looks Like

In [2]:
df

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


- Splitting data

In [25]:
# Split the data: 70% train, 15% validation, 15% test
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=0)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=0)

# Save the splits
train_df.to_csv('data/train.csv', index=False)
validation_df.to_csv('data/validation.csv', index=False)
test_df.to_csv('data/test.csv', index=False)

print(f"Train size: {len(train_df)}, Validation size: {len(validation_df)}, Test size: {len(test_df)}")

Train size: 3901, Validation size: 836, Test size: 837


- Initialize git and dvc

In [26]:
!git init
!dvc init

Reinitialized existing Git repository in D:/AML/Assignment02/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


- Using Service Account

In [27]:
!dvc remote add -d storage gdrive://14Lr78YM5bApS192CqHCWWDd4K7faGkK-
!dvc remote modify storage gdrive_use_service_account true
!dvc remote modify storage gdrive_service_account_json_file_path dvc.json

Setting 'storage' as a default remote.


- Pushing raw_data.csv

In [28]:
!dvc add data/raw_data.csv
!git add data/raw_data.csv.dvc data/.gitignore
!git commit -m "Track data/raw_data.csv using DVC"
!dvc push


To track the changes with git, run:

	git add 'data\.gitignore' 'data\raw_data.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[detached HEAD f4fb9e8] Track data/raw_data.csv using DVC
 1 file changed, 3 deletions(-)
Everything is up to date.


- Pushing train, test and validation data to dvc (1st Time)

In [29]:
!dvc add data/train.csv
!dvc add data/validation.csv
!dvc add data/test.csv

!git add data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc data/.gitignore
!git commit -m "Track data/train, data/validation, and data/test splits using DVC"

!dvc push


To track the changes with git, run:

	git add 'data\.gitignore' 'data\train.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph




To track the changes with git, run:

	git add 'data\validation.csv.dvc' 'data\.gitignore'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph




To track the changes with git, run:

	git add 'data\.gitignore' 'data\test.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[detached HEAD 8d57845] Track data/train, data/validation, and data/test splits using DVC
 1 file changed, 3 insertions(+)
Everything is up to date.


- See the distribution of 1st time 

In [30]:
train_df = pd.read_csv('data/train.csv')
validation_df = pd.read_csv('data/validation.csv')
test_df = pd.read_csv('data/test.csv')

print("Train distribution:")
print(train_df['label'].value_counts())

print("Validation distribution:")
print(validation_df['label'].value_counts())

print("Test distribution:")

print(test_df['label'].value_counts())

Train distribution:
label
ham     3396
spam     505
Name: count, dtype: int64
Validation distribution:
label
ham     731
spam    105
Name: count, dtype: int64
Test distribution:
label
ham     700
spam    137
Name: count, dtype: int64


- Let’s create a new split with a different seed and save them

In [31]:
from sklearn.model_selection import train_test_split

# New split with a different random seed
train_df, temp_df = train_test_split(df, test_size=0.3, random_state=100)
validation_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=100)

# Save the new splits
train_df.to_csv('data/train.csv', index=False)
validation_df.to_csv('data/validation.csv', index=False)
test_df.to_csv('data/test.csv', index=False)

print(f"New Train size: {len(train_df)}, Validation size: {len(validation_df)}, Test size: {len(test_df)}")

New Train size: 3901, Validation size: 836, Test size: 837


- Track the updated splits with DVC (2nd time)

In [32]:
!dvc add data/train.csv
!dvc add data/validation.csv
!dvc add data/test.csv

!git add data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc data/.gitignore
!git commit -m "Update data/train, data/validation, and data/test splits with new random seed"

!dvc push


To track the changes with git, run:

	git add 'data\train.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph




To track the changes with git, run:

	git add 'data\validation.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph




To track the changes with git, run:

	git add 'data\test.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[detached HEAD 061220e] Update data/train, data/validation, and data/test splits with new random seed
 3 files changed, 6 insertions(+), 6 deletions(-)
Everything is up to date.


- Checkout the first version and show target distribution:

In [33]:
!git log --oneline

061220e Update data/train, data/validation, and data/test splits with new random seed
8d57845 Track data/train, data/validation, and data/test splits using DVC
f4fb9e8 Track data/raw_data.csv using DVC
037ecab Track data/train, data/validation, and data/test splits using DVC
ed2ee80 Track raw_data.csv using DVC
dff5dd5 Update train, validation, and test splits with new random seed
9605fbb Track train, validation, and test splits using DVC
518af85 Update train, validation, and test splits with new random seed
67a9574 Track train, validation, and test splits using DVC
f098d5e Update train, validation, and test splits with new random seed
1bdff2d Track train, validation, and test splits using DVC
2ff0bab Track raw_data.csv using DVC


In [34]:
# putting the first version's commit code
!git checkout 8d57845 
!dvc checkout

M	.dvc/config
D	.gitignore
D	raw_data.csv.dvc
D	test.csv.dvc
D	train.csv.dvc
D	validation.csv.dvc


any of your branches:

  061220e Update data/train, data/validation, and data/test splits with new random seed

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> 061220e

HEAD is now at 8d57845 Track data/train, data/validation, and data/test splits using DVC


M       data\validation.csv
M       data\train.csv
M       data\test.csv


- Now print the distribution of initial commit

In [35]:
# It matches with the above target distribution(1st version)
train_df = pd.read_csv('data/train.csv')
validation_df = pd.read_csv('data/validation.csv')
test_df = pd.read_csv('data/test.csv')

print("Train distribution:")
print(train_df['label'].value_counts())

print("Validation distribution:")
print(validation_df['label'].value_counts())

print("Test distribution:")
print(test_df['label'].value_counts())

Train distribution:
label
ham     3396
spam     505
Name: count, dtype: int64
Validation distribution:
label
ham     731
spam    105
Name: count, dtype: int64
Test distribution:
label
ham     700
spam    137
Name: count, dtype: int64


- Checkout the updated version

In [36]:
# putting the updated version's commit code
!git checkout 061220e
!dvc checkout

M	.dvc/config
D	.gitignore
D	raw_data.csv.dvc
D	test.csv.dvc
D	train.csv.dvc
D	validation.csv.dvc


Previous HEAD position was 8d57845 Track data/train, data/validation, and data/test splits using DVC
HEAD is now at 061220e Update data/train, data/validation, and data/test splits with new random seed


M       data\validation.csv
M       data\train.csv
M       data\test.csv


- Printing the updated distribution

In [37]:
train_df = pd.read_csv('data/train.csv')
validation_df = pd.read_csv('data/validation.csv')
test_df = pd.read_csv('data/test.csv')

print("Updated Train distribution:")
print(train_df['label'].value_counts())

print("Updated Validation distribution:")
print(validation_df['label'].value_counts())

print("Updated Test distribution:")
print(test_df['label'].value_counts())

Updated Train distribution:
label
ham     3374
spam     527
Name: count, dtype: int64
Updated Validation distribution:
label
ham     732
spam    104
Name: count, dtype: int64
Updated Test distribution:
label
ham     721
spam    116
Name: count, dtype: int64


- Confirm Google Drive storage works

In [38]:
!dvc status -r storage

Cache and remote 'storage' are in sync.
