# Loading of the dataset and First preprocessing step
for this project I chose the dataset available on Hugging face ncbi-virus-complete-dna-v230722, but reducing its size, the original dataset contains over 2 million rows, but for a matter of complexity and capacity of the pc I took a portion containing 200 000 items, 

To download the complete dataset, you can run the cell below which will automatically check if the dataset is in the folder, otherwise the download

In [1]:
import json
import glob

from datasets import load_dataset
import pandas as pd
import os
from imblearn.over_sampling import SMOTEN
from tqdm.notebook import tqdm


path_dataframe = '/Volumes/Seagate Bas/Vito/ML/Dataset/'
path_multiclass = path_dataframe + 'MulticlassDatasets/'
df_name = 'ncbi-virus-all-complete-nucleotides-2023-07-22'

# if we want to load the file from parquet or from jsonl
parquet = False

# number of rows selected
n = 200000

To reduce the dataset size we take the first 500 000 rows of the dataset

In [None]:
if not glob.glob(path_dataframe + f'{df_name}.jsonl'):
    print('Downloading dataset...')
    # load dataset from hugging face
    df = load_dataset("LKarlo/ncbi-virus-complete-dna-v230722")['train']
    # Save the dataset on parquet extention
    df.to_parquet(path_dataframe + f'{df_name}.parquet')
    df = df.sample(n=n)
else: 
    print('Dataset already exists')
    if os.path.exists(path_dataframe + f'ncbi-virus-{n}-dna.parquet') and parquet:
        print(f'Loading the dataset consisting of {n} rows in parquet format')
        df = pd.read_parquet(path_dataframe + f'ncbi-virus-{n}-dna.parquet')
    else:
        if os.path.exists(path_dataframe + f'ncbi-virus-{n}-dna.jsonl'):
            print(f'Loading the dataset consisting of {n} rows in jsonl format')
            df = None
            with open(path_dataframe + f'ncbi-virus-{n}-dna.jsonl') as f:
                for index, line in tqdm(enumerate(f)):
                    data = json.loads(line)
                    if df is None:
                        df = pd.DataFrame(data=data, index=[0])
                    else:
                        df.loc[len(df)] = data
        else:    
            print(f'Loading {n} raws from the entire dataset from jsonl...')
            df = None
            with open(path_dataframe + f'{df_name}.jsonl') as f:
                for index, line in tqdm(enumerate(f)):
                    if index == n:
                        break
                    data = json.loads(line)
                    if df is None:
                        df = pd.DataFrame(data=data, index=[0])
                    else:
                        df.loc[len(df)] = data
df.head()                

Dataset already exists
Loading 200000 raws from the entire dataset from jsonl...


Get the first 500 000 row of the dataset and save it

In [None]:
if not os.path.exists(path_dataframe + f'ncbi-virus-{n}-dna.*'):
    # Save parquet file in local
    print('Saving subset of dataset...')
    df.to_json(path_dataframe+f'ncbi-virus-{n}-dna.jsonl', index=False, lines=True)
    df.to_parquet(path_dataframe+f'ncbi-virus-{n}-dna.parquet')

## For the Multiclass classification task we analize the distribution of the feature target

In [None]:
df['Family'].value_counts()

Let’s delete the rows that contain nan to the family

In [None]:
index = df[df['Family'] == 'nan'].index
df.drop(index, inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

In [None]:
df['Family'].value_counts()

To reduce the variability of sequences, let’s consider a maximum limit of 4000 elements per sequence

In [None]:
max_len  = 4000

df['Length'] = df['Length'].apply(lambda x: int(x))
df = df.where(df["Length"] < max_len)
df['Family'].value_counts()

### We select the 4 most frequent classes
We select Geminiviridae, Spinareoviridae, Phenuiviridae and Circoviridae, and save the dataset both to make an unbalanced classification and to oversample the dataset.

In [None]:
selected_df = df.query(
    'Family == "Geminiviridae" or '
    'Family == "Spinareoviridae" or '
    'Family == "Phenuiviridae" or '
    'Family == "Circoviridae"'
)

In [None]:
selected_df['Family'].value_counts()

We check if the dataset has already been saved and save it

In [None]:
if not os.path.exists(path_multiclass+'/UnbalancedDataset.parquet'):
    print('Saving the dataset...')
    selected_df.to_parquet(path_multiclass + '/UnbalancedDataset.parquet')
    print('Done!')
else:
    print('Dataset already exists')

We perform the oversampling with the SMOTEN technique.

The SMOTEN technique is an oversampling technique of minority classes that also works with categorical variables, the imblearn library containing its implementation is used, the reference paper talks about the SMOTE technique in general.

The reference link is the above: [SMOTEN](https://doi.org/10.48550/arXiv.1106.1813) 

In [None]:
cols = selected_df.columns.values.tolist()
cols.remove('Family')
y = selected_df['Family']
X = selected_df[cols] 
resampler = SMOTEN(random_state=42)

X,y = resampler.fit_resample(X,y)

In [None]:
X['Family'] = y

X['Family'].value_counts()

In [None]:
X.tail()

In [None]:
if not os.path.exists(path_multiclass+'/BalancedDataset.parquet'):
    print('Saving the dataset...')
    X.to_parquet(path_multiclass + '/BalancedDataset.parquet')
    print('Done!')
else:
    print('Dataset already exists')