## Loading necessary libraries and also mounting google drive where the datasets are present.
<br>
We have one French to Fongbe JW300 dataset present from where we will extract relevant datapoints to augment the train data. We will then train a model on the extracted datasets for French-Fongbe
<br>
We also have one French to Ewe JW300 dataset present from where we will extract relevant datapoints to augment the French-Ewe data and train our model on the augmented data 

In [2]:
# !pip install -U sentence-transformers
# !pip install --upgrade transformers

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Necessary libraries needed to extract relevant dataset from the JW300 data collected

In [4]:
import pandas as pd
import numpy as np
import gc
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

## Augmented Data Extraction 

## We start with French To Fongbe


## Method1
Here we consider only the French records from the test data and try to extract the similar records from the JW300 French-Fongbe dataset. 
<br>
Cosine similarity is used to find the similarity between one test data-point and all other French datapoints in the JW300 data.
<br>
Only records with highest and second highest similarities are taken to consideration. 

### Model used for sentence similarity - XLM-ROBERTA-LARGE

In [7]:
jwdf = pd.read_csv('/content/drive/MyDrive/takwimu_translations/french_fongbe.csv')
jwdf['French'] = jwdf['French'].astype(str)
jwdf['Fongbe'] = jwdf['Fongbe'].astype(str)

jwdf = jwdf.dropna(subset=['French'])
jwdf = jwdf.dropna(subset=['Fongbe'])


availdf  = pd.read_csv('Test.csv')
availdf = availdf[availdf['Target_Language']=='Fon']

In [9]:
model = SentenceTransformer('xlm-roberta-large')

Exception when trying to download http://sbert.net/models/xlm-roberta-large.zip. Response 404
SentenceTransformer-Model http://sbert.net/models/xlm-roberta-large.zip not found. Try to create it from scratch
Try to create Transformer Model xlm-roberta-large with mean pooling


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=513.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2244861551.0, style=ProgressStyle(descr…




Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=9096718.0, style=ProgressStyle(descript…




In [8]:
highest_indices_df           = pd.DataFrame(columns=['French'])
highest_indices_df['French'] = availdf['French'].values.tolist()

highest_similarity_df           = pd.DataFrame(columns=['French'])
highest_similarity_df['French'] = availdf['French'].values.tolist()


In [10]:
batch = 1
rows        = availdf['French'].values.tolist()
row_enc     = model.encode(rows,batch_size=64,show_progress_bar=True)
for i in range(0,jwdf.shape[0],50000):
  print('batch:- ',batch)
  
  start = i
  end   = start+50000-1


  columns = jwdf['French'].loc[start:end].values.tolist()
  columns_enc = model.encode(columns,batch_size=128,show_progress_bar=True)
  try:
    a = start
    b = end+1
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)
  except:
    a = start
    b = a+len(columns)
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)

  highest_label              = simdf.idxmax(axis=1).values
  

  highest_match              = simdf.max(axis=1).values
  

  highest_indices_df[f'batch_{batch}_indices']       = highest_label
  highest_similarity_df[f'batch_{batch}_similarity'] = highest_match

  
  del simdf
  gc.collect()
  batch = batch+1

HBox(children=(FloatProgress(value=0.0, description='Batches', max=46.0, style=ProgressStyle(description_width…


batch:-  1


HBox(children=(FloatProgress(value=0.0, description='Batches', max=245.0, style=ProgressStyle(description_widt…




In [11]:
sim = highest_similarity_df.drop(['French'],axis=1)
highest_indices_df['second_most_sim'] = sim.T.apply(lambda x: x.nlargest(2).idxmin())
highest_indices_df['first_most_sim']  = sim.T.apply(lambda x: x.nlargest(1).idxmin())

In [12]:
def get_index(text):
  return text.replace('similarity','indices')

highest_indices_df['second_most_sim'] = highest_indices_df.apply(lambda z: z[get_index(z['second_most_sim'])],axis=1) 
highest_indices_df['first_most_sim']  = highest_indices_df.apply(lambda z: z[get_index(z['first_most_sim'])],axis=1) 

In [13]:
indices = highest_indices_df['first_most_sim'].values.tolist()
data = jwdf[jwdf.index.isin(indices)]
data.shape

(1980, 2)

In [14]:
indices1 = highest_indices_df['second_most_sim'].values.tolist()
data1 = jwdf[jwdf.index.isin(indices1)]
data1.shape

(1980, 2)

In [15]:
data.to_csv('french_fongbe_train_xlm_roberta.csv',index=False)
data1.to_csv('french_fongbe_valid_xlm_roberta.csv',index=False)

## Method 2:- 
 Here we perform the same procedure as above and we use Language agnostic BERT Sentence to calculate vectors which will be used to find similarities between the sentences.
<br>
We will determine the highest and second highest similar French statements and augment the train data with it

In [16]:
model = SentenceTransformer('LaBSE')
highest_indices_df           = pd.DataFrame(columns=['French'])
highest_indices_df['French'] = availdf['French'].values.tolist()

highest_similarity_df           = pd.DataFrame(columns=['French'])
highest_similarity_df['French'] = availdf['French'].values.tolist()


batch = 1
rows        = availdf['French'].values.tolist()
row_enc     = model.encode(rows,batch_size=64,show_progress_bar=True)
for i in range(0,jwdf.shape[0],50000):
  print('batch:- ',batch)
  
  start = i
  end   = start+50000-1


  columns = jwdf['French'].loc[start:end].values.tolist()
  columns_enc = model.encode(columns,batch_size=128,show_progress_bar=True)
  try:
    a = start
    b = end+1
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)
  except:
    a = start
    b = a+len(columns)
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)

  highest_label              = simdf.idxmax(axis=1).values
  

  highest_match              = simdf.max(axis=1).values
  

  highest_indices_df[f'batch_{batch}_indices']       = highest_label
  highest_similarity_df[f'batch_{batch}_similarity'] = highest_match

  
  del simdf
  gc.collect()
  batch = batch+1


sim = highest_similarity_df.drop(['French'],axis=1)
highest_indices_df['second_most_sim'] = sim.T.apply(lambda x: x.nlargest(2).idxmin())
highest_indices_df['first_most_sim']  = sim.T.apply(lambda x: x.nlargest(1).idxmin())

def get_index(text):
  return text.replace('similarity','indices')

highest_indices_df['second_most_sim'] = highest_indices_df.apply(lambda z: z[get_index(z['second_most_sim'])],axis=1) 
highest_indices_df['first_most_sim']  = highest_indices_df.apply(lambda z: z[get_index(z['first_most_sim'])],axis=1) 


indices = highest_indices_df['first_most_sim'].values.tolist()
data = jwdf[jwdf.index.isin(indices)]
data.shape

indices1 = highest_indices_df['second_most_sim'].values.tolist()
data1 = jwdf[jwdf.index.isin(indices1)]
data1.shape


data.to_csv('french_fongbe_train_labse.csv',index=False)
data1.to_csv('french_fongbe_valid_labse.csv',index=False)

HBox(children=(FloatProgress(value=0.0, max=1754318854.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Batches', max=46.0, style=ProgressStyle(description_width…


batch:-  1


HBox(children=(FloatProgress(value=0.0, description='Batches', max=245.0, style=ProgressStyle(description_widt…




# Extracting statements from French-Ewe JW300

We perform the same procedure for EWE as well but since we have a pretrained seq2seq model for french-Ewe, we will use only one method to extract relevant datasets which will be XLM-ROBERTA-LARGE and extract those points which are going to augment the train data.
<br>


In [19]:
jwdf = pd.read_csv('/content/drive/MyDrive/takwimu_translations/french_ewe.csv')
jwdf['French'] = jwdf['French'].astype(str)
jwdf['Ewe'] = jwdf['Ewe'].astype(str)

jwdf = jwdf.dropna(subset=['French'])
jwdf = jwdf.dropna(subset=['Ewe'])


availdf  = pd.read_csv('Test.csv')
availdf = availdf[availdf['Target_Language']=='Ewe']

In [20]:
model = SentenceTransformer('xlm-roberta-large')
highest_indices_df           = pd.DataFrame(columns=['French'])
highest_indices_df['French'] = availdf['French'].values.tolist()

highest_similarity_df           = pd.DataFrame(columns=['French'])
highest_similarity_df['French'] = availdf['French'].values.tolist()


batch = 1
rows        = availdf['French'].values.tolist()
row_enc     = model.encode(rows,batch_size=64,show_progress_bar=True)
for i in range(0,jwdf.shape[0],50000):
  print('batch:- ',batch)
  
  start = i
  end   = start+50000-1


  columns = jwdf['French'].loc[start:end].values.tolist()
  columns_enc = model.encode(columns,batch_size=128,show_progress_bar=True)
  try:
    a = start
    b = end+1
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)
  except:
    a = start
    b = a+len(columns)
    colnames                   = list(range(a,b))
    colnames                   = [str(i) for i in colnames]
    simdf                      = pd.DataFrame(cosine_similarity(row_enc,columns_enc),columns=colnames)

  highest_label              = simdf.idxmax(axis=1).values
  

  highest_match              = simdf.max(axis=1).values
  

  highest_indices_df[f'batch_{batch}_indices']       = highest_label
  highest_similarity_df[f'batch_{batch}_similarity'] = highest_match

  
  del simdf
  gc.collect()
  batch = batch+1


sim = highest_similarity_df.drop(['French'],axis=1)
highest_indices_df['second_most_sim'] = sim.T.apply(lambda x: x.nlargest(2).idxmin())
highest_indices_df['first_most_sim']  = sim.T.apply(lambda x: x.nlargest(1).idxmin())

def get_index(text):
  return text.replace('similarity','indices')

highest_indices_df['second_most_sim'] = highest_indices_df.apply(lambda z: z[get_index(z['second_most_sim'])],axis=1) 
highest_indices_df['first_most_sim']  = highest_indices_df.apply(lambda z: z[get_index(z['first_most_sim'])],axis=1) 


indices = highest_indices_df['first_most_sim'].values.tolist()
data = jwdf[jwdf.index.isin(indices)]
data.shape

indices1 = highest_indices_df['second_most_sim'].values.tolist()
data1 = jwdf[jwdf.index.isin(indices1)]
data1.shape


data.to_csv('french_ewe_train_xlm_roberta.csv',index=False)
data1.to_csv('french_ewe_valid_xlm_roberta.csv',index=False)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=47.0, style=ProgressStyle(description_width…


batch:-  1


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  2


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  3


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  4


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  5


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  6


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  7


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  8


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  9


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  10


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  11


HBox(children=(FloatProgress(value=0.0, description='Batches', max=391.0, style=ProgressStyle(description_widt…


batch:-  12


HBox(children=(FloatProgress(value=0.0, description='Batches', max=283.0, style=ProgressStyle(description_widt…


