<a href="https://colab.research.google.com/github/coegoke/NLP_Project/blob/main/Plagiarism_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from tqdm import tqdm

In [None]:
def preprocess_data(data_path, sample_size):

    data = pd.read_csv(data_path, low_memory = False)

    data = data.dropna(subset = ['abstract']).reset_index(drop=True)

    data = data.sample(sample_size)['abstract']

    return data


In [None]:
data_path = '/content/drive/MyDrive/NLP/data.csv'
source_data = preprocess_data(data_path, 100)

In [None]:
# !pip -q install transformers
# !pip -q install keras

In [None]:
import torch
from keras.preprocessing.sequence import pad_sequences
from transformers import BertTokenizer, AutoModelForSequenceClassification

In [None]:
model_path = "bert-base-uncased"

tokenizer = BertTokenizer.from_pretrained(model_path,
                                          do_lower_case = True
)

model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           output_attentions = False,
                                                           output_hidden_states = True)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
source_data = pd.DataFrame(source_data)
source_data['apa'] = 'iyaiyaiya'
source_data

Unnamed: 0,abstract,apa
2188,BACKGROUND Due to demand on UK memory clinic s...,iyaiyaiya
7490,From the Executive Summary: The global effort ...,iyaiyaiya
246,BACKGROUND: Studies evaluating strategies for ...,iyaiyaiya
5332,Telemedicine has rapidly expanded in many aspe...,iyaiyaiya
7023,"BACKGROUND: In COVID-19 patients, undetected c...",iyaiyaiya
...,...,...
4977,[Image: see text] The COVID-19 pandemic is inc...,iyaiyaiya
5572,BACKGROUND/AIMS Patients who develop acute kid...,iyaiyaiya
8631,Background and Objectives: The aim of this ret...,iyaiyaiya
1875,We investigated severe acute respiratory syndr...,iyaiyaiya


In [None]:
def create_vector_from_text(tokenizer, model, text, MAX_LEN=510):
    # Tokenisasi teks menggunakan tokenizer
    input_ids = tokenizer.encode(
        text,
        add_special_tokens=True,
        max_length=MAX_LEN
    )

    # Padding token IDs untuk mencapai panjang maksimum
    results = pad_sequences(
        [input_ids],
        maxlen=MAX_LEN,
        dtype="long",
        truncating="post",
        padding="post"
    )
    input_ids = results[0]

    # Membuat attention mask untuk mengidentifikasi token-token padding
    attention_mask = [int(i > 0) for i in input_ids]

    # Mengkonversi token IDs dan attention mask ke tensor PyTorch
    input_ids = torch.tensor(input_ids)
    attention_mask = torch.tensor(attention_mask)

    # Menambahkan dimensi batch (batch_size=1)
    input_ids = input_ids.unsqueeze(0)
    attention_mask = attention_mask.unsqueeze(0)

    # Mode evaluasi untuk model
    model.eval()

    # Menggunakan model untuk mendapatkan logits dan encoded layers
    with torch.no_grad():
        logits, encoded_layers = model(
            input_ids=input_ids,
            token_type_ids=None,
            attention_mask=attention_mask,
            return_dict=False
        )

    # Menentukan layer, batch, dan token yang akan digunakan
    layer_i = 12
    batch_i = 0
    token_i = 0

    # Mengambil vektor dari encoded layers
    vector = encoded_layers[layer_i][batch_i][token_i]

    # Mengkonversi vektor ke array NumPy
    vector = vector.detach().cpu().numpy()

    return vector


# Create Vector Database

In [None]:
import numpy as np

def create_vector_database(data):

    # The list of all the vectors
    vectors = []

    # Get overall text data
    source_data = data.abstract.values

    # Loop over all the comment and get the embeddings
    for text in tqdm(source_data):

        # Get the embedding
        vector = create_vector_from_text(tokenizer, model, text)

        #add it to the list
        vectors.append(vector)

    data["vectors"] = vectors
    data["vectors"] = data["vectors"].apply(lambda emb: np.array(emb))
    data["vectors"] = data["vectors"].apply(lambda emb: emb.reshape(1, -1))

    return data


In [None]:
vector_database = create_vector_database(source_data)

  0%|          | 0/100 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 100/100 [04:06<00:00,  2.47s/it]


In [None]:
vector_database.sample(5)

Unnamed: 0,abstract,apa,vectors
9482,Of the various adverse reactions to COVID-19 v...,iyaiyaiya,"[[-0.6010775, -0.6136122, -0.5267856, -0.49154..."
3892,Rhinoviruses (RV’s) are common human pathogens...,iyaiyaiya,"[[-0.40913472, -0.1338518, -0.7423564, -0.3902..."
116,"In the last 20 years, accumulating evidence in...",iyaiyaiya,"[[-0.5469991, -0.30055866, -0.55782557, -0.099..."
2809,OBJECTIVES: To review the response to the coro...,iyaiyaiya,"[[-0.8285695, -0.9449627, -0.39063674, 0.09564..."
6228,BACKGROUND: Since the World Health Organizatio...,iyaiyaiya,"[[-0.82029444, -0.5826201, -0.09296857, -0.053..."


# Language detector and translation

In [None]:
#!pip -q install sentencepiece
from transformers import MarianMTModel, MarianTokenizer

In [None]:
"""
Candidate Languages
de: German
fr: French      el: Greek
ja: Japan       ru: Russian
"""
language_list = ['de', 'fr', 'el', 'ja', 'ru']

In [None]:
# Install the library
!pip -q install langdetect
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0

In [None]:
def translate_text(text, text_lang, target_lang='en'):

  model_name = f"Helsinki-NLP/opus-mt-{text_lang}-{target_lang}"

  tokenizer = MarianTokenizer.from_pretrained(model_name)

  model = MarianMTModel.from_pretrained(model_name)

  formated_text = ">>{}<< {}".fromat(text_lang, text)

  translation = model.generate(**tokenizer([formated_text],
                                           return_tensors = "pt",
                                           padding = True))

  translated_text = [tokenizer.decode(t, skip_special_tokens = True) for t in translation][0]

  return translated_text

# Implement Plagiarism Analysis

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
def process_document(text):
    """
    Create a vector for given text and adjust it for cosine similarity search
    """
    text_vect = create_vector_from_text(tokenizer, model, text)
    text_vect = np.array(text_vect)
    text_vect = text_vect.reshape(1, -1)

    return text_vect


def is_plagiarism(similarity_score, plagiarism_threshold):

  is_plagiarism = False

  if(similarity_score >= plagiarism_threshold):
    is_plagiarism = True

  return is_plagiarism


def check_incoming_document(incoming_document):

  text_lang = detect(incoming_document)
  language_list = ['de', 'fr', 'el', 'ja', 'ru']

  final_result = ""

  if(text_lang == 'en'):
    final_result = incoming_document

  elif(text_lang not in language_list):
    final_result = None

  else:
    # Translate in English
    final_result = translate_text(incoming_document, text_lang)

  return final_result


def run_plagiarism_analysis(query_text, data, plagiarism_threshold=0.8):

    top_N=3

    # Check the language of the query/incoming text and translate if required.
    document_translation = check_incoming_document(query_text)

    if(document_translation is None):
      print("Only the following languages are supported: English, French, Russian, German, Greek and Japanese")
      exit(-1)

    else:
      # Preprocess the document to get the required vector for similarity analysis
      query_vect = process_document(document_translation)

      # Run similarity Search
      data["similarity"] = data["vectors"].apply(lambda x: cosine_similarity(query_vect, x))
      data["similarity"] = data["similarity"].apply(lambda x: x[0][0])

      similar_articles = data.sort_values(by='similarity', ascending=False)[0:top_N+1]
      formated_result = similar_articles[["abstract", "similarity"]].reset_index(drop = True)

      similarity_score = formated_result.iloc[0]["similarity"]
      most_similar_article = formated_result.iloc[0]["abstract"]
      is_plagiarism_bool = is_plagiarism(similarity_score, plagiarism_threshold)

      plagiarism_decision = {'similarity_score': similarity_score,
                             'is_plagiarism': is_plagiarism_bool,
                             'most_similar_article': most_similar_article,
                             'article_submitted': query_text
                            }

      return plagiarism_decision

In [None]:
# Select an existing article from the database
new_incoming_text = source_data.iloc[0]['abstract']

# Run the plagiarism detection
analysis_result = run_plagiarism_analysis(new_incoming_text, vector_database, plagiarism_threshold=0.8)

In [None]:
analysis_result

{'similarity_score': 0.99999976,
 'is_plagiarism': True,
 'most_similar_article': "BACKGROUND Due to demand on UK memory clinic services, most patients have limited consultant interaction before diagnosis/discharge. Technology offers an opportunity for remote assessment, from telephone/video-based consultations to fully digitised cognitive assessments with potential to track disease progression. Whilst many acute services utilise remote assessment, there are perceived barriers in memory clinic populations. However, COVID-19 and related national restrictions may have altered patients' attitudes towards and experience with remote assessment tools. We aimed to investigate attitudes including confidence and perceived challenges towards remote assessment as well as access and experience with technology amongst Oxfordshire memory clinic patients. METHOD Between June and September 2020, all patients awaiting initial memory clinic assessment were asked to participate in a standardised semi-qua

In [None]:
french_article_to_check = """
The Innovation and Agricultural Transfer Networks (RITA) were created in 2011 to better connect agricultural research and development,
intra and inter-DOM, with the objective of supporting the diversification of local production. The CGAAER was tasked with analyzing this system and
to propose courses of action to improve the Research – Training – Innovation – Development – Transfer chain in the overseas territories in a context
  of sustainable agriculture, for the benefit of increasing food autonomy.
"""

In [None]:
!pip install sentencepiece

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
analysis_result = run_plagiarism_analysis(french_article_to_check, vector_database, plagiarism_threshold=0.8)
analysis_result

{'similarity_score': 0.7841553,
 'is_plagiarism': False,
 'most_similar_article': 'The present paper is a review of the main challenges faced by the management of a tertiary specialty hospital during the COVID-19 pandemic in the northern Italian region of Lombardy, an area of extremely high epidemic impact. The article focuses on the management of patient flows, access to the hospital, maintaining and reallocating staffing levels, and managing urgent referrals, information, and communications from the point of view of the hospital managers over a seven-week period. The objective of the article is to provide beneficial insights and solutions to other hospital managers and medical directors who should find themselves in the same or a similar situation. In such an epidemic emergency, in the authors’ opinion, the most important factors influencing the capability of the hospital to maintain operations are (1) sustaining the strict triage of patients, (2) the differentiation of flows and pat

In [None]:
from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>fr<< this is a sentence in english that we want to translate to french",
    ">>pt<< This should go to portuguese",
    ">>es<< And this to Spanish",
]

model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
tokenizer = MarianTokenizer.from_pretrained(model_name)

model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
tgt_text

