### Import Libraries

In [12]:
import os

from plagiarism.adapters.elasticsearch_db import ElasticsearchConn
from plagiarism.adapters.name_detector import NameDetector
from plagiarism.domain.models.bert_model import BertModelWrapper
from plagiarism.settings.elasticsearch_settings import ElasticsearchSettings

### Start Elasticsearch Connection & Instance of BERT Model

In [13]:
bert_model = BertModelWrapper()
name_detector = NameDetector()

# Initialize Elasticsearch client
es_con = ElasticsearchConn(ElasticsearchSettings())
es = es_con.conn
print(es.info())

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'name': '27ab35b5ca16', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'XNNmpQcgSGGXKjjcn62UOQ', 'version': {'number': '8.13.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '16cc90cd2d08a3147ce02b07e50894bc060a4cbf', 'build_date': '2024-04-05T14:45:26.420424304Z', 'build_snapshot': False, 'lucene_version': '9.10.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


In [14]:
landing_txt_path = "../../../data_txt/files/"
index_name = "documents_testing"

### Create Index

In [15]:
index_status = es_con.create_index_if_not_exists(index_name)
print(index_status)

Index already exists


### Index Documents

In [16]:
for filename in os.listdir(landing_txt_path):
    if filename.endswith(".txt"):
        print(f"File: {filename}")
        file_path = os.path.join(landing_txt_path, filename)
        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()

            indexing_status = es_con.index_document(index_name, filename, text,
                                                    bert_model.get_embedding(text),
                                                    name_detector.get_author(file_path, text))
            print(indexing_status)

print(es_con.refresh(index_name))

File: TP_2_Weiss_Gonzalo.txt
Document indexed successfully
File: TP6-Gariglio.txt
Document indexed successfully
File: UTNMKT2016-MoraLeandro-TP4.txt
Document indexed successfully
File: TP4 - Comercio Electronico - Marina Pross.txt
Document indexed successfully
File: TP N°1 - Wikinomics (1).txt
Document indexed successfully
File: TP4 - Difusión y adopción TIC - Ramirez Fernando 2017.txt
Document indexed successfully
File: TP N°04 (1).txt
Document indexed successfully
File: TP6-Sistemasemergentes-RamirezFernando2017.txt
Document indexed successfully
File: TP 4 - Wikinomía.txt
Document indexed successfully
File: TP7 Dominio de la informacion - Juan Facundo Obregon (1).txt
Document indexed successfully
File: TP 1 - Marketing (1).txt
Document indexed successfully
File: TP 3 (1).txt
Document indexed successfully
File: TP2PabloPallocchi (1).txt
Document indexed successfully
File: TP MKT - Grupo 3 - Interfaces Humanizadas.txt
Document indexed successfully
File: TP Larga Cola.txt
Document ind

## Test the plagiarism detection with examples:

### Test 1
#### NOT PLAGIARIZED

In [17]:
filename_test = "exmaple.txt"
query = ("Este texto es un texto de ejemplo para probar el modelo BERT. Vamos a hablar sobre las criptomonedas."
         "Bitcoin es la criptomoneda más popular. Ethereum es otra criptomoneda muy conocida.")
# Example query
similar_docs = es_con.search_similar_documents(index_name, query)

print(f"\nSimilar documents to '{filename_test}':")
for doc in similar_docs:
    print(f"\nScore: {doc['_score']}, Document ID: {doc['_id']}, Student Name: {doc['_source']['student_name']}")

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Similar documents to 'exmaple.txt':

Score: 0.91521233, Document ID: TP N° 2 – La larga cola - Melanie Blejter.txt, Student Name: Melanie Blejter

Score: 0.91408896, Document ID: TP2 (1).txt, Student Name: Tomás Duhourq

Score: 0.9139896, Document ID: TP1.txt, Student Name: UNKNOWN_AUTHOR

Score: 0.91380066, Document ID: PREGUNTAS TP Machine, Platform, Crowd Gabriela Gonzalez.txt, Student Name: Erik Brynjolfsson

Score: 0.9135382, Document ID: TP Rifkin.txt, Student Name: UNKNOWN_AUTHOR


### Test 2
#### PLAGIARIZED

In [18]:
    query_2 = ('1)	¿Cómo define Anderson a "La larga cola"?  ¿Por qué asegura que es el presente y futuro de la economía minorista? Grafique. '
               'El concepto de La larga cola explica que además de los productos populares/hits existe una enorme cantidad de productos no tan consumidos/de nicho que totalizados representan una porción muy grande del mercado que puede incluso competir con los hits.'
               'Asegura que es el presente y futuro de la economía minorista porque ahora quienes fabrican pueden pones a disposición de minorista en el mercado no sólo productos para el común denominador de la gente sino también para los nichos.'
               'El consumidor puede encontrar todo en la larga cola.'
               'Llevado a gráfico puede representarse genéricamente de la siguiente manera:')

    similar_docs = es_con.search_similar_documents(index_name, query_2)

    print(f"\nSimilar documents to La larga cola test:")
    for doc in similar_docs:
        print(f"\nScore: {doc['_score']}, Document ID: {doc['_id']}, Student Name: {doc['_source']['student_name']}")

Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Similar documents to La larga cola test:

Score: 0.97487867, Document ID: TP Larga Cola.txt, Student Name: Hernán Suzuki Son

Score: 0.9740537, Document ID: Tp2 Filannino marketing en internet.txt, Student Name: UNKNOWN_AUTHOR

Score: 0.9731908, Document ID: TP 2 - La economía Long Tail.txt, Student Name: Ventas

Score: 0.97317445, Document ID: TP1 - Larga cola.txt, Student Name: Ley De Pareto

Score: 0.9727902, Document ID: UTN - TP 2 - Matias Sas .txt, Student Name: Matias Sas


## Conclusion
### With a Treshold of 0,97 -> we can say that the text has a good chance of being plagiarized