<a href="https://colab.research.google.com/github/arindamkeswani/RePlicate/blob/main/RePlicate_(HPC_Project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Observations and assumptions so far:
1. Since an online plag checker would work on a cloud platform, we abandoned local testing in favour of a platform like Colab, as that would give results that are more realistic
2. Major tests done include comparing the execution time taken by:
  1. Serial implementation (for base time)
  2. Multiprocessing library (for base data-parallelism time)
  3. Numba library (for potential in-built optimum time)
  4. If time permits, other libraries such as CUDA will be considered for implementation

3. Other libraries, such as iparallel are performing worse than expected, and worse than serial implementation, so they were left out of the final analysis
4. Text files will be fed as input to the program (to be serially or manually converted beforehand since this project focusses of parallel plagiarism, detection, not conversion)

5. Final output will consist of two parts from the perspective of:
 1. Product: Dataframe/spreadsheet type structure, showing level of plagiarism between files
 2. Reasearch: A time-based comparison between the aforementioned methodologies.

 Original outcome expected data parallelism to perform better, but that is not the case, hence a black implemetation had to be adopted for successful completion.


---

Implementation:
1. First part involves building the plag-checker and applying the various devised methodologies.
2. The next step is building a PDF-to-text converter. The goal is to build a simple converter but if time permits, it will be implemented in parallel. 
3. The aim of the project is to create the first two modules. Future implementation will involve integrating the two modules together.

In [1]:
from google.colab import files 
uploaded = files.upload()

Saving fatma.txt to fatma.txt
Saving john.txt to john.txt
Saving juma.txt to juma.txt


In [55]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import time
import pandas as pd
from numba import jit
from numba import njit
from numba.typed import List

In [56]:
student_files = [doc for doc in os.listdir() if doc.endswith('.txt')] #store all text files
student_notes =[open(File).read() for File in  student_files] #stores all lines of all files

In [57]:
vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray()  #to vectorize the words of text files
# similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2]) #to store similarity of two documents

In [58]:
vectors = vectorize(student_notes) #store vectorized values
s_vectors = list(zip(student_files, vectors)) #store it with file names
# plagiarism_results = set() #to store results in a set
plagiarism_results =[]
# s_vectors

In [121]:
def similarity(doc1, doc2):
  return cosine_similarity([doc1, doc2])

In [122]:
def check_plagiarism(s_vectors_partial):
    # similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2]) #to store similarity of two documents
    plagiarism_results =[]
    print("Starting process...")
    global s_vectors
    for student_a, text_vector_a in s_vectors_partial:  #traverse through students and their vectors (for first document)
        # print(f"Started testing:{student_a}")
        print("Started testing:",student_a)
        new_vectors = s_vectors.copy() 
        
        # current_index = new_vectors.index((student_a, text_vector_a))
        # del new_vectors[current_index]
        

        for student_b , text_vector_b in new_vectors: #traverse through students and their vectors (for second document)
            # print(f"Testing {student_a} against {student_b}")
            print("Testing",student_a,"against",student_b)
            sim_score = similarity(text_vector_a, text_vector_b)[0][1] #calculate similarity of both documents
            # student_pair = sorted((student_a, student_b)) 
            student_pair = (student_a, student_b) 
            # score = (student_pair[0], student_pair[1],sim_score)
            score = [student_pair[0], student_pair[1],sim_score]
            # plagiarism_results.add(score) #add score with file names into the set
            plagiarism_results.append(score)
            print("Finished testing",student_a,"against",student_b)
        print()
    return plagiarism_results  
    # return createTable(plagiarism_results)

def createTable(ans):
    df=pd.DataFrame(np.zeros((len(student_files),len(student_files))),index=student_files,columns=student_files)

    for data in ans:
      for rowName in range(len(student_files)):
        if df.index[rowName]==data[0]:
          r=rowName
          for colName in range(len(student_files)):
            if df.index[colName]==data[1]:
              c=colName

              df.iloc[r,c] = data[2]
    return df

In [123]:
#Serial
%%time
start=time.time()
ans=check_plagiarism(s_vectors)
df=createTable(ans)

end=time.time()
print()
print("Time taken:", end-start)

Starting process...
Started testing: juma.txt
Testing juma.txt against juma.txt
Finished testing juma.txt against juma.txt
Testing juma.txt against fatma.txt
Finished testing juma.txt against fatma.txt
Testing juma.txt against john.txt
Finished testing juma.txt against john.txt

Started testing: fatma.txt
Testing fatma.txt against juma.txt
Finished testing fatma.txt against juma.txt
Testing fatma.txt against fatma.txt
Finished testing fatma.txt against fatma.txt
Testing fatma.txt against john.txt
Finished testing fatma.txt against john.txt

Started testing: john.txt
Testing john.txt against juma.txt
Finished testing john.txt against juma.txt
Testing john.txt against fatma.txt
Finished testing john.txt against fatma.txt
Testing john.txt against john.txt
Finished testing john.txt against john.txt


Time taken: 0.018192291259765625
CPU times: user 21.1 ms, sys: 1.31 ms, total: 22.4 ms
Wall time: 18.3 ms


In [124]:
df

Unnamed: 0,juma.txt,fatma.txt,john.txt
juma.txt,1.0,0.186434,0.546597
fatma.txt,0.186434,1.0,0.148069
john.txt,0.546597,0.148069,1.0




---

Parallel [Manual] approach

Ways to achieve data parallelism:
1. Divide s_vectors in parts (more likely)
2. Divide s_vectors[0][1] in parts (potential)

In [116]:
#Parallel [Manual] Part 1
start=time.time()
ans=check_plagiarism(s_vectors[:len(s_vectors)//2])

# for data in ans:
#     print(data)
print(createTable(ans))
end=time.time()
print()
print("Time taken:", end-start)

Starting process...
Started testing: juma.txt
Testing juma.txt against juma.txt
Finished testing juma.txt against juma.txt
Testing juma.txt against fatma.txt
Finished testing juma.txt against fatma.txt
Testing juma.txt against john.txt
Finished testing juma.txt against john.txt

           juma.txt  fatma.txt  john.txt
juma.txt        1.0   0.186434  0.546597
fatma.txt       0.0   0.000000  0.000000
john.txt        0.0   0.000000  0.000000

Time taken: 0.017520666122436523


In [117]:
#Parallel [Manual] Part 2
start=time.time()
ans=check_plagiarism(s_vectors[len(s_vectors)//2:])

# for data in ans:
#     print(data)
print(createTable(ans))
end=time.time()
print()
print("Time taken:", end-start)

Starting process...
Started testing: fatma.txt
Testing fatma.txt against juma.txt
Finished testing fatma.txt against juma.txt
Testing fatma.txt against fatma.txt
Finished testing fatma.txt against fatma.txt
Testing fatma.txt against john.txt
Finished testing fatma.txt against john.txt

Started testing: john.txt
Testing john.txt against juma.txt
Finished testing john.txt against juma.txt
Testing john.txt against fatma.txt
Finished testing john.txt against fatma.txt
Testing john.txt against john.txt
Finished testing john.txt against john.txt

           juma.txt  fatma.txt  john.txt
juma.txt   0.000000   0.000000  0.000000
fatma.txt  0.186434   1.000000  0.148069
john.txt   0.546597   0.148069  1.000000

Time taken: 0.0187532901763916




---





---

Multiprocessing approach

In [None]:
s_vectors[:len(s_vectors)//2]

In [118]:
#Multiprocessing approach
import multiprocessing
import os

# def worker1(func1): 
#   func1(m)
  

start=time.time()

pool = multiprocessing.Pool(processes=2) 



l1 = s_vectors[:len(s_vectors)//2]
l2 = s_vectors[len(s_vectors)//2:]

start=time.time()

result = pool.map(check_plagiarism, [l1,l2])

for i in result:
  print(i)

print(createTable(result))
# print(result)

end=time.time()

# ty.append("Multiprocessing [50000 elements]")
# t.append(end-start)

print("Time taken: ",end-start)

Starting process...
Starting process...
Started testing: juma.txt
Testing juma.txt against juma.txt
Finished testing juma.txt against juma.txt
Testing juma.txt against fatma.txt
Started testing: fatma.txt
Testing fatma.txt against juma.txt
Finished testing juma.txt against fatma.txt
Testing juma.txt against john.txt
Finished testing fatma.txt against juma.txt
Finished testing juma.txt against john.txt
Testing fatma.txt against fatma.txt

Finished testing fatma.txt against fatma.txt
Testing fatma.txt against john.txt
Finished testing fatma.txt against john.txt

Started testing: john.txt
Testing john.txt against juma.txt
Finished testing john.txt against juma.txt
Testing john.txt against fatma.txt
Finished testing john.txt against fatma.txt
Testing john.txt against john.txt
Finished testing john.txt against john.txt

[['juma.txt', 'juma.txt', 1.0000000000000004], ['juma.txt', 'fatma.txt', 0.18643448370323362], ['juma.txt', 'john.txt', 0.5465972177348937]]
[['fatma.txt', 'juma.txt', 0.186



---

Numba approach

In [125]:
#Numba approach
from numba.core.errors import NumbaDeprecationWarning, NumbaPendingDeprecationWarning
import warnings

warnings.simplefilter('ignore', category=NumbaDeprecationWarning)
warnings.simplefilter('ignore', category=NumbaPendingDeprecationWarning)

In [127]:
%%time
start=time.time()
# try:

num_res= jit(parallel=True)(check_plagiarism)
a=num_res(s_vectors)

end=time.time()


Compilation is falling back to object mode WITH looplifting enabled because Function "check_plagiarism" failed type inference due to: Untyped global name 'similarity': cannot determine Numba type of <class 'function'>

File "<ipython-input-122-6caef6bd4f15>", line 18:
def check_plagiarism(s_vectors_partial):
    <source elided>
            print("Testing",student_a,"against",student_b)
            sim_score = similarity(text_vector_a, text_vector_b)[0][1] #calculate similarity of both documents
            ^

  def check_plagiarism(s_vectors_partial):
Compilation is falling back to object mode WITHOUT looplifting enabled because Function "check_plagiarism" failed type inference due to: cannot determine Numba type of <class 'numba.core.dispatcher.LiftedLoop'>

File "<ipython-input-122-6caef6bd4f15>", line 6:
def check_plagiarism(s_vectors_partial):
    <source elided>
    global s_vectors
    for student_a, text_vector_a in s_vectors_partial:  #traverse through students and their vector

Starting process...
Started testing: juma.txt
Testing juma.txt against juma.txt
Finished testing juma.txt against juma.txt
Testing juma.txt against fatma.txt
Finished testing juma.txt against fatma.txt
Testing juma.txt against john.txt
Finished testing juma.txt against john.txt

Started testing: fatma.txt
Testing fatma.txt against juma.txt
Finished testing fatma.txt against juma.txt
Testing fatma.txt against fatma.txt
Finished testing fatma.txt against fatma.txt
Testing fatma.txt against john.txt
Finished testing fatma.txt against john.txt

Started testing: john.txt
Testing john.txt against juma.txt
Finished testing john.txt against juma.txt
Testing john.txt against fatma.txt
Finished testing john.txt against fatma.txt
Testing john.txt against john.txt
Finished testing john.txt against john.txt

CPU times: user 690 ms, sys: 39.8 ms, total: 730 ms
Wall time: 728 ms



File "<ipython-input-122-6caef6bd4f15>", line 6:
def check_plagiarism(s_vectors_partial):
    <source elided>
    global s_vectors
    for student_a, text_vector_a in s_vectors_partial:  #traverse through students and their vectors (for first document)
    ^

  state.func_ir.loc))


In [106]:
#Convert set into
print("Time taken: ", end-start)

Time taken:  0.6218178272247314


In [128]:
df2=createTable(a)
df2

Unnamed: 0,juma.txt,fatma.txt,john.txt
juma.txt,1.0,0.186434,0.546597
fatma.txt,0.186434,1.0,0.148069
john.txt,0.546597,0.148069,1.0




---


Rough space

In [None]:
for data in ans:
  for rowName in range(len(student_files)):
    if df.index[rowName]==data[0]:
      r=rowName
      for colName in range(len(student_files)):
        if df.index[colName]==data[1]:
          c=colName

          df.iloc[r,c] = data[2]
df

In [None]:
df=pd.DataFrame(np.zeros((len(student_files),len(student_files))),index=student_files,columns=student_files)
df

In [None]:
df2=pd.DataFrame(np.zeros((len(student_files),len(student_files))),index=student_files,columns=student_files)
df2

Unnamed: 0,juma.txt,fatma.txt,john.txt
juma.txt,0.0,0.0,0.0
fatma.txt,0.0,0.0,0.0
john.txt,0.0,0.0,0.0


In [None]:
for data in a:
  for rowName in range(len(student_files)):
    if df2.index[rowName]==data[0]:
      r=rowName
      for colName in range(len(student_files)):
        if df2.index[colName]==data[1]:
          c=colName

          df2.iloc[r,c] = data[2]
df2

Unnamed: 0,juma.txt,fatma.txt,john.txt
juma.txt,1.0,0.186434,0.546597
fatma.txt,0.186434,1.0,0.148069
john.txt,0.546597,0.148069,1.0


In [108]:
from numba import jit


def sq(n):
  s=0
  for i in range(n):
    s+=i**2
  print(s)


In [109]:
%%time
sq(100000)

333328333350000
CPU times: user 30.7 ms, sys: 0 ns, total: 30.7 ms
Wall time: 31.5 ms


In [111]:
from numba import jit

# @jit(nopython=True)
def sq2(n):
  s=0
  for i in range(n):
    s+=i**2
  print(s)


In [112]:
%%time
ans=jit(nopython=True)(sq2)

CPU times: user 238 µs, sys: 28 µs, total: 266 µs
Wall time: 270 µs


In [None]:
%%time 
ans(100000)

333328333350000
CPU times: user 138 µs, sys: 4 µs, total: 142 µs
Wall time: 101 µs
