<a href="https://colab.research.google.com/github/arindamkeswani/RePlicator/blob/main/RePlicator_%5BTool_Notebook%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Research Summary:**
- 4 methodologies were implemented, tested, and compared, with time as the deciding metric:
  1. Serial implementation (for base time)
  2. Multiprocessing library (for base data-parallelism time)
  3. Psuedo data-parallelism (purely for research purposes)
  4. Numba library (for potential in-built optimum time)
- 5 Tests done on each methodology:
  1. Miniature Data [3 mini-files]
  2. Actual data [10 files]
  3. Actual data [26 files] [In-built cosine similarity]
  4. Actual data [26 files] [Manual cosine similarity]
  5. Actual data [50 files] [Manual cosine similarity]
- Mid-size dataset (20-30 files) saw a speed-up of `19.42%`, as compared to Serial time
- Large-size dataset (50+ files) saw a speed-up of `28.74%`, as compared to Serial time


---



#**Description:**
RePlicator (Relative Plagiarism Indicator) is a plagiarism indication tool meant to reduce the time taken to calculate level of plagiarism of a document with all the others in a dataset by using concepts of data-parallelism and libraries that work on similar principles.

Application targets can include research facilities, universities, etc. who need a comparative analysis of a certain set of documents.

Initially meant to be a research project, it was converted into an open-source tool with colab as its supporting platform, which enables users to use the tool **without worrying about licenses, hosting fees, and wasting memory on heavy applications.**


Currently supported file types: 
- .pdf
- .txt

**All the user needs to do is run the code cells step-by-step (guide below).**


---





#Steps to use the tool:
To run a code cell, simply press `[Shift + Enter]`

Follow the below steps in sequence
1. Click `Connect` on top right of the screen to **connect to the server**. You should see a green tick in a few seconds
2. Run cell in `Section 0` to **Import necessary libraries** for the tool to work
3. Run cell in `Section 1` to **Upload files**, when prompted
4. Run cell in `Section 2` for **PDF file processing**
5. Run cell in `Section 3` to **Check plagiarism**
6. Run cell in `Section 4` to **Display full table**
7. Run cell in `Section 5` to **Download full table**
8. Run cell in `Section 6` to **Download summarized table**


#**Section 0 : Import libraries**


In [6]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import time
import pandas as pd
from numba import jit
from numba import njit

from os import system
import sys
from numpy import dot #Alt for cosine similarity
from numpy.linalg import norm #Alt for cosine similarity

import multiprocessing

!pip install PyPDF2
import PyPDF2 

# !pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from google.colab import files 


print("__________________________________________\n\nALL LIBRARIES IMPORTED")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
__________________________________________

ALL LIBRARIES IMPORTED


#**Section 1 : Upload files**

In [7]:
uploaded = files.upload()

print("__________________________________________\n\nALL FILES UPLOADED")

Saving Foucault5.pdf to Foucault5.pdf
Saving Foucault6.pdf to Foucault6.pdf
Saving Foucault7.pdf to Foucault7.pdf
Saving Foucault8.pdf to Foucault8.pdf
Saving Gender1.pdf to Gender1.pdf
Saving john.txt to john.txt
Saving juma.txt to juma.txt
Saving ML1.pdf to ML1.pdf
Saving ML2.pdf to ML2.pdf
Saving PatFem1.txt to PatFem1.txt
Saving PatFem3.pdf to PatFem3.pdf
__________________________________________

ALL FILES UPLOADED


#**Section 2 : PDF File Processing**

In [8]:
student_files_pdf = [doc for doc in os.listdir() if doc.endswith('.pdf')]

# %%time
#Multiprocessing approach
# creating a pdf file object 
def convert2(student_files_pdf):
  for i in student_files_pdf:
    try:
      path=i
      pin='/content/'+path
      print(f"Converting {pin.split('/')[-1]}...")
      pout=pin[:-4]+".txt"
      print(pout)
      pdfFileObj = open(pin, 'rb') 
          
      # creating a pdf reader object 
      pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
          
      # printing number of pages in pdf file 
      print(f"Number of pages: {pdfReader.numPages}") 
          
      # creating a page object 
      s=""
      for i in range(pdfReader.numPages):
        pageObj = pdfReader.getPage(i) 
          
        # extracting text from page 
        
        s+=pageObj.extractText()
      print(f"Writing contents of {pin} to {pout}")
      myText = open(pout,'w')

      stop_words = set(stopwords.words('english'))
 
      word_tokens = word_tokenize(s)
      filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
 
      filtered_sentence = []
      
      for w in word_tokens:
          if w not in stop_words:
              filtered_sentence.append(w)
      
      sp = " "
  
      # joins elements of list1 by '-'
      # and stores in sting s

      filter_joined = sp.join(filtered_sentence)
      myText.write(filter_joined)
      myText.close()
      pdfFileObj.close()
      print('_'*100)
    except:
      print("Cannot convert",i)
      print('_'*100)
  
pool = multiprocessing.Pool(processes=2) 



l1 = student_files_pdf[:len(student_files_pdf)//2]
l2 = student_files_pdf[len(student_files_pdf)//2:]

start=time.time()

result = pool.map(convert2, [l1,l2])

for i in result:
  print(i)

Converting Cloud2.pdf...
/content/Cloud2.txt
Converting Foucault7.pdf...
/content/Foucault7.txt
Number of pages: 20
Number of pages: 133
Writing contents of /content/Foucault7.pdf to /content/Foucault7.txt
____________________________________________________________________________________________________
Converting Foucault6.pdf...
/content/Foucault6.txt
Number of pages: 3
Writing contents of /content/Foucault6.pdf to /content/Foucault6.txt
____________________________________________________________________________________________________
Converting Foucault1.pdf...
/content/Foucault1.txt
Number of pages: 4
Writing contents of /content/Foucault1.pdf to /content/Foucault1.txt
____________________________________________________________________________________________________
Converting ML1.pdf...
/content/ML1.txt




Number of pages: 18
Writing contents of /content/ML1.pdf to /content/ML1.txt
____________________________________________________________________________________________________
Converting Capitalism.pdf...
/content/Capitalism.txt
Number of pages: 10
Writing contents of /content/Capitalism.pdf to /content/Capitalism.txt
____________________________________________________________________________________________________
Converting ML2.pdf...
/content/ML2.txt
Number of pages: 30
Writing contents of /content/ML2.pdf to /content/ML2.txt
____________________________________________________________________________________________________
Converting Foucault5.pdf...
/content/Foucault5.txt
Number of pages: 15
Writing contents of /content/Foucault5.pdf to /content/Foucault5.txt
____________________________________________________________________________________________________
Writing contents of /content/Cloud2.pdf to /content/Cloud2.txt
________________________________________________________



Number of pages: 19
Writing contents of /content/Foucault2.pdf to /content/Foucault2.txt
____________________________________________________________________________________________________
None
None


#**Section 3 : Check plagiarism**

In [17]:
student_files = [doc for doc in os.listdir() if doc.endswith('.txt')] #store all text files
student_notes =[open(File).read() for File in  student_files] #stores all lines of all files

vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray()  #to vectorize the words of text files

vectors = vectorize(student_notes) #store vectorized values
s_vectors = list(zip(student_files, vectors)) #store it with file names
plagiarism_results =[]

def check_plagiarism(s_vectors_partial):
    # similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2]) #to store similarity of two documents
    plagiarism_results =[]
    
    sys.stdout.write("\r"+"Starting process...")
    global s_vectors
    for student_a, text_vector_a in s_vectors_partial:  #traverse through students and their vectors (for first document)
        # print(f"Started testing:{student_a}")
        # print("Started testing:",student_a)
        sys.stdout.write("\r"+"Started testing:"+student_a) # Cursor up one line
        # time.sleep(1)
        new_vectors = s_vectors.copy() 
        
        # current_index = new_vectors.index((student_a, text_vector_a))
        # del new_vectors[current_index]
        

        for student_b , text_vector_b in new_vectors: #traverse through students and their vectors (for second document)
            # print(f"Testing {student_a} against {student_b}")
            # print("Testing",student_a,"against",student_b)
            sys.stdout.write("\r"+"Testing: "+student_a+" | Against: "+student_b) # Cursor up one line
            # time.sleep(1)
            # sim_score = similarity(text_vector_a, text_vector_b)[0][1] #calculate similarity of both documents
            sim_score = dot(text_vector_a, text_vector_b)/(norm(text_vector_a)*norm(text_vector_b))
            # sim_score = cosine_similarity([text_vector_a, text_vector_b])[0][1]#########################Uncomment it later
            # student_pair = sorted((student_a, student_b)) 
            student_pair = (student_a, student_b) 
            # score = (student_pair[0], student_pair[1],sim_score)
            score = [student_pair[0], student_pair[1],float("%.2f" % round(sim_score*100, 2))]
            # plagiarism_results.add(score) #add score with file names into the set
            plagiarism_results.append(score)
            # print("Finished testing",student_a,"against",student_b)
        sys.stdout.write("\r"+"Finished testing: "+student_a)
        sys.stdout.write("\r")
        # print()
    sys.stdout.write("\r"+"TESTING COMPLETE!")
    return plagiarism_results  
    # return createTable(plagiarism_results)

def createTable(ans):
    df=pd.DataFrame(np.zeros((len(student_files),len(student_files))),index=student_files,columns=student_files)

    for data in ans:
      for rowName in range(len(student_files)):
        if df.index[rowName]==data[0]:
          r=rowName
          for colName in range(len(student_files)):
            if df.index[colName]==data[1]:
              c=colName

              df.iloc[r,c] = data[2]
    return df
num_res= jit(parallel=True,forceobj=True)(check_plagiarism)
ans=num_res(s_vectors)

TESTING COMPLETE!

#**Section 4 : Full table**

In [18]:
df=createTable(ans)
df

Unnamed: 0,ML2.txt,Cloud2.txt,Capitalism.txt,PatFem3.txt,PatFem1.txt,Cloud1.txt,john.txt,Foucault5.txt,Foucault1.txt,Foucault2.txt,juma.txt,Foucault8.txt,Gender1.txt,Foucault7.txt,ML1.txt,Foucault6.txt
ML2.txt,100.0,1.33,1.38,0.02,1.85,0.28,0.0,2.36,0.54,1.13,0.9,0.88,0.7,0.95,1.84,0.81
Cloud2.txt,1.33,100.0,3.13,0.09,11.68,3.27,8.04,7.67,4.85,8.84,2.95,5.24,6.39,7.28,16.59,4.92
Capitalism.txt,1.38,3.13,100.0,0.06,3.82,0.41,1.8,9.25,7.26,10.83,2.1,4.94,6.45,9.57,2.83,5.92
PatFem3.txt,0.02,0.09,0.06,100.0,0.6,0.0,0.0,0.27,0.51,0.9,0.0,0.49,0.24,0.99,0.33,0.37
PatFem1.txt,1.85,11.68,3.82,0.6,100.0,0.89,10.82,14.23,9.03,15.49,17.76,8.35,20.96,16.59,10.38,9.53
Cloud1.txt,0.28,3.27,0.41,0.0,0.89,100.0,0.09,0.6,0.29,0.79,0.16,0.39,0.42,0.47,0.48,0.33
john.txt,0.0,8.04,1.8,0.0,10.82,0.09,100.0,1.91,1.35,2.19,63.87,2.54,0.93,1.66,0.61,3.52
Foucault5.txt,2.36,7.67,9.25,0.27,14.23,0.6,1.91,100.0,12.35,25.43,3.3,18.8,6.85,15.25,7.19,14.76
Foucault1.txt,0.54,4.85,7.26,0.51,9.03,0.29,1.35,12.35,100.0,18.76,3.13,11.59,5.1,10.93,5.03,16.78
Foucault2.txt,1.13,8.84,10.83,0.9,15.49,0.79,2.19,25.43,18.76,100.0,4.2,38.67,13.71,28.23,8.98,37.05


#**Section 5 : Summary table**

In [19]:
df_res=pd.DataFrame()
df_res["Max plag value"]= df.apply(lambda row: row.nlargest(2).values[-1],axis=1)
df_res["Max plag doc"]= df.T.apply(lambda x: x.nlargest(2).idxmin())
df_res["Average plag"] = (df.sum(axis=1)-1) / (len(df)-1)
df_res

Unnamed: 0,Max plag value,Max plag doc,Average plag
ML2.txt,2.36,Foucault5.txt,7.598
Cloud2.txt,16.59,ML1.txt,12.751333
Capitalism.txt,10.83,Foucault2.txt,11.25
PatFem3.txt,0.99,Foucault7.txt,6.924667
PatFem1.txt,20.96,Gender1.txt,16.732
Cloud1.txt,3.27,Cloud2.txt,7.191333
john.txt,63.87,juma.txt,13.222
Foucault5.txt,25.43,Foucault2.txt,15.948
Foucault1.txt,18.76,Foucault2.txt,13.766667
Foucault2.txt,38.67,Foucault8.txt,20.946667


#**Section 6 : Download full table**

In [25]:
df.to_csv('FullTable.csv')
files.download('FullTable.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#**Section 7 : Download summary table**

In [27]:
df_res.to_csv('Summary.csv')
files.download('Summary.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>



---

