In [75]:
%matplotlib inline
from nltk.tokenize import sent_tokenize
from google import search
from collections import Counter
import editdistance
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context('notebook')
sns.set_style('whitegrid')

# Plagiarism Check

When reading [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672989/#!po=1.82927) I got sent I googled the paragraph 


`Vacuum extraction was first described in 1705 by Dr. James Yonge, an English surgeon, several decades before the invention of the obstetric forceps. However, it did not gain widespread use until the 1950s, when it was popularized in a series of studies by the Swedish obstetrician Dr. Tage Malmström.5 By the 1970s, the vacuum extractor had almost completely replaced forceps for assisted vaginal deliveries in most northern European countries, but its popularity in many English-speaking countries, including the United States and the United Kingdom, was limited. By 1992, however, the number of vacuum assisted deliveries surpassed the number of forceps deliveries in the United States, and by the year 2000 approximately 66% of operative vaginal deliveries were by vacuum`

and found that in [this textbook chapter 15 10th edition e-book version  ](https://books.google.ca/books?id=zRNBDwAAQBAJ&pg=PA161&dq=.+Vacuum+extraction+was+first+described+in+1705+by+Dr.+James+Yonge,+an+English+surgeon,+several+decades+before+the+invention+of+the+obstetric+forceps.+However,+it+did+not+gain+widespread+use+until+the+1950s,+when+it+was+popularized+in+a+series+of+studies+by+the+Swedish+obstetrician+Dr.+Tage+Malmstr%C3%B6m.5+By+the+1970s,+the+vacuum+extractor+had+almost+completely+replaced+forceps+for+assisted+vaginal+deliveries+in+most+northern+European+countries,+but+its+popularity+in+many+English-speaking+countries,+including+the+United+States+and+the+United+Kingdom,+was+limited.+By+1992,+however,+the+number+of+vacuum+assisted+deliveries+surpassed+the+number+of+forceps+deliveries+in+the+United+States,+and+by+the+year+2000+approximately+66%25+of+operative+vaginal+deliveries+were+by+vacuu&hl=en&sa=X&ved=0ahUKEwjJ-e7RgqHYAhUCGt8KHRe2AL0Q6AEIKzAA#v=onepage&q=.%20Vacuum%20extraction%20was%20first%20described%20in%201705%20by%20Dr.%20James%20Yonge%2C%20an%20English%20surgeon%2C%20several%20decades%20before%20the%20invention%20of%20the%20obstetric%20forceps.%20However%2C%20it%20did%20not%20gain%20widespread%20use%20until%20the%201950s%2C%20when%20it%20was%20popularized%20in%20a%20series%20of%20studies%20by%20the%20Swedish%20obstetrician%20Dr.%20Tage%20Malmstr%C3%B6m.5%20By%20the%201970s%2C%20the%20vacuum%20extractor%20had%20almost%20completely%20replaced%20forceps%20for%20assisted%20vaginal%20deliveries%20in%20most%20northern%20European%20countries%2C%20but%20its%20popularity%20in%20many%20English-speaking%20countries%2C%20including%20the%20United%20States%20and%20the%20United%20Kingdom%2C%20was%20limited.%20By%201992%2C%20however%2C%20the%20number%20of%20vacuum%20assisted%20deliveries%20surpassed%20the%20number%20of%20forceps%20deliveries%20in%20the%20United%20States%2C%20and%20by%20the%20year%202000%20approximately%2066%25%20of%20operative%20vaginal%20deliveries%20were%20by%20vacuu&f=false)  the paragraph was reproduced verbatim. 

While a citation to the original paper was present in the Chapter I am suspicious that this might be a pattern of this author.

Therefore I downloaded the the text of [9th edition of this book](https://www.sciencedirect.com/science/book/9781437701340#ancp4) (all that was legally available) and extracted the text of the same chapter

In [61]:
with open('chapter15_9ed.txt') as fh:
    text = fh.read()

source_tokenise  = sent_tokenize(text)

In [11]:
top_hits = [] 
for sent in source_tokenize:
    for hit in search(sent, tld="com", num=5, stop=1, pause=3):
        top_hits.append(hit)

In [25]:
tally = Counter(top_hits)
tally.most_common(10)

[('https://www.sciencedirect.com/topics/nursing-and-health-professions/shoulder-dystocia',
  47),
 ('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672989/', 27),
 ('https://www.sciencedirect.com/topics/medicine-and-dentistry/cesarean-section',
  23),
 ('https://www.sciencedirect.com/topics/medicine-and-dentistry/forceps-in-childbirth',
  14),
 ('https://www.sciencedirect.com/topics/medicine-and-dentistry/maternal-fetal-medicine',
  12),
 ('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2633252/', 10),
 ('https://www.ncbi.nlm.nih.gov/pubmed/19646324', 9),
 ('https://www.scribd.com/document/345130206/Averys-Diseases-of-the-Newborn-9th-2012-pdf',
  8),
 ('http://medreviews.com/sites/default/files/2016-11/RIOG_21_5_0.pdf', 8),
 ('http://emedicine.medscape.com/article/1602970-overview', 8)]

So of the top 10 hits, the sciencedirect links are the text in question, as is the scribd.

The medreviews link is the pdf of the original paper.

So let's just focus on the 3 pubmed hits to papers:

- [Ali & Norwitz, 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672989/) 

- [Kotaska et. al., 2009](https://www.ncbi.nlm.nih.gov/pubmed/19646324)

- [Prapas et. al,m 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2633252/)

Starting to think whichever edition was released after 2009 and whoever wrote that are the main potential plagiarism sources.

Let's grab the text for these papers and find the most similar sentences between the two.

Just going to do this manually and copy and paste into text file for laziness reasons.


In [44]:
plagiarism = {}
for possible in ['ali.txt', 'kotaska.txt', 'prapas.txt']:
    with open(possible) as fh:
        text = " ".join([line.strip() for line in fh])
    plagiarism.update({possible: sent_tokenize(text)})

Let's try a quick and dirty edit distance 

In [124]:
distances = {'source_sent':[], 'plag_source':[], 'plag_sent': [], 'levenshtein':[]}

for sent in source_tokenise:
    for source in plagiarism:
        for plag_sent in plagiarism[source]:
            distances['source_sent'].append(sent)
            distances['plag_source'].append(source)
            distances['plag_sent'].append(plag_sent)
            distances['levenshtein'].append(editdistance.eval(sent, plag_sent))
            
df = pd.DataFrame(distances)
df = df.sort_values(by=['levenshtein'])
df[df['levenshtein'] == 0]

Unnamed: 0,levenshtein,plag_sent,plag_source,source_sent
43148,0,The rate of neonatal trauma and respiratory di...,prapas.txt,The rate of neonatal trauma and respiratory di...


So this actually works and reveals another potential plagiarism where a whole sentence (edit distance 0) is lifted (and cited) in the book chapter
from the abstract of the Prapas paper: 
    
    `The rate of neonatal trauma and respiratory distress syndrome did not differ significantly between the two groups.`
    
Need to remove a lot of crud in the df before I can identify anything else, remembering that different citation style will lead to edit distances greater than 0

In [125]:
df = df[df['source_sent'].str.len() > 20]
df = df[df['plag_sent'].str.len() > 20]

Having cleaned up we've found a few other somewhat obvious cases of questionable lifting from both prapas and ali
e.g. all the text under edit distance of 24 

In [127]:
df[df['levenshtein'] < 24]

Unnamed: 0,levenshtein,plag_sent,plag_source,source_sent
43148,0,The rate of neonatal trauma and respiratory di...,prapas.txt,The rate of neonatal trauma and respiratory di...
42726,5,The rate of neonates with Apgar scores ≤ 4 at ...,prapas.txt,The rate of neonates with Apgar scores ≤6 at 1...
41040,10,The aim of the present study is to compare the...,prapas.txt,The aim of their study was to compare the shor...
61819,16,There are 2 main types of disposable cups...,ali.txt,"There are two main types of disposable cups, w..."
55040,17,Vacuum extraction was first described in 17...,ali.txt,Vacuum extraction was first described in 1705 ...
41884,19,"Out of 7098 deliveries, 374 were instrument as...",prapas.txt,"Of 7098 deliveries, 374 were instrument assist..."
60977,20,The original vacuum device developed in the 19...,ali.txt,The original vacuum device developed in the 19...
62240,21,The soft cup is a pliable funnel- or bell-s...,ali.txt,"The soft cup is a pliable, funnel- or bell-sha..."
42305,23,Results: The incidence of 3rd degree laceratio...,prapas.txt,The incidence of third-degree lacerations and ...
