<h1 align="center"> $\color{#800000}{\text{Text Analysis 2}}$ </h1> 
<img src="index.PNG">


### $\color{green}{\text{Task}}$ 


* Compare sample1_compare.pdf and sample2_compare.pdf. There are two paragraphs in each of these PDFs. 
* To find  differences in terms of spelling differences, word differences and/or paragraph differences.

### $\color{red}{\text{1. Import required libraries}}$ 


In [1]:
import textract
from nltk.tokenize import sent_tokenize, word_tokenize
from spellchecker import SpellChecker

### $\color{brown}{\text{2. Importance of required libraries}}$ 
 
  

#### Textract 
As undesireable as it might be, more often than not there is extremely useful information embedded in Word documents, PowerPoint presentations, PDFs, etc—so-called “dark data”—that would be valuable for further textual analysis and visualization. While several packages exist for extracting content from each of these formats on their own, this package provides a single interface for extracting content from any type of file, without any irrelevant markup.

#### NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

#### Spell Checker

It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.

pyspellchecker supports multiple languages including English, Spanish, German, French, and Portuguese. Dictionaries were generated using the WordFrequency project on GitHub.

### $\color{#40826D}{\text{3. Read two sample documents for comparitive text analysis}}$  


In [2]:
# The sample pdf files are converted to docx file because of unsupported format and required other libraries 
# to install. the Errors.ipynb notebook will show the reason for converting pdf to docx file.
sample_1 = textract.process('sample1_compare.docx')
sample_2 = textract.process('sample2_compare.docx')

In [3]:
# lets check the type of the data.
print(type(sample_1),type(sample_2))

<class 'bytes'> <class 'bytes'>


In [5]:
sample_1

b'Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults. You must take PREGABALINE Karomi in combination with your current treetment. Your doctor will prescribe PREGABALINE Karomi to help treat your epilepsy whan your current treatment does not completely control your seizures. PREGABALINE Karomi should not be used alone, but should always be used in combination with another antiepileptic drug.\n\n\n\n\n\n\n\n\n\nPeripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage. Different pathologies such as diabetes or shingles can induce peripheral pain pain. Painful manifestations can be described as sensations of heat, burning, throbbing pain, slenderness, stabbing, shooting pain, cramping, soreness, tingling, numbness, sprains and strokes. needle. Peripheral and central neuropathic pain may also be associated with changes in mood, sl

#### Observation

* The data types are in 'bytes' format so it should be converted to string to  do tokenization 

In [6]:
# To below code is to decode the data 
sample_1= sample_1.decode("utf-8") 
sample_2= sample_2.decode("utf-8") 

In [7]:
# to check the data after decoding 
print(type(sample_1),type(sample_2))

<class 'str'> <class 'str'>


In [8]:
sample_1

'Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults. You must take PREGABALINE Karomi in combination with your current treetment. Your doctor will prescribe PREGABALINE Karomi to help treat your epilepsy whan your current treatment does not completely control your seizures. PREGABALINE Karomi should not be used alone, but should always be used in combination with another antiepileptic drug.\n\n\n\n\n\n\n\n\n\nPeripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage. Different pathologies such as diabetes or shingles can induce peripheral pain pain. Painful manifestations can be described as sensations of heat, burning, throbbing pain, slenderness, stabbing, shooting pain, cramping, soreness, tingling, numbness, sprains and strokes. needle. Peripheral and central neuropathic pain may also be associated with changes in mood, sle

#### Observation
So now the data are in str format lets proceed to find misspelled words and sentence postion differences

<h1 align="center"> $\color{#800080}{\text{Word difference}}$ </h1> 


In [9]:
# The function to find the different words from two documents
def word_diff(sample1,sample2):
    count = {} 
    # insert the sting of sample1 to hash
    for word in sample1.split(): 
        count[word] = count.get(word, 0) + 1
    # insert the sting of sample2 to hash
    for word in sample2.split(): 
        count[word] = count.get(word, 0) + 1
    # return the required list of words
    return [word for word in count if count[word] == 1] 

print(word_diff(sample_1, sample_2)) 



['treetment.', 'whan', 'fatigue', 'fatigue,', 'when', 'treatment.']


#### Observation 
* The above words are different in two documents beacuse of misspelled words and punctuation 

<h1 align="center"> $\color{green}{\text{Word Tokenization}}$ </h1> 


In [10]:
# The below code is to tokenize the paragraphs into words 
sample_1_token=word_tokenize(sample_1)
sample_2_token=word_tokenize(sample_2)

### $\color{#00AF6F}{\text{4. spell check in sample1_compare document}}$  


In [11]:
spell = SpellChecker()

# To find those words that may be misspelled
misspelled = spell.unknown(sample_1_token)
# print the number of words which are misspelled
print('The misspelled words are' , misspelled)
print()

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

The misspelled words are {'pregabaline', 'karomi', 'whan', 'antiepileptic', 'treetment'}

pregabaline
{'pregabaline'}
karami
{'karoui', 'karami'}
what
{'wean', 'than', 'khan', 'han', 'when', 'hwan', 'whin', 'wan', 'wuhan', 'wha', 'what', 'woan', 'wham', 'shan', 'bhan', 'chan', 'phan'}
antiepileptic
{'antiepileptic'}
treatment
{'treatment'}


#### Observation 
* The word 'pregabaline', 'antiepileptic', 'karomi', are kind of names so it also consider as misspelled words

### $\color{#00006F}{\text{5. spell check in sample2_compare document}}$  


In [12]:
# find those words that may be misspelled
misspelled = spell.unknown(sample_2_token)

print('The misspelled words are' , misspelled)
print()

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

The misspelled words are {'karomi', 'pregabaline', 'antiepileptic'}

karami
{'karoui', 'karami'}
pregabaline
{'pregabaline'}
antiepileptic
{'antiepileptic'}


<h1 align="center"> $\color{brown}{\text{Sentence Tokenization}}$ </h1> 


In [13]:
# The below code is to tokenize the paragraphs into sentence
sample_1_sent_token=sent_tokenize(sample_1)
sample_2_sent_token=sent_tokenize(sample_2)

In [14]:
# to print tokenized sentence for sample 1
sample_1_sent_token

['Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults.',
 'You must take PREGABALINE Karomi in combination with your current treetment.',
 'Your doctor will prescribe PREGABALINE Karomi to help treat your epilepsy whan your current treatment does not completely control your seizures.',
 'PREGABALINE Karomi should not be used alone, but should always be used in combination with another antiepileptic drug.',
 'Peripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage.',
 'Different pathologies such as diabetes or shingles can induce peripheral pain pain.',
 'Painful manifestations can be described as sensations of heat, burning, throbbing pain, slenderness, stabbing, shooting pain, cramping, soreness, tingling, numbness, sprains and strokes.',
 'needle.',
 'Peripheral and central neuropathic pain may also be associated with change

In [15]:
# To print tokenized sentence for sample 2
sample_2_sent_token

['Peripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage.',
 'Different pathologies such as diabetes or shingles can induce peripheral pain pain.',
 'Painful manifestations can be described as sensations of heat, burning, throbbing pain, slenderness, stabbing, shooting pain, cramping, soreness, tingling, numbness, sprains and strokes.',
 'needle.',
 'Peripheral and central neuropathic pain may also be associated with changes in mood, sleep disturbance, fatigue, and may impact physical and social functioning, and overall quality of life.',
 'Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults.',
 'Your doctor will prescribe PREGABALINE Karomi to help treat your epilepsy when your current treatment does not completely control your seizures.',
 'You must take PREGABALINE Karomi in combination with your current treatment.',
 'PRE

In [16]:
# To check the type of tokenized sentence
print(type(sample_2_sent_token),type(sample_1_sent_token))

<class 'list'> <class 'list'>


#### Observation
* The type of tokenized sentence are list

In [17]:
# To check the lenght of the two documents
print(len(sample_2_sent_token),len(sample_1_sent_token))

9 9


#### Observation
* The lenght of two tokenized documents are same as length '9'

In [18]:
# lets check the simillar sentence in two documents and the index of them

for sent1 in sample_1_sent_token:
    for sent2 in sample_2_sent_token:
        if sent1 == sent2:
            print( sample_2_sent_token.index(sent2),sent2)
            print(sample_1_sent_token.index(sent1),sent1)
            print()
            print('-----------------------------------------------------------------------------')
            print()


5 Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults.
0 Epilepsy PREGABALINE Karomi is used to treat a particular type of epilepsy (partial epileptic seizures with or without secondary generalization) in adults.

-----------------------------------------------------------------------------

8 PREGABALINE Karomi should not be used alone, but should always be used in combination with another antiepileptic drug.
3 PREGABALINE Karomi should not be used alone, but should always be used in combination with another antiepileptic drug.

-----------------------------------------------------------------------------

0 Peripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage.
4 Peripheral and central neuropathic pain PREGABALINE Karomi is used to treat persistent pain caused by nerve damage.

---------------------------------------------

#### Observation 
* From the two documents 6 sentence are simillar and are misplaced as show in the about output