## TEXT PRE-PROCESSING USING SPACY

### TASK 1

In [21]:
import spacy                                                 
from spacy.lang.en import English 
nlp = spacy.load("en_core_web_sm")  
from spacy.lang.en.stop_words import STOP_WORDS 

In [5]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS 
print('Number of stop words: %d' % len(spacy_stopwords)) 
print('First ten stop words: %s' % list(spacy_stopwords)[:15]) 

Number of stop words: 326
First ten stop words: ['but', "'s", '‘ll', 'nor', 'often', "n't", 'through', 'ourselves', 'upon', 'everyone', 'us', 'themselves', 'say', 'whither', 'whereafter']


### TASK 2

In [11]:
tokens=['cries','this','lied','computing','organizing','matches']

In [15]:
#lemmatization
doc = nlp("cries this lied computing organizing matches")                                                                                          
for word in doc:                                                                            
    print(word.text,'-->',word.lemma_) 

cries --> cry
this --> this
lied --> lie
computing --> computing
organizing --> organizing
matches --> match


In [13]:
#stemming
import nltk                                                                                 
from nltk.stem.snowball import SnowballStemmer                                              
stemmer = SnowballStemmer(language='english')
for token in tokens:                                                                     
    print(token + ' --> ' + stemmer.stem(token))

cries --> cri
this --> this
lied --> lie
computing --> comput
organizing --> organ
matches --> match


In [16]:
#comparing stemming and lemmatization
for token in tokens:                                                                        
    print('After stemming:',token + ' --> ' + stemmer.stem(token))                                  
print('\n')                                                                                  
for word in doc:                                                                             
    print('After lemmatization:',word.text,'-->',word.lemma_) 

After stemming: cries --> cri
After stemming: this --> this
After stemming: lied --> lie
After stemming: computing --> comput
After stemming: organizing --> organ
After stemming: matches --> match


After lemmatization: cries --> cry
After lemmatization: this --> this
After lemmatization: lied --> lie
After lemmatization: computing --> computing
After lemmatization: organizing --> organizing
After lemmatization: matches --> match


In [18]:
print('Lemmatization gives better output as a root word while comparing with Stemming')

Lemmatization gives better output as a root word while comparing with Stemming


### TASK 3

In [20]:
#A)scifiscripts_intro.txt
data=open('scifiscripts_intro.txt').read()
data1=nlp(data)
print(data1)

Note from poster to Kubrick newsgroup:

I found this on a bbs a while ago and I thought I'd pass it along to all 
of you Kubrick freaks out there.

02/23/89
Transcriber's note:

For all you Clarke/Kubrick/2001 fans,

I found the original paper copy of this screenplay a while back and felt 
compelled to transcribe it to disk and upload it to various bulletin 
boards for the enjoyment of all.

The final movie deviates from this screenplay in a number of interesting 
ways. I've tried to maintain the format of the original document except 
the number of lines per page of the original. In order to reduce the 
length of this file I've used a bar of "------" to delimit the pages as 
there was a lot of whitespace per original screenplay page.


In [23]:
#removal of stop words for (A)
filtered_data=[]                                                                                                                     
for word in data1:                                                         
    if word.is_stop==False:                                               
        filtered_data.append(word)                                        
print("Sentence after removal of stop words:",filtered_data)  

Sentence after removal of stop words: [Note, poster, Kubrick, newsgroup, :, 

, found, bbs, ago, thought, pass, 
, Kubrick, freaks, ., 

, 02/23/89, 
, Transcriber, note, :, 

, Clarke, /, Kubrick/2001, fans, ,, 

, found, original, paper, copy, screenplay, felt, 
, compelled, transcribe, disk, upload, bulletin, 
, boards, enjoyment, ., 

, final, movie, deviates, screenplay, number, interesting, 
, ways, ., tried, maintain, format, original, document, 
, number, lines, page, original, ., order, reduce, 
, length, file, bar, ", ------, ", delimit, pages, 
, lot, whitespace, original, screenplay, page, .]


In [24]:
#B)Five sentence by own
d="This is the first sentence.And followed by first, is the second sentence.With the help of second, we get third sentence.After succeeded by third sentence, we get fourth sentence.Finally, we get fifth sentence."
d1=nlp(d)
print(d1)

This is the first sentence.And followed by first, is the second sentence.With the help of second, we get third sentence.After succeeded by third sentence, we get fourth sentence.Finally, we get fifth sentence.


In [25]:
#removal of stop words for (B)
filtered_data1=[]                                                                                                                     
for word in d1:                                                         
    if word.is_stop==False:                                               
        filtered_data1.append(word)                                        
print("Sentence after removal of stop words:",filtered_data1)  

Sentence after removal of stop words: [sentence, ., followed, ,, second, sentence, ., help, second, ,, sentence, ., succeeded, sentence, ,, fourth, sentence, ., Finally, ,, fifth, sentence, .]


### TASK 4

In [32]:
#pos and tag with description
Text=nlp('Bryan visited his friend for a while and then went home at 10 pm.')
for word in Text:                                       
    print(word.text,'\t',word.pos_,'\t',word.tag_) 


Bryan 	 PROPN 	 NNP
visited 	 VERB 	 VBD
his 	 PRON 	 PRP$
friend 	 NOUN 	 NN
for 	 ADP 	 IN
a 	 DET 	 DT
while 	 NOUN 	 NN
and 	 CCONJ 	 CC
then 	 ADV 	 RB
went 	 VERB 	 VBD
home 	 ADV 	 RB
at 	 ADP 	 IN
10 	 NUM 	 CD
pm 	 NOUN 	 NN
. 	 PUNCT 	 .


### TASK 5

In [33]:
#find proper nouns and numbers from the file
file=open('Random.txt').read()
File=nlp(file)
print(File)

PADUA HIGH SCHOOL - DAY
Revision November 12, 1997



In [41]:
#pos
noun=[]
num=[]
for word in File:                                       
    print(word.text,'\t',word.pos_,'\t',word.tag_)
    if(word.pos_=='PROPN'):
        noun.append(word.text)
    elif(word.pos_=='NUM'):
        num.append(word.text)

PADUA 	 PROPN 	 NNP
HIGH 	 PROPN 	 NNP
SCHOOL 	 PROPN 	 NNP
- 	 PUNCT 	 HYPH
DAY 	 PROPN 	 NNP

 	 SPACE 	 _SP
Revision 	 PROPN 	 NNP
November 	 PROPN 	 NNP
12 	 NUM 	 CD
, 	 PUNCT 	 ,
1997 	 NUM 	 CD

 	 SPACE 	 _SP


In [43]:
print('The proper nouns are:',noun)
print('The numbers are:',num)

The proper nouns are: ['PADUA', 'HIGH', 'SCHOOL', 'DAY', 'Revision', 'November']
The numbers are: ['12', '1997']


### TASK 6

In [45]:
#adding 5 stop words to the default list
nlp.Defaults.stop_words |= {'Word1','Word2','Word3','Word4','Word5'}         
print(nlp.Defaults.stop_words)   

{'but', "'s", '‘ll', 'nor', 'often', "n't", 'through', 'ourselves', 'upon', 'everyone', 'us', 'themselves', 'say', 'whither', 'whereafter', 'former', 'n‘t', 'nothing', 'have', 'give', 'whoever', 'has', 'under', 'except', "'d", 'regarding', 'whose', 'hereby', 'full', 'least', 'take', 'beyond', 'thru', 'made', 'at', 'everything', 'became', 'whereupon', 'move', 'did', 'whatever', 'else', 'own', 'any', 'used', '‘re', 'although', 'formerly', 'wherein', 'next', 'name', 'down', 'five', 'out', 'using', 'would', 'who', "'re", 'elsewhere', 'before', 'last', 'that', 'top', 'twelve', 'until', 'seems', 'eight', 'alone', 'together', 'not', 'therefore', 'now', 'since', 'otherwise', 'less', 'put', 'all', 'seemed', 'n’t', 'your', 'most', 'they', 'amongst', 'few', 'Word5', 'perhaps', 'off', 'she', 'whereas', 'noone', 'hers', 'anyone', 'such', 'yourself', 'really', 'get', 'mine', 'part', 'many', "'m", 'well', 'our', 're', 'side', 'several', 'six', 'Word1', 'had', 'yourselves', 'ours', '‘m', 'being', 'you

In [46]:
#removing always,never,between,becomes stop words
nlp.Defaults.stop_words -= {'always','never','between','becomes'}          
print(nlp.Defaults.stop_words) 

{'but', "'s", '‘ll', 'nor', 'often', "n't", 'through', 'ourselves', 'upon', 'everyone', 'us', 'themselves', 'say', 'whither', 'whereafter', 'former', 'n‘t', 'nothing', 'have', 'give', 'whoever', 'has', 'under', 'except', "'d", 'regarding', 'whose', 'hereby', 'full', 'least', 'take', 'beyond', 'thru', 'made', 'at', 'everything', 'became', 'whereupon', 'move', 'did', 'whatever', 'else', 'own', 'any', 'used', '‘re', 'although', 'formerly', 'wherein', 'next', 'name', 'down', 'five', 'out', 'using', 'would', 'who', "'re", 'elsewhere', 'before', 'last', 'that', 'top', 'twelve', 'until', 'seems', 'eight', 'alone', 'together', 'not', 'therefore', 'now', 'since', 'otherwise', 'less', 'put', 'all', 'seemed', 'n’t', 'your', 'most', 'they', 'amongst', 'few', 'Word5', 'perhaps', 'off', 'she', 'whereas', 'noone', 'hers', 'anyone', 'such', 'yourself', 'really', 'get', 'mine', 'part', 'many', "'m", 'well', 'our', 're', 'side', 'several', 'six', 'Word1', 'had', 'yourselves', 'ours', '‘m', 'being', 'you

### TASK 7

In [47]:
#opening the file and reading the data
raw=open('Raw_data_for_analysis.txt').read()
Raw=nlp(raw)
print(Raw)

PADUA HIGH SCHOOL - DAY
Revision November 12, 1997
I hope dinner's ready because I only have ten minutes before Mrs. Johnson squirts out a screamer.
He grabs the mail and rifles through it, as he bends down to kiss Sharon on the cheek.
MICHAEL- C'mon. I'm supposed to give you the tour. They head out of the office
MICHAEL (continuing)- So -- which Dakota you from?
          
                                 



In [64]:
#TOKENIZATION
print ([token.text for token in Raw]) 
print('\nNo.of tokens=',len(token))

['PADUA', 'HIGH', 'SCHOOL', '-', 'DAY', '\n', 'Revision', 'November', '12', ',', '1997', '\n', 'I', 'hope', 'dinner', "'s", 'ready', 'because', 'I', 'only', 'have', 'ten', 'minutes', 'before', 'Mrs.', 'Johnson', 'squirts', 'out', 'a', 'screamer', '.', '\n', 'He', 'grabs', 'the', 'mail', 'and', 'rifles', 'through', 'it', ',', 'as', 'he', 'bends', 'down', 'to', 'kiss', 'Sharon', 'on', 'the', 'cheek', '.', '\n', 'MICHAEL-', "C'm", 'on', '.', 'I', "'m", 'supposed', 'to', 'give', 'you', 'the', 'tour', '.', 'They', 'head', 'out', 'of', 'the', 'office', '\n', 'MICHAEL', '(', 'continuing)-', 'So', '--', 'which', 'Dakota', 'you', 'from', '?', '\n          \n                                 \n']

No.of tokens= 46


In [65]:
#Remove stop words from file
filtered_data2=[]                                                                                                                     
for word in Raw:                                                         
    if word.is_stop==False:                                               
        filtered_data2.append(word)                                        
print("Sentence after removal of stop words:\n",filtered_data2)                                       

Sentence after removal of stop words:
 [PADUA, HIGH, SCHOOL, -, DAY, 
, Revision, November, 12, ,, 1997, 
, hope, dinner, ready, minutes, Mrs., Johnson, squirts, screamer, ., 
, grabs, mail, rifles, ,, bends, kiss, Sharon, cheek, ., 
, MICHAEL-, C'm, ., supposed, tour, ., head, office, 
, MICHAEL, (, continuing)-, --, Dakota, ?, 
          
                                 
]


In [62]:
#lemmatization after removal of stop words      
for word in filtered_data2:                                                      
    print(word.text,'-> ',word.lemma_)  

PADUA ->  PADUA
HIGH ->  HIGH
SCHOOL ->  SCHOOL
- ->  -
DAY ->  DAY

 ->  

Revision ->  Revision
November ->  November
12 ->  12
, ->  ,
1997 ->  1997

 ->  

hope ->  hope
dinner ->  dinner
ready ->  ready
minutes ->  minute
Mrs. ->  Mrs.
Johnson ->  Johnson
squirts ->  squirt
screamer ->  screamer
. ->  .

 ->  

grabs ->  grab
mail ->  mail
rifles ->  rifle
, ->  ,
bends ->  bend
kiss ->  kiss
Sharon ->  Sharon
cheek ->  cheek
. ->  .

 ->  

MICHAEL- ->  MICHAEL-
C'm ->  come
. ->  .
supposed ->  suppose
tour ->  tour
. ->  .
head ->  head
office ->  office

 ->  

MICHAEL ->  MICHAEL
( ->  (
continuing)- ->  continuing)-
-- ->  --
Dakota ->  Dakota
? ->  ?

          
                                 
 ->  
          
                                 



In [66]:
#POS tagging after removal of stop words.
for word in filtered_data2:
    print(word.text,'\t',word.pos_,'\t',word.tag_)

PADUA 	 PROPN 	 NNP
HIGH 	 PROPN 	 NNP
SCHOOL 	 PROPN 	 NNP
- 	 PUNCT 	 HYPH
DAY 	 PROPN 	 NNP

 	 SPACE 	 _SP
Revision 	 PROPN 	 NNP
November 	 PROPN 	 NNP
12 	 NUM 	 CD
, 	 PUNCT 	 ,
1997 	 NUM 	 CD

 	 SPACE 	 _SP
hope 	 VERB 	 VBP
dinner 	 NOUN 	 NN
ready 	 ADJ 	 JJ
minutes 	 NOUN 	 NNS
Mrs. 	 PROPN 	 NNP
Johnson 	 PROPN 	 NNP
squirts 	 VERB 	 VBZ
screamer 	 NOUN 	 NN
. 	 PUNCT 	 .

 	 SPACE 	 _SP
grabs 	 VERB 	 VBZ
mail 	 NOUN 	 NN
rifles 	 NOUN 	 NNS
, 	 PUNCT 	 ,
bends 	 VERB 	 VBZ
kiss 	 VERB 	 VB
Sharon 	 PROPN 	 NNP
cheek 	 NOUN 	 NN
. 	 PUNCT 	 .

 	 SPACE 	 _SP
MICHAEL- 	 PROPN 	 NNP
C'm 	 VERB 	 VBZ
. 	 PUNCT 	 .
supposed 	 VERB 	 VBN
tour 	 NOUN 	 NN
. 	 PUNCT 	 .
head 	 VERB 	 VBP
office 	 NOUN 	 NN

 	 SPACE 	 _SP
MICHAEL 	 PROPN 	 NNP
( 	 PUNCT 	 -LRB-
continuing)- 	 NOUN 	 NNS
-- 	 PUNCT 	 :
Dakota 	 PROPN 	 NNP
? 	 PUNCT 	 .

          
                                 
 	 SPACE 	 _SP
