<a href="https://colab.research.google.com/github/anushka-code/Coding-Blocks-ML-DL-/blob/main/Text_PreProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics - Text PreProcessing

### Bag of Words Pipeline

*   Get the Data/Corpus
*   Tokenisation & Stopwords Removal
*   Stemming
*   Building a Vocab
*   Vectorisation
*   Classification




In [52]:
import nltk #natural language toolkit
import sklearn

In [3]:
from nltk.corpus import brown

In [4]:
nltk.download("brown")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [5]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [6]:
data = brown.sents(categories = 'science_fiction')

In [7]:
data[6]

['he',
 'studied',
 'them',
 ',',
 'compared',
 'them',
 'with',
 'what',
 'he',
 'had',
 'been',
 'taught',
 'as',
 'a',
 'nestling',
 ',',
 'struggling',
 'to',
 'bridge',
 'between',
 'languages',
 ',',
 'the',
 'one',
 'he',
 'thought',
 'with',
 'and',
 'the',
 'one',
 'he',
 'was',
 'learning',
 'to',
 'think',
 'in',
 '.']

In [8]:
' '.join(data[6])

'he studied them , compared them with what he had been taught as a nestling , struggling to bridge between languages , the one he thought with and the one he was learning to think in .'

Tokenization - Breaking the document into sentences or breaking the sentences into words

In [144]:
document = """I introduce myself as Anushka Bhave. I am a computer engineering student. I love to code."""
sentence = """I am passionate about research in the fields of Deep Learning and NLP"""

In [10]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [11]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
sentences = sent_tokenize(document)
print(sentences)

['I introduce myself as Anushka Bhave.', 'I am a computer engineering student.', 'I love to code.']


In [13]:
sentences[0]

'I introduce myself as Anushka Bhave.'

In [14]:
words = word_tokenize(sentence)
print(words)

['I', 'am', 'passionate', 'about', 'research', 'in', 'the', 'fields', 'of', 'Deep', 'Learning', 'and', 'NLP']


Stopwords Removal - These are words which don't add to the analysis of the sentence or don't carry any weightage

In [15]:
from nltk.corpus import stopwords

In [16]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [17]:
sw = set(stopwords.words('english'))

In [18]:
print(sw)

{'than', 'after', 'what', 'about', 'be', 'ourselves', 'which', 'to', 'against', 'this', 'from', 'are', 'more', 'for', "don't", 'any', 'above', 'of', 'did', 'further', 'then', 'our', 'him', 'in', "hasn't", 'll', "won't", "weren't", 'out', 'hers', 'when', 'between', 're', 'that', 'haven', 'these', 'he', 'am', 'where', 've', 'm', 'those', 'your', 'who', "that'll", 'has', 'can', 'at', 'there', 'most', "aren't", 'you', 'by', 'whom', 'up', 'yours', 'yourself', 'aren', 'so', 'why', 'now', 'we', 'should', 'them', 's', 'but', 'shan', 'on', 'because', 'themselves', 'wouldn', "you've", 'both', 'own', 'not', 'just', 'my', 'mustn', "shouldn't", 'himself', 'before', 'couldn', 'been', 'd', 'hadn', 'mightn', 'until', 'off', 'won', 'while', 'ours', 'an', "it's", "couldn't", 'isn', 'i', 'o', 'was', 'having', 'few', 'no', 'a', 'each', 'myself', 'down', 'nor', 'only', "wasn't", 'such', "haven't", 'or', 'same', "hadn't", 'didn', 'again', 'through', "you're", 'shouldn', 'other', 'weren', 'were', "you'll", "

In [19]:
def RemoveStopwords(text, stopwords):
  useful_words = [w for w in text if w not in stopwords]
  return useful_words

sent = "I don't respect him a lot".split()
RemoveStopwords(sent,sw)

['I', 'respect', 'lot']

In [20]:
from nltk.tokenize import RegexpTokenizer

In [21]:
text = "My email id is bhaveanushka19@gmail.com and contact number is 9657088983" 
tokenizer = RegexpTokenizer('[a-zA-Z@.]+') #regular expression tokenization for for rmeoving numbers 
useful_text = tokenizer.tokenize(text)
print(useful_text)

['My', 'email', 'id', 'is', 'bhaveanushka', '@gmail.com', 'and', 'contact', 'number', 'is']


Stemming/Lemmatization - Breaking down the words to their root word like jumps, jumping, jumped all becomes jump

In [22]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [23]:
from nltk.stem import WordNetLemmatizer

wn = WordNetLemmatizer()
wn.lemmatize('swimming')

'swimming'

In [24]:
from nltk.stem.snowball import SnowballStemmer, PorterStemmer
ps = PorterStemmer()
ps.stem('jumping')

'jump'

In [25]:
ps.stem('jumps')

'jump'

In [26]:
ps.stem('loving')

'love'

In [27]:
ps.stem('lovely')

'love'

Building a Vocab & Vectorisation - Vocab is a list of the unique words present in the corpus. Vectorisation is to convert the sentence into numeric form. It has an index for a unique word and keeps the count of repetitions of that word

In [28]:
corpus = [ 
          'The first Test between India and Sri Lanka which ended in three days in Mohali was a special occasion for former India captain Virat Kohli. He became the 12th Indian cricketer to turn out in 100 Tests for India and also surpassed 8,000 runs in Test cricket on Day 1 of the Test',
          'On one side, a feminist movement that led the Me Too movement in Asia; on the other, young men whose resistance to the modest gains made by South Korean women has been exploited by the two main candidates.',
          'Food prices have also been pushed up by the war, and are a very real consideration and problem for people in poor countries.',
          'Gangubai Kathiawadi’ chronicles Ganga rises to power and fame from a demure small-town girl in Gujarat, to the undisputed queen of kamathipura in Mumbai.'          
          ]

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [31]:
cv = CountVectorizer()

In [132]:
vectorised_corpus = cv.fit_transform(corpus)

In [53]:
save = vectorised_corpus.toarray()

In [130]:
count = cv.vocabulary

None


In [64]:
save_first = save[0]

array([1, 1, 1, 1, 2, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 3, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 3, 1, 0, 3, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0])

In [95]:
len(save[0])

98

In [66]:
cv.inverse_transform?

In [67]:
save_first = save_first.reshape(1,-1)

In [69]:
sentence = cv.inverse_transform(save_first)
print(sentence)

[array(['000', '100', '12th', 'also', 'and', 'became', 'between',
       'captain', 'cricket', 'cricketer', 'day', 'days', 'ended', 'first',
       'for', 'former', 'he', 'in', 'india', 'indian', 'kohli', 'lanka',
       'mohali', 'occasion', 'of', 'on', 'out', 'runs', 'special', 'sri',
       'surpassed', 'test', 'tests', 'the', 'three', 'to', 'turn',
       'virat', 'was', 'which'], dtype='<U13')]


Vectorisation with Stopword Removal

In [82]:
def VectorisationwithStopword(sometext):
  words = tokenizer.tokenize(sometext.lower())
  words = RemoveStopwords(words,sw)
  return words

In [83]:
VectorisationwithStopword("Hi, my name is Anushka Bhave and I am a computer engineering student at VIT Pune")

['hi',
 'name',
 'anushka',
 'bhave',
 'computer',
 'engineering',
 'student',
 'vit',
 'pune']

In [87]:
new_corp = cv.fit_transform(corpus)

In [109]:
cv = CountVectorizer(tokenizer=VectorisationwithStopword)

In [92]:
save1 = new_corp.toarray()
saveok = save1[0]

In [93]:
len(saveok) #the length of the unique vocab words reduced from 98 to 72

72

Bigrams, Trigrams and N-grams

In [145]:
sentence1 = "Gangubai Kathiawadi is an excellent movie."
sentence2 = "Gangubai Kathiawadi is not a good movie."
sentence3 = "Gangubai Kathiawadi is an amazing movie. Beautifully acted, emotion and drama are at the core."

In [155]:
docs = [sentence1, sentence2, sentence3]

In [156]:
print(docs)

['Gangubai Kathiawadi is an excellent movie.', 'Gangubai Kathiawadi is not a good movie.', 'Gangubai Kathiawadi is an amazing movie. Beautifully acted, emotion and drama are at the core.']


In [157]:
cv1 = CountVectorizer(ngram_range=(2,2))

TF-IDF Normalization : Assigning weights to the words according to its frequency in the document. More the frequency, less important information is given and it has less weightage, hence.

In [147]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [150]:
tfidf = TfidfVectorizer()

In [166]:
vc = tfidf.fit_transform(docs).toarray()

In [167]:
print(vc)

[[0.         0.         0.44102652 0.         0.         0.
  0.         0.         0.         0.         0.57989687 0.34249643
  0.         0.34249643 0.34249643 0.34249643 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.32052772
  0.54270061 0.32052772 0.32052772 0.32052772 0.54270061 0.        ]
 [0.28899189 0.28899189 0.21978578 0.28899189 0.28899189 0.28899189
  0.28899189 0.28899189 0.28899189 0.28899189 0.         0.17068326
  0.         0.17068326 0.17068326 0.17068326 0.         0.28899189]]


In [171]:
tfidf.vocabulary_

{'acted': 0,
 'amazing': 1,
 'an': 2,
 'and': 3,
 'are': 4,
 'at': 5,
 'beautifully': 6,
 'core': 7,
 'drama': 8,
 'emotion': 9,
 'excellent': 10,
 'gangubai': 11,
 'good': 12,
 'is': 13,
 'kathiawadi': 14,
 'movie': 15,
 'not': 16,
 'the': 17}