## NLP Session-1 PreClass
- NLTK, short for Natural Language ToolKit, is a library written in Python for symbolic and statistical Natural Language Processing.
- NLTK contains a module called `tokenize()` which further classifies into two sub-categories:
    - **Word Tokenize :** We use the `word_tokenize()` method to split a sentence into tokens or words.
    - **Sentence Tokenize :** We use the `sent_tokenize` method to split a document or paragraph into sentences.

In [1]:
from nltk.tokenize import word_tokenize

In [2]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

In [3]:
word_tokenize(text)
# NLTK is considering punctuation as a token. Hence for future tasks, we need to remove the punctuations from the initial list.

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 '’',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars',
 '.',
 'In',
 '2008',
 ',',
 'SpaceX',
 '’',
 's',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid-fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth',
 '.']

In [4]:
from nltk.tokenize import sent_tokenize

In [5]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

In [6]:
sent_tokenize(text)

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

## Text Cleaning in Python

In [7]:
import warnings
warnings.filterwarnings("ignore")

In [8]:
raw_docs = ["I am writing some very basic english sentences", 
            "I'm just writing it for the demo PURPOSE to make audience understand the basics .",
            "The point is to _learn HOW it works_ on #simple # data."]

In [9]:
import nltk

In [10]:
# nltk.download() -- in Jupyter
# python -m nltk.downloader all -- or in Command line

### Step 1 - Convert to lower case

In [11]:
import string
raw_docs = [doc.lower() for doc in raw_docs]
raw_docs

['i am writing some very basic english sentences',
 "i'm just writing it for the demo purpose to make audience understand the basics .",
 'the point is to _learn how it works_ on #simple # data.']

### Step 2 - Tokenization

In [12]:
# Word tokenization
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
tokenized_docs

[['i', 'am', 'writing', 'some', 'very', 'basic', 'english', 'sentences'],
 ['i',
  "'m",
  'just',
  'writing',
  'it',
  'for',
  'the',
  'demo',
  'purpose',
  'to',
  'make',
  'audience',
  'understand',
  'the',
  'basics',
  '.'],
 ['the',
  'point',
  'is',
  'to',
  '_learn',
  'how',
  'it',
  'works_',
  'on',
  '#',
  'simple',
  '#',
  'data',
  '.']]

In [13]:
# Sentence tokenization
from nltk.tokenize import sent_tokenize
sent_token = [sent_tokenize(doc) for doc in raw_docs]
sent_token

[['i am writing some very basic english sentences'],
 ["i'm just writing it for the demo purpose to make audience understand the basics ."],
 ['the point is to _learn how it works_ on #simple # data.']]

### Step 3 - Punctuation Removal

In [14]:
import re
regex = re.compile('[%s]' % re.escape(string.punctuation))

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u"", token)
        if not new_token == u"":
            new_review.append(new_token)
            
    tokenized_docs_no_punctuation.append(new_review)
    
print(tokenized_docs_no_punctuation)

[['i', 'am', 'writing', 'some', 'very', 'basic', 'english', 'sentences'], ['i', 'm', 'just', 'writing', 'it', 'for', 'the', 'demo', 'purpose', 'to', 'make', 'audience', 'understand', 'the', 'basics'], ['the', 'point', 'is', 'to', 'learn', 'how', 'it', 'works', 'on', 'simple', 'data']]


### Step 4 - Removing Stopwords 

In [15]:
from nltk.corpus import stopwords

tokenized_docs_no_stopwords = []

for doc in tokenized_docs_no_punctuation:
    new_term_vector = []
    for word in doc:
        if not word in stopwords.words("english"):
            new_term_vector.append(word)
            
    tokenized_docs_no_stopwords.append(new_term_vector)
    
print(tokenized_docs_no_stopwords)

[['writing', 'basic', 'english', 'sentences'], ['writing', 'demo', 'purpose', 'make', 'audience', 'understand', 'basics'], ['point', 'learn', 'works', 'simple', 'data']]


### Step 5 - Stemming and Lemmatization

In [16]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
wordnet = WordNetLemmatizer()

preprocessed_docs = []

for doc in tokenized_docs_no_stopwords:
    final_doc = []
    for word in doc:
        #final_doc.append(porter.stem(word)) # Stemming
        final_doc.append(wordnet.lemmatize(word)) # Lemmatization
    preprocessed_docs.append(final_doc)

print(preprocessed_docs)

# Stemming just cut the words from the end.
# Lemmatization will refer to a dictionary and convert it to the meaningful root word.

[['writing', 'basic', 'english', 'sentence'], ['writing', 'demo', 'purpose', 'make', 'audience', 'understand', 'basic'], ['point', 'learn', 'work', 'simple', 'data']]


## CountVectorizer 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["hello, my name is Aman and I am  a data scientist."]
text1 = ["You are watching unfold data science"]

In [5]:
vectorizer = CountVectorizer()
vectorizer.fit(text)

In [6]:
vectorizer.vocabulary_
# Burada her bir kelimeye bir index numarası atandı.
# Aynı zamanda cleaning de yaptığı için bazı kelimeler yok oldu.

{'hello': 4,
 'my': 6,
 'name': 7,
 'is': 5,
 'aman': 1,
 'and': 2,
 'am': 0,
 'data': 3,
 'scientist': 8}

In [7]:
newvector = vectorizer.transform(text1)
newvector.toarray()
# The word "data" is present in the other document.

array([[0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

In [8]:
text2 = ["You are watching unfold data science. Aman Aman"]

In [9]:
newvector2 = vectorizer.transform(text2)
newvector2.toarray()
# Burda 1. index'e 2 sayısı atandı çünkü 1. index "aman" ı temsil ediyor ve text2 içinde 2 kez geçiyor.

array([[0, 2, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)

## TF - IDF (Term Frequency - Inverse Document Frequency)
- Purpose of TF-IDF is to highlight words which are frequent in a document but not across documents

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ["Aman is a data scientist in India", "This is unfold data science", "Data Science is a promising career."]

In [11]:
vectorizer = TfidfVectorizer()
vectorizer.fit(text)

In [14]:
vectorizer.vocabulary_

{'aman': 0,
 'is': 5,
 'data': 2,
 'scientist': 8,
 'in': 3,
 'india': 4,
 'this': 9,
 'unfold': 10,
 'science': 7,
 'promising': 6,
 'career': 1}

In [15]:
vectorizer.idf_
# Burada text içindeki her bir cümlede "data" kelimesi(2.index) geçtiği için onun katsayısını 1 verdi. Belirleyici değil.Etkisiz
# Ayrıca "is" ifadesi(5. index) de aynı şekilde her cümlede geçtiği için katsayı olarak 1 aldı.

array([1.69314718, 1.69314718, 1.        , 1.69314718, 1.69314718,
       1.        , 1.69314718, 1.28768207, 1.69314718, 1.69314718,
       1.69314718])

In [16]:
text_as_input = text[0]
text_as_input

'Aman is a data scientist in India'

In [18]:
vector = vectorizer.transform([text_as_input])
vector.toarray()
# "Aman is a data scientist in India" cümlesi içinde olmayan kelimeleri belirten index numaraları 0 oldu.

array([[0.46138073, 0.        , 0.27249889, 0.46138073, 0.46138073,
        0.27249889, 0.        , 0.        , 0.46138073, 0.        ,
        0.        ]])

In [19]:
text_as_input = text[2]
text_as_input

'Data Science is a promising career.'

In [20]:
vector = vectorizer.transform([text_as_input])

In [21]:
vector.toarray()

array([[0.        , 0.55249005, 0.32630952, 0.        , 0.        ,
        0.32630952, 0.55249005, 0.42018292, 0.        , 0.        ,
        0.        ]])