# **01 - NLP Tokenization**

Now that we’ve covered the basics, let’s dive deeper. Consider that we’ve gathered and cleaned our corpus. The first question to ask is: **How can we represent text so that a computer can understand it?** This is where **tokenization** comes in.

Tokenization, also known as text segmentation, is the process of breaking text into smaller chunks, like words or sentences, becoming tokens. It's the first step in turning raw text into something usable for NLP tasks.

In [3]:
import nltk
print(nltk.data.path)
print(f'\n{nltk.__version__}')

['/usr/share/nltk_data', '/root/nltk_data', '/usr/local/nltk_data', '/usr/local/share/nltk_data', '/usr/local/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']

3.8.1


In [4]:
import nltk

# Explicitly set the NLTK data path
#nltk.data.path.append('/usr/share/nltk_data')

# Verify the path
print(nltk.data.path)

# Now, run the tokenizer
from nltk.tokenize import word_tokenize
sentence = "It works!"
tokens = word_tokenize(sentence)
print(tokens)

['/usr/share/nltk_data', '/root/nltk_data', '/usr/local/nltk_data', '/usr/local/share/nltk_data', '/usr/local/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
['It', 'works', '!']


In [3]:
import nltk
nltk.download('punkt')  # Should say already downloaded
from nltk.tokenize import word_tokenize
print(word_tokenize("It works!"))


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/usr/share/nltk_data'
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [4]:
print(nltk.__version__)

3.9.1


In [1]:
import nltk
import pandas as pd

# Download the necessary NLTK data files
nltk.download('punkt')

# Sample sentence
sentence = "Hi there! How are you doing today? Hope you're having a good time at the NLP Bootcamp."

# Tokenizers
whitespace_tokenizer = sentence.split()
punctuation_tokenizer = nltk.regexp_tokenize(sentence, pattern=r'\s|[\.,!?;"]')
treebank_tokenizer = nltk.word_tokenize(sentence)
tweet_tokenizer = nltk.tokenize.TweetTokenizer().tokenize(sentence)
mwe_tokenizer = nltk.word_tokenize(sentence)  # Example of simple MWE tokenizer, here we'll keep it same as Treebank

# Create a DataFrame
data = {
    'Word Tokenizer': ['Whitespace-based Tokenization', 'Punctuation-based Tokenization', 
                       'Default/Treebank Tokenizer', 'Tweet Tokenizer', 'MWE Tokenizer'],
    'Sentence Split': [
        str(whitespace_tokenizer),
        str(punctuation_tokenizer),
        str(treebank_tokenizer),
        str(tweet_tokenizer),
        str(mwe_tokenizer)
    ]
}

df = pd.DataFrame(data)
df

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/usr/share/nltk_data'
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [7]:
lang_dir = nltk.find(f"tokenizers/punkt_tab/{lang}/")


NameError: name 'lang' is not defined

print('madje')

In [4]:
 from nltk.tokenize.punkt import PunktSentenceTokenizer

In [10]:
pst = PunktSentenceTokenizer()
pst.tokenize(sentence)

['Hi there!',
 'How are you doing today?',
 "Hope you're having a good time at the NLP Bootcamp."]

In [None]:
print('f')

In [5]:
import os

# Check if the punkt folder exists where it should
punkt_path = "/usr/share/nltk_data/tokenizers/punkt"
print("Exists:", os.path.exists(punkt_path))
print("Contents:", os.listdir(punkt_path) if os.path.exists(punkt_path) else "Not found")

Exists: True
Contents: ['slovene.pickle', 'danish.pickle', 'portuguese.pickle', 'dutch.pickle', 'swedish.pickle', '.DS_Store', 'estonian.pickle', 'french.pickle', 'README', 'turkish.pickle', 'german.pickle', 'malayalam.pickle', 'english.pickle', 'italian.pickle', 'polish.pickle', 'russian.pickle', 'PY3', 'czech.pickle', 'spanish.pickle', 'finnish.pickle', 'norwegian.pickle', 'greek.pickle']


In [6]:
import nltk

print(nltk.data.path)

['/usr/share/nltk_data', '/root/nltk_data', '/usr/local/nltk_data', '/usr/local/share/nltk_data', '/usr/local/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


In [7]:
from nltk.tokenize import word_tokenize

sentence = "This is a test."
tokens = word_tokenize(sentence)
print(tokens)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/usr/share/nltk_data'
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
