<a href="https://colab.research.google.com/github/brrikcy/nlp/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Sample text documents (3 docs)
text_data = [
    """
    Dhoni finishes off in style, A magnificient strike into the crowd. Dhoni and co doest it after 28 years.
    """,
    """
    Indian cricket team had its golden reign during the 2007-2013 time period as India won all 3 major ICC trophies in which Dhoni was the captain.
    """,
    """
    Dhoni does it again and again, it doesnt matter how old he gets, he just lives finishing.
    """
]

In [None]:
# Initialize stop words
stop_words = set(stopwords.words('english'))

In [None]:
# Create the vocabulary
vocab = set()


In [None]:
# Create the bag-of-words model
bow_model = []

In [None]:
for text in text_data:
    # Create a dictionary to store the word counts for each document
    word_counts = {}

    # Tokenize the text
    tokens = word_tokenize(text.lower())  # Convert to lowercase for uniformity

    # Filter out stop words and non-alphanumeric tokens
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

    # Update the vocabulary
    vocab.update(tokens)

    # Count the occurrences of each word
    for word in tokens:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

    # Add the word counts to the bag-of-words model
    bow_model.append(word_counts)


In [None]:
# Print the vocabulary
print("Vocabulary:", vocab)

Vocabulary: {'crowd', 'team', 'period', 'old', 'dhoni', 'reign', 'indian', 'captain', 'style', 'time', 'trophies', 'doesnt', 'major', 'india', 'gets', 'doest', '3', 'co', '28', 'strike', 'magnificient', 'years', 'lives', 'cricket', 'finishes', 'icc', 'golden', 'matter', 'finishing'}


In [None]:

print("\nWord counts for the first document:", bow_model[0])
print("\nWord counts for the second document:", bow_model[1])
print("\nWord counts for the third document:", bow_model[2])


Word counts for the first document: {'dhoni': 2, 'finishes': 1, 'style': 1, 'magnificient': 1, 'strike': 1, 'crowd': 1, 'co': 1, 'doest': 1, '28': 1, 'years': 1}

Word counts for the second document: {'indian': 1, 'cricket': 1, 'team': 1, 'golden': 1, 'reign': 1, 'time': 1, 'period': 1, 'india': 1, '3': 1, 'major': 1, 'icc': 1, 'trophies': 1, 'dhoni': 1, 'captain': 1}

Word counts for the third document: {'dhoni': 1, 'doesnt': 1, 'matter': 1, 'old': 1, 'gets': 1, 'lives': 1, 'finishing': 1}
