# NLP. Lab 1. Tokenization.


## What is tokenization?


Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization. We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens.

![NLP_Tokenization](https://raw.githubusercontent.com/satishgunjal/images/master/NLP_Tokenization.png)


### Purpose


Every sentence gets its meaning by the words present in it. So by analyzing the words present in the text we can easily interpret the meaning of the text. Once we have a list of words we can also use statistical tools and methods to get more insights into the text. For example, we can use word count and word frequency to find out important of word in that sentence or document.


## Tokenization in Python


In [1]:
text = "Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization.  We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens."

### Built-in methods


We can use **split()** method to split a string into a list where each word is a list item.


#### Word tokenization


In [2]:
tokens = text.split()
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [3]:
tokens = text.split(".")
print(tokens[:3])

['Tokenization is one of the first step in any NLP pipeline', ' Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens', " If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'"]


### RegEx tokenization

- Using RegEx we can match character combinations in string and perform word/sentence tokenization.
- You can check your regular expressions at [regex101](https://regex101.com/)


#### Word tokenization


In [4]:
import re

tokens = re.findall("[\w]+", text)
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


### NLTK library


#### Word tokenization


In [5]:
# !pip install nltk

In [6]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [7]:
from nltk.tokenize import sent_tokenize

tokens = sent_tokenize(text)
print(tokens[:3])

['Tokenization is one of the first step in any NLP pipeline.', 'Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens.', "If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'."]


### Spacy library


#### Word tokenization


In [8]:
# !pip install spacy
# !python -m spacy download en

In [9]:
from spacy.lang.en import English

english_tokenizer = English()

doc = english_tokenizer(text)
tokens = [token.text for token in doc]
print(tokens[:5])

['Tokenization', 'is', 'one', 'of', 'the']


#### Sentence tokenization


In [10]:
english_tokenizer = English()
english_tokenizer.add_pipe("sentencizer")


doc = english_tokenizer(text)
tokens = [token.sent for token in doc.sents]
print(tokens[:3])

[Tokenization is one of the first step in any NLP pipeline., Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens., If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'.]


## Task


Your goal is to solve tokenization task and count number of numeric tokens.

You should submit your solution to [competition](https://www.kaggle.com/t/50b3669520ce4a0e892900406bbc1f2f).


### Grade distribution

- your solution is ranked above or the same as the benchmark solution in the leaderboard - 1 point
- your solution is lower than the benchmark solution - 0.5 points
- no submission / late submission / no appearance on leaderboard - 0 points


In [13]:
import re

counts = list()

with open('./data.txt', 'r') as f:
    for id, s in enumerate(f.readlines()):
        counts.append(len(re.findall(r'\d+', s)))

counts

[2, 1, 1, 3, 6, 2, 1, 3, 2, 1, 1, 2, 4]

In [14]:
with open("submission.csv", "w") as f:
    f.write("id,count\n")
    for id, count in enumerate(counts):
        f.write(f"{id},{count}\n")