In [1]:
# Text cleaning/Preprocessing

# Part 1 : Display punctuation characters
import string
print("Punctuation characters: ", string.punctuation)

Punctuation characters:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [2]:
# Part 2 : Remove newlines and punctuation characters
# Sample text
import nltk

eu_definition = '''
The European Union (EU) is a supranational political and economic union of 27 member states that are located primarily in Europe.[9][10] 
The union has a total area of 4,233,255 km2 (1,634,469 sq mi) and an estimated population of over 449 million as of 2024. 
The EU is often described as a sui generis political entity combining characteristics of both a federation and a confederation.[11][12]

Containing 5.5% of the world population in 2023,[13] EU member states generated a nominal gross domestic product (GDP) of around €17.935 trillion in 2024,
accounting for approximately one sixth of global economic output.[14] 
Its cornerstone, the Customs Union, paved the way to establishing an internal single market based on standardised legal framework and legislation that applies in all member states in
those matters, and only those matters, where the states have agreed to act as one. EU policies aim to ensure the free movement of people, goods, services and capital within
the internal market;[15] enact legislation in justice and home affairs; and maintain common policies on trade,[16] agriculture,[17] fisheries and regional development.[18] 
Passport controls have been abolished for travel within the Schengen Area.[19] The eurozone is a group composed of the 20 EU member states that have fully implemented the EU's 
economic and monetary union and use the euro currency. Through the Common Foreign and Security Policy, the union has developed a role in external relations and defence.
It maintains permanent diplomatic missions throughout the world and represents itself at the United Nations, the World Trade Organization, the G7 and the G20.
Due to its global influence, the European Union has been described by some scholars as an emerging superpower.[20][21][22][needs update]
'''

eu_definition = eu_definition.replace('\n', ' ')  # Remove newlines

# First print all punctuations
print("\npunctuation list: ")
for punct in string.punctuation:
    print(punct)


punctuation list: 
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
:
;
<
=
>
?
@
[
\
]
^
_
`
{
|
}
~


In [3]:
# Now apply to our text

for punct in string.punctuation:
    eu_definition = eu_definition.replace(punct, '')
print("\nText without punctuation: ")    
print(eu_definition)    # The text is now somewhat confusing


Text without punctuation: 
 The European Union EU is a supranational political and economic union of 27 member states that are located primarily in Europe910  The union has a total area of 4233255 km2 1634469 sq mi and an estimated population of over 449 million as of 2024  The EU is often described as a sui generis political entity combining characteristics of both a federation and a confederation1112  Containing 55 of the world population in 202313 EU member states generated a nominal gross domestic product GDP of around €17935 trillion in 2024 accounting for approximately one sixth of global economic output14  Its cornerstone the Customs Union paved the way to establishing an internal single market based on standardised legal framework and legislation that applies in all member states in those matters and only those matters where the states have agreed to act as one EU policies aim to ensure the free movement of people goods services and capital within the internal market15 enact l

In [4]:
# Another common preprocessing step
eu_definition = eu_definition.lower()
print("\nLowercased test: ")
print(eu_definition)


Lowercased test: 
 the european union eu is a supranational political and economic union of 27 member states that are located primarily in europe910  the union has a total area of 4233255 km2 1634469 sq mi and an estimated population of over 449 million as of 2024  the eu is often described as a sui generis political entity combining characteristics of both a federation and a confederation1112  containing 55 of the world population in 202313 eu member states generated a nominal gross domestic product gdp of around €17935 trillion in 2024 accounting for approximately one sixth of global economic output14  its cornerstone the customs union paved the way to establishing an internal single market based on standardised legal framework and legislation that applies in all member states in those matters and only those matters where the states have agreed to act as one eu policies aim to ensure the free movement of people goods services and capital within the internal market15 enact legislatio

In [5]:
# Now tokenize
tokenized_eu_definition = nltk.tokenize.word_tokenize(eu_definition)
print("\nTokenized text : ")
print(tokenized_eu_definition)


Tokenized text : 
['the', 'european', 'union', 'eu', 'is', 'a', 'supranational', 'political', 'and', 'economic', 'union', 'of', '27', 'member', 'states', 'that', 'are', 'located', 'primarily', 'in', 'europe910', 'the', 'union', 'has', 'a', 'total', 'area', 'of', '4233255', 'km2', '1634469', 'sq', 'mi', 'and', 'an', 'estimated', 'population', 'of', 'over', '449', 'million', 'as', 'of', '2024', 'the', 'eu', 'is', 'often', 'described', 'as', 'a', 'sui', 'generis', 'political', 'entity', 'combining', 'characteristics', 'of', 'both', 'a', 'federation', 'and', 'a', 'confederation1112', 'containing', '55', 'of', 'the', 'world', 'population', 'in', '202313', 'eu', 'member', 'states', 'generated', 'a', 'nominal', 'gross', 'domestic', 'product', 'gdp', 'of', 'around', '€17935', 'trillion', 'in', '2024', 'accounting', 'for', 'approximately', 'one', 'sixth', 'of', 'global', 'economic', 'output14', 'its', 'cornerstone', 'the', 'customs', 'union', 'paved', 'the', 'way', 'to', 'establishing', 'an', 

In [6]:
# Print first 10 tokens
print("\nFirst 10 tokens: ")
print(tokenized_eu_definition[0:10])

# Frequency distribution
from pprint import pprint


First 10 tokens: 
['the', 'european', 'union', 'eu', 'is', 'a', 'supranational', 'political', 'and', 'economic']


In [7]:
# Top 20 tokens
from pprint import pprint

from nltk.probability import FreqDist
print("\nTop 20 most common tokens : ")
fdist = FreqDist(tokenized_eu_definition)
top_tokens = fdist.most_common(20)
pprint(top_tokens)


Top 20 most common tokens : 
[('the', 22),
 ('and', 15),
 ('of', 10),
 ('a', 8),
 ('union', 7),
 ('in', 7),
 ('eu', 5),
 ('states', 5),
 ('member', 4),
 ('as', 4),
 ('to', 4),
 ('is', 3),
 ('economic', 3),
 ('that', 3),
 ('has', 3),
 ('an', 3),
 ('world', 3),
 ('have', 3),
 ('european', 2),
 ('political', 2)]
