<a href="https://colab.research.google.com/github/faisu6339-glitch/Natural-Language-Processing-NLP-/blob/main/Normalizing_Textual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Normalizing Textual

#Textual data:
Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.

#Text normalization:
Text normalization is that the method of transforming text into one canonical form that it'd not have had before. Normalizing text before storing or processing it allows for separation of concerns since the input is sure to be consistent before operations are performed thereon. Text normalization requires being conscious of what sort of text is to be normalized and the way it's to be processed afterwards; there's no all-purpose normalization procedure.

## Steps for Text Normalization

Text normalization involves several steps to transform raw text into a consistent and standardized format suitable for processing. Here are the common steps:

1.  **Lowercasing**: Converting all text to lowercase to treat words like "The" and "the" as the same.

2.  **Removing Punctuation**: Eliminating punctuation marks (e.g., periods, commas, question marks) that might not be relevant for analysis.

3.  **Tokenization**: Breaking down the text into smaller units, usually words or subwords. This is a fundamental step for most NLP tasks.

4.  **Removing Stop Words**: Filtering out common words (e.g., "a", "an", "the", "is", "are") that carry little meaning and can clutter analysis.

5.  **Stemming/Lemmatization**: Reducing words to their base or root form.
    *   **Stemming** is a more aggressive process that often chops off the end of words (e.g., "running" -> "run", "jumps" -> "jump"), which might not result in a valid word.
    *   **Lemmatization** is a more sophisticated process that uses vocabulary and morphological analysis to return the base or dictionary form of a word (e.g., "better" -> "good", "ran" -> "run").

6.  **Handling Numbers**: Deciding how to treat numerical data. This might involve removing them, converting them to a standard format, or representing them symbolically.

7.  **Removing Extra Whitespace**: Eliminating multiple spaces, tabs, and newlines to ensure consistent spacing.

8.  **Handling Special Characters and Emojis**: Deciding whether to remove, convert, or preserve special characters and emojis based on the specific task.

9.  **Expanding Abbreviations and Acronyms**: Replacing common abbreviations or acronyms with their full forms (e.g., "Dr." -> "Doctor", "ASAP" -> "As Soon As Possible").

10. **Spell Correction**: Correcting misspelled words, though this can be a complex step and depends on the specific use case.

_The choice and order of these steps can vary greatly depending on the specific application and the nature of the textual data._

#Text String

In [1]:
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)

       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows).


#Case Conversion (Lower Case)



In [2]:
# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()
print(lower_string)

       python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2's end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows).


#Removing Numbers



In [3]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

       python ., released in , was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python . with python 's end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows  (and old installers not restricted to -bit windows).


#Removing punctuation



In [4]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)

       python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#Removing White space



In [5]:
# import regex
import re

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)

# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows


#Removing Stop Words



In [6]:
# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

{'between', 'no', 'their', 'so', 'if', 'both', 'was', 'am', 'him', 'shan', "they're", 'has', 'we', "they'll", 'by', 'more', 'any', "mustn't", 'through', 'a', 'be', 'each', 'yours', 'isn', 's', 'again', 'having', 'that', "aren't", 'in', 'needn', "should've", 'weren', 'few', 'or', "she'll", 'these', 'only', "that'll", "needn't", 'from', "didn't", 've', 'can', 're', 'o', 'off', "it's", 'under', 'yourself', "hadn't", 'they', 'into', "i've", 'how', 'than', 'now', 'which', 'because', 'i', "couldn't", "you'll", 'out', 'me', 'this', 'did', "you've", "he's", 'up', "isn't", "you'd", 'all', 'but', 'he', "haven't", 'about', 'should', 'hasn', 'same', 'before', 'it', 'mustn', 'itself', "he'd", "she'd", 'just', 'then', 'other', 'll', 'down', 'ain', 'some', 'where', 'further', 'ours', "we're", 'against', 'had', "mightn't", 'once', 'themselves', 'wouldn', 'the', "it'll", 'do', 'such', 't', 'what', 'being', 'didn', 'm', 'will', 'your', 'when', 'below', 'own', "we'd", 'not', 'its', 'hadn', 'my', 'and', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
# import regex
import re

# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)

# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '

# removing last space
no_stpwords_string = no_stpwords_string[:-1]

# output
print(no_stpwords_string)

['python', 'released', 'in', 'was', 'a', 'major', 'revision', 'of', 'the', 'language', 'that', 'is', 'not', 'completely', 'backward', 'compatible', 'and', 'much', 'python', 'code', 'does', 'not', 'run', 'unmodified', 'on', 'python', 'with', 'python', 's', 'endoflife', 'only', 'python', 'x', 'and', 'later', 'are', 'supported', 'with', 'older', 'versions', 'still', 'supporting', 'eg', 'windows', 'and', 'old', 'installers', 'not', 'restricted', 'to', 'bit', 'windows']
python released major revision language completely backward compatible much python code run unmodified python python endoflife python x later supported older versions still supporting eg windows old installers restricted bit windows


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
