<a href="https://colab.research.google.com/github/azganushpoghosyan/nlp/blob/master/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook guides you through the essential steps to convert raw text into a clean, standardized format, essential for various natural language processing tasks.

In [2]:
# Importing libraries
import pandas as pd  # For working with structured data in a tabular form
import numpy as np  # Essential for numerical operations and working with arrays
import nltk  # Natural Language Toolkit for NLP tasks
import spacy  # An advanced NLP library
from nltk.stem import WordNetLemmatizer  # Lemmatization
from textblob import Word, TextBlob  # TextBlob for text processing
from nltk.corpus import stopwords  # Stopwords for filtering common words
from nltk.stem import PorterStemmer  # Stemming
import warnings  # For handling warnings
warnings.filterwarnings('ignore')  # Ignore warnings to improve readability
import re  # Regular expressions for text pattern matching and manipulation

In [3]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('vader_lexicon', quiet=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [4]:
# Create a DataFrame with a column named 'original_text' and populate it with the sample text
sample_text = 'Overview:\nThe PBNA Insights Reporting Analyst’s role will work primarily with Sparkling team where his/her role will be focused on enhancing and automating business reporting to fuel stronger, faster business performance insight for the PBNA Marketing and Insights teams. This includes connecting multiple data sources through curated metrics and developing calculated metrics to focus on the key outcome and diagnostic measures. A critical element of this role is to be able to deliver the strategic presentation focused around future Growth for PepsiCo.\nResponsibilities:\nExecute against team charter for Reporting vertical within SSC\no Execute market, portfolio, and brand level reporting of marketing KPI performance (utilizing dashboards, templated decks, and reporting tools)\no Leverage business performance explanations from teams around the world to incorporate considerations beyond data into reporting\no Explain business performance, drivers, and optimization opportunities\no Monitor key channel, customer, competitor (incl. PL) and emerging player performance and execute reporting at required intervals\no Deliver against needs of stakeholders, requestors and sector/functional leaders\no Support processes for output adherence and delivery to agreed scope – in line with the agreed timelines, aligned templates and content management\no Monitor and act upon regular feedback inputs from deliverables end-users and Business Partners\no Flag and monitor any business risks related to delivering the operational output (facilities, IT resources, recruitment efforts)\no Primary executor responsible for flawless support process and structure, including knowledge management and transfer\no Support communication processes with Reporting vertical leaders and Business Partners (project planning, workflow monitoring, quality checks, on-going changes)\no Help Reporting vertical leadership develop and finetune internal COE processes (work-flow mapping, pain-points and bottlenecks management) both related to service delivery and internal center operations\no Improve existing processes based on frequent end-user and Business Partner feedback loop Qualifications:'
df = pd.DataFrame({'original_text': [sample_text]})

###Preprocessing steps

*   Standardize text: Convert all characters in the text to lowercase. This ensures uniformity in the text representation, making it easier for subsequent analysis.

*   Remove punctuation: Eliminate punctuation marks from the text. This step helps focus on the actual words and improves the consistency of tokenization.

*   Remove numbers: Exclude numerical digits from the text. Removing numbers can be beneficial in tasks where numeric values are not essential for analysis, such as text classification or sentiment analysis.

*   Remove english stopwords: Filter out common English stopwords (e.g., "the," "and," "is") that typically do not contribute much meaning to the text. Removing stopwords reduces noise in the data.

*   Create tokens from words: Tokenization involves breaking down the text into individual words or tokens. This step is fundamental for further analysis, as it transforms the text into a format suitable for natural language processing tasks.

*   Stem the tokens: Apply stemming to reduce words to their base or root form by removing suffixes. Stemming helps in capturing the core meaning of words and reducing variations.

*   Lemmatize the tokens: Lemmatization is another technique for reducing words to their base or dictionary form (lemma). Unlike stemming, lemmatization considers the context and meaning of words, providing a more accurate representation.



In [5]:
def text_preprocessing(text, result = 'clean_tokens'):
    """
    Preprocesses the raw text applying the following steps: standardize, remove numbers, punctuation and stopwords, stem and lemmatize the words.

    Parameters:
    - text (str): The raw text to be preprocessed.
    - result (str): The step to include in the output. Possible values: 'standardized_text', 'no_punctuation', 'no_numbers', 'no_stopwords', 'stemmed_tokens', 'lemmatized_tokens', 'clean_tokens'. Default is 'clean_tokens'.

    Returns:
    - str or list: The result of the specified step applied.
    """
    text = str(text)
    # remove newline characters
    combined_text = text.replace('\n', ' ')

    # standardization of letters (make lowercase)
    standardized_text = combined_text.lower()
    if result == 'standardized_text':
      return standardized_text

    # remove punctuation
    no_punctuation = re.sub(r'[^\w\s]', '', standardized_text)
    if result == 'no_punctuation':
      return no_punctuation

    # remove numbers
    no_numbers = re.sub(r'\d', '', no_punctuation)
    if result == 'no_numbers':
      return no_numbers

    # remove stopwords
    stop_words = set(stopwords.words('english'))
    no_stopwords = " ".join([word for word in no_numbers.split() if word not in stop_words])
    if result == 'no_stopwords':
      return no_stopwords

    # spacy tokenization
    nlp = spacy.load('en_core_web_sm')
    doc_tokenize = nlp(no_stopwords)
    tokens = [token.text for token in doc_tokenize]
    if result == 'tokens':
      return tokens

    # stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    if result == 'stemmed_tokens':
      return stemmed_tokens

    # lemmatization
    sentence = " ".join(tokens)
    doc_lemmitize = nlp(sentence)
    lemmatized_tokens = [token.lemma_ for token in doc_lemmitize]
    if result == 'lemmatized_tokens':
      return lemmatized_tokens

    # final cleaning: remove empty strings, single letters and duplicates
    clean_tokens = [token for token in list(set(lemmatized_tokens)) if token.strip() != '' and len(token) > 1]
    if result == 'clean_tokens':
      return clean_tokens

In [6]:
# Apply preprocessing function to the sample data
standardized_text = text_preprocessing(sample_text, 'standardized_text')
no_punctuation = text_preprocessing(sample_text, 'no_punctuation')
no_numbers = text_preprocessing(sample_text, 'no_numbers')
no_stopwords = text_preprocessing(sample_text, 'no_stopwords')
stemmed_tokens = text_preprocessing(sample_text, 'stemmed_tokens')
lemmatized_tokens = text_preprocessing(sample_text, 'lemmatized_tokens')
clean_tokens = text_preprocessing(sample_text, 'clean_tokens')

In [7]:
# Add the result of each step as a separate column in the dataframe
df['standardized_text'] = [standardized_text]
df['no_punctuation'] = [no_punctuation]
df['no_numbers'] = [no_numbers]
df['no_stopwords'] = [no_stopwords]
df['stemmed_tokens'] = [" ".join(stemmed_tokens)]
df['lemmatized_tokens'] = [" ".join(lemmatized_tokens)]
df['clean_tokens'] = [" ".join(clean_tokens)]
df.head()

Unnamed: 0,original_text,standardized_text,no_punctuation,no_numbers,no_stopwords,stemmed_tokens,lemmatized_tokens,clean_tokens
0,Overview:\nThe PBNA Insights Reporting Analyst...,overview: the pbna insights reporting analyst’...,overview the pbna insights reporting analysts ...,overview the pbna insights reporting analysts ...,overview pbna insights reporting analysts role...,overview pbna insight report analyst role work...,overview pbna insight report analyst role work...,utilize incl charter support customer focus au...
