# Natural Language Processing with Python

Natural Language Processing (NLP) is a field of computer science that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human language in a way that is valuable.

Although NLP includes a wide range of techniques and applications, some of the most common tasks include:

1. **Tokenization**: Breaking down text into smaller units, such as words or sentences.
2. **Lowercasing**: Converting all characters in the text to lowercase to ensure uniformity.
3. **Lemmatization**: Reducing words to their base or root form:
   - Example: "running" becomes "run"
   - Example: "tasks" becomes "task"
4. **Special Character Removal**: Stripping out punctuation, numbers, and other non-alphabetic characters from the text.
5. **Stopword Removal**: Eliminating common words (e.g., "the", "is", "and") that do not contribute significantly to the meaning of the text.


## Import Packages

`spaCy` is a NLP library in Python that provides tools for tokenization, lemmatization, and more. You may have used `nltk` or `textblob` before, but `spaCy` is known for its speed and efficiency. For small tasks like this, you will not notice much difference, but for larger datasets, `spaCy` can be significantly faster.


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import spacy
from collections import Counter

## NLP with `spaCy` using a String


In [2]:
# Load spaCy English model
# Make sure you've run: python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Here is a sample text that we will process using `spaCy`:


In [3]:
original_text = "Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!"

A `nlp` object is created using the `spacy.load()` function, which loads a pre-trained language model. In this case, we are using the English model `en_core_web_sm`. The text is then processed using the `nlp()` function, which creates a `Doc` object containing tokens and their linguistic features.


In [4]:
# Create a spaCy Doc
doc = nlp(original_text)

# Check the type of doc
type(doc)

spacy.tokens.doc.Doc

We can tokenize the text, convert it to lowercase, lemmatize the tokens, remove special characters, and eliminate stopwords. Although this tutorial intentionally breaks down each step for clarity, in practice, these steps can be combined into a single processing pipeline for efficiency.


### Tokenization

Tokenization is the process of breaking down text into smaller units called tokens, which can be words, phrases, or symbols. In this case, we are tokenizing the text using spaCy's `Doc` object, which allows us to easily access and manipulate the tokens.


In [5]:
# Tokenization
tokens = [token.text for token in doc]

tokens[:5]

['Celebrating', '10', 'years', 'with', 'Mastercard']

### Lowercasing

We convert all tokens to lowercase to ensure uniformity. This helps in reducing the number of unique tokens, as "The" and "the" will be treated as the same token. Note that we are using Python's built-in `lower()` method for strings.


In [6]:
# Lowercasing
lower_tokens = [t.lower() for t in tokens]

lower_tokens[:5]

['celebrating', '10', 'years', 'with', 'mastercard']

### Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as the lemma. This helps in normalizing words and reducing the number of unique tokens. For example, "running" becomes "run", and "tasks" becomes "task". In this case, we are using spaCy's built-in lemmatization capabilities to obtain the lemmas of the tokens.

The `lemma_` attribute of each token in the `Doc` object provides the lemmatized form of the token.


In [7]:
# Lemmatization
lemmas = [token.lemma_ for token in doc]

lemmas[:5]

['celebrate', '10', 'year', 'with', 'Mastercard']

:::{warning} This step is redundant!

The stop word removal that comes later also converts tokens to lowercase, so this step is redundant. This is only included here for educational purposes to illustrate the lowercasing process separately.

:::


### Stopword Removal

Stopwords are common words that do not contribute significantly to the meaning of the text. Examples include "the", "is", "and", etc. Removing stopwords helps in reducing noise and focusing on the more meaningful words in the text.

`spaCy` provides a built-in attribute `is_stop` for each token, which indicates whether the token is a stopword. We can use this attribute to filter out stopwords from our list of tokens.


In [8]:
# Stopword & punctuation removal (lemmatized + lowercased)
clean_tokens = [
    token.lemma_.lower()
    for token in doc
    if not token.is_stop and not token.is_punct and not token.is_space
]
clean_tokens[:5]

['celebrate', '10', 'year', 'mastercard', 'incredible']

In [9]:
print("Original:", original_text)
print("Tokens:", tokens)
print("Lower tokens:", lower_tokens)
print("Lemmas:", lemmas)
print("Clean tokens (no stopwords/punct, lemmatized, lowercased):", clean_tokens)

Original: Celebrating 10 years with Mastercard has been an incredible journey - great benefits, flexible hours, and amazing colleagues!
Tokens: ['Celebrating', '10', 'years', 'with', 'Mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lower tokens: ['celebrating', '10', 'years', 'with', 'mastercard', 'has', 'been', 'an', 'incredible', 'journey', '-', 'great', 'benefits', ',', 'flexible', 'hours', ',', 'and', 'amazing', 'colleagues', '!']
Lemmas: ['celebrate', '10', 'year', 'with', 'Mastercard', 'have', 'be', 'an', 'incredible', 'journey', '-', 'great', 'benefit', ',', 'flexible', 'hour', ',', 'and', 'amazing', 'colleague', '!']
Clean tokens (no stopwords/punct, lemmatized, lowercased): ['celebrate', '10', 'year', 'mastercard', 'incredible', 'journey', 'great', 'benefit', 'flexible', 'hour', 'amazing', 'colleague']


## NLP with `spaCy` using a `DataFrame`

We can apply the same NLP techniques to a pandas `DataFrame` containing multiple reviews.


### Dataset

The dataset contains Glassdoor employee reviews for MasterCard. Each review has a unique `review_id` and multiple ratings and text fields. We will focus on the text fields, which contains text about what employees liked or disliked about working at MasterCard.

Some reviews may contain special characters, mixed casing, and stopwords, which we will clean using the NLP techniques mentioned above.


In [10]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/bdi475/datasets/refs/heads/main/mastercard-glassdoor-reviews.csv"
)
df.head(3)

Unnamed: 0,review_id,review_date,review_text,review_liked_text,review_disliked_text,count_helpful,count_not_helpful,employer_responses,is_current_job,length_of_employment,...,rating_ceo,rating_compensation_and_benefits,rating_culture_and_values,rating_diversity_and_inclusion,rating_overall,rating_recommend_to_friend,rating_senior_leadership,rating_work_life_balance,job_title_text,location_name
0,101108831,2025-11-06,no,surroundings is very awesome and good,only the people who is master,0,0,,True,2,...,APPROVE,5.0,5,5,5,POSITIVE,5.0,5.0,Human Resources,"New York, NY"
1,101220670,2025-11-11,Continue nurturing the supportive culture and ...,Celebrating 10 years with Mastercard has been ...,"Like any fast-growing company, there have been...",0,0,,True,20,...,APPROVE,3.0,5,5,4,POSITIVE,5.0,4.0,Principal Engineer,"New York, NY"
2,100991760,2025-10-30,,Working in Business experimentation is extreme...,The pay is not competitive compared with other...,0,0,,True,0,...,APPROVE,4.0,5,5,4,POSITIVE,3.0,3.0,Consultant,"Arlington, VA"


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   review_id                         1000 non-null   int64  
 1   review_date                       1000 non-null   object 
 2   review_text                       389 non-null    object 
 3   review_liked_text                 1000 non-null   object 
 4   review_disliked_text              1000 non-null   object 
 5   count_helpful                     1000 non-null   int64  
 6   count_not_helpful                 1000 non-null   int64  
 7   employer_responses                38 non-null     object 
 8   is_current_job                    1000 non-null   bool   
 9   length_of_employment              1000 non-null   int64  
 10  rating_business_outlook           661 non-null    object 
 11  rating_career_opportunities       1000 non-null   float64
 12  rating_

There are 1000 reviews in total.


In [12]:
df.shape

(1000, 22)

### What did employees like about working at MasterCard?


In [13]:
token_lists = []

# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_liked_text"], disable=["parser", "ner"]):
    token_lists.append(
        [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct and not token.is_space
        ]
    )

df["review_liked_tokens"] = token_lists

df[["review_id", "review_liked_text", "review_liked_tokens"]].head()

Unnamed: 0,review_id,review_liked_text,review_liked_tokens
0,101108831,surroundings is very awesome and good,"[surrounding, awesome, good]"
1,101220670,Celebrating 10 years with Mastercard has been ...,"[celebrate, 10, year, mastercard, incredible, ..."
2,100991760,Working in Business experimentation is extreme...,"[work, business, experimentation, extremely, r..."
3,100092071,"Benefits such as 401k, sick leave, etc","[benefit, 401k, sick, leave, etc]"
4,100155384,The logo. The stock price. The ability to tell...,"[logo, stock, price, ability, tell, people, pa..."


In [14]:
# Explode liked_tokens so each token becomes its own row
df_liked_exploded = df[["review_id", "review_liked_tokens"]].explode("review_liked_tokens").reset_index(drop=True)
df_liked_exploded = df_liked_exploded.rename(columns={"review_liked_tokens": "token"})

# Inspect result
df_liked_exploded.head()

Unnamed: 0,review_id,token
0,101108831,surrounding
1,101108831,awesome
2,101108831,good
3,101220670,celebrate
4,101220670,10


After processing the "liked" column, we explode the list of tokens so that each token becomes its own row in the DataFrame. We also clean the tokens by dropping any missing or empty tokens and trimming whitespace.


In [15]:
df_liked_exploded.shape

(12919, 2)

The most common tokens in the "liked" reviews can be identified by counting the occurrences of each token in the exploded DataFrame.


In [16]:
df_liked_common_tokens = df_liked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_liked_common_tokens

Unnamed: 0,token,count
0,work,653
1,great,460
2,good,391
3,benefit,324
4,company,260
5,people,229
6,culture,211
7,life,184
8,balance,172
9,pay,159


### What did employees dislike about working at MasterCard?

We can repeat the same process for the disliked column to identify the most common tokens in that column as well.


In [17]:
token_lists = []

# parser is the dependency parser
# ner is the named entity recognizer
for doc in nlp.pipe(df["review_disliked_text"], disable=["parser", "ner"]):
    token_lists.append(
        [
            token.lemma_.lower()
            for token in doc
            if not token.is_stop and not token.is_punct and not token.is_space
        ]
    )

df["review_disliked_tokens"] = token_lists

df[["review_id", "review_disliked_text", "review_disliked_tokens"]].head()

Unnamed: 0,review_id,review_disliked_text,review_disliked_tokens
0,101108831,only the people who is master,"[people, master]"
1,101220670,"Like any fast-growing company, there have been...","[like, fast, grow, company, grow, pain, adapt,..."
2,100991760,The pay is not competitive compared with other...,"[pay, competitive, compare, consulting, firm, ..."
3,100092071,Some of the upper management is bias. For exam...,"[upper, management, bias, example, 8, promotio..."
4,100155384,Where to begin? Mastercard is the poster child...,"[begin, mastercard, poster, child, corporate, ..."


In [18]:
# Explode disliked_tokens so each token becomes its own row
df_disliked_exploded = df[["review_id", "review_disliked_tokens"]].explode("review_disliked_tokens").reset_index(drop=True)
df_disliked_exploded = df_disliked_exploded.rename(columns={"review_disliked_tokens": "token"})

# Inspect result
df_disliked_exploded.head()

Unnamed: 0,review_id,token
0,101108831,people
1,101108831,master
2,101220670,like
3,101220670,fast
4,101220670,grow


In [19]:
df_disliked_common_tokens = df_disliked_exploded["token"].value_counts().to_frame().reset_index().head(30)
df_disliked_common_tokens

Unnamed: 0,token,count
0,work,383
1,company,254
2,management,176
3,team,166
4,people,165
5,time,136
6,mastercard,121
7,employee,117
8,x000d,115
9,office,113
