# Text Preprocessing

 - This tutorial demonstrates how to preprocess text data and perform exploratory data analysis (EDA) on a dataset of over 60,000 Spotify app reviews scraped from Google Play Store. Download the dataset [here](https://www.kaggle.com/datasets/mfaaris/spotify-app-reviews-2022)

 - Our goal is to clean the text data and analyze it to uncover key patterns and prepare it for advanced NLP tasks such as sentiment analysis or topic modeling

## 0. Objectives:

1. Import the dataset
2. Exploratory Data Analysis (EDA) for text data
3. Understand and apply basic text preprocessing techniques: cleaning, tokenization, removal of stop word and lemmatization

## 1. Import Data

In [1]:
import pandas as pd

In [2]:
# import data to pandas dataframe

raw_reviews = pd.read_csv("reviews.csv")

raw_reviews.head()

Unnamed: 0,Time_submitted,Review,Rating,Total_thumbsup,Reply
0,2022-07-09 15:00:00,"Great music service, the audio is high quality...",5,2,
1,2022-07-09 14:21:22,Please ignore previous negative rating. This a...,5,1,
2,2022-07-09 13:27:32,"This pop-up ""Get the best Spotify experience o...",4,0,
3,2022-07-09 13:26:45,Really buggy and terrible to use as of recently,1,1,
4,2022-07-09 13:20:49,Dear Spotify why do I get songs that I didn't ...,1,1,


## 2. EDA 

In [3]:
# the info method gives you an overview of the data

raw_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61594 entries, 0 to 61593
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Time_submitted  61594 non-null  object
 1   Review          61594 non-null  object
 2   Rating          61594 non-null  int64 
 3   Total_thumbsup  61594 non-null  int64 
 4   Reply           216 non-null    object
dtypes: int64(2), object(3)
memory usage: 2.3+ MB


In [4]:
raw_reviews.columns

Index(['Time_submitted', 'Review', 'Rating', 'Total_thumbsup', 'Reply'], dtype='object')

In [6]:
# lower column names

# str is called a string accessor

raw_reviews.columns = raw_reviews.columns.str.lower()

raw_reviews.columns

Index(['time_submitted', 'review', 'rating', 'total_thumbsup', 'reply'], dtype='object')

In [10]:
# checking the distribution of the reviews in terms of rating

raw_reviews['rating'].value_counts(normalize=True).sort_values(ascending=False)

rating
5    0.358720
1    0.286603
4    0.127318
2    0.115563
3    0.111797
Name: proportion, dtype: float64

In [11]:
# mean rating of spotify for this dataset

raw_reviews['rating'].mean()

3.1559892197291943

In [12]:
# checking the average review length in characters

raw_reviews['review_length'] = raw_reviews['review'].apply(len)  

print(f"Average review length: {raw_reviews['review_length'].mean():.2f} characters")

Average review length: 163.32 characters


In [13]:
# we can also check the distribution of the reviews length

raw_reviews['review_length'].describe()

count    61594.000000
mean       163.323457
std        119.940997
min         10.000000
25%         72.000000
50%        130.000000
75%        221.000000
max       3753.000000
Name: review_length, dtype: float64

In [15]:
# longest review!!!

# loc[row/index_label, column_name(s)]

raw_reviews.loc[raw_reviews['review_length']==3753, 'review'].values

array(["I very rarely leave reviews and if I do, they're almost never negative but I feel compelled in this instance, as the Spotify app has got to be the worst app I have ever used. I'm on pay as you go with no Internet access unless I'm on Wi-Fi and all I use this app for is to listen to podcasts which I download in full before my work commute. It's obvious it simply has not been designed with any consideration for a user like myself. I would go as far as to state that it's terrible for podcasts in general, regardless of how you use your phone. To give you some examples: 1. When I'm at work, my phone will connect to the work Wi-Fi. There's an internal authorisation process to let devices use Internet when on the company network. What happens is that I can't play any podcasts if my phone's Wi-Fi is on - I press the play button, but nothing happens. I have two phones, one authorised to use the network and one not and this happens on both of them. I think what is taking place in the fir

In [None]:
# examine the span of time for the reviews

raw_reviews['time_submitted'] = pd.to_datetime(raw_reviews['time_submitted'])

raw_reviews.info()

In [None]:
raw_reviews.sort_values('time_submitted', inplace=True, ignore_index=True)

raw_reviews.head()

In [None]:
raw_reviews.tail()

## 3. Text Preprocessing

Text preprocessing transforms raw text into clean, structured data suitable for analysis. 
This includes steps like lowercasing, removing punctuation and stopwords, tokenizing the text, and lemmatizing words

#### Steps in Preprocessing
1. **Lowercasing:** Treat words like "Music" and "music" as the same
2. **Removing Punctuation/Numbers:** Remove symbols and digits that don't add meaning to the analysis
3. **Tokenization:** Split text into individual words (tokens)
4. **Stopword Removal:** Remove common words like "the" and "is" that don't carry significant meaning
5. **Lemmatization:** Normalize words to their base form (e.g., "children" → "child")


In [None]:
# install NLTK package for NLP

!pip install nltk

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Punkt is a multi-language, pre-trained, unsupervised sentence tokenizer in the NLTK library
nltk.download('punkt')

# List of stopwords in many spoken languages
nltk.download('stopwords')

# WordNet is a large, lexical database of the English language. Here it will handle the lemmatization
nltk.download('wordnet')

In [None]:
# let's examine a single review

sample = raw_reviews['review'][600]

sample

In [None]:
# lowercasing

sample = sample.lower()

sample

 - Re is short for Regular Expressions
 - It is a Python module that allows us to find patterns in text and perform certain actions
 - You can find the documentation [here](https://docs.python.org/3/library/re.html)
 - For writing text search patterns, a.k.a. regex, take a look at this detailed [cheatsheet](https://media.datacamp.com/legacy/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf)


In [None]:
# removing the punctuation 

# ^ inside []: Negates the set, meaning "anything NOT matching this set"
# \w: Matches any word character (letters, digits, or underscores)
# \s: Matches any whitespace character (spaces, tabs, etc.)
# so, [^\w\s] matches any character that is not a word character or whitespace

sample = re.sub(r'[^\w\s]', '', sample)

sample

In [None]:
# remove digits

# \d: Matches any digit (0-9)
# +: Matches one or more of the preceding token

sample = re.sub(r'\d+', '', sample)

sample

In [None]:
# Tokenize the text

sample_tokens = word_tokenize(sample)

print(sample_tokens)

In [None]:
# remove stopwords

stop_words = set(stopwords.words('english'))

sample_tokens = [word for word in sample_tokens if word not in stop_words]

print(sample_tokens)

In [None]:
# lemmatize the resulting tokens

lemmatizer = WordNetLemmatizer()

sample_tokens = [lemmatizer.lemmatize(word) for word in sample_tokens]

print(sample_tokens)

In [None]:
# regroup the tokens as a single text

sample_cleaned = ' '.join(sample_tokens)

sample_cleaned

In [None]:
# now we can build a cleaning function that has all these steps and apply it to the entire dataset


def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing
raw_reviews['cleaned_review'] = raw_reviews['review'].apply(preprocess_text)
raw_reviews.head()


In [None]:
# adding a new character count after preprocessing

raw_reviews['cleaned_review_length'] = raw_reviews['cleaned_review'].apply(len)

print(f"Average cleaned review length: {raw_reviews['cleaned_review_length'].mean():.2f} characters")