# **Text Preprocessing**

---
### Import libraries
- **`pandas`**: Used to manage and manipulate tabular data structures like DataFrames.
- **`re`**: Python’s built-in regular expressions library. Useful for pattern matching and cleaning text (e.g., removing punctuation or special characters).
- **`nltk.corpus.stopwords`**: Provides a list of common stopwords (e.g., "the", "and") that are typically removed from text before analysis.
- **`PorterStemmer`**: A stemming algorithm from NLTK that reduces words to their root form (e.g., "running" → "run").
- **`WordNetLemmatizer`**: Another tool for reducing words, but using proper lemmas (dictionary base forms) instead of crude cuts (e.g., "better" → "good").


In [1]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

## 📂 Load the Saved Coursera Dataset

In [2]:
data = pd.read_csv('coursera_course_dataset.csv')
data

Unnamed: 0.1,Unnamed: 0,Title,Organization,Skills,Metadata
0,0,Machine Learning,Multiple educators,"Unsupervised Learning, Supervised Learning, M...",Beginner · Specialization · 1 - 3 Months
1,1,Machine Learning with Python,IBM,"Unsupervised Learning, Supervised Learning, R...",Intermediate · Course · 1 - 3 Months
2,2,Mathematics for Machine Learning and Data Science,DeepLearning.AI,"Descriptive Statistics, Bayesian Statistics, ...",Intermediate · Specialization · 1 - 3 Months
3,3,IBM Machine Learning,IBM,"Exploratory Data Analysis, Feature Engineerin...",Intermediate · Professional Certificate · 3 - ...
4,4,"Python for Data Science, AI & Development",IBM,"Jupyter, Python Programming, Data Structures,...",Beginner · Course · 1 - 3 Months
...,...,...,...,...,...
583,583,Introduction to Artificial Intelligence (AI),IBM,"Generative AI, ChatGPT, Natural Language Proc...",Beginner · Course · 1 - 4 Weeks
584,584,Mathematics for Machine Learning,Imperial College London,"Linear Algebra, Dimensionality Reduction, Num...",Beginner · Specialization · 3 - 6 Months
585,585,Fundamentals of Machine Learning and Artificia...,Amazon Web Services,Artificial Intelligence and Machine Learning ...,Mixed · Course · 1 - 4 Weeks
586,586,Supervised Machine Learning: Regression and Cl...,DeepLearning.AI,"Supervised Learning, Jupyter, Scikit Learn (M...",Beginner · Course · 1 - 4 Weeks


## 🧼 Text Preprocessing Pipeline

This section defines several functions to clean and prepare raw text for analysis, typically used in Natural Language Processing (NLP) tasks.

---

### 🔧 Preprocessing Setup

- **`PorterStemmer`**: Used for stemming (reducing words to their base/root form).
- **`WordNetLemmatizer`**: Used for lemmatization (reducing words to their proper dictionary form).
- **`stop_words`**: A set of common English stopwords to be removed during cleaning.

In [3]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

### 🧱 Text Cleaning Functions

In [4]:
def lowering(text: str) -> str:
    text = text.lower()
    return text

def remove_punctuation_and_symbol(text: str) -> str:
    text = re.sub(r'[^\w\s]', '', text)
    return text

def stopword_removal(text: str) -> str:
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

def stemming(text: str) -> str:
    text = " ".join([stemmer.stem(word) for word in text.split()])
    return text

def lemmatization(text: str) -> str:
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    return text

def preprocessing(text: str) -> str:

    text = lowering(text)
    text = remove_punctuation_and_symbol(text)
    text = stopword_removal(text)
    text = stemming(text)
    text = lemmatization(text)

    return text

## 🧪 Apply Preprocessing to Dataset Columns

The following lines apply the full text preprocessing pipeline to specific text columns in the dataset.

---

In [6]:
# Apply preprocessing to 'course_description' column
data['Skills_processed'] = data['Skills'].apply(preprocessing)
# Apply preprocessing to 'course_title' column
data['Title_processed'] = data['Title'].apply(preprocessing)

In [7]:
data.drop(['Unnamed: 0', 'Title', 'Skills'], axis=1, inplace=True)
data.to_csv('processed_coursera_course_dataset.csv', index=False)