# Thuto Wesley Sephai
## Umuzi Experience Gig XPL2
## Check-in Feedback Analysis - 11 November 2025


### Project Description
NLP Trend Identification

Conduct a thorough Natural Language Processing (NLP) analysis of the student check-in data. Primary goal is to transform the unstructured text in the "Win", "Loss", and "Blocker" columns into quantifiable, actionable insights. By identifying and quantifying the most frequent topics and sentiment trends uncover the core experiences of the student cohort. This analysis will directly inform program management about what is working well and where immediate improvements are needed.


### What are the expected outcomes?

1. Code Repository: A dedicated Git repository containing a self-contained Jupyter Notebook or Python script

2. Top Trends Summary: A concise report, summarizing the:
- Top 5 most frequent Wins themes
- Top 5 most frequent Losses themes
- Top 5 most frequent Blockers themes

3. Visual Dashboard: Data visualisations that illustrate the frequency and distribution of the identified themes.

4. Recommendations: A concluding section that provides 3-5 concrete recommendations for the program team based on your findings.


In [2]:
# import libraries
import pandas as pd
import re
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
import seaborn as sns



In [3]:
# load dataset
df = pd.read_csv('Copy of Umuzi XB1 Check in (Responses) - Form Responses 1 - Copy of Umuzi XB1 Check in (Responses) - Form Responses 1.csv')

In [4]:
# print the first few rows of the dataframe
print(df.head())

           Timestamp  Column 2  Full name Please enter the date today  \
0  7/9/2025 14:34:49       NaN  Student 1                    7/9/2025   
1  7/9/2025 14:43:15       NaN  Student 2                    7/9/2025   
2  7/9/2025 14:49:40       NaN  Student 3                    7/9/2025   
3  7/9/2025 14:50:41       NaN  Student 4                    7/9/2025   
4  7/9/2025 15:14:46       NaN  Student 5                    7/9/2025   

  Share a win from the last week (what went well, something you enjoyed)  \
0  Completing my first week with Umuzi gave me co...                       
1  I enjoyed introspecting myself on the basis of...                       
2  Submitting all my work in time and completing ...                       
3       I submitted most of the assigned assignments                       
4  I enjoyed the Life Lifeline activity. I got to...                       

  Share a loss (something that was challenging or did not go well)  \
0  I didn’t get opportunities from

In [5]:
# check columns
df.columns

Index(['Timestamp', 'Column 2', 'Full name', 'Please enter the date today',
       'Share a win from the last week (what went well, something you enjoyed)',
       'Share a loss (something that was challenging or did not go well)',
       'Share a blocker, if any (anything that stopped you from doing what you needed to do)',
       'Anything else you would like to share or ask'],
      dtype='object')

In [7]:
# check duplicates
df.duplicated().sum()

np.int64(0)

In [6]:
# missing values
df.isnull().sum()

Timestamp                                                                                 0
Column 2                                                                                372
Full name                                                                                 0
Please enter the date today                                                               0
Share a win from the last week (what went well, something you enjoyed)                   12
Share a loss (something that was challenging or did not go well)                         69
Share a blocker, if any (anything that stopped you from doing what you needed to do)     95
Anything else you would like to share or ask                                            154
dtype: int64

In [10]:
# focus on relevant columns
Win = "Share a win from the last week (what went well, something you enjoyed)"
Loss = "Share a loss (something that was challenging or did not go well)"
Blocker = "Share a blocker, if any (anything that stopped you from doing what you needed to do)"

In [11]:
# handle missing values
df[Win].fillna("", inplace=True)
df[Loss].fillna("", inplace=True)
df[Blocker].fillna("", inplace=True)

In [12]:
# stopwords and lemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...


In [13]:

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization
    words = word_tokenize(text)
    # Remove stopwords and lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

In [None]:
# handle spelling variations
def correct_spelling(text):
    # This is a placeholder for a spelling correction function.
    # In practice, you might use libraries like TextBlob or SymSpell.
    return text
# contraction handling
contraction_mapping = { "can't": "cannot", "won't": "will not", "n't": " not", "'re": " are", "'s": " is", "'d": " would", "'ll": " will", "'t": " not", "'ve": " have", "'m": " am" }

# incosistency phrases handling
def expand_contractions(text):
    for contraction, full_form in contraction_mapping.items():
        text = text.replace(contraction, full_form)
    return text

In [None]:
# feature engineering
# Converting text into numerical features using techniques ising TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
def vectorize_text(corpus):
    X = tfidf_vectorizer.fit_transform(corpus)
    return X


In [None]:

# Example usage
df[Win] = df[Win].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text)
X_win = vectorize_text(df[Win])
df[Loss] = df[Loss].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text)
X_loss = vectorize_text(df[Loss])
df[Blocker] = df[Blocker].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text)
X_blocker = vectorize_text(df[Blocker])


In [None]:
# print shapes of the resulting matrices
print(X_win.shape, X_loss.shape, X_blocker.shape)
