# Thuto Wesley Sephai
## Umuzi Experience Gig XPL2
## Check-in Feedback Analysis - 11 November 2025


### Project Description
NLP Trend Identification

Conduct a thorough Natural Language Processing (NLP) analysis of the student check-in data. Primary goal is to transform the unstructured text in the "Win", "Loss", and "Blocker" columns into quantifiable, actionable insights. By identifying and quantifying the most frequent topics and sentiment trends uncover the core experiences of the student cohort. This analysis will directly inform program management about what is working well and where immediate improvements are needed.


### What are the expected outcomes?

1. Code Repository: A dedicated Git repository containing a self-contained Jupyter Notebook or Python script

2. Top Trends Summary: A concise report, summarizing the:
- Top 5 most frequent Wins themes
- Top 5 most frequent Losses themes
- Top 5 most frequent Blockers themes

3. Visual Dashboard: Data visualisations that illustrate the frequency and distribution of the identified themes.

4. Recommendations: A concluding section that provides 3-5 concrete recommendations for the program team based on your findings.


In [49]:
# import libraries
import pandas as pd
import re
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
import seaborn as sns



In [50]:
# load dataset
df = pd.read_csv('Copy of Umuzi XB1 Check in (Responses) - Form Responses 1 - Copy of Umuzi XB1 Check in (Responses) - Form Responses 1.csv')

In [51]:
# show first few rows
df.head()

Unnamed: 0,Timestamp,Column 2,Full name,Please enter the date today,"Share a win from the last week (what went well, something you enjoyed)",Share a loss (something that was challenging or did not go well),"Share a blocker, if any (anything that stopped you from doing what you needed to do)",Anything else you would like to share or ask
0,7/9/2025 14:34:49,,Student 1,7/9/2025,Completing my first week with Umuzi gave me co...,I didn’t get opportunities from two companies ...,Being financially unstable has been draining m...,I appreciate Umuzi for this opportunity to sho...
1,7/9/2025 14:43:15,,Student 2,7/9/2025,I enjoyed introspecting myself on the basis of...,Except for being sick and experiencing challen...,"None, only temporary set backs (reception and ...",Nothing for now.
2,7/9/2025 14:49:40,,Student 3,7/9/2025,Submitting all my work in time and completing ...,I don’t have any,"Data , I couldn’t join some meetings because I...",No thank you
3,7/9/2025 14:50:41,,Student 4,7/9/2025,I submitted most of the assigned assignments,I did not understand some assignments s well a...,Spending most time in class leading to having ...,"In overall, I am doing well and trying to do a..."
4,7/9/2025 15:14:46,,Student 5,7/9/2025,I enjoyed the Life Lifeline activity. I got to...,,"I forgot to login to Google classroom, until I...",No.


In [52]:
# check columns
df.columns

Index(['Timestamp', 'Column 2', 'Full name', 'Please enter the date today',
       'Share a win from the last week (what went well, something you enjoyed)',
       'Share a loss (something that was challenging or did not go well)',
       'Share a blocker, if any (anything that stopped you from doing what you needed to do)',
       'Anything else you would like to share or ask'],
      dtype='object')

In [53]:
# check duplicates
df.duplicated().sum()

np.int64(0)

In [54]:
# missing values
df.isnull().sum()

Timestamp                                                                                 0
Column 2                                                                                372
Full name                                                                                 0
Please enter the date today                                                               0
Share a win from the last week (what went well, something you enjoyed)                   12
Share a loss (something that was challenging or did not go well)                         69
Share a blocker, if any (anything that stopped you from doing what you needed to do)     95
Anything else you would like to share or ask                                            154
dtype: int64

In [55]:
# focus on relevant columns
Win = "Share a win from the last week (what went well, something you enjoyed)"
Loss = "Share a loss (something that was challenging or did not go well)"
Blocker = "Share a blocker, if any (anything that stopped you from doing what you needed to do)"

In [56]:
# handle missing values
df[Win].fillna("", inplace=True)
df[Loss].fillna("", inplace=True)
df[Blocker].fillna("", inplace=True)

In [57]:
# stopwords and lemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Geeks2_PC12\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
#  
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


In [59]:

# text preprocessing function
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenization
    words = word_tokenize(text)
    # Remove stopwords and lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

In [None]:

# contraction handling
contraction_mapping = { "can't": "cannot", "won't": "will not", "n't": " not", "'re": " are", "'s": " is", "'d": " would", "'ll": " will", "'t": " not", "'ve": " have", "'m": " am" }

# incosistency phrases handling
def expand_contractions(text):
    for contraction, full_form in contraction_mapping.items():
        text = text.replace(contraction, full_form)
    return text

In [61]:
# feature engineering
# Converting text into numerical features using techniques ising TF-IDF or Bag of Words
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_win = vectorizer.fit_transform(df[Win].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text))
X_loss = vectorizer.fit_transform(df[Loss].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text))
X_blocker = vectorizer.fit_transform(df[Blocker].apply(expand_contractions).apply(correct_spelling).apply(preprocess_text))


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\Geeks2_PC12/nltk_data'
    - 'c:\\Users\\Geeks2_PC12\\AppData\\Local\\Programs\\Python\\Python314\\nltk_data'
    - 'c:\\Users\\Geeks2_PC12\\AppData\\Local\\Programs\\Python\\Python314\\share\\nltk_data'
    - 'c:\\Users\\Geeks2_PC12\\AppData\\Local\\Programs\\Python\\Python314\\lib\\nltk_data'
    - 'C:\\Users\\Geeks2_PC12\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:

# print shapes of the resulting matrices
print(X_win.shape, X_loss.shape, X_blocker.shape)
