# Chit-Chat: Making Sense of My Class Conversations  

## Project Overview  
This project is me turning my own class’s **Google Meet chats** into a full-blown NLP playground.  
We talk a lot; sometimes about data, sometimes about life, sometimes about memes that shouldn’t exist.  
I figured: *why not analyze it?*  

The idea is simple: **take raw chat transcripts → turn them into structured data → apply NLP + analytics → uncover patterns about how we talk, what we talk about, and when we talk the most.**  

Think of it as: *quantifying chaos*.  

## Data  
- Source: Class standup session chat exports (Tuesday → Thursday).  
- Format: .sbv subtitle-like files → parsed into a dataframe with:  
  - timestamp`  
  - name`  
  - text`  
  - day`  
  - start_time / end_time  
  - hours_in_class (so I can track talkativeness across the session)  

## What I’m Doing  

### 1. **Exploratory Stuff**  
- Word frequencies & n-grams → who says what, and how often.  
- TF-IDF → unique vocab per student or day.  
- Engagement timelines → when does the energy peak? Are we more talkative in hour 1 vs hour 2?  

### 2. **Topic Modeling**  
- Using LDA / BERTopic to see the hidden themes.  
- Expect clusters like:  
  - *“dataset troubleshooting”*  
  - *“presentation anxiety”*  
  - *“inside jokes no one outside class will get”*  

### 3. **Clustering & Embeddings**  
- Sentence embeddings + KMeans to group similar chats.  
- See which convos naturally cluster together (questions, banter, actual work 👀).  

### 4. **Sentiment & Emotion**  
- Track the mood of the class: are we positive, stressed, neutral?  
- Emotion tagging (joy, frustration, confusion) to see how feelings flow during the session.  

### 5. **Speaker Analysis**  
- Who talks the most?  
- Who introduces new topics vs who mostly reacts?  
- Basically: who drives the vibe of the class.  

### 6. **Recommender Angle**  
- Build a “study buddy” recommender system based on similarity of chats.  
- If you sound like me in class, the algo might just recommend we team up.  

### 7. **Advanced Fun**  
- **Dialogue Act Classification** → label lines as *Question / Answer / Joke / Instruction / Off-topic*.  
- **Keyword Extraction** → daily highlights without re-reading everything.  
- **Summarization** → generate auto “standup recaps.”  
- **Network Analysis** → build a social graph of who responds to who (aka, our class dynamics in one picture).    

## Tech Stack  
- **Pandas** for wrangling  
- **NLTK / SpaCy** for preprocessing  
- **Scikit-learn** for TF-IDF, clustering, and classification  
- **BERTopic / Gensim** for topic modeling  
- **NetworkX** for social graph analysis  
- **Matplotlib / Seaborn / Plotly** for making the chaos look pretty  

## Expected Insights  
- Are we more talkative at the start, midway, or near the end?  
- What do we actually talk about (vs what we *think* we talk about)?  
- Who dominates conversations, and who stays quiet?  
- Can we auto-summarize our standups without losing context?  

Import all relevant libraries

In [30]:
# Utilities
import warnings
warnings.filterwarnings('ignore')

# Mathematical Operations
import numpy as np

# Data Manipulation
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
import seaborn as sns

# String manipulation
import re

from IPython.display import display

# NLP
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# Clustering
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans
from scipy.cluster.hierarchy import dendrogram, ward
from sklearn.neighbors import NearestNeighbors

# SVD
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds

# Clustering metric scores
from sklearn.metrics import silhouette_score
from sklearn import metrics

# Statistics & Scientific Computing
from scipy.stats import randint

# Machine Learning - Preprocessing
from sklearn.preprocessing import StandardScaler, normalize

# Machine Learning - Model Selection & Evaluation
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score
)
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    auc,
    roc_auc_score,
    balanced_accuracy_score,
    f1_score
)

# Show entire column contents
pd.set_option('display.max_colwidth', None)

In [31]:
# Get stopwords from NLTK
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:
# Check the english stopwords
print(stopwords.words('english'))

# Check Swahili stopwords
#print(stopwords.words('kiswahili'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [33]:
# Load Tuesday's chats
tuesday_path = r"C:\Users\lenovo\OneDrive\Desktop\DS\PROJECTS\chit-chat\Chatter box\DSF-FT13R_P4_Standup - 2025_09_30 08_33 EAT - Chat 2.sbv"

with open(tuesday_path, "r", encoding="utf-8-sig") as f:
    tuesday_lines = f.readlines()

tuesday_lines[:20]  # peek at structure

['00:01:15.122,00:01:18.122\n',
 'William Arasirwa: lakini i need lessons on presentation\n',
 '\n',
 '00:10:39.341,00:10:42.341\n',
 'Stanley Njihia: Mnaskia reverb wakubwa\n',
 '\n',
 '00:10:47.810,00:10:50.810\n',
 'William Arasirwa: eeeeh\n',
 '\n',
 '00:14:14.814,00:14:17.814\n',
 'Stanley Njihia: Peris, cheza na massgrave io watermark iko apo bottom right iache kukusumbua.\n',
 '\n',
 '00:15:11.823,00:15:14.823\n',
 'William Arasirwa: address the elephant in the room\n',
 '\n',
 '00:15:43.275,00:15:46.275\n',
 'William Arasirwa: watu wa car dataset tuliona vumb\n',
 '\n',
 '00:16:02.430,00:16:05.430\n',
 'Stanley Njihia: ptsd\n']

In [34]:
# Load Wednesday's chats
wenno_path = r"C:\Users\lenovo\OneDrive\Desktop\DS\PROJECTS\chit-chat\Chatter box\DSF-FT13R_P4_Standup - 2025_10_01 08_31 EAT - Chat.sbv"

with open(wenno_path, "r", encoding="utf-8-sig") as f:
    wenno_lines = f.readlines()

wenno_lines[:20]  # peek at structure

['00:00:06.928,00:00:09.928\n',
 'Ann-Felicity Mureithi: mnijibu\n',
 '\n',
 '00:00:10.928,00:00:13.928\n',
 'Norman Mwapea: Skia sentiments za gen alpha\n',
 '\n',
 '00:00:41.309,00:00:44.309\n',
 'Norman Mwapea: Kuhustle ka sjakosea. Ama ni aje Judith\n',
 '\n',
 '00:01:05.563,00:01:08.563\n',
 'Ann-Felicity Mureithi: @ stan wewe ebu confrim\n',
 '\n',
 '00:02:00.671,00:02:03.671\n',
 'Ann-Felicity Mureithi: aws\n',
 '\n',
 '00:02:22.962,00:02:25.962\n',
 'Norman Mwapea: Ati what tool did you use apart from gpt. That was just nasty\n',
 '\n',
 '00:10:30.379,00:10:33.379\n',
 'Ann-Felicity Mureithi: google search\n']

In [35]:
# Load Thursday's chats
thur_path = r"C:\Users\lenovo\OneDrive\Desktop\DS\PROJECTS\chit-chat\Chatter box\DSF-FT13R_P4_Standup - 2025_10_02 08_32 EAT - Chat.sbv"

with open(thur_path, "r", encoding="utf-8-sig") as f:
    thur_lines = f.readlines()

thur_lines[:20]  # peek at structure

['00:00:00.075,00:00:03.075\n',
 'Norman Mwapea: Story inabamba sindio\n',
 '\n',
 '00:00:17.119,00:00:20.119\n',
 'Norman Mwapea: Mucene ya asubuhi. Morning gloru\n',
 '\n',
 '00:01:36.531,00:01:39.531\n',
 'Huldah Rotich: Kitts wa part time take a longer time so hatuwezi graduate na hao\n',
 '\n',
 '00:02:09.736,00:02:12.736\n',
 'Stanley Njihia: Vile ii mambo inaendelea itabidi Kitts ametolewa frontlines\n',
 '\n',
 '00:02:40.130,00:02:43.130\n',
 'William Arasirwa: ken ?\n',
 '\n',
 '00:02:43.832,00:02:46.832\n',
 'Jeff Kandie: Ken walibor\n',
 '\n',
 '00:02:48.896,00:02:51.896\n',
 'Jeff Kandie: a\n']

Good. The files are loaded and ready for manipulation. Since I want to 

In [36]:
# Parse the corpus and convert them into dataframes
def sbv_to_dataframe(file_path):
    
    '''Function to convert each file into a dataframe for easier analysis'''

    with open(file_path, "r", encoding="utf-8-sig") as f:
        content = f.read().strip()
    
    # Split by double newlines (each block = 1 subtitle)
    blocks = content.split("\n\n")
    
    data = []
    for block in blocks:
        lines = block.strip().split("\n")
        if len(lines) >= 2:
            timestamp = lines[0].strip()
            line_text = lines[1].strip()
            
            # Split into name + text if ":" exists
            if ":" in line_text:
                name, text = line_text.split(":", 1)
                name, text = name.strip(), text.strip()
            else:
                name, text = None, line_text  # fallback if no name
            
            data.append([timestamp, name, text])
    
    return pd.DataFrame(data, columns=["timestamp", "name", "text"])

In [37]:
# Load the corpus as dataframes

print("Tuesday:")
display(tuesday_df.sample(7))

print("Wednesday:")
display(wenno_df.sample(7))

print("Thursday:")
display(thur_df.sample(7))

Tuesday:


Unnamed: 0,timestamp,name,text,date
137,"01:24:44.677,01:24:47.677",Jeff Mogaka,eishh,2025-09-30
123,"01:23:08.205,01:23:11.205",Jeff Mogaka,i give up,2025-09-30
114,"01:21:52.562,01:21:55.562",Jeff Mogaka,si wewe Fridah ulikua unalia ju ya JONTE phase 1,2025-09-30
172,"01:35:00.840,01:35:03.840",Jeff Mogaka,nitatokwa na uwazimu,2025-09-30
78,"01:15:49.048,01:15:52.048",William Arasirwa,cleanliness is key ya lef ama right?,2025-09-30
215,"01:43:03.178,01:43:06.178",Alvin kipleting,@ jeff ndio aura irudi naona itabidi umetumia OSINT on some people here,2025-09-30
55,"01:13:41.196,01:13:44.196",Stacy Mogeni,self love muhimu,2025-09-30


Wednesday:


Unnamed: 0,timestamp,name,text,date
225,"02:32:36.868,02:32:39.868",Alvin kipleting,get ready get ready,2025-10-01
243,"02:41:17.568,02:41:20.568",Jeff Mogaka,damn,2025-10-01
115,"01:54:09.703,01:54:12.703",Huldah Rotich,Hehehehe ati wananifukuza. Yho....I am right at home.,2025-10-01
76,"01:44:19.641,01:44:22.641",Alvin kipleting,watu wa Tz wanaumia,2025-10-01
7,"00:11:22.853,00:11:25.853",William Arasirwa,chatgpt recieving shembeteng,2025-10-01
101,"01:50:00.110,01:50:03.110",Jeff Mogaka,*Judith,2025-10-01
59,"01:33:25.295,01:33:28.295",William Arasirwa,cha mkufuu mwana hu haa,2025-10-01


Thursday:


Unnamed: 0,timestamp,name,text,date
107,"01:02:18.406,01:02:21.406",William Arasirwa,why cmd n,2025-10-02
114,"01:09:26.801,01:09:29.801",Kitts Kikumu,unajipiga own goal,2025-10-02
26,"00:29:08.550,00:29:11.550",Maureen Ngaire,Hulda siuseme ama nikuseme,2025-10-02
122,"01:22:22.812,01:22:25.812",Nesphory Mwadime,I might be wrong but I think the sample method generated different samples for everyone. Sikumbuki tukitumia random_state for reproducibilitty,2025-10-02
154,"01:45:16.237,01:45:19.237",Stanley Njihia,Pace ilichange haraka upesi,2025-10-02
146,"01:34:46.177,01:34:49.177",Alvin kipleting,"vitu DS na MANU inanifanyia acha tu, Arasirwa hiyo shamba unabuy when",2025-10-02
103,"01:01:52.579,01:01:55.579",Ann-Felicity Mureithi,@norman and others,2025-10-02


Combine all the files into one dataset.

In [38]:
# --- 0. assign actual class dates (adjust if your dates differ) ---
tuesday_df["date"] = pd.to_datetime("2025-09-30")
wenno_df["date"]   = pd.to_datetime("2025-10-01")
thur_df["date"]    = pd.to_datetime("2025-10-02")

# --- 1. combine ---
chat_df = pd.concat([tuesday_df, wenno_df, thur_df], ignore_index=True)

# --- 2. extract start & end from timestamp ---
def _extract_start_end(ts):

    if pd.isna(ts):
        return (None, None)

    s = str(ts).strip()

    if "," in s: 
        left, right = [p.strip() for p in s.split(",", 1)]
        return (left or None, right or None)
    matches = re.findall(r"\d{1,2}:\d{2}:\d{2}(?:\.\d+)?", s)

    if len(matches) >= 2:
        return (matches[0], matches[1])

    if len(matches) == 1:
        return (matches[0], None)

    return (None, None)

times = chat_df["timestamp"].apply(_extract_start_end)
chat_df[["start_time_str", "end_time_str"]] = pd.DataFrame(times.tolist(), index=chat_df.index)

# --- 3. convert to Timedelta and compute hours in class ---
chat_df["start_time"] = pd.to_timedelta(chat_df["start_time_str"], errors="coerce")
chat_df["end_time"]   = pd.to_timedelta(chat_df["end_time_str"], errors="coerce")
chat_df["hours_in_class"] = chat_df["start_time"].dt.total_seconds() / 3600.0

# --- 4. add day column ---
chat_df["day"] = chat_df["date"].dt.day_name()

# --- 5. final dataframe ---
final_df = chat_df[["timestamp", "name", "text", "day", "start_time", "end_time", "hours_in_class"]].copy()

# Preview
display(final_df)

Unnamed: 0,timestamp,name,text,day,start_time,end_time,hours_in_class
0,"00:01:15.122,00:01:18.122",William Arasirwa,lakini i need lessons on presentation,Tuesday,0 days 00:01:15.122000,0 days 00:01:18.122000,0.020867
1,"00:10:39.341,00:10:42.341",Stanley Njihia,Mnaskia reverb wakubwa,Tuesday,0 days 00:10:39.341000,0 days 00:10:42.341000,0.177595
2,"00:10:47.810,00:10:50.810",William Arasirwa,eeeeh,Tuesday,0 days 00:10:47.810000,0 days 00:10:50.810000,0.179947
3,"00:14:14.814,00:14:17.814",Stanley Njihia,"Peris, cheza na massgrave io watermark iko apo bottom right iache kukusumbua.",Tuesday,0 days 00:14:14.814000,0 days 00:14:17.814000,0.237448
4,"00:15:11.823,00:15:14.823",William Arasirwa,address the elephant in the room,Tuesday,0 days 00:15:11.823000,0 days 00:15:14.823000,0.253284
...,...,...,...,...,...,...,...
741,"02:16:58.724,02:17:01.724",Alvin kipleting,inakata tu naisha,Thursday,0 days 02:16:58.724000,0 days 02:17:01.724000,2.282979
742,"02:18:27.329,02:18:30.329",Kitts Kikumu,two months in two days,Thursday,0 days 02:18:27.329000,0 days 02:18:30.329000,2.307591
743,"02:18:30.920,02:18:33.920",Nesphory Mwadime,Enyewe Moringa ni business,Thursday,0 days 02:18:30.920000,0 days 02:18:33.920000,2.308589
744,"02:18:43.093,02:18:46.093",Stanley Njihia,eeiyy,Thursday,0 days 02:18:43.093000,0 days 02:18:46.093000,2.311970
