# Twitch-Review-Audit: Data Labeling Consistency & Noise Detection
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Status: High Quality](https://img.shields.io/badge/Data_Quality-High-green.svg)](#)

## Project Overview
This project establishes a high-precision NLP pipeline to audit the data quality of Twitch mobile app reviews. In modern Machine Learning, **Data-Centric AI** focuses on the reliability of labels before model training. 

We implement a **Rule-Based Annotation** system to cross-validate user-provided ratings (Scores) against the semantic intent of their reviews.

### üõ† Tech Stack
* **NLP Pipeline:** NLTK (Tokenization, Regex Normalization)
* **Metric:** Cohen‚Äôs Kappa Coefficient ($\kappa$)
* **Objective:** Label Noise Detection & Dataset Sanitization

In [40]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from sklearn.metrics import cohen_kappa_score, accuracy_score

# Downloading necessary resource for tokenization
nltk.download('punkt')

print("Libraries imported and NLTK resources ready.")

Libraries imported and NLTK resources ready.


[nltk_data] Downloading package punkt to C:\Users\Marcell/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Data Acquisition
We load the raw Twitch review dataset. The primary features of interest are `content` (textual feedback) and `score` (numerical rating 1-5).

In [41]:
# Load the dataset
df = pd.read_csv('twitch_reviews.csv')

print(f"Dataset Shape: {df.shape}")
# Displaying the first few rows for initial data inspection
df[['content', 'score']].head()

Dataset Shape: (97548, 8)


Unnamed: 0,content,score
0,Get away with that UI. Uninstalled.,1
1,New layout is absolute rubbish. Dont try to be...,1
2,Please revert to the previous UI,1
3,"Absolute hideous UI update since today, random...",1
4,The UI is soooooo bad now,1


## 2. Text Normalization & Tokenization
To ensure the lexicon matches are precise, we apply a normalized pipeline. Since **Lemmatization is omitted** to preserve the raw intensity of gaming slang, we prioritize clean tokenization.

**Strategy:**
1. **Case Folding:** Lowercasing all text.
2. **Noise Removal:** Eliminating URLs and non-alphabetic characters using Regex.
3. **Word Segmentation:** Utilizing `nltk.word_tokenize` for discrete token mapping.

In [42]:
def clean_and_tokenize(text):
    # Convert to string and lowercase
    text = str(text).lower()
    # Remove URLs, hashtags, and mentions
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    # Remove special characters and numbers, keeping only alphabets
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    return tokens

df['tokens'] = df['content'].apply(clean_and_tokenize)
print("Preprocessing complete: Text normalized and tokenized.")
df[['content', 'tokens']].head()

Preprocessing complete: Text normalized and tokenized.


Unnamed: 0,content,tokens
0,Get away with that UI. Uninstalled.,"[get, away, with, that, ui, uninstalled]"
1,New layout is absolute rubbish. Dont try to be...,"[new, layout, is, absolute, rubbish, dont, try..."
2,Please revert to the previous UI,"[please, revert, to, the, previous, ui]"
3,"Absolute hideous UI update since today, random...","[absolute, hideous, ui, update, since, today, ..."
4,The UI is soooooo bad now,"[the, ui, is, soooooo, bad, now]"


## 3. Rule-Based Annotation (Heuristic Model)

We employ a **curated rule-based lexical heuristic** tailored specifically to the Twitch ecosystem. The lexicon incorporates domain-specific expressions and accounts for common morphological variations (e.g., *lag, lagging, lags*) to maintain interpretability while compensating for the absence of lemmatization.


In [43]:
# Expanded Lexicon curated for Twitch-specific feedback
pos_lexicon = [
    'awesome', 'love', 'loved', 'perfect', 'best', 'good', 'nice', 'fun', 
    'cool', 'great', 'amazing', 'excellent', 'smooth', 'helpful', 'pog', 'poggers'
]

neg_lexicon = [
    'worst', 'garbage', 'trash', 'rubbish', 'horrible', 'hideous', 'hate', 'hated',
    'terrible', 'uninstall', 'uninstalled', 'revert', 'reverted', 'previous', 'old',
    'broken', 'lag', 'lagging', 'lags', 'slow', 'slows', 'buggy', 'bugs', 
    'tiktok', 'bad', 'mess', 'cluttered', 'ads', 'advertise', 'advertising'
]

def rule_based_label(tokens):
    """
    Annotator 1 Logic:
    Calculates sentiment based on exact matches within the token list.
    """
    pos_count = sum(1 for word in tokens if word in pos_lexicon)
    neg_count = sum(1 for word in tokens if word in neg_lexicon)
    
    if neg_count > pos_count:
        return 0  # Negative
    elif pos_count > neg_count:
        return 1  # Positive
    else:
        return 2  # Neutral / Ambiguous

df['label_lexicon'] = df['tokens'].apply(rule_based_label)
print("Lexicon-based labeling complete.")

Lexicon-based labeling complete.


## 4. Agreement Analysis Between Labeling Sources

The **original user-provided rating** and the **rule-based lexical heuristic output** are treated as two independent labeling sources. To evaluate their consistency, we compute the following metrics:

1. **Agreement Rate (Accuracy):** The proportion of samples where both labeling sources assign the same sentiment label.
2. **Cohen‚Äôs Kappa ($\kappa$):** A chance-corrected agreement statistic that measures the level of consistency beyond random agreement.

$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$


In [44]:
def map_score_to_label(score):
    if score <= 2:
        return 0  # Negative
    elif score >= 4:
        return 1  # Positive
    else:
        return 2  # Neutral

df['label_original'] = df['score'].apply(map_score_to_label)

In [45]:
# Filtering out 'Neutral' samples to focus on clear Sentiment Agreement
df_eval = df[(df['label_lexicon'] != 2) & (df['label_original'] != 2)].copy()

accuracy = accuracy_score(df_eval['label_original'], df_eval['label_lexicon'])
kappa = cohen_kappa_score(df_eval['label_original'], df_eval['label_lexicon'])

print("=== QUALITY AUDIT RESULTS ===")
print(f"Total Evaluated Samples: {len(df_eval)}")
print(f"Agreement Rate (Accuracy): {accuracy:.2%}")
print(f"Cohen's Kappa Score: {kappa:.4f}")

# Final Interpretation
if kappa > 0.60:
    print("Interpretation: Substantial Agreement - Lexicon rules are highly consistent.")
elif kappa > 0.40:
    print("Interpretation: Moderate Agreement - Room for lexicon expansion.")
else:
    print("Interpretation: Low Agreement - High risk of sarcasm or misaligned ratings.")

=== QUALITY AUDIT RESULTS ===
Total Evaluated Samples: 55989
Agreement Rate (Accuracy): 94.90%
Cohen's Kappa Score: 0.8211
Interpretation: Substantial Agreement - Lexicon rules are highly consistent.


## 5. Label Inconsistency Detection & Remediation

We identified **4,909 samples (8.7%)** exhibiting **potential label inconsistency** between user-provided ratings and textual sentiment inferred by the rule-based lexical heuristic.

**Operational Definition:**  
Samples in which the numerical rating (e.g., 5 stars) diverges from the dominant linguistic sentiment expressed in the review text (e.g., explicit negative expressions such as ‚Äúthis app is rubbish‚Äù).

**Recommended Actions for AI Training:**
- Flag these samples for **manual review or re-annotation** within a Human-in-the-Loop (HITL) framework.
- Excluding or correcting highly inconsistent samples may reduce noisy supervision and improve the robustness of downstream sentiment classification models (e.g., BERT or SVM).


In [46]:
# Identifying 'Label Noise'
df['is_noise'] = (df['label_original'] != df['label_lexicon']) & (df['label_lexicon'] != 2)
noise_data = df[df['is_noise'] == True]

print(f"Detected Label Noise: {len(noise_data)} samples.")
print("\nTop samples requiring manual re-annotation:")
noise_data[['content', 'score', 'label_lexicon']].head(10)

# Saving the audited dataset for further review
df.to_csv('twitch_reviews_audited.csv', index=False)

Detected Label Noise: 4909 samples.

Top samples requiring manual re-annotation:


## üèÅ 6. Final Conclusions & Strategic Recommendations

### 6.1 Key Performance Insights
Based on the quality audit of **55,989 samples**, the following conclusions were drawn:

* **Strong Heuristic Consistency:** A **Cohen‚Äôs Kappa score of 0.8211** indicates **substantial to near-perfect agreement** between the rule-based lexical heuristic and user-provided ratings. This suggests strong alignment under clearly polarized sentiment conditions and indicates that the selected domain-specific keywords (e.g., *‚Äúpog‚Äù*, *‚Äúrevert‚Äù*, *‚Äútiktok‚Äù*) capture dominant sentiment signals commonly expressed in Twitch reviews.
* **High Label Consistency:** The observed **Agreement Rate of 94.90%** reflects a high level of consistency between numerical ratings and textual feedback, which is notable for large-scale social media review datasets.
* **Effective Inconsistency Identification:** The pipeline identified **4,909 samples (8.7%)** exhibiting **potential label inconsistency**, where the sentiment expressed in text diverges from the assigned star rating. These cases often correspond to sarcasm, mixed sentiment, or rating bias.

### 6.2 Impact on AI Training
Identifying potentially inconsistent samples is critical for improving downstream machine learning workflows:
1. **Reduced Noisy Supervision:** Excluding or correcting highly inconsistent samples may prevent models from learning misleading or spurious correlations.
2. **Improved Model Robustness:** Training on higher-consistency data can help models better distinguish dominant patterns in complaints versus praise.

### 6.3 Future Recommendations
To further enhance this NLP pipeline, the following steps are recommended:
* **N-gram Analysis:** Incorporate **bigrams** to better capture negation and contextual polarity (e.g., distinguishing *‚Äúnot good‚Äù* from *‚Äúgood‚Äù*).
* **Human-in-the-Loop Validation:** Implement a **Human-in-the-Loop (HITL)** process where samples flagged as potentially inconsistent are manually reviewed or re-annotated.
* **Model Benchmarking:** Following dataset auditing, future work may involve training and benchmarking **transformer-based models (e.g., BERT)** on the high-consistency subset to evaluate potential performance improvements over traditional approaches.
