| Property | Value |
|---|---|
| Name | Ahsan Umar |
| Roll No | 232552 |
| Course | Machine Learning |
| Assignment No | 2 |
| GitHub Repository | codewithdark-git/USA-MLAssignment |
| Project Directory | urdu-sentiment-analysis |
Urdu Movie Reviews — Sentiment Analysis
This project implements a complete NLP + Machine Learning pipeline for classifying Urdu movie reviews as positive or negative. It covers everything from raw data loading to ensemble classifiers, with clean visualizations and a thorough comparative analysis.
Dataset: IMDB 50K Movie Reviews — Translated Urdu (Kaggle)
urdu-sentiment-analysis/
│
├── README.md ← Assignment document
├── requirements.txt ← All dependencies
├── run_all.py ← Execute all questions
│
├── data/
│ ├── imdb_urdu_reviews.csv ← Dataset (50K reviews)
│ └── README_data.md ← Instructions for dataset
│
├── utils/ ← Shared utilities
│ ├── __init__.py
│ ├── urdu_stopwords.py ← Urdu stopword list
│ └── text_utils.py ← Reusable preprocessing functions
│
├── Q1_Data_Loading/
│ ├── README.md
│ └── q1_data_loading.py ← Data Loading & EDA
│
├── Q2_Visualization/
│ ├── README.md
│ └── q2_wordcloud.py ← WordCloud Visualization
│
├── Q3_Preprocessing/
│ ├── README.md
│ └── q3_preprocessing.py ← Data Preprocessing
│
├── Q4_Feature_Extraction/
│ ├── README.md
│ └── q4_feature_extraction.py ← N-grams & TF-IDF
│
├── Q5_Classification/
│ ├── README.md
│ └── q5_classification.py ← ML Classification Models
│
├── Q6_Comparative_Analysis/
│ ├── README.md
│ ├── q6_comparative_analysis.py ← Model Comparison
│ └── q6_comparison.py
│
├── Q7_Ensemble/
│ ├── README.md
│ └── q7_ensemble.py ← Ensemble Methods
│
└── outputs/
├── q5_cv_results.csv ← Cross-validation results
└── q7_ensemble_results.csv ← Ensemble results
- Python 3.8+
- Git
- pip package manager
git clone https://github.com/codewithdark-git/USA-MLAssignment.git
cd USA-MLAssignment/urdu-sentiment-analysispip install -r requirements.txt- Download from Kaggle — IMDB 50K Urdu Reviews
- Extract and place the CSV file inside the
data/folder - Verify:
data/imdb_urdu_reviews.csv
Option A: Run all questions sequentially
python run_all.pyOption B: Run individual questions
python Q1_Data_Loading/q1_data_loading.py
python Q2_Visualization/q2_wordcloud.py
python Q3_Preprocessing/q3_preprocessing.py
python Q4_Feature_Extraction/q4_feature_extraction.py
python Q5_Classification/q5_classification.py
python Q6_Comparative_Analysis/q6_comparative_analysis.py
python Q7_Ensemble/q7_ensemble.py| # | Topic | Key Techniques | Files |
|---|---|---|---|
| Q1 | Data Loading & Preliminary Analysis | Pandas, EDA, class distribution | Q1_Data_Loading |
| Q2 | Data Visualization | WordCloud, Matplotlib, Text analysis | Q2_Visualization |
| Q3 | Data Preprocessing | Normalization, Tokenization, Stopwords, Stemming | Q3_Preprocessing |
| Q4 | Feature Extraction | Unigrams, Bigrams, Trigrams, TF-IDF | Q4_Feature_Extraction |
| Q5 | ML Classification | Naïve Bayes, SVM, Decision Tree, RF, k-NN | Q5_Classification |
| Q6 | Comparative Analysis | Performance metrics comparison, visualizations | Q6_Comparative_Analysis |
| Q7 | Ensemble Methods | Voting Classifier (NB + SVM + RF) | Q7_Ensemble |
- Naïve Bayes — Fast probabilistic baseline
- Support Vector Machine (SVM) — Strong for high-dimensional text
- Decision Tree — Interpretable, rule-based
- Random Forest — Ensemble of trees, robust
- k-Nearest Neighbors (k-NN) — Distance-based classifier
- Voting Ensemble — Combines top 3 models
- TF-IDF transforms raw word counts into weighted relevance scores
- Stratified K-Fold ensures class balance across all folds
- N-grams capture both individual words and multi-word phrases
- Ensemble methods reduce variance and improve generalization
Each model is evaluated using Accuracy, Precision, Recall, and F1 Score.
This assignment demonstrates proficiency in:
✅ Data Preprocessing — Handling multilingual text (Urdu)
✅ Feature Engineering — TF-IDF, N-grams, and vectorization
✅ Classification Models — Implementing and comparing multiple algorithms
✅ Model Evaluation — Cross-validation, metrics (Accuracy, Precision, Recall, F1)
✅ Ensemble Methods — Combining weak learners for improved performance
✅ Visualization — WordCloud, comparison charts, confusion matrices
| Link | Description |
|---|---|
| GitHub Repository | Main assignment repository |
| Project Source | Urdu Sentiment Analysis project |
| Dataset (Kaggle) | IMDB 50K Urdu Reviews |
| Utility Functions | Shared preprocessing utilities |
- Submission Type: GitHub Repository
- Repository Owner: codewithdark-git
- Assignment Type: Machine Learning (NLP + Classification)
- Status: Complete with all 7 questions implemented
Ahsan Umar
Roll No: 232552
Course: Machine Learning
Assignment: 2/2
Dataset Credit: Kaggle — akkefa
Created as part of Machine Learning coursework