Skip to content

codewithdark-git/USA-MLAssignment

Repository files navigation

📝 Machine Learning Assignment No. 2

📋 Assignment Details

Property Value
Name Ahsan Umar
Roll No 232552
Course Machine Learning
Assignment No 2
GitHub Repository codewithdark-git/USA-MLAssignment
Project Directory urdu-sentiment-analysis

🎬 Project Title

Urdu Movie Reviews — Sentiment Analysis

Project Description

This project implements a complete NLP + Machine Learning pipeline for classifying Urdu movie reviews as positive or negative. It covers everything from raw data loading to ensemble classifiers, with clean visualizations and a thorough comparative analysis.

Dataset: IMDB 50K Movie Reviews — Translated Urdu (Kaggle)


🗂️ Repository Structure

urdu-sentiment-analysis/
│
├── README.md                        ← Assignment document
├── requirements.txt                 ← All dependencies
├── run_all.py                       ← Execute all questions
│
├── data/
│   ├── imdb_urdu_reviews.csv        ← Dataset (50K reviews)
│   └── README_data.md               ← Instructions for dataset
│
├── utils/                           ← Shared utilities
│   ├── __init__.py
│   ├── urdu_stopwords.py            ← Urdu stopword list
│   └── text_utils.py                ← Reusable preprocessing functions
│
├── Q1_Data_Loading/
│   ├── README.md
│   └── q1_data_loading.py           ← Data Loading & EDA
│
├── Q2_Visualization/
│   ├── README.md
│   └── q2_wordcloud.py              ← WordCloud Visualization
│
├── Q3_Preprocessing/
│   ├── README.md
│   └── q3_preprocessing.py          ← Data Preprocessing
│
├── Q4_Feature_Extraction/
│   ├── README.md
│   └── q4_feature_extraction.py     ← N-grams & TF-IDF
│
├── Q5_Classification/
│   ├── README.md
│   └── q5_classification.py         ← ML Classification Models
│
├── Q6_Comparative_Analysis/
│   ├── README.md
│   ├── q6_comparative_analysis.py   ← Model Comparison
│   └── q6_comparison.py
│
├── Q7_Ensemble/
│   ├── README.md
│   └── q7_ensemble.py               ← Ensemble Methods
│
└── outputs/
    ├── q5_cv_results.csv            ← Cross-validation results
    └── q7_ensemble_results.csv      ← Ensemble results

⚡ Installation & Usage

Prerequisites

  • Python 3.8+
  • Git
  • pip package manager

Step 1: Clone the Repository

git clone https://github.com/codewithdark-git/USA-MLAssignment.git
cd USA-MLAssignment/urdu-sentiment-analysis

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Prepare the Dataset

  1. Download from Kaggle — IMDB 50K Urdu Reviews
  2. Extract and place the CSV file inside the data/ folder
  3. Verify: data/imdb_urdu_reviews.csv

Step 4: Execute Questions

Option A: Run all questions sequentially

python run_all.py

Option B: Run individual questions

python Q1_Data_Loading/q1_data_loading.py
python Q2_Visualization/q2_wordcloud.py
python Q3_Preprocessing/q3_preprocessing.py
python Q4_Feature_Extraction/q4_feature_extraction.py
python Q5_Classification/q5_classification.py
python Q6_Comparative_Analysis/q6_comparative_analysis.py
python Q7_Ensemble/q7_ensemble.py

📦 Questions Overview

# Topic Key Techniques Files
Q1 Data Loading & Preliminary Analysis Pandas, EDA, class distribution Q1_Data_Loading
Q2 Data Visualization WordCloud, Matplotlib, Text analysis Q2_Visualization
Q3 Data Preprocessing Normalization, Tokenization, Stopwords, Stemming Q3_Preprocessing
Q4 Feature Extraction Unigrams, Bigrams, Trigrams, TF-IDF Q4_Feature_Extraction
Q5 ML Classification Naïve Bayes, SVM, Decision Tree, RF, k-NN Q5_Classification
Q6 Comparative Analysis Performance metrics comparison, visualizations Q6_Comparative_Analysis
Q7 Ensemble Methods Voting Classifier (NB + SVM + RF) Q7_Ensemble

📊 ML Algorithms Used

  • Naïve Bayes — Fast probabilistic baseline
  • Support Vector Machine (SVM) — Strong for high-dimensional text
  • Decision Tree — Interpretable, rule-based
  • Random Forest — Ensemble of trees, robust
  • k-Nearest Neighbors (k-NN) — Distance-based classifier
  • Voting Ensemble — Combines top 3 models

🔑 Key Concepts

  • TF-IDF transforms raw word counts into weighted relevance scores
  • Stratified K-Fold ensures class balance across all folds
  • N-grams capture both individual words and multi-word phrases
  • Ensemble methods reduce variance and improve generalization

📈 Evaluation Metrics

Each model is evaluated using Accuracy, Precision, Recall, and F1 Score.


📚 Learning Outcomes

This assignment demonstrates proficiency in:

Data Preprocessing — Handling multilingual text (Urdu)
Feature Engineering — TF-IDF, N-grams, and vectorization
Classification Models — Implementing and comparing multiple algorithms
Model Evaluation — Cross-validation, metrics (Accuracy, Precision, Recall, F1)
Ensemble Methods — Combining weak learners for improved performance
Visualization — WordCloud, comparison charts, confusion matrices


🔗 Important Links

Link Description
GitHub Repository Main assignment repository
Project Source Urdu Sentiment Analysis project
Dataset (Kaggle) IMDB 50K Urdu Reviews
Utility Functions Shared preprocessing utilities

📝 Submission Information

  • Submission Type: GitHub Repository
  • Repository Owner: codewithdark-git
  • Assignment Type: Machine Learning (NLP + Classification)
  • Status: Complete with all 7 questions implemented

👤 Author

Ahsan Umar
Roll No: 232552
Course: Machine Learning
Assignment: 2/2


📄 License & Attribution

Dataset Credit: Kaggle — akkefa
Created as part of Machine Learning coursework

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages