A collection of Natural Language Processing projects demonstrating expertise in Text Generation, Sequence Modeling, and Language Understanding using TensorFlow, NLTK, and modern NLP techniques.
| # | Project | Task | Notebook | Technique |
|---|---|---|---|---|
| 1 | Text Generator | Language Modeling | 01_text_generator.ipynb |
RNN/LSTM Sequence Generation |
| 2 | NLP Final Project | Comprehensive NLP | 02_nlp_final_project.ipynb |
Multiple NLP Tasks |
- TensorFlow/Keras - Deep learning for NLP
- NLTK - Natural Language Toolkit
- spaCy - Industrial-strength NLP
- Transformers - State-of-the-art models (optional)
- Tokenization - Word and sentence splitting
- Lemmatization & Stemming - Word normalization
- Stop Words Removal - Text cleaning
- Word Embeddings - Word2Vec, GloVe
- RNNs - Recurrent Neural Networks
- LSTMs - Long Short-Term Memory
- GRUs - Gated Recurrent Units
- Attention Mechanisms - Focus on relevant parts
- Python 3.8 or higher
-
Clone the repository
git clone https://github.com/uzi-gpu/nlp-projects.git cd nlp-projects -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\\Scripts\\activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK data (if needed)
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')
-
Launch Jupyter Notebook
jupyter notebook
File: 01_text_generator.ipynb
Objective: Build a character-level or word-level text generator using Recurrent Neural Networks
Task: Language Modeling & Text Generation
Architecture:
- Input: Sequences of characters/words
- Model: LSTM/GRU layers
- Output: Next character/word prediction
Implementation:
1. Data Preprocessing:
- β Text corpus loading
- β Tokenization (character or word-level)
- β Sequence creation
- β Vocabulary building
- β One-hot encoding or embeddings
2. Model Architecture:
Model: Sequential
βββ Embedding Layer (word-level) OR Input Layer (char-level)
βββ LSTM/GRU Layers (stacked)
βββ Dropout (regularization)
βββ Dense Layer
βββ Softmax (probability distribution)3. Training:
- β Teacher forcing
- β Cross-entropy loss
- β Adam optimizer
- β Perplexity tracking
4. Text Generation:
- β Seed text input
- β Sampling strategies (greedy, temperature, top-k)
- β Beam search (optional)
- β Diverse output generation
Key Features:
- Character-level generation for creative text
- Word-level generation for coherent sentences
- Temperature-controlled creativity
- Sequence padding and batching
Applications:
- Creative writing assistance
- Code generation
- Poetry/story generation
- Chatbot responses
File: 02_nlp_final_project.ipynb
Objective: Comprehensive NLP project covering multiple language processing tasks
Tasks Covered:
1. Text Preprocessing Pipeline:
- β Tokenization
- β Lowercasing
- β Stop words removal
- β Punctuation handling
- β Lemmatization/Stemming
- β Text normalization
2. Feature Extraction:
- β Bag of Words (BoW)
- β TF-IDF (Term Frequency-Inverse Document Frequency)
- β N-grams
- β Word embeddings (Word2Vec, GloVe)
3. NLP Tasks:
- Text Classification
- Sentiment Analysis
- Named Entity Recognition (NER)
- Part-of-Speech (POS) Tagging
- Text Summarization
- Language Translation (if applicable)
4. Advanced Techniques:
- β Sequence-to-Sequence models
- β Attention mechanisms
- β Transfer learning with pre-trained models
- β Fine-tuning BERT/GPT (optional)
Pipeline:
Raw Text β Preprocessing β Feature Extraction β Model Training β Evaluation β Deployment
Evaluation Metrics:
- Classification: Accuracy, Precision, Recall, F1-Score
- Generation: BLEU, ROUGE, Perplexity
- NER: Entity-level F1
- Tokenization - Breaking text into words/sentences
- Normalization - Lowercasing, stemming, lemmatization
- Stop Words - Removing common words
- Special Characters - Cleaning punctuation
- Bag of Words - Simple word frequency
- TF-IDF - Term importance weighting
- Word Embeddings - Dense vector representations
- Contextual Embeddings - BERT, ELMo
- RNNs - Recurrent architectures
- LSTMs - Long-term dependencies
- GRUs - Gated mechanisms
- Bidirectional RNNs - Context from both directions
- Attention Mechanisms - Focus on relevant parts
- Transformer Architecture - Self-attention
- Transfer Learning - Pre-trained models
- Fine-tuning - Task-specific adaptation
- Perplexity: Achieved low perplexity indicating good language modeling
- Coherence: Generated text shows grammatical structure
- Creativity: Temperature parameter controls diversity
- Quality: Longer sequences maintain context
- Classification Accuracy: High performance on text classification tasks
- Feature Engineering: TF-IDF outperforms BoW
- Model Comparison: Deep learning models excel on complex tasks
- Pipeline: End-to-end NLP workflow successfully implemented
Through these projects, I have demonstrated proficiency in:
-
NLP Fundamentals
- Text preprocessing and cleaning
- Tokenization strategies
- Feature extraction techniques
- Vocabulary management
-
Deep Learning for NLP
- Recurrent architectures (RNN, LSTM, GRU)
- Sequence-to-sequence models
- Attention mechanisms
- Loss functions for language tasks
-
Practical NLP
- Data pipeline creation
- Model training and evaluation
- Text generation strategies
- Real-world application development
-
Advanced Topics
- Transfer learning in NLP
- Word embeddings
- Language modeling
- Evaluation metrics (BLEU, perplexity)
Uzair Mubasher - BSAI Graduate
This project is licensed under the MIT License - see the LICENSE file for details.
- NLTK and spaCy communities
- TensorFlow/Keras documentation
- NLP course instructors and resources
β If you found this repository helpful, please consider giving it a star!