## Building a Natural Language Processing (NLP) Model that Rates the Sentiment of Tweets about Apple and Google Products as Positive, Negative or Neutral.

+ **Student:** Wambui Munene
+ **Student pace:** DSPT08
+ **Scheduled project review date/time:** 12/02/2025 23.59 Hours
+ **Instructor name:** Samuel Karu

## Project Summary

### Business and Data Understanding

The objective of this project is to build a Natural Language Processing (NLP) model that rates the sentiment of tweets about Apple and Google products as positive, negative or neutral. The dataset used to build the model is sourced from CrowdFlower via data.world https://data.world/crowdflower/brands-and-product-emotions. This dataset consists of over 8,000 human-rated tweets.

Sentiment Analysis is a powerful tool that provides businesses with deep insights into public perception of their products and services. By leveraging sentiment analysis, companies can effectively gauge customer sentiment and understand the emotional tone behind customer interactions. This enables businesses to identify  areas of concern in real-time, allowing them to proactively address customer needs and improve their offerings.

Social media is a dynamic and widespread platform where customers freely express their thoughts and feelings about products, services, and brands.Using social media platforms like X (formerly twitter) to gauge sentiments is immensely valuable for businesses as it provides real-time and unfiltered insights into customer opinions and experiences. 

By analyzing these sentiments, companies can tap into a wealth of authentic feedback that traditional surveys or feedback forms might miss. This immediate access to customer sentiment enables businesses to swiftly identify trends, preferences, and potential issues, allowing for proactive engagement and timely adjustments to strategies.

Additionally, sentiment analysis can be useful in understanding the broader market landscape, and how competitors are faring and tailor products to match or exceed market expectations.

### Data Preparation
Data preparation involved the following key steps that are critical for preparing text data for modeling:-
1. **Data Cleaning:** 
- Used Regular Expressions (REGEX) to remove irrelevant information such as URLs,mentions(@) and hastags(#).
- Converted all text to lowercase to ensure uniformity
- Applied lemmatization to reduce words to their base forms for consistent analysis and reducing complexity
- Removed stop words (common words that typically do not carry significant meaning such as "the," "is," "in," "and," etc.). This helped in focusing on more meaningful words in the text, leading to better performance of NLP models.

2. **Feature Engineering:**
- Transformed the cleaned text data into numerical representation (vectors) using Term Frequency-Inverse Document Frequency(TF-IDF). This technique evaluates the importance of a word in a document relative to a corpus.
- Adjusted the ngram-range paramemter in the TF-IDF vectorizer to include both unigram(single words) bigram(pair of executive words) to capture context, enriching the feature set and enhancing the model performance.


3. **Exploratory Data Analysis (EDA):**
 - Analyzed the distribution of sentiment labels (positive, negative,neutral) using bar charts and value counts to understand class balance.
 - Visualized the top 10 most common words in the data set.
 - Created word clouds for positive, negative and neutral tweets to visualize most common words in each sentimenclass
 - Visualized bigrams using bar charts to identify common word pairs in the data set, and for each sentiment class
 
 4. **Data Splitting:**
 - Split the data into training, validation and test sets. The training set was set at 70% of the data while the validation and test sets will be 15% respectively. 
 - The validation set was used to tune the hyperparameters and choose the best model configuration without overfitting the test data.
 
 

In [2]:
# A list of libraries used in the Data Preparation Process: 

# Regular Expressions (re): For cleaning text data
import re 

# NLTK (Natural Language Toolkit): For tokenization, stop words removal, and lemmatization.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# Scikit-learn: For TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer


# Pandas, Matplotlib and Seaborn for data manipulation and analysis and visualizations
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# WordCloud: For generating word clouds.
from wordcloud import WordCloud


### Modeling
The project utilized a combination of baseline models and advanced neural networks on the cleaned and vectorized data.

- Created pipelines to streamline data preprocessing (Normalizethe TF-IDF vectors using StandardScaler from Scikit-learn to ensure fair comparisons accross different features),model training and evaluation.This ensured a reproducible and efficient workflow, and minimized the risk of data leakage.
- Initial models included Logistic Regression and Naive Bayes. These models were tuned using GridSearchCV, to find the best hyperparameters, and incorporated cross-validation to prevent overfitting. The optimal hyperparameters were then used on the validation set to fine-tune model performance.
- For advanced modeling, Convolutional Neural Networks (CNNs) were implemented to capture local patterns within the text data. - - The accuracy and ROC-AUC scores of the CNN models were compared to those of the baseline models to evaluate their performance improvements.
- After identifying the best-performing model, it was evaluated on the test set to provide an unbiased assessment of its generalization capability. The final test confirmed the model's robustness and accuracy in predicting sentiment on unseen data.



In [5]:
# Scikit-learn for normalization.
# from sklearn.preprocessing import StandardScaler,MinMaxScaler

# Scikit-learn for creating pipelines, training models, hyperparameter tuning, and evaluation.
# from sklearn.pipeline import Pipeline
# from sklearn.linear_model import LogisticRegression
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.model_selection import GridSearchCV, train_test_split
# from sklearn.metrics import accuracy_score, roc_auc_score

# TensorFlow/Keras for building and training Convolutional Neural Networks (CNNs).
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, Flatten, Dense, Embedding


  np.object,


AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. 
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations