## BRANDS AND PRODUCTS EMOTIONS NLP PROJECT


**Business Problem:**

In an era dominated by social media, brands must continuously track customer sentiments expressed online. Twitter, in particular, has become a critical platform where users voice their opinions about products and brands. However, the vast volume and rapid pace of tweets make it impractical for businesses to manually analyze these opinions for insights. To address this, a Natural Language Processing (NLP) model needs to be developed to automatically classify the sentiment of tweets and determine which brand or product is the target of those sentiments.

The dataset from CrowdFlower includes over 9,000 tweets that have been evaluated for sentiment (positive, negative, or neutral) and tagged with the associated brand or product. The goal is to build an NLP model that can accurately and efficiently:

1. **Classify Sentiments**: Identify whether a tweet expresses positive, negative, or no emotion.
2. **Identify Brand/Product**: Recognize which brand or product is being referred to in the tweet.
3. **Handle Ambiguity**: Deal with tweets that might reference multiple brands or unclear sentiments.

Key challenges include:

- **Textual Variations**: Dealing with informal language, abbreviations, emojis, and slang used on social media.
- **Context Understanding**: Ensuring the model understands subtle and implicit expressions of sentiment.
- **Real-Time Processing**: Building a scalable solution that can process large volumes of data in real time for timely insights.

Solving this problem will help brands enhance their reputation management, respond promptly to consumer feedback, and optimize their marketing strategies based on real-time sentiment analysis.



## Overview:


**Data Understanding:**

The dataset provided contains 9,093 rows of tweet evaluations, focusing on tweets mentioning various brands and products. The data was crowdsourced, with contributors being asked to assess whether a tweet conveyed positive, negative, or no emotion towards a brand or product. If an emotion was expressed, the contributors were further asked to identify the specific brand or product being referred to.

### Key Features of the Data:
1. **Tweet Text**: This contains the actual text of the tweet, which is the main input for sentiment analysis and brand/product identification.
2. **Emotion Label**: Indicates whether the tweet expresses:
   - Positive sentiment
   - Negative sentiment
   - No sentiment (neutral)
3. **Brand/Product Label**: Identifies the brand or product that is the subject of the expressed emotion (if applicable). Some tweets may not reference a brand/product, especially those classified as neutral.
4. **Contributor Annotations**: Information about how the tweet was classified by different crowd contributors, which might provide insights into the subjectivity of sentiment classification.

### Data Considerations:
- **Textual Noise**: As the dataset is composed of tweets, the text may contain slang, abbreviations, misspellings, and non-standard grammar typical of social media posts. Tweets may also include emojis and special characters, which need to be handled properly during data preprocessing.
- **Imbalanced Sentiment Distribution**: There may be a natural imbalance in the data, with more tweets being neutral or expressing positive emotions than negative ones. This imbalance could affect the performance of a sentiment classifier and should be considered during model training.
- **Ambiguous Sentiments**: Some tweets may contain conflicting signals, such as sarcasm or mixed emotions, making it difficult to assign a clear sentiment. The crowdsource annotations may reflect this ambiguity in some cases.
- **Multilabel Outputs**: A single tweet could reference multiple brands or products. The model needs to account for the possibility of multiple correct labels for brand/product identification.
- **Contributor Discrepancies**: Since sentiment is subjective, different contributors may have classified the same tweet differently. Understanding and resolving these discrepancies will be crucial to ensure the accuracy of the model’s training data.

### Data Exploration Goals:
1. **Sentiment Distribution**: Examine the distribution of positive, negative, and neutral tweets to understand the balance of sentiments in the data.
2. **Brand/Product Mentions**: Explore how many distinct brands or products are mentioned and the frequency of mentions for each brand or product.
3. **Textual Characteristics**: Investigate the average length of tweets, frequency of common words, use of emojis, and presence of hashtags or mentions. This will help tailor preprocessing steps.
4. **Annotator Agreement**: Check how often annotators agreed on the sentiment or brand/product for each tweet, as this might highlight instances of ambiguity or disagreement.

By understanding the structure and challenges of the data, we can ensure that the subsequent steps in the NLP model development process, such as preprocessing, feature engineering, and model selection, are tailored to the characteristics of the dataset.
Data source: https://data.world/crowdflower/brands-and-product-emotions

In [None]:
#import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

from sklearn.model_selection import train_test_split, cross_validate
from numpy import array
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from tensorflow import keras
from tensorflow.keras import regularizers, layers
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer

In [18]:
#import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

from sklearn.model_selection import train_test_split, cross_validate
from numpy import array
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# TensorFlow and Keras imports using tensorflow.keras namespace
from tensorflow import keras
from tensorflow.keras import regularizers, layers
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Dense, Flatten, Embedding


In [20]:
#load the dataset
df = pd.read_csv('judge-1377884607_tweet_product_company.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [21]:
df.shape

(9093, 3)

In [22]:
print(df.info())
print(("-"*20))
print('Total duplicated rows')
print(df.duplicated().sum())
print(("-"*20))
print('Total null values')
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None
--------------------
Total duplicated rows
22
--------------------
Total null values
tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64
