## How do words feel? Exploring Sentiment analysis and Emotion Detection

##### In this poroject, we'll delve into the fascinating world of sentiment analysis and emotion detection. We'll also tackle essential tasks like text preprocessing and feature engineering. Along the way, we'll explore a variety of machine learning techniques to create models that can classify and evaluate text data. To assess our models, we'll use a tool called a confusion matrix. Let's take a look at how this works by reviewing a few terminologies.

##### Sentiment analysis and emotion detection are two essential techniques in natural language processing (NLP). Sentiment analysis assesses the overall sentiment of a sentence, categorizing it as positive, negative, or neutral offering insights to users reactions to products or brands. However, there are some limitations to this such as it's inability to capture the full spectrum of emotions. This is where emotion detection comes to play.

##### Emotion detection identifes the specific emotions like sadness, anger, and happiness in text data. This is great because it offers buisnesses a more comprhensive understanding which makes facilitating informed decision making easier.


## Building a Custom Classifier

##### While there are many libraries available for prdicting sentiments in text, the same doesn't hold true for detecting emotions which is a bit more complex. In order to handle this problem we are going to take matters in our own hands and create a custom classifier. This classifier will claffiy emotions alsongside the sentiment prediction lirbaries to assess both the emotional and sentiment aspects of text.

#####

### About the Dataset

##### The GoEmotions dataset comprises 58,000 meticulously selected Reddit comments meticulously annotated across 27 distinct emotion categories alongside a neutral classification. These categories span a comprehensive spectrum of human emotional responses, encompassing complex nuances such as admiration, amusement, anger, and more. Each comment serves as a valuable data point, contributing to a profound understanding of how individuals express a diverse range of emotions within online communities. This dataset stands as a robust resource for academic and professional endeavors, offering rich insights into the intricate tapestry of human emotional experiences in digital communication. For access to the dataset, please follow this link: https://github.com/google-research/google-research/blob/master/goemotions/README.md

In [None]:
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

In [None]:
import pandas as pd
import os

f1 = pd.read_csv('/content/data/full_dataset/goemotions_1.csv')
f2 = pd.read_csv('/content/data/full_dataset/goemotions_2.csv')
f3 = pd.read_csv('/content/data/full_dataset/goemotions_3.csv')

#data = pd.concat([f1, f2, f3], ignore_index= True)
data = pd.concat([f1], ignore_index= True)
#data.columns = ['text', '']
data.head()

In [None]:
# Assuming 'df' is the DataFrame from the previous code
# Melt the DataFrame to combine the emotion columns into a single 'emotion' column
melted = data.melt(id_vars=['text', 'example_very_unclear'], value_vars=data.columns[10:], var_name='emotion', value_name='emotion_value')

# Filtering to get rows where emotion_value is 1
melted = melted[melted['emotion_value'] == 1]

# Selecting only relevant columns
result = melted[['text', 'emotion']]

# Displaying the resulting DataFrame
print(result)


In [None]:
result.columns = ['text', 'emotion']
print(result)


In [None]:
# Dictionary mapping emotion column names to emotion names
emotion_mapping = {
    'admiration': 'admiration',
    'amusement': 'amusement',
    'anger': 'anger',
    'annoyance': 'annoyance',
    'approval': 'approval',
    'caring': 'caring',
    'confusion': 'confusion',
    'curiosity': 'curiosity',
    'desire': 'desire',
    'disappointment': 'disappointment',
    'disapproval': 'disapproval'
}

# Mapping the binary values to emotion names
result['emotion'] = result['emotion'].map(emotion_mapping)


# Reassigning result to our main dataframe
data = result

# Displaying the DataFrame with emotion names
print(data)


In [None]:
data.head()

## Data Cleaning and Preprocessing

##### At this point we've doone a slight bit of processing of the data to make it clear and readable. Instead of using the binary values to retrieve the emotions were going to use the labels for the emotions. Regauding data cleaning, it's important to perform to obtain better features and and accuracy. Some steps to do text preprocessing can be changing case, correcting spelling, removing special characters, punctuation, stop words, and normalization.

##### In order to do this we're going to use the following libraries to preprocess the text.

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
from nltk.util import ngrams
import re
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import sklearn.feature_extraction.text as text
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
import xgboost
from sklearn import decomposition, ensemble
import pandas, numpy, textblob, string
import re
import nltk
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error

In [None]:
! pip install pandas nltk textblob

In [None]:
import nltk

nltk.download('stopwords')

In [None]:

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from textblob import TextBlob
import re
from nltk.tokenize import word_tokenize



In [None]:
# Converting uppercase to lowercase

data['text'] = data['text'].apply(lambda a: " ".join(a.lower() for a in a.split()))

In [None]:
# Removing whitespace and special characters

data['text'] = data['text'].apply(lambda a: " ".join(a.replace('[^\w\s]','') for a in a.split()))

***What are stopwords and why remove them?***

*Stopwords are common words like "the," "and," "of," etc., which occur frequently in language but often carry little specific meaning in a sentence. Removing them from text analysis helps focus on more important, content-bearing words, streamlining the process by filtering out ubiquitous but less informative terms. This aids in better identifying the core, meaningful words for tasks like sentiment analysis, text classification, or information retrieval, enhancing the accuracy and relevance of the analysis.*

In [None]:
# Removing stopwords

stop = stopwords.words('english')
data['text'] = data['text'].apply(lambda a: " ".join(a for a in a.split() if a not in stop))

In [None]:
!pip install pyspellchecker

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

def correct_spellings(text):
    corrected_text = []
    words = text.split()
    misspelled_words = spell.unknown(words)
    word_correction_mapping = {word: spell.correction(word) if spell.correction(word) is not None else word for word in misspelled_words}

    for word in words:
        if word in word_correction_mapping:
            corrected_text.append(word_correction_mapping[word])
        else:
            corrected_text.append(word)

    return " ".join(corrected_text)

data['text'] = data['text'].apply(lambda a: correct_spellings(a))

In [None]:
# Correcting mispelled words

#data['text'] = data['text'].apply(lambda a: str(TextBlob(a).correct()))

***What is Stemming?***

*Stemming is basically reducing words to their base or root form by removing prefixes and suffixes. This is important because we need to normalize the words  to their core meaning, so that similar variations are treated as a single word. This will help us simplify text analysis and imporve task like search and language processing.*

*For instance, stemming converts words like **"running"**, **"runs"**, **"ran"**, to their common root **"run"**.*


In [None]:
# Normalizing

stem = PorterStemmer()
data['text'] = data['text'].apply(lambda a: " ".join([stem.stem(word) for word in a.split()]))

***Numeric Transformation of Categorical Data***

*Converting categorical values to numerical values is valuable for this analysis beacuse many machine learning algorithms and statistical models work better with numerical inputs. Using  python's label encoder function helps translate categories into numeric representations, enabling the algorithms to effectivly interpret the data. In our analysis we will use the function to label the emotions.*

In [None]:
data['emotion'].value_counts()

In [None]:
# Transforming emotion categories to numerical categories

labelE = preprocessing.LabelEncoder()
data['emotion'] = labelE.fit_transform(data['emotion'])

In [None]:
data['emotion'].value_counts()

In [None]:
# Checking data after preprocessing

data.head()

## Train and Test Split

In [None]:
Xtrain, Xtest, Ytrain, Ytest = model_selection.train_test_split(data['text'], data['emotion'],stratify= data['emotion'])

***What is feature engineering?***

*Feature engineering involves shaping and refining data to improve predictive models. Our focus here is to create or modify features that better capture the essence of the data, especially in textual content. Leveraging methods like count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency), we convert text into numerical representations, highlighting important patterns within the data. Count vectorization quantifies the occurrence of words in text, while TF-IDF reflects the significance of words in a document compared to their occurrence in a broader collection of documents. These techniques are instrumental in converting unstructured text into structured, numeric form, empowering machine learning models to extract meaningful insights from the text.*

In [None]:
# Instantiate the CountVectorizer
countV = CountVectorizer()

# Fit the CountVectorizer with the 'text' data from the entire dataset
countV.fit(data['text'])

# Transform the text data of the training set (Xtrain) into a document-term matrix
cv_xtrain = countV.transform(Xtrain)

# Transform the text data of the testing set (Xtest) into a document-term matrix using the same CountVectorizer
cv_xtest = countV.transform(Xtest)


In [None]:
# Create a TF-IDF Vectorizer instance
tVect = TfidfVectorizer()

# Fit the TF-IDF Vectorizer with the 'text' data from the entire dataset
tVect.fit(data['text'])

# Transform the text data of the training set (Xtrain) using the TF-IDF Vectorizer
tv_xtrain = tVect.transform(Xtrain)

# Transform the text data of the testing set (Xtest) using the same TF-IDF Vectorizer
tv_xtest = tVect.transform(Xtest)


In [None]:


def build(model, X_train, target, X_test):
  # Train the model
  model.fit(X_train, target)

  # Predict using the trained model
  predictions = model.predict(X_test)

  # Calculate and return accuracy
  return metrics.accuracy_score(predictions, Ytest)

***What is the Multinominal Naive Bayes?***

Can you expand on what the multinominal naive bayes algorithm?

The multinomial naive Bayes algorithm essentially calculates the probability of each category using the Bayes theorem.

In [None]:
# Naive Bayes Model with count vectors

cv_NBresult = build(naive_bayes.MultinomialNB(), cv_xtrain, Ytrain, cv_xtest)

print(cv_NBresult)

In [None]:
# Naive Bayes Model with count vectors

tv_NBresult = build(naive_bayes.MultinomialNB(), tv_xtrain, Ytrain, tv_xtest)

print(tv_NBresult)

***What is Random Forest?***

*Can you expand on what random forest is?*

The random forest essentially calculates the probability of each category using the Bayes theorem.*

In [None]:
cv_RFresult = build(ensemble.RandomForestClassifier(), cv_xtrain, Ytrain, cv_xtest)

print(cv_RFresult)

In [None]:
tv_RFresult = build(ensemble.RandomForestClassifier(), tv_xtrain, Ytrain, tv_xtest)

print(tv_RFresult)

## Confusion Matrix

In [None]:
classifier = linear_model.LogisticRegression().fit(tv_xtrain, Ytrain)
val_predictions = classifier.predict(tv_xtest)

# Precision , Recall , F1 - score , Support
y_true, y_pred = Ytest, val_predictions
print(classification_report(y_true, y_pred))
print()

## Connecting to Twitter API

In [None]:
import requests
import pandas as pd

In [None]:
tData = []

In [None]:
payload = {
    'api_key':'ENTER YOUR SCRAPER API KEY',
    'query':'Meta',
    'num':'500'

}

res = requests.get(
    'https://api.scraperapi.com/structured/twitter/search',params = payload
)

data = res.json()

In [None]:
data.keys()

In [None]:
allTweets = data['organic_results']
for tweet in allTweets:
  tData.append(tweet)

In [None]:
df = pd.DataFrame(tData)
df.to_json('tweets.json', orient = 'index')
print("exported")

In [None]:
df

In [None]:
twt = pd.read_json('tweets.json', lines = True, orient = 'records')

In [None]:
twt = twt.to_csv('twt.csv', index = False)

In [None]:
twt = pd.read_csv('twt.csv')

In [None]:
twt = df[['snippet']]

In [None]:
twt.tail()

In [None]:
Xpredict = twt['snippet']

pred_tfidf = tVect.transform(Xpredict)
twt['Emotion'] = classifier.predict(pred_tfidf)
twt.tail() #Change twt

In [None]:
twt['sentiment'] = twt['snippet'].apply(lambda a: TextBlob(a).sentiment[0] )
def function (value):
     if value['sentiment'] < 0 :
        return 'Negative'
     if value['sentiment'] > 0 :
        return 'Positive'
     return 'Neutral'

twt['Sentiment_label'] = twt.apply (lambda a: function(a),axis=1)
twt.tail()

In [None]:
! pip install chart_studio

In [None]:
import chart_studio.plotly as py
import plotly as ply
import cufflinks as cf
from plotly.graph_objs import *
from plotly.offline import *
from IPython.display import display, HTML

init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True, theme='white')

Sentiment_df = pd.DataFrame(twt.Sentiment_label.value_counts().reset_index())
Sentiment_df.columns = ['sentiment', 'Count']
Sentiment_df = pd.DataFrame(Sentiment_df)
Sentiment_df['Percentage'] = 100 * Sentiment_df['Count']/ Sentiment_df['Count'].sum()
Sentiment_Max = Sentiment_df.iloc[0,0]


Sentiment_percent = str(round(Sentiment_df.iloc[0,2],2))
fig1 = Sentiment_df.iplot(kind='pie',labels='sentiment',values='Count',textinfo='label+percent', title= 'Sentiment Analysis', world_readable=True,
                    asFigure=True)
ply.offline.plot(fig1,filename="sentiment")

# Use IPython's display() function to read and display the HTML file
display(HTML(filename='Sentiment.html'))

In [None]:
import chart_studio.plotly as py
import plotly as ply
import cufflinks as cf
from plotly.graph_objs import *
from plotly.offline import *
from IPython.display import display, HTML

init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True, theme='white')
Emotion_df = pd.DataFrame(twt.Emotion.value_counts().reset_index())
Emotion_df.columns = ['Emotion', 'Count']
Emotion_df = pd.DataFrame (Emotion_df)

# Convert 'Emotion' column to string type
#Emotion_df['Emotion'] = Emotion_df['Emotion'].astype(str)

Emotion_df['Percentage'] = 100 * Emotion_df['Count']/ Emotion_df['Count'].sum()
Emotion_Max = Emotion_df.iloc[0,0]
Emotion_percent = str(round(Emotion_df.iloc[0,2],2))
fig = Emotion_df.iplot(kind='pie', labels = 'Emotion', values = 'Count',pull= .2, hole=.2 , colorscale = 'reds', textposition='outside',colors=['red','green','purple','orange','blue','yellow','pink'],textinfo='label+percent', title= 'Emotion Analysis', world_readable=True,asFigure=True)
ply.offline.plot(fig,filename="Emotion")


# Use IPython's display() function to read and display the HTML file
display(HTML(filename='Emotion.html'))