### BUILDING A MODEL THAT CAN RATE THE SENTIMENT OF A TWEET BASED ON ITS CONTENT

### Introduction 
Technology represents the application of knowledge to achieve practical goals in a reproducible manner. In today's rapidly evolving world, over 80 percent of the global population embraces technology as an integral part of their lives. Among the numerous tech giants, Google and Apple stand out as dominant players with an extensive worldwide market presence. These industry leaders have left an indelible mark through their innovative devices, cutting-edge software, and indispensable services. Their influence has been pivotal in fostering connectivity among individuals, effectively transforming our planet into a global village through seamless communication.

### Problem Statement:
 In a world where technology has become an integral part of our daily lives, there is an increasing need to understand the sentiments expressed towards tech giants like Google and Apple through social media platforms, particularly Twitter. To address this, the challenge is to develop a robust sentiment analysis model capable of accurately rating the sentiment of tweets based on their content. This model will play a crucial role in assessing public perception and sentiment towards the products and services offered by Google and Apple, allowing businesses and decision-makers to gain valuable insights into customer opinions and preferences in the ever-evolving tech landscape.

### Metrics of Success:
 Effectively classification of the tweeter sentiments towards different products.

Classification Accuracy: To measure the percentage of correctly classified tweets out of the total tweets in the dataset. A high classification accuracy indicates that the model effectively rates sentiments based on tweet content.

Precision and Recall: Precision and recall are essential metrics, especially in sentiment analysis where imbalanced classes are common. Precision measures the accuracy of positive sentiment predictions, while recall measures the ability of the model to identify all positive sentiments correctly. A balance between precision and recall is crucial.

F1-Score:It provides a single score that balances the trade-off between precision and recall.

Confusion Matrix: Understanding where the model is making mistakes and which sentiment class is more challenging to predict.

### Research quetions:
1. How do brand mentions in tweet text influence the sentiment classification results, particularly for brands like Apple and Google?
2. what extent can the sentiment analysis models developed on this dataset be generalized to other similar datasets or social media platforms?
3. What are the comparative performances of Logistic Regression and Multinomial Naive Bayes models in classifying sentiment in tweets?
4. To what extent can the sentiment analysis models developed on this dataset be generalized to other similar datasets or social media platforms?


### Objectives:

1. **Continuous Text Preprocessing:** Tokenized tweet text, removed stopwords, eliminated punctuation, and lemmatized words, ensuring that the text data was consistently clean and ready for analysis.

2. **Persistent Model Implementation:** implement and compared two sentiment classification models, namely Logistic Regression and Multinomial Naive Bayes,thorough evaluation of model performance.

3. **Sustainable Practical Applications:** Evaluated how the insights gained from sentiment analysis could be practically applied in real-world scenarios, such as marketing strategies and customer service enhancements for thebrands and products.



### Data understanding


   - The dataset is sourced from CrowdFlower via data.world.
   - It contains over 9,000 tweets that have been rated by human raters for sentiment.
   - The sentiment labels are categorized as positive, negative, or neither (neutral).


   - The dataset consists of several columns, each serving a specific purpose.
   - Key columns in the dataset include:
     - `tweet_text`: Contains the text of the tweets.
     - `emotion_in_tweet_is_directed_at`: Specifies whether the emotion in the tweet is directed at a brand or product.
     - `is_there_an_emotion_directed_at_a_brand_or_product`: Provides sentiment labels (positive, negative).

### Importing necessary libraries

In [64]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import string

### Loading the data

In [65]:
data = pd.read_csv("judge-1377884607_tweet_product_company.csv", encoding='ISO-8859-1')

In [66]:
# checking the shape of the data
data.shape

(9093, 3)

The data has 9093 rows and 3 columns

In [67]:
# Checking the datatypes of different columns
data.dtypes

tweet_text                                            object
emotion_in_tweet_is_directed_at                       object
is_there_an_emotion_directed_at_a_brand_or_product    object
dtype: object

All the three columns are srings

### Preprocessing

Removing Stopwords: Removing stops words fron the data set

Removing Punctuation: Removing punctuation from the remaining words. Punctuation marks, such as periods, commas, and exclamation points.

In [68]:

# Function to remove stop words and punctuation from a text column
def preprocess_text(text):
    if isinstance(text, str):  
        words = text.split()
    
        
        # Removing stopwords and punctuation
        stop_words = set(stopwords.words('english'))
        words = [word for word in words if word.lower() not in stop_words]
        words = [''.join(c for c in word if c not in string.punctuation) for word in words]
        
        # Joining the processed words back into a text
        return ' '.join(words)
    else:
        return text 

columns_to_preprocess = ['tweet_text', 'emotion_in_tweet_is_directed_at']
for column in columns_to_preprocess:
    data[column] = data[column].apply(preprocess_text)



During removal of stopwords  didn't remove digits because they are important in making the data make sence for example: cannot drop three in 3g becouse it's defining the network the product is using

Tokenizing the tweet_ text and the emotion_in_tweet_is_directed_at columns

In [69]:

# Tokenizing 'tweet_text' column
data['tweet_text_tokens'] = data['tweet_text'].apply(lambda x: word_tokenize(x) if isinstance(x, str) else [])

# Tokenizing 'emotion_in_tweet_is_directed_at' column
data['emotion_tokens'] = data['emotion_in_tweet_is_directed_at'].apply(lambda x: word_tokenize(x) if isinstance(x, str) else [])


In [70]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens,emotion_tokens
0,wesley83 3G iPhone 3 hrs tweeting RISEAustin d...,iPhone,Negative emotion,"[wesley83, 3G, iPhone, 3, hrs, tweeting, RISEA...",[iPhone]
1,jessedee Know fludapp Awesome iPadiPhone app ...,iPad iPhone App,Positive emotion,"[jessedee, Know, fludapp, Awesome, iPadiPhone,...","[iPad, iPhone, App]"
2,swonderlin wait iPad 2 also sale SXSW,iPad,Positive emotion,"[swonderlin, wait, iPad, 2, also, sale, SXSW]",[iPad]
3,sxsw hope years festival crashy years iPhone a...,iPad iPhone App,Negative emotion,"[sxsw, hope, years, festival, crashy, years, i...","[iPad, iPhone, App]"
4,sxtxstate great stuff Fri SXSW Marissa Mayer G...,Google,Positive emotion,"[sxtxstate, great, stuff, Fri, SXSW, Marissa, ...",[Google]


Dropping "tweet_text","emotion_in_tweet_is_directed_at" columns so as to remain with the tokenized columns

In [71]:
data.drop(columns=["tweet_text","emotion_in_tweet_is_directed_at"],axis=1,inplace=True)

In [72]:
# lowecase
data = data.applymap(lambda x: x.lower() if isinstance(x, str) else x)


Lemmatizing the the text data

In [73]:
# Initializing the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatizing 'tweet_text_tokens' column
data['tweet_text_tokens_lemmatized'] = data['tweet_text_tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(token) for token in tokens]
)

# Lemmatizing 'emotion_tokens' column
data['emotion_tokens_lemmatized'] = data['emotion_tokens'].apply(
    lambda tokens: [lemmatizer.lemmatize(token) for token in tokens]
)


In [74]:
data.head()

Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens,emotion_tokens,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
0,negative emotion,"[wesley83, 3G, iPhone, 3, hrs, tweeting, RISEA...",[iPhone],"[wesley83, 3G, iPhone, 3, hr, tweeting, RISEAu...",[iPhone]
1,positive emotion,"[jessedee, Know, fludapp, Awesome, iPadiPhone,...","[iPad, iPhone, App]","[jessedee, Know, fludapp, Awesome, iPadiPhone,...","[iPad, iPhone, App]"
2,positive emotion,"[swonderlin, wait, iPad, 2, also, sale, SXSW]",[iPad],"[swonderlin, wait, iPad, 2, also, sale, SXSW]",[iPad]
3,negative emotion,"[sxsw, hope, years, festival, crashy, years, i...","[iPad, iPhone, App]","[sxsw, hope, year, festival, crashy, year, iPh...","[iPad, iPhone, App]"
4,positive emotion,"[sxtxstate, great, stuff, Fri, SXSW, Marissa, ...",[Google],"[sxtxstate, great, stuff, Fri, SXSW, Marissa, ...",[Google]


Dropping "tweet_text_tokens",	"emotion_tokens" so as to remain with the updated columns

In [75]:
# Dropping columns
data.drop(columns=["tweet_text_tokens",	"emotion_tokens"	],axis = 1, inplace = True)

In [76]:
data.head()




Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
0,negative emotion,"[wesley83, 3G, iPhone, 3, hr, tweeting, RISEAu...",[iPhone]
1,positive emotion,"[jessedee, Know, fludapp, Awesome, iPadiPhone,...","[iPad, iPhone, App]"
2,positive emotion,"[swonderlin, wait, iPad, 2, also, sale, SXSW]",[iPad]
3,negative emotion,"[sxsw, hope, year, festival, crashy, year, iPh...","[iPad, iPhone, App]"
4,positive emotion,"[sxtxstate, great, stuff, Fri, SXSW, Marissa, ...",[Google]


Converting text data to lower case

In [27]:
# Convert text to lowercase in the lemmatized columns
data['tweet_text_tokens_lemmatized'] = data['tweet_text_tokens_lemmatized'].apply(
    lambda tokens: [token.lower() for token in tokens]
)

data['emotion_tokens_lemmatized'] = data['emotion_tokens_lemmatized'].apply(
    lambda tokens: [token.lower() for token in tokens]
)



In [77]:
data.head()

Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
0,negative emotion,"[wesley83, 3G, iPhone, 3, hr, tweeting, RISEAu...",[iPhone]
1,positive emotion,"[jessedee, Know, fludapp, Awesome, iPadiPhone,...","[iPad, iPhone, App]"
2,positive emotion,"[swonderlin, wait, iPad, 2, also, sale, SXSW]",[iPad]
3,negative emotion,"[sxsw, hope, year, festival, crashy, year, iPh...","[iPad, iPhone, App]"
4,positive emotion,"[sxtxstate, great, stuff, Fri, SXSW, Marissa, ...",[Google]


The data is well tokenized and all text in lower case

I will forcus in Binary classification of text.  So I will select the rows with positive ang negative feedback

In [29]:
#selecting colunms which onlt have positive or negative feedback
positive_negative_emotion = data[
    (data["is_there_an_emotion_directed_at_a_brand_or_product"] == "negative emotion") |
    (data["is_there_an_emotion_directed_at_a_brand_or_product"] == "positive emotion")
]


In [30]:
positive_negative_emotion.head(10)

Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
0,negative emotion,"[wesley83, 3g, iphone, 3, hr, tweeting, riseau...",[iphone]
1,positive emotion,"[jessedee, know, fludapp, awesome, ipadiphone,...","[ipad, iphone, app]"
2,positive emotion,"[swonderlin, wait, ipad, 2, also, sale, sxsw]",[ipad]
3,negative emotion,"[sxsw, hope, year, festival, crashy, year, iph...","[ipad, iphone, app]"
4,positive emotion,"[sxtxstate, great, stuff, fri, sxsw, marissa, ...",[google]
7,positive emotion,"[sxsw, starting, ctia, around, corner, googlei...",[android]
8,positive emotion,"[beautifully, smart, simple, idea, rt, madebym...","[ipad, iphone, app]"
9,positive emotion,"[counting, day, sxsw, plus, strong, canadian, ...",[apple]
10,positive emotion,"[excited, meet, samsungmobileus, sxsw, show, s...",[android]
11,positive emotion,"[find, amp, start, impromptu, parties, sxsw, h...","[android, app]"


In [31]:
positive_negative_emotion.shape

(3548, 3)

The data has 3548 rows and 3 columns 

##### Splitting the data to complete data and the data with missing values and empty list 

The main purpose is to deal with data separately

Selecting the complete rows which dont have any missing value or empty list

In [32]:
#complete rows 
complete_rows = positive_negative_emotion[positive_negative_emotion['emotion_tokens_lemmatized'].apply(len) > 0]


Data with incomplete rows

In [33]:
#incomplete rows
missing = positive_negative_emotion[positive_negative_emotion["emotion_tokens_lemmatized"].apply(lambda x: len(x) == 0)]


In [34]:
missing.head()


Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
46,positive emotion,"[handheld, û÷hoboûª, drafthouse, launch, û÷...",[]
64,negative emotion,"[again, rt, mention, line, apple, store, insan...",[]
68,negative emotion,"[boooo, rt, mention, flipboard, developing, ip...",[]
103,negative emotion,"[know, quotdatavizquot, translates, quotsatani...",[]
112,positive emotion,"[spark, android, teamandroid, award, sxsw, rea...",[]


Filling the empty list based on the correspondent tweete text. If it contains words such as 'apple', 'ipad', 'iphone' then there is a high probability the person is addressing it towards apple products and if the tweete text contain 'google', 'android' then  there is a high probability the person is addresssing to a google product.

In [78]:
#creating a function to replace the empty list
def replace_empty_lists(row):
    if not row['emotion_tokens_lemmatized']:
        keywords = row['tweet_text_tokens_lemmatized']
        if any(word in keywords for word in ['apple', 'ipad', 'iphone']):
            return ['apple']
        elif any(word in keywords for word in ['google', 'android']):
            return ['google']
    return row['emotion_tokens_lemmatized']

# Appling the function to replace empty lists
missing['emotion_tokens_lemmatized'] = missing.apply(replace_empty_lists, axis=1)



Dropping the empty list which could not be classified weather adressing Google or Apple 

In [80]:
# removing the missing items 
non_missing = missing[missing['emotion_tokens_lemmatized'].apply(lambda x: bool(x))]

In [81]:
non_missing.head(10)

Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
46,positive emotion,"[handheld, û÷hoboûª, drafthouse, launch, û÷...",[apple]
64,negative emotion,"[again, rt, mention, line, apple, store, insan...",[apple]
68,negative emotion,"[boooo, rt, mention, flipboard, developing, ip...",[apple]
103,negative emotion,"[know, quotdatavizquot, translates, quotsatani...",[apple]
112,positive emotion,"[spark, android, teamandroid, award, sxsw, rea...",[google]
131,positive emotion,"[smallbiz, need, review, play, google, placesw...",[google]
157,positive emotion,"[mention, sxsw, lonelyplanet, austin, guide, i...",[apple]
337,positive emotion,"[first, day, sxsw, fun, final, presentation, g...",[google]
386,positive emotion,"[quotyou, google, canadian, tuxedo, lose, hour...",[google]
417,negative emotion,"[shipments, daily, follow, mention, appleatxdt...",[apple]


Concatinating the datasets to have one comprehensive dataset

In [38]:
#concatinating the both data frames
updated_data = pd.concat([complete_rows, non_missing])


In [82]:
updated_data.shape

(3515, 3)

The rows that have remained are 3515 indicating we have lost only 33 rows

In [83]:
updated_data

Unnamed: 0,is_there_an_emotion_directed_at_a_brand_or_product,tweet_text_tokens_lemmatized,emotion_tokens_lemmatized
0,negative emotion,"[wesley83, 3g, iphone, 3, hr, tweeting, riseau...",[iphone]
1,positive emotion,"[jessedee, know, fludapp, awesome, ipadiphone,...","[ipad, iphone, app]"
2,positive emotion,"[swonderlin, wait, ipad, 2, also, sale, sxsw]",[ipad]
3,negative emotion,"[sxsw, hope, year, festival, crashy, year, iph...","[ipad, iphone, app]"
4,positive emotion,"[sxtxstate, great, stuff, fri, sxsw, marissa, ...",[google]
...,...,...,...
9011,positive emotion,"[apparently, line, get, ipad, sxsw, store, gre...",[apple]
9043,negative emotion,"[hey, anyone, sxsw, signing, group, texting, a...",[apple]
9049,positive emotion,"[mention, buy, used, ipad, ill, pick, one, tom...",[apple]
9052,positive emotion,"[mention, could, buy, new, ipad, 2, tmrw, appl...",[apple]


### Modeling

Divinding the data into two: X - The predictor, y - The predict


In [84]:
X = updated_data['tweet_text_tokens_lemmatized'].apply(lambda x: ' '.join(x))
y = updated_data['is_there_an_emotion_directed_at_a_brand_or_product']


Spliting the data into training and testing sets

In [85]:
#Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [86]:
#TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=10000)  
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


using the logistic curve

In [87]:

# Logistic Regression Model
logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(X_train_tfidf, y_train)


In [88]:
# Predictions on the test set
y_pred = logistic_regression.predict(X_test_tfidf)


In [89]:
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion)
print('Classification Report:')
print(report)

Accuracy: 0.85
Confusion Matrix:
[[  7 101]
 [  2 593]]
Classification Report:
                  precision    recall  f1-score   support

negative emotion       0.78      0.06      0.12       108
positive emotion       0.85      1.00      0.92       595

        accuracy                           0.85       703
       macro avg       0.82      0.53      0.52       703
    weighted avg       0.84      0.85      0.80       703



The results indicate that the model has a relatively high accuracy of 85%. 
Confusion matrix and the classification report, we observe that the model performs much better in identifying positive emotion than negative emotions. It has relatively low recall for negative emotions, indicating that it struggles to correctly identify negative sentiment.

The macro-average and weighted-average metrics provide an overall assessment of model performance, considering both classes. The macro-average F1-score is 0.52, while the weighted-average F1-score is 0.80.

The model performs well in identifying positive sentiment, but in detecting negative sentiment it perfors poorly, as indicated by its low recall for negative emotions.

 ##### modelling using Naive bayes

In [90]:
# Initializing and training the Multinomial Naive Bayes model
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tfidf, y_train)




In [91]:
# Predictions on the test set
y_pred = naive_bayes.predict(X_test_tfidf)



In [92]:
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion)
print('Classification Report:')
print(report)

Accuracy: 0.85
Confusion Matrix:
[[  3 105]
 [  0 595]]
Classification Report:
                  precision    recall  f1-score   support

negative emotion       1.00      0.03      0.05       108
positive emotion       0.85      1.00      0.92       595

        accuracy                           0.85       703
       macro avg       0.93      0.51      0.49       703
    weighted avg       0.87      0.85      0.79       703



Accuracy: The model has an accuracy of 85%, which means it correctly classifies 85% of the tweets into their respective sentiment categories.

Confusion Matrix:

The model correctly identifies 595 positive sentiment tweets (True Positives).
It incorrectly classifies 105 negative sentiment tweets as positive (False Positives).
It correctly identifies 3 negative sentiment tweets (True Negatives).
There are no False Negatives (negative sentiment tweets incorrectly classified as positive).
Classification Report:

For the "negative emotion" class, the model has perfect precision (1.00) but a very low recall (0.03), resulting in an F1-score of 0.05. This indicates that while the model is precise in identifying negative sentiment, it misses a significant number of negative sentiment tweets.
For the "positive emotion" class, the model has high precision (0.85), recall (1.00), and F1-score (0.92), indicating that it performs very well in identifying positive sentiment tweets.
The weighted-average F1-score is 0.79.






##### Modelling using svm

In [93]:

# Initializing and train the Support Vector Machine
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_tfidf, y_train)



In [94]:
# Predictions on the test set
y_pred = svm_classifier.predict(X_test_tfidf)



In [95]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(confusion)
print('Classification Report:')
print(report)

Accuracy: 0.89
Confusion Matrix:
[[ 35  73]
 [  7 588]]
Classification Report:
                  precision    recall  f1-score   support

negative emotion       0.83      0.32      0.47       108
positive emotion       0.89      0.99      0.94       595

        accuracy                           0.89       703
       macro avg       0.86      0.66      0.70       703
    weighted avg       0.88      0.89      0.86       703



Accuracy: The model has an accuracy of 89%, which means it correctly classifies 89% of the tweets into their respective sentiment categories.

Confusion Matrix:

The model correctly identifies 588 positive sentiment tweets (True Positives).
It incorrectly classifies 73 negative sentiment tweets as positive (False Positives).
It correctly identifies 35 negative sentiment tweets (True Negatives).
There are 7 False Negatives (negative sentiment tweets incorrectly classified as positive).
Classification Report:

For the "negative emotion" class, the model has moderate precision (0.83) but a relatively low recall (0.32), resulting in an F1-score of 0.47. This indicates that the model is reasonably precise in identifying negative sentiment .
For the "positive emotion" class, the model performs very well, with high precision (0.89), recall (0.99), and F1-score (0.94), indicating that it excels in identifying positive sentiment tweets.
The macro-average F1-score is 0.70, while the weighted-average F1-score is 0.86.









**Conclusions:**

1. **Logistic Regression Model:**
   - Accuracy: 85%
   - Performance on Positive Emotion:
     - High precision (0.85), recall (1.00), and F1-score (0.92).
   - Performance on Negative Emotion:
     - Low recall (0.06) and F1-score (0.12).

2. **Multinomial Naive Bayes Model:**
   - Accuracy: 85%
   - Performance on Positive Emotion:
     - High precision (0.85), recall (1.00), and F1-score (0.92).
   - Performance on Negative Emotion:
     - Perfect precision (1.00) but very low recall (0.03) and F1-score (0.05).

3. **Support Vector Machine (SVM) Model:**
   - Accuracy: 89%
   - Performance on Positive Emotion:
     - High precision (0.89), recall (0.99), and F1-score (0.94).
   - Performance on Negative Emotion:
     - Moderate precision (0.83) and recall (0.32), resulting in an F1-score of 0.47.

**Recommendations:**

1. **Best Model**: Among the three models evaluated, the Support Vector Machine (SVM) model outperforms the others with an accuracy of 89% and good performance in identifying both positive and negative sentiment tweets. Therefore, the SVM model is the recommended choice.


2. **Regular Updates**: If this sentiment analysis model is used for real-time monitoring of Twitter sentiment, it should be regularly updated with new data to adapt to changing trends and language usage.



