#Sentiment Analysis
Natural language processing has extensively applied in sentiment analysis as a significant challenge. In this scenario, the objective is to determine if the tweets shared by customers regarding technology companies that produce and sell mobiles, computers, laptops, and similar products express positive sentiment or negative sentiment. The goal will be to build a system that can accurately classify the new tweets sentiments. You can divide the data into train and test. The Evaluation metric you should use is the accuracy.

## Load and explore data


In [1]:
import pandas as pd

df = pd.read_csv('/content/sample_data/tweets.csv')
display(df.head())
display(df.info())

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7920 entries, 0 to 7919
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      7920 non-null   int64 
 1   label   7920 non-null   int64 
 2   tweet   7920 non-null   object
dtypes: int64(2), object(1)
memory usage: 185.8+ KB


None

## Preprocessing


In [2]:
import re
import nltk
from nltk.tokenize import word_tokenize
import string

nltk.download('punkt')
nltk.download('punkt_tab')


def clean_tweet(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'@\w+', '', text) # Remove mentions
    text = re.sub(r'#\w+', '', text) # Remove hashtags
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

df['cleaned_tweet'] = df['tweet'].apply(clean_tweet)
df['tokens'] = df['cleaned_tweet'].apply(word_tokenize)

display(df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Unnamed: 0,id,label,tweet,cleaned_tweet,tokens
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...,test,[test]
1,2,0,Finally a transparant silicon case ^^ Thanks t...,finally a transparant silicon case thanks to m...,"[finally, a, transparant, silicon, case, thank..."
2,3,0,We love this! Would you go? #talk #makememorie...,we love this would you go,"[we, love, this, would, you, go]"
3,4,0,I'm wired I know I'm George I was made that wa...,im wired i know im george i was made that way,"[im, wired, i, know, im, george, i, was, made,..."
4,5,1,What amazing service! Apple won't even talk to...,what amazing service apple wont even talk to m...,"[what, amazing, service, apple, wont, even, ta..."


## Split data


In [3]:
from sklearn.model_selection import train_test_split

X = df['cleaned_tweet']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (6336,) (6336,)
Testing set shape: (1584,) (1584,)


## Model training and Hyperparameter Tuning


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

param_grid = {'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]}

grid_search = GridSearchCV(MultinomialNB(), param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train_tfidf, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best accuracy:", grid_search.best_score_)

best_model = grid_search.best_estimator_

Best parameters: {'alpha': 0.5}
Best accuracy: 0.8590600016432667


## Evaluation


In [5]:
from sklearn.metrics import accuracy_score

X_test_tfidf = tfidf_vectorizer.transform(X_test)
y_pred = best_model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy on the test data: {accuracy:.4f}")

Model accuracy on the test data: 0.8567
