Blog: https://medium.com/@swayampatil7918/getting-started-with-sentiment-analysis-a-step-by-step-guide-1a16085688a7

Data: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset/data

In [1]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

In [3]:
df = pd.read_csv('data/Tweets.csv')
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


## 2. Preprocessing

In [4]:
# Convert text to lowercase
df['text'] = df['text'].str.lower()

df['text'] = df['text'].astype(str)  # Convert 'text' column to string data type

df['tokens'] = df['text'].apply(nltk.word_tokenize)  # Tokenization

In [5]:
# Remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
df['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stopwords])

## 3. Split the data

X_train: This is the training portion of the feature dataset X. It contains 80% of the data (since test_size=0.2 means 20% of the data is set aside for testing). These features will be used to train the model.

X_test: This is the testing portion of the feature dataset X. It contains the remaining 20% of the data, which will be used to evaluate the model after training.

y_train: This is the training portion of the target labels y. It contains 80% of the labels corresponding to the training features in X_train. These labels are what the model tries to predict.

y_test: This is the testing portion of the target labels y. It contains 20% of the labels corresponding to the testing features in X_test, and it is used to evaluate the model's performance.

In [6]:
X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
X_train

11293                              doctor who has finished
11299                                          you should.
18204    back at school again. almost weekend. oh wait,...
22728    my computer is so slooowww this morning.  i th...
1231                             on my way to dazzle bar!!
                               ...                        
21575    star trek was pure awesome! love it!!! <3333  ...
5390     will be going to indiana baptist sunday, pray ...
860      is sitting thru the boring bits in titanic wai...
15795                                      missed the play
23654    oh i`m really tired of these migraines! #endom...
Name: text, Length: 21984, dtype: object

In [19]:
y_train

11293     neutral
11299     neutral
18204     neutral
22728     neutral
1231      neutral
           ...   
21575    positive
5390      neutral
860       neutral
15795    negative
23654    negative
Name: sentiment, Length: 21984, dtype: object

## 4. Feature extraction

Convert the textual data into numerical features that can be used by a machine learning algorithm. One common approach is to use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer

In [8]:
vectorizer = TfidfVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

In [23]:
feature_names = vectorizer.get_feature_names_out()
print("feature names: ", feature_names)
print("First 5 rows of X_train_vectors: ", X_train_vectors[:5].toarray())

feature names:  ['00' '000' '000th' ... '½you' '½z' '½ï']
First 5 rows of X_train_vectors:  [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


##  5. Build and Train a Sentiment Analysis Model

Chose a classification algorithm (SVM in this case) and use to build a model

In [24]:
model = SVC()
model.fit(X_train_vectors, y_train)

## Step 6. Evaluate the Model

### Accuracy
Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total number of predictions made. It is useful when the classes are balanced.

Accuracy = (𝑇𝑃 + 𝑇𝑁)/(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)

### Precision
Precision is the ratio of correctly predicted positive instances to the total predicted positives. It is a good metric when the cost of false positives is high.

Precision = TP/(TP+FP)
 
### Recall (Sensitivity or True Positive Rate)
Recall is the ratio of correctly predicted positive instances to the total actual positives. It is important when the cost of false negatives is high (e.g., in medical diagnoses).

Recall = TP/(TP+FN)​
 
### F1-Score
The F1-score is the harmonic mean of precision and recall. It provides a balance between the two metrics and is useful when you want to maintain both precision and recall in a model.

F1-Score = 2 × ((Precision×Recall)/(Precision+Recall))

### Specificity (True Negative Rate)

Specificity is the ratio of correctly predicted negative instances to the total actual negatives. It is useful when it is important to correctly identify negatives.

Specificity = TN/(TN+FP)
​ 
### ROC Curve (Receiver Operating Characteristic Curve)
The ROC curve is a plot of the true positive rate (recall) against the false positive rate (FPR) at different classification thresholds. It shows the performance of a classifier across all threshold values.

False Positive Rate (FPR):

FPR = FP/(FP+TN)
​ 
### AUC (Area Under the ROC Curve)
The AUC measures the area under the ROC curve. A model with an AUC of 1 is a perfect classifier, while a model with an AUC of 0.5 is as good as random guessing. It is a good measure of a model's ability to distinguish between classes.

In [25]:
y_pred = model.predict(X_test_vectors)

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Accuracy Score:", accuracy_score(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

    negative       0.77      0.57      0.66      1562
     neutral       0.62      0.81      0.70      2230
    positive       0.80      0.68      0.73      1705

    accuracy                           0.70      5497
   macro avg       0.73      0.68      0.70      5497
weighted avg       0.72      0.70      0.70      5497

Accuracy Score: 0.698744769874477
