<a href="https://colab.research.google.com/github/faniyonm/Twitter-Sentiment-Analysis/blob/main/Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Twitter Sentiment Analysis


The Sentiment140 dataset is a large collection of 1.6 million tweets labeled for sentiment analysis. It provides a foundation for training and evaluating models that classify tweets as positive, negative, or neutral. Researchers and businesses use it to study public opinion, brand perception, and social trends at scale. With Python libraries like Pandas, and scikit-learn, we can easily load, clean, and analyze this dataset to build sentiment analysis models efficiently.

In [15]:
#Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

## Loading the Dataset
The Sentiment140 dataset is accessed using Pandas which allows us to directly load the dataset from a CSV file into a DataFrame. We keep only the polarity column (which shows the sentiment label: 0 for negative, 2 for neutral, 4 for positive) and the tweet text column (which contains the tweet content).

In [16]:
df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header=None,engine='python',on_bad_lines="skip")
df = df[[0, 5]]
df.columns = ['polarity', 'text']
print(df.head())

   polarity                                               text
0         0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1         0  is upset that he can't update his Facebook by ...
2         0  @Kenichan I dived many times for the ball. Man...
3         0    my whole body feels itchy and like its on fire 
4         0  @nationwideclass no, it's not behaving at all....


## Positive and Negative Sentiments
Here we remove neutral tweets where polarity is 2, map the labels so 0 stays negative and 4 becomes 1 for positive. Then we print how many positive and negative tweets are left in the data.

In [17]:
df = df[df.polarity != 2]

df['polarity'] = df['polarity'].map({0: 0, 4: 1})

print(df['polarity'].value_counts())

polarity
0    800000
1    800000
Name: count, dtype: int64


##Cleaning the Tweets
We create a simple function to convert all text to lowercase for consistency, apply it to every tweet in the dataset, and then display the original and cleaned versions of the first few tweets.

In [18]:
def clean_text(text):
    return text.lower()

df['clean_text'] = df['text'].apply(clean_text)

print(df[['text', 'clean_text']].head())

                                                text  \
0  @switchfoot http://twitpic.com/2y1zl - Awww, t...   
1  is upset that he can't update his Facebook by ...   
2  @Kenichan I dived many times for the ball. Man...   
3    my whole body feels itchy and like its on fire    
4  @nationwideclass no, it's not behaving at all....   

                                          clean_text  
0  @switchfoot http://twitpic.com/2y1zl - awww, t...  
1  is upset that he can't update his facebook by ...  
2  @kenichan i dived many times for the ball. man...  
3    my whole body feels itchy and like its on fire   
4  @nationwideclass no, it's not behaving at all....  


##Preparing the Data for Modeling

###Train Test Split
We split the clean_text and polarity columns into training and testing sets.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_text'],
    df['polarity'],
    test_size=0.2,
    random_state=42
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))

Train size: 1280000
Test size: 320000


### Perform Vectorization
We create a TF-IDF vectorizer that converts text into numerical features using unigrams and bigrams, limited to 5000 features. It is fitted and transformed on the training data, then applied to the test data. Finally, we print the shapes of the resulting TF-IDF matrices.

In [20]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("TF-IDF shape (train):", X_train_tfidf.shape)
print("TF-IDF shape (test):", X_test_tfidf.shape)

TF-IDF shape (train): (1280000, 5000)
TF-IDF shape (test): (320000, 5000)


##Fitting Different Models
We train different machine learning models on the data, including Bernoulli Naive Bayes, Support Vector Machine (SVM), and Logistic Regression, to compare their performance on the sentiment classification task.

### 1. Logistic Regression
We train a Logistic Regression model with up to 100 iterations on the TF-IDF features. The model then predicts sentiment labels for the test data, and we print the accuracy along with a detailed classification report for evaluation.

In [21]:
#Logistic Regression
logreg = LogisticRegression(max_iter=100)
logreg.fit(X_train_tfidf, y_train)

logreg_pred = logreg.predict(X_test_tfidf)

print("Logistic Regression Accuracy:", accuracy_score(y_test, logreg_pred))
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, logreg_pred))

Logistic Regression Accuracy: 0.79539375

Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.78      0.79    159494
           1       0.79      0.81      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



The Logistic Regression model achieved an accuracy of 79.5%. The classification report shows balanced performance, with precision, recall, and F1-scores around 0.80 for both negative and positive classes, indicating the model is effective at distinguishing sentiment in tweets.

### 2. Bernoulli Naive Bayes
We train a Bernoulli Naive Bayes classifier on the TF-IDF features from the training data. The model then predicts sentiments for the test data, and we print the accuracy along with a detailed classification report for evaluation.

In [22]:
bnb = BernoulliNB()
bnb.fit(X_train_tfidf, y_train)

bnb_pred = bnb.predict(X_test_tfidf)

print("Bernoulli Naive Bayes Accuracy:", accuracy_score(y_test, bnb_pred))
print("\nBernoulliNB Classification Report:\n", classification_report(y_test, bnb_pred))

Bernoulli Naive Bayes Accuracy: 0.766478125

BernoulliNB Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.75      0.76    159494
           1       0.76      0.78      0.77    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000



The Bernoulli Naive Bayes model achieved an accuracy of 76.6%. The classification report shows fairly balanced performance, with precision, recall, and F1-scores around 0.76–0.77 for both classes, indicating the model performs reasonably well but slightly below Logistic Regression.

### 3. Support Vector Machine (SVM)
We train a Support Vector Machine (SVM) model with a maximum of 1000 iterations on the TF-IDF features. The model then predicts sentiment labels for the test data, and we print the accuracy along with a detailed classification report to evaluate its performance.

In [23]:
svm = LinearSVC(max_iter=1000)
svm.fit(X_train_tfidf, y_train)

svm_pred = svm.predict(X_test_tfidf)

print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("\nSVM Classification Report:\n", classification_report(y_test, svm_pred))

SVM Accuracy: 0.79528125

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.78      0.79    159494
           1       0.79      0.81      0.80    160506

    accuracy                           0.80    320000
   macro avg       0.80      0.80      0.80    320000
weighted avg       0.80      0.80      0.80    320000



The Support Vector Machine (SVM) model achieved an accuracy of 79.5%. The classification report shows balanced precision, recall, and F1-scores around 0.79–0.80 for both negative and positive classes, indicating performance comparable to Logistic Regression.

##Predictions on sample Tweets
Three sample tweets are taken and transformed into TF-IDF features using the same vectorizer. These features are then passed to the trained Logistic Regression BernoulliNaive Bayes, and SVM models to predict sentiment. The predictions are printed for each classifier, where 1 represents Positive and 0 represents Negative.

In [24]:
sample_tweets = ["I love this!", "I hate that!", "It was okay, not great."]
sample_vec = vectorizer.transform(sample_tweets)

print("\nSample Predictions:")
print("Logistic Regression:", logreg.predict(sample_vec))
print("BernoulliNB:", bnb.predict(sample_vec))
print("SVM:", svm.predict(sample_vec))


Sample Predictions:
Logistic Regression: [1 0 1]
BernoulliNB: [1 0 1]
SVM: [1 0 1]


All three models, Logistic Regression, Bernoulli Naive Bayes, and SVM, predicted the same results for the sample tweets: [1 0 1], meaning the first and third tweets were classified as Positive and the second tweet as Negative. We can see that our models are working fine and giving the same predictions even with different approaches.