**Introduction to Sentiment Analysis**:
Sentiment analysis, also known as opinion mining, is a subfield of natural language processing (NLP) that aims to determine the sentiment or subjective attitude expressed in a piece of text. It involves analyzing and categorizing text as positive, negative, or neutral based on the underlying sentiment conveyed by the author.

Sentiment analysis has various applications, including understanding customer opinions, social media monitoring, brand reputation management, market research, and more. By automating the process of sentiment analysis, businesses and organizations can gain valuable insights from large volumes of text data.

Introduction to the "IMDB Movie Review" Dataset:
The "IMDB Movie Review" dataset is a widely used dataset for sentiment analysis tasks. It contains a collection of movie reviews from the Internet Movie Database (IMDB) website, where each review is labeled with a sentiment (positive or negative) based on the overall sentiment expressed in the review. The dataset is balanced, meaning it contains an equal number of positive and negative reviews.

The dataset I have chosen is the "Twitter Sentiment Analysis Dataset." This dataset focuses on sentiment analysis of tweets, providing a valuable resource for understanding and predicting public sentiment on various topics. It is particularly useful for analyzing real-time opinions and sentiments shared on Twitter, a popular social media platform.

The Twitter Sentiment Analysis Dataset comprises a collection of tweets labeled with sentiment values such as positive, negative, or neutral. The dataset captures a wide range of subjects, including politics, entertainment, technology, and more, making it versatile for sentiment analysis tasks across different domains.

Repository: Sentiment140
Link: https://www.kaggle.com/kazanova/sentiment140

The Sentiment140 repository on Kaggle provides a dataset containing 1.6 million tweets labeled with sentiment values (0 for negative sentiment and 4 for positive sentiment). This dataset is commonly used for sentiment analysis tasks and can be easily accessed and downloaded from the Kaggle website.

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# create the .kaggle directory in Colab
!mkdir -p ~/.kaggle


In [3]:
!cp '/content/kaggle.json' ~/.kaggle/


In [4]:
# Create a new folder named "Datasets" in my drive
!mkdir '/content/drive/MyDrive/Datasets'


mkdir: cannot create directory ‘/content/drive/MyDrive/Datasets’: File exists


In [5]:
# Navigate to the newly created folder
%cd /content/drive/MyDrive/Datasets


/content/drive/MyDrive/Datasets


In [6]:
# Use the !kaggle command to download the dataset directly from the Kaggle repository.
!kaggle datasets download -d kazanova/sentiment140


sentiment140.zip: Skipping, found more recently modified local copy (use --force to force download)


In [7]:
#Extract the contents using the
!unzip sentiment140.zip


Archive:  sentiment140.zip
replace training.1600000.processed.noemoticon.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: training.1600000.processed.noemoticon.csv  


In [8]:
import pandas as pd


Load the CSV file into a pandas DataFrame using the read_csv() function:

In [9]:
data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1', header=None)


In [10]:
print(data)


         0           1                             2         3  \
0        0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   
1        0  1467810672  Mon Apr 06 22:19:49 PDT 2009  NO_QUERY   
2        0  1467810917  Mon Apr 06 22:19:53 PDT 2009  NO_QUERY   
3        0  1467811184  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
4        0  1467811193  Mon Apr 06 22:19:57 PDT 2009  NO_QUERY   
...     ..         ...                           ...       ...   
1599995  4  2193601966  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599996  4  2193601969  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599997  4  2193601991  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599998  4  2193602064  Tue Jun 16 08:40:49 PDT 2009  NO_QUERY   
1599999  4  2193602129  Tue Jun 16 08:40:50 PDT 2009  NO_QUERY   

                       4                                                  5  
0        _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  
1          scotthamilton  is upset that he can't up

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       1600000 non-null  int64 
 1   1       1600000 non-null  int64 
 2   2       1600000 non-null  object
 3   3       1600000 non-null  object
 4   4       1600000 non-null  object
 5   5       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [12]:
column3_data = data[3]
column3_data

0          NO_QUERY
1          NO_QUERY
2          NO_QUERY
3          NO_QUERY
4          NO_QUERY
             ...   
1599995    NO_QUERY
1599996    NO_QUERY
1599997    NO_QUERY
1599998    NO_QUERY
1599999    NO_QUERY
Name: 3, Length: 1600000, dtype: object

In [13]:
unique_values = data[3].unique()
print(unique_values)


['NO_QUERY']


This column does not contain any useful information for sentiment analysis.

In [14]:
data = data.drop(3, axis=1)
data

Unnamed: 0,0,1,2,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


- Column 0: This column contains the sentiment label, where 0 represents a negative sentiment and 4 represents a positive sentiment.
- Column 1: It contains a numeric timestamp associated with each tweet.
- Column 2: This column contains the timestamp of the tweet in the format "Day Month DD HH:MM:SS PDT YYYY."
- Column 4: It represents the username or handle of the Twitter account that posted the tweet.
- Column 5: This column contains the actual text of the tweet.

By analyzing the tweet text and the associated sentiment labels, we can perform sentiment analysis tasks to predict the sentiment of the given text.

## Text Cleaning:
•	Remove special characters, punctuation, URLs, and mentions from the tweet text. This step helps eliminate noise and irrelevant information that may not contribute to sentiment analysis.

•	Convert the text to lowercase to ensure consistent analysis and avoid distinguishing between words based on case sensitivity.
with their codes

In [15]:
import re  # provides regular expression matching operations

def clean_text(text):
    # Removes any characters that are not alphabets, digits, or whitespace
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)

    # Remove mentions
    text = re.sub(r"@[^\s]+", "", text)

    # Convert text to lowercase
    text = text.lower()

    return text


In [16]:
# apply the 'clean_text()' function to both columns 4 and 5
data[4] = data[4].apply(clean_text)  # Clean column 4
data[5] = data[5].apply(clean_text)  # Clean column 5

#data = data.drop(5, axis=1)


In [17]:
data.head(3)

Unnamed: 0,0,1,2,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,thespecialone,switchfoot awww thats a bummer you shoulda ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he cant update his facebook by t...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,kenichan i dived many times for the ball manag...


In [18]:
data = data.rename(columns={0: 'Sentiment', 1: 'Timestamp', 2: 'Date', 4: 'Username', 5: 'Text'})


In [19]:
data.head(3)

Unnamed: 0,Sentiment,Timestamp,Date,Username,Text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,thespecialone,switchfoot awww thats a bummer you shoulda ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,is upset that he cant update his facebook by t...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,mattycus,kenichan i dived many times for the ball manag...


## Stopword Removal:
Stopword removal helps reduce noise and focus on more meaningful words.
Remove common words, known as stopwords, that appear frequently but carry little sentiment-related information (e.g., "the," "is," "and").


In [20]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')  # Download stopwords data (needed once)

stop_words = set(stopwords.words('english'))  # Set of English stopwords

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

data['Text'] = data['Text'].apply(remove_stopwords)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The function takes a `text` string as input, splits it into individual `words`, applies the `stopword` removal using a list comprehension, and then joins the `filtered_words` back into a string.

## Stemming or Lemmatization:
- Perform stemming or lemmatization to reduce words to their base or root forms.
- Stemming reduces words to their stems, which may not always be actual English words but are consistent representations (e.g., "running" becomes "run").
- Lemmatization maps words to their dictionary form or lemma (e.g., "running" becomes "run"). Lemmatization retains actual English words but may result in fewer dimensions compared to stemming.

Advantages:
- help reduce the dimensionality of the data and simplify the analysis
- Contextual Understanding
- Domain-Specific Terms: Consider whether your dataset contains domain-specific terms or jargon that may not be handled well by stemming or lemmatization. In such cases, it may be better to skip these steps to preserve the specific vocabulary of the domain.
- Time and Resources: Stemming and lemmatization can add computational overhead, especially for large datasets. If you have limited computational resources or time constraints, you might choose to skip these steps.

In [21]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def perform_stemming(text):
    tokens = word_tokenize(text)
    stemmed_words = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_words)

def perform_lemmatization(text):
    tokens = word_tokenize(text)
    lemmatized_words = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_words)

data['Stemmed_Text'] = data['Text'].apply(perform_stemming)
data['Lemmatized_Text'] = data['Text'].apply(perform_lemmatization)


## Feature Extraction:
Convert the preprocessed text into numerical representations that machine learning models can process.

**Bag-of-Words (BoW)**: In the bag-of-words approach, each word is treated as a separate feature, and the presence or absence of words is used to represent the text. This technique does not consider the order or context of the words, but it can be effective in capturing important keywords or themes in the text.

To apply the **Bag-of-Words (BoW)** technique to the dataset, you can use the `CountVectorize`r class from the `sklearn.feature_extraction.text` module in `scikit-learn`. This class allows you to convert a collection of text documents into a matrix of token counts.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to your text data
bow_matrix = vectorizer.fit_transform(data['Text'])

# Print the shape of the BoW matrix
print("Shape of BoW matrix:", bow_matrix.shape)


Shape of BoW matrix: (1600000, 772770)


We have 1,600,000 documents (rows) and 772,770 unique words (columns) in your dataset. Each entry in the matrix represents the count of a specific word in a particular document.

The large number of unique words suggests that your dataset contains a wide vocabulary, which can provide rich information for analysis. However, it's important to consider the computational resources required to work with such a large matrix.

### Train-Test Split:
- Split the preprocessed dataset into training and testing subsets.
- The training subset is used to train the sentiment analysis model, while the testing subset is used to evaluate its performance.
- A common split is to allocate around 80% of the data for training and the remaining 20% for testing.


In [24]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(bow_matrix, data['Sentiment'], test_size=0.2, random_state=42)

# Print the shapes of the training and testing data
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (1280000, 772770)
Shape of X_test: (320000, 772770)
Shape of y_train: (1280000,)
Shape of y_test: (320000,)


### Model Training and Evaluation
1- Evaluating a logistic regression model using scikit-learn use the `data['Stemmed_Text']`

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(data['Stemmed_Text'], data['Sentiment'], test_size=0.2, random_state=42)

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()
# Fit the vectorizer to your text data
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Create an instance of the logistic regression model
model = LogisticRegression(max_iter=1000)  # Increase max_iter for convergence

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Update the sentiment labels to 0 and 1
y_test[y_test == 4] = 1
y_pred[y_pred == 4] = 1

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1)
recall = recall_score(y_test, y_pred, pos_label=1)
f1 = f1_score(y_test, y_pred, pos_label=1)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 0.781025
Precision: 0.7729769867909493
Recall: 0.7977209574719948
F1 Score: 0.7851540702130921


2- Evaluating a logistic regression model using scikit-learn use the `data['Lemmatized_Text']`

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split the dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(data['Lemmatized_Text'], data['Sentiment'], test_size=0.2, random_state=42)

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()
# Fit the vectorizer to your text data
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Create an instance of the logistic regression model
model = LogisticRegression(max_iter=1000)  # Increase max_iter for convergence

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Update the sentiment labels to 0 and 1
y_test[y_test == 4] = 1
y_pred[y_pred == 4] = 1

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1)
recall = recall_score(y_test, y_pred, pos_label=1)
f1 = f1_score(y_test, y_pred, pos_label=1)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 0.784525
Precision: 0.7761476744887494
Recall: 0.8016024323078265
F1 Score: 0.7886697152104353


3- modified code using the data['Lemmatized_Text'] column for the **Naive Bayes classifier**

In [63]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split the dataset into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(data['Lemmatized_Text'], data['Sentiment'], test_size=0.2, random_state=42)

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()
# Fit the vectorizer to your text data
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Create an instance of the Naive Bayes classifier
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Update the sentiment labels to 0 and 1
y_test[y_test == 4] = 1
y_pred[y_pred == 4] = 1

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1)
recall = recall_score(y_test, y_pred, pos_label=1)
f1 = f1_score(y_test, y_pred, pos_label=1)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)



Accuracy: 0.77321875
Precision: 0.7923209892959244
Recall: 0.7424831470474624
F1 Score: 0.7665929061225539


The `accuracy` represents the overall correctness of the predictions, while `precision` measures the proportion of correctly predicted positive sentiment instances out of all predicted positive instances. `Recall` indicates the proportion of correctly predicted positive sentiment instances out of all actual positive instances. The `F1 score` is the harmonic mean of precision and recall, providing a balanced measure of the model's performance.

Based on these results, we can see that both models achieve relatively similar performance in terms of accuracy and F1 score. However, Naive Bayes shows slightly higher precision while Logistic Regression has a slightly higher recall.

The lower accuracy, precision, recall, and F1 score indicate that the classifier is struggling to accurately classify the sentiment of the movie reviews. This could be due to the limitations of the Naive Bayes algorithm, which assumes independence between features and may not capture complex relationships in the text data.

## Model Optimization and Fine-tuning
Improve the model's performance by using use cross-validation, I utilize a technique called **"memory-friendly cross-validation"** or **"incremental cross-validation."** This approach allows to perform cross-validation without loading the entire dataset into memory at once.

### Optimize and fine-tune the Naive Bayes model using cross-validation

In [32]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB


In [33]:
# Create an instance of the Naive Bayes classifier
naive_bayes = MultinomialNB()

In [35]:
#Perform cross-validation. Here X_train represents the training data containing the lemmatized text, and y_train contains the corresponding sentiment labels
scores = cross_val_score(naive_bayes, X_train, y_train, cv=5, scoring='accuracy')



In [None]:
# Compute the mean accuracy and other evaluation metrics
mean_accuracy = scores.mean()
precision = cross_val_score(naive_bayes, X_train, y_train, cv=5, scoring='precision').mean()
recall = cross_val_score(naive_bayes, X_train, y_train, cv=5, scoring='recall').mean()
f1 = cross_val_score(naive_bayes, X_train, y_train, cv=5, scoring='f1').mean()


In [47]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Create a pipeline with CountVectorizer and Naive Bayes classifier
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Perform cross-validation
cv_scores = cross_val_score(pipeline, data['Lemmatized_Text'], data['Sentiment'], cv=5, scoring='accuracy')

# Compute the mean accuracy and print the results
mean_accuracy = cv_scores.mean()
print("Mean Accuracy:", mean_accuracy)


Mean Accuracy: 0.7620150000000001


## Sentiment Prediction:
- Once we have a trained and optimized model, use it to predict the sentiment of new, unseen tweets or text data.
•	Pass the preprocessed input text through the trained model to obtain sentiment predictions (e.g., positive or negative sentiment).


In [51]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    filtered_words = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmed_words = [stemmer.stem(word) for word in filtered_words]

    # Join the stemmed words back into a single string
    preprocessed_text = ' '.join(stemmed_words)

    return preprocessed_text


In [66]:
# Train the model on the training data
model.fit(X_train, y_train)

# Specify the new text for sentiment prediction
new_text = "This is a great movie. I loved it!"

# Preprocess the input text (e.g., remove stopwords, perform stemming/lemmatization)
preprocessed_text = preprocess_text(new_text)

# Vectorize the preprocessed text using the same vectorizer used during training
text_vector = vectorizer.transform([preprocessed_text])

# Make sentiment prediction using the trained model
sentiment_prediction = model.predict(text_vector)

# Print the sentiment prediction
if sentiment_prediction == 0:
    print("Negative sentiment")
else:
    print("Positive sentiment")



Positive sentiment


## Conclusion:
To predict the sentiment of a new text using the trained model, follow these steps:

Preprocess the new text in a similar manner as the training data, including steps like removing stopwords, performing stemming/lemmatization, and any other necessary text normalization techniques.

Vectorize the preprocessed text using the same vectorizer used during training. This step converts the text into a numerical representation that can be understood by the model.

Use the trained model to make a sentiment prediction on the vectorized text. This can be done by calling the predict method of the model and passing the vectorized text as input.

In [67]:
new_text = "I hate this movie, it has a derogatory scene!"

In [69]:
# Preprocess the input text (e.g., remove stopwords, perform stemming/lemmatization)
preprocessed_text = preprocess_text(new_text)

# Vectorize the preprocessed text using the same vectorizer used during training
text_vector = vectorizer.transform([preprocessed_text])

# Make sentiment prediction using the trained model
sentiment_prediction = model.predict(text_vector)

# Print the sentiment prediction
if sentiment_prediction == 0:
    print("Negative sentiment")
else:
    print("Positive sentiment")

Negative sentiment
