<a href="https://colab.research.google.com/github/ashishmission93/ML-PTOJECTS/blob/main/sentiment_analysis_and_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is an example code demonstrating how to perform sentiment analysis and text classification on movie reviews dataset:

In [1]:
import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')

# Load movie reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
import random
random.shuffle(documents)

# Define feature extractor and label set
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# Define feature extractor function
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Extract features from documents
featuresets = [(document_features(d), c) for (d, c) in documents]

# Split dataset into training and testing sets
train_set, test_set = train_test_split(featuresets, test_size=0.2, random_state=42)

# Train a classifier (e.g., Support Vector Machine)
classifier = nltk.classify.SklearnClassifier(SVC(kernel='linear'))
classifier.train(train_set)

# Test the classifier
y_true = [category for (features, category) in test_set]
y_pred = [classifier.classify(features) for (features, category) in test_set]

# Evaluate classifier accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Classifier Accuracy:", accuracy)


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Classifier Accuracy: 0.8175


this code snippet demonstrates:

Downloading and loading the movie reviews dataset from NLTK.
Preprocessing the text data by tokenizing words and extracting relevant features.
Splitting the dataset into training and testing sets.
Training a Support Vector Machine (SVM) classifier using the training set.
Testing the classifier on the testing set and evaluating its accuracy.

In [2]:
import nltk
from nltk.corpus import movie_reviews, stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.decomposition import LatentDirichletAllocation

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('stopwords')
nltk.download('punkt')

# Load movie reviews dataset
documents = [(movie_reviews.raw(fileid), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
import random
random.shuffle(documents)

# Preprocess the documents
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenization and lowercase conversion
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

# Apply preprocessing to documents
documents = [(preprocess_text(text), category) for text, category in documents]

# Split dataset into training and testing sets
X = [text for text, _ in documents]
y = [category for _, category in documents]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=2000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train a classifier (e.g., Support Vector Machine)
classifier = SVC(kernel='linear')
classifier.fit(X_train_tfidf, y_train)

# Test the classifier
y_pred = classifier.predict(X_test_tfidf)

# Evaluate classifier accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Classifier Accuracy:", accuracy)

# Perform Topic Modeling using Latent Dirichlet Allocation (LDA)
lda = LatentDirichletAllocation(n_components=5, random_state=42)
X_lda = lda.fit_transform(X_train_tfidf)

# Display top words for each topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[:-10 - 1:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx+1}: {', '.join(top_words)}")


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Classifier Accuracy: 0.84
Topic 1: mulan, disney, army, animated, serve, murphy, chinese, eddie, animation, spectacular
Topic 2: mulan, wrestling, larry, flynt, spawn, shrek, vampires, truman, carpenter, spice
Topic 3: truman, flynt, wrestling, spawn, larry, carrey, norton, speech, freedom, court
Topic 4: nbsp, television, carter, meaning, fbi, appearance, series, culture, truth, independence
Topic 5: film, movie, one, like, even, good, story, time, would, characters


In this expanded code:

We preprocess the text data by tokenizing, converting to lowercase, removing non-alphabetic tokens, and filtering out stopwords.
We split the dataset into training and testing sets.
We vectorize the text data using TF-IDF (Term Frequency-Inverse Document Frequency) representation.
We train a Support Vector Machine (SVM) classifier on the TF-IDF vectors.
We evaluate the classifier's accuracy on the testing set.
We perform topic modeling using Latent Dirichlet Allocation (LDA) to identify topics in the documents.
We display the top words for each topic discovered by LDA.