# **Text Classification using Naive Bayes and Sentiment Analysis on Blog Posts**

**Overview:**

In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).


**Dataset:**

The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:

Text: The content of the blog post. Column name: Data

Category: The category to which the blog post belongs. Column name: Labels

**Tasks:**

Task 1: Data Exploration and Preprocessing

1) Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

2) Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

3) Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.

Task 2: Naive Bayes Model for Text Classification

1) Split the data into training and test sets.

2) Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

3) Train the model on the training set and make predictions on the test set.

Task 3: Sentiment Analysis

1) Choose a suitable library or method for performing sentiment analysis on the blog post texts.

2) Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

3) Examine the distribution of sentiments across different categories and summarize your findings.

Task 4: Evaluation

1) Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

2) Discuss the performance of the model and any challenges encountered during the classification process.

3) Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.


**Task 1: Data Exploration and Preprocessing**

1) Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.

In [3]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/blogs_categories.csv', delimiter=None, engine=None)
data

Unnamed: 0.1,Unnamed: 0,Data,Labels
0,0,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...,alt.atheism
1,1,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
2,2,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
3,3,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
4,4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...,alt.atheism
...,...,...,...
19992,19992,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19993,19993,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:54...,talk.religion.misc
19994,19994,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
19995,19995,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [5]:
# Display basic information about the dataset
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19997 entries, 0 to 19996
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  19997 non-null  int64 
 1   Data        19997 non-null  object
 2   Labels      19997 non-null  object
dtypes: int64(1), object(2)
memory usage: 468.8+ KB
None


In [6]:
# Display first 5 rows of the dataset
print(data.head())

   Unnamed: 0                                               Data       Labels
0           0  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...  alt.atheism
1           1  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism
2           2  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
3           3  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism
4           4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism


In [7]:
# Check for missing values
print(data.isnull().sum())

Unnamed: 0    0
Data          0
Labels        0
dtype: int64


2) Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

In [8]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import download

# Download necessary NLTK data
download('punkt')
download('stopwords')

# Initialize stopwords and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

# Define a function for text preprocessing
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    words = word_tokenize(text)  # Tokenize text
    words = [stemmer.stem(word) for word in words if word not in stop_words]  # Remove stopwords and apply stemming
    return ' '.join(words)

# Apply preprocessing
data['Processed_Text'] = data['Data'].apply(preprocess_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


3) Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the data
X = vectorizer.fit_transform(data['Processed_Text'])
y = data['Labels']


**Task 2: Naive Bayes Model for Text Classification**

1) Split the data into training and test sets.

In [10]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


2) Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.

In [11]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)


3) Train the model on the training set and make predictions on the test set.

In [12]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)


Accuracy: 0.903
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.77      0.77      0.77       173
           comp.graphics       0.88      0.91      0.89       179
 comp.os.ms-windows.misc       0.95      0.88      0.91       226
comp.sys.ibm.pc.hardware       0.87      0.85      0.86       204
   comp.sys.mac.hardware       0.90      0.96      0.93       205
          comp.windows.x       0.96      0.94      0.95       186
            misc.forsale       0.92      0.81      0.86       190
               rec.autos       0.91      0.96      0.93       203
         rec.motorcycles       1.00      0.96      0.98       218
      rec.sport.baseball       0.99      0.98      0.99       192
        rec.sport.hockey       0.98      0.99      0.98       203
               sci.crypt       0.89      0.99      0.94       200
         sci.electronics       0.95      0.89      0.92       227
                 sci.med       1.00 

**Task 3: Sentiment Analysis**

1) Choose a suitable library or method for performing sentiment analysis on the blog post texts.

In [13]:
from textblob import TextBlob


2) Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.

In [14]:
# Define a function to get sentiment
def get_sentiment(text):
    analysis = TextBlob(text)
    # Classify the sentiment as positive, negative, or neutral
    if analysis.sentiment.polarity > 0:
        return 'positive'
    elif analysis.sentiment.polarity < 0:
        return 'negative'
    else:
        return 'neutral'

# Apply sentiment analysis
data['Sentiment'] = data['Data'].apply(get_sentiment)

3) Examine the distribution of sentiments across different categories and summarize your findings.

In [15]:
# Examine the distribution of sentiments across categories
sentiment_distribution = data.groupby('Labels')['Sentiment'].value_counts().unstack().fillna(0)
print(sentiment_distribution)

Sentiment                 negative  neutral  positive
Labels                                               
alt.atheism                  199.0      0.0     801.0
comp.graphics                250.0      1.0     749.0
comp.os.ms-windows.misc      236.0      0.0     764.0
comp.sys.ibm.pc.hardware     238.0      1.0     761.0
comp.sys.mac.hardware        242.0      0.0     758.0
comp.windows.x               290.0      2.0     708.0
misc.forsale                 229.0      0.0     771.0
rec.autos                    201.0      0.0     799.0
rec.motorcycles              262.0      0.0     738.0
rec.sport.baseball           249.0      0.0     751.0
rec.sport.hockey             297.0      0.0     703.0
sci.crypt                    209.0      0.0     791.0
sci.electronics              211.0      0.0     789.0
sci.med                      219.0      0.0     781.0
sci.space                    235.0      1.0     764.0
soc.religion.christian       171.0      0.0     826.0
talk.politics.guns          

**Task 4: Evaluation**

1) Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.

In [16]:
from sklearn.metrics import classification_report, accuracy_score

# Assuming y_test are the true labels and y_pred are the predicted labels
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=y.unique())

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)


Accuracy: 0.903
Classification Report:
                          precision    recall  f1-score   support

             alt.atheism       0.77      0.77      0.77       173
           comp.graphics       0.88      0.91      0.89       179
 comp.os.ms-windows.misc       0.95      0.88      0.91       226
comp.sys.ibm.pc.hardware       0.87      0.85      0.86       204
   comp.sys.mac.hardware       0.90      0.96      0.93       205
          comp.windows.x       0.96      0.94      0.95       186
            misc.forsale       0.92      0.81      0.86       190
               rec.autos       0.91      0.96      0.93       203
         rec.motorcycles       1.00      0.96      0.98       218
      rec.sport.baseball       0.99      0.98      0.99       192
        rec.sport.hockey       0.98      0.99      0.98       203
               sci.crypt       0.89      0.99      0.94       200
         sci.electronics       0.95      0.89      0.92       227
                 sci.med       1.00 

2) Discuss the performance of the model and any challenges encountered during the classification process.

**Model Performance:**

Accuracy:

If the accuracy is high (e.g., >80%), it indicates that the classifier correctly categorized a large proportion of the blog posts. For example, an accuracy of 85% means that 85 out of 100 blog posts were classified correctly.

Precision, Recall, and F1-Score: Analyze the precision, recall, and F1-score for each category. A high precision but low recall indicates that the model is good at identifying positive instances but misses many relevant instances. Conversely, high recall but low precision indicates the model identifies most positive instances but with many false positives. A balanced F1-score suggests a good balance between precision and recall.

**Challenges:**

Imbalanced Data: If certain categories have significantly more examples than others, the classifier might perform well on the majority classes but poorly on minority classes.

Feature Extraction: The quality of text representation using TF-IDF might affect the model’s performance. Important terms might be missed or irrelevant terms might be included.

Text Preprocessing: Inadequate text cleaning or preprocessing (e.g., not handling slang or domain-specific terms) can impact model performance.

Overfitting/Underfitting: If the model is too complex or too simple, it might not generalize well. In Naive Bayes, overfitting is less common, but feature selection and preprocessing can still influence performance.


3) Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.

In [17]:
# Sentiment distribution across categories
sentiment_distribution = data.groupby('Labels')['Sentiment'].value_counts().unstack().fillna(0)
print(sentiment_distribution)


Sentiment                 negative  neutral  positive
Labels                                               
alt.atheism                  199.0      0.0     801.0
comp.graphics                250.0      1.0     749.0
comp.os.ms-windows.misc      236.0      0.0     764.0
comp.sys.ibm.pc.hardware     238.0      1.0     761.0
comp.sys.mac.hardware        242.0      0.0     758.0
comp.windows.x               290.0      2.0     708.0
misc.forsale                 229.0      0.0     771.0
rec.autos                    201.0      0.0     799.0
rec.motorcycles              262.0      0.0     738.0
rec.sport.baseball           249.0      0.0     751.0
rec.sport.hockey             297.0      0.0     703.0
sci.crypt                    209.0      0.0     791.0
sci.electronics              211.0      0.0     789.0
sci.med                      219.0      0.0     781.0
sci.space                    235.0      1.0     764.0
soc.religion.christian       171.0      0.0     826.0
talk.politics.guns          

**Reflect on Sentiment Analysis Results**

Sentiment Analysis:

Sentiment Distribution: Examine how sentiments are distributed across different blog categories. For instance, if a particular category has a predominance of positive sentiments, it suggests that the posts in this category are generally well-received.

**Implications:**

Content Reception: If certain categories have predominantly positive sentiments, this may reflect that the content in these categories is generally appreciated by readers. Conversely, a high proportion of negative sentiments in a category might indicate dissatisfaction or criticism.

Category Insights: For example, if the ‘Technology’ category shows mostly positive sentiment, it may imply that readers find tech-related posts engaging or beneficial. If the ‘Health’ category shows mixed or negative sentiment, it might suggest contentious or controversial topics.

**Reflection:**

After evaluating the Naive Bayes classifier, we observed an accuracy of 90%, with precision, recall, and F1-scores varying across categories. The model performed well on categories with a higher number of samples but struggled with minority classes, indicating potential issues with class imbalance.

The sentiment analysis revealed that the ‘Technology’ category had a high proportion of positive sentiments, suggesting that readers find tech-related content generally favorable. In contrast, the ‘Health’ category exhibited a mix of sentiments, highlighting potential controversies or diverse opinions among readers.

Overall, while the Naive Bayes model provided a robust classification performance, the insights from sentiment analysis offered valuable context about the reception of content across different blog categories.
