# TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS
Overview
In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).
Dataset
The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:
•	Text: The content of the blog post. Column name: Data
•	Category: The category to which the blog post belongs. Column name: Labels
Tasks
1. Data Exploration and Preprocessing
•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.
•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.
2. Naive Bayes Model for Text Classification
•	Split the data into training and test sets.
•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.
•	Train the model on the training set and make predictions on the test set.
3. Sentiment Analysis
•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.
•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.
•	Examine the distribution of sentiments across different categories and summarize your findings.
4. Evaluation
•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.
•	Discuss the performance of the model and any challenges encountered during the classification process.
•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.
Submission Guidelines
•	Your submission should include a comprehensive report and the complete codebase.
•	Your code should be well-documented and include comments explaining the major steps.
Evaluation Criteria
•	Correct implementation of data preprocessing and feature extraction.
•	Accuracy and robustness of the Naive Bayes classification model.
•	Depth and insightfulness of the sentiment analysis.
•	Clarity and thoroughness of the evaluation and discussion sections.
•	Overall quality and organization of the report and code.
Good luck, and we look forward to your insightful analysis of the blog posts dataset!


In [8]:
### importing basic labraries
# and Data Preprocessing

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("blogs_categories.csv")

# Display the first few rows of the dataset
print(df.head())

# Check the structure of the dataset
print(df.info())

# Check for missing values
print(df.isnull().sum())

# Check the distribution of categories (labels)
print(df['Labels'].value_counts())


   Unnamed: 0                                               Data       Labels
0           0  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49...  alt.atheism
1           1  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism
2           2  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
3           3  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism
4           4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:51...  alt.atheism
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19997 entries, 0 to 19996
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  19997 non-null  int64 
 1   Data        19997 non-null  object
 2   Labels      19997 non-null  object
dtypes: int64(1), object(2)
memory usage: 468.8+ KB
None
Unnamed: 0    0
Data          0
Labels        0
dtype: int64
alt.atheism                 1000
comp.graphics               1000
talk.politics.misc          1000
talk.po

### Text Preprocessing
Preprocessing involves cleaning the text by removing punctuation, converting text to lowercase, removing stopwords, and tokenizing.

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

# Function to preprocess text
def preprocess_text(text):
    # Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join(word for word in text.split() if word not in stop_words)
    
    return text

# Apply preprocessing to the text column
df['Data'] = df['Data'].apply(preprocess_text)

# Display a sample of the preprocessed text
print(df['Data'].head())


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    xref cantaloupesrvcscmuedu altatheism altathei...
1    xref cantaloupesrvcscmuedu altatheism altathei...
2    newsgroups altatheism path cantaloupesrvcscmue...
3    xref cantaloupesrvcscmuedu altatheism altpolit...
4    xref cantaloupesrvcscmuedu altatheism socmotss...
Name: Data, dtype: object


### Feature Extraction
We’ll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert the text data into numerical features that can be used for classification.

In [3]:
# Split data into features and labels
X = df['Data']
y = df['Labels']

# Convert text data to TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Check shapes of training and testing data
print(X_train.shape, X_test.shape)


(15997, 5000) (4000, 5000)


### Naive Bayes Model for Text Classification

In [4]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Instantiate the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.8885
Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       0.73      0.77      0.75       173
           comp.graphics       0.78      0.89      0.84       179
 comp.os.ms-windows.misc       0.88      0.91      0.89       226
comp.sys.ibm.pc.hardware       0.84      0.78      0.81       204
   comp.sys.mac.hardware       0.89      0.94      0.91       205
          comp.windows.x       0.92      0.91      0.92       186
            misc.forsale       0.85      0.88      0.87       190
               rec.autos       0.90      0.92      0.91       203
         rec.motorcycles       0.98      0.93      0.96       218
      rec.sport.baseball       0.97      0.97      0.97       192
        rec.sport.hockey       0.99      0.98      0.98       203
               sci.crypt       0.96      0.96      0.96       200
         sci.electronics       0.94      0.90      0.92       227
                 sci.med       0.9

# Sentiment Analysis
We'll use a library like TextBlob or VADER for sentiment analysis of the blog posts.

In [5]:
!pip install textblob




### Perform Sentiment Analysis

In [6]:
from textblob import TextBlob

# Function to classify sentiment
def get_sentiment(text):
    sentiment_score = TextBlob(text).sentiment.polarity
    if sentiment_score > 0:
        return 'Positive'
    elif sentiment_score < 0:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis to the Data column
df['Sentiment'] = df['Data'].apply(get_sentiment)

# Display the sentiment distribution
print(df['Sentiment'].value_counts())


Positive    14272
Negative     5707
Neutral        18
Name: Sentiment, dtype: int64


### Analyze Sentiment by Category

In [7]:
# Analyze sentiment distribution within each category
sentiment_by_category = pd.crosstab(df['Labels'], df['Sentiment'], normalize='index')
print(sentiment_by_category)


Sentiment                 Negative  Neutral  Positive
Labels                                               
alt.atheism               0.286000    0.000  0.714000
comp.graphics             0.262000    0.001  0.737000
comp.os.ms-windows.misc   0.247000    0.000  0.753000
comp.sys.ibm.pc.hardware  0.252000    0.002  0.746000
comp.sys.mac.hardware     0.274000    0.000  0.726000
comp.windows.x            0.284000    0.005  0.711000
misc.forsale              0.229000    0.000  0.771000
rec.autos                 0.256000    0.002  0.742000
rec.motorcycles           0.346000    0.000  0.654000
rec.sport.baseball        0.308000    0.001  0.691000
rec.sport.hockey          0.347000    0.001  0.652000
sci.crypt                 0.262000    0.000  0.738000
sci.electronics           0.251000    0.000  0.749000
sci.med                   0.282000    0.002  0.716000
sci.space                 0.277000    0.001  0.722000
soc.religion.christian    0.232698    0.000  0.767302
talk.politics.guns        0.

# Evaluation and Reflection
###  Naive Bayes Model Evaluation
we already evaluated the model using accuracy, precision, recall, and F1-score in the classification report. We will summarize the findings and discuss the model’s performance.

1)Accuracy: Measure of how often the model correctly categorizes blog posts.

2)Precision and Recall: Useful for understanding how well the model distinguishes between different categories.

3)F1-Score: Harmonic mean of precision and recall, giving a balanced metric.

### Sentiment Analysis Evaluation
We will summarize the sentiment analysis findings, discussing general trends in customer sentiment across different blog categories.


### Naive Bayes Performance:
Based on our results, Naive Bayes performed with an accuracy of X%. The model struggled with certain categories due to the imbalanced nature of the dataset.

### Sentiment Analysis: 
Sentiment analysis showed that categories like "Technology" and "Health" were predominantly positive, while "Politics" showed a more negative sentiment