***Akanksha C. Khandare***

In [None]:
***Assignment_19_

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

***Load dataset***

In [3]:
df = pd.read_csv("blogs.csv")   # <-- reads your uploaded file
print("Dataset shape:", df.shape)
print("Columns:", df.columns.tolist())
print(df.head())


Dataset shape: (2000, 2)
Columns: ['Data', 'Labels']
                                                Data       Labels
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....  alt.atheism
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...  alt.atheism
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...  alt.atheism
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...  alt.atheism


In [4]:
# Check for missing values
print("\nMissing values:\n", df.isnull().sum())



Missing values:
 Data      0
Labels    0
dtype: int64


In [5]:
# Drop missing rows if any
df.dropna(subset=['Data', 'Labels'], inplace=True)
df.reset_index(drop=True, inplace=True)

***Text preprocessing***

In [6]:
def clean_text(text):
    """Lowercase, remove URLs, punctuation, numbers, etc."""
    text = str(text).lower()
    text = re.sub(r'http\S+|www\.\S+', ' ', text)   # remove links
    text = re.sub(r'\S+@\S+', ' ', text)             # remove emails
    text = re.sub(r'[^a-z\s]', ' ', text)            # keep only letters
    text = re.sub(r'\s+', ' ', text).strip()         # remove extra spaces
    return text

In [7]:
df["cleaned"] = df["Data"].apply(clean_text)
print("\nSample cleaned text:\n", df[["Data", "cleaned"]].head())



Sample cleaned text:
                                                 Data  \
0  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...   
1  Newsgroups: alt.atheism\nPath: cantaloupe.srv....   
2  Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...   
3  Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...   
4  Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...   

                                             cleaned  
0  path cantaloupe srv cs cmu edu magnesium club ...  
1  newsgroups alt atheism path cantaloupe srv cs ...  
2  path cantaloupe srv cs cmu edu das news harvar...  
3  path cantaloupe srv cs cmu edu magnesium club ...  
4  xref cantaloupe srv cs cmu edu alt atheism tal...  


***Feature extraction using TF-IDF***

In [8]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1,2))


***Train-test split***

In [10]:
X = df["cleaned"]
y = df["Labels"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


***Model training (Naive Bayes)***

In [12]:
model = Pipeline([
    ('tfidf', vectorizer),
    ('nb', MultinomialNB())])

model.fit(X_train, y_train)



***Predictions and evaluation***

In [13]:
y_pred = model.predict(X_test)

print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=0))



Accuracy: 0.9175

Classification Report:
                           precision    recall  f1-score   support

             alt.atheism       1.00      0.65      0.79        20
           comp.graphics       0.86      0.95      0.90        20
 comp.os.ms-windows.misc       0.95      1.00      0.98        20
comp.sys.ibm.pc.hardware       0.83      1.00      0.91        20
   comp.sys.mac.hardware       1.00      0.95      0.97        20
          comp.windows.x       1.00      0.85      0.92        20
            misc.forsale       0.95      1.00      0.98        20
               rec.autos       0.91      1.00      0.95        20
         rec.motorcycles       1.00      0.95      0.97        20
      rec.sport.baseball       1.00      1.00      1.00        20
        rec.sport.hockey       1.00      1.00      1.00        20
               sci.crypt       0.95      1.00      0.98        20
         sci.electronics       1.00      0.85      0.92        20
                 sci.med       1

In [14]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)



Confusion Matrix:
 [[13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  6]
 [ 0 19  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  1 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  2  0 17  0  0  0  0  0  1  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  1 19  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 20  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  1  1  0  0  0  0 17  0  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  0  0 19  0  0  0  0  0  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  0  0  0 19  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 

***Simple Sentiment Analysis***

In [15]:
positive_words = set("good great excellent amazing awesome wonderful happy love liked positive best".split())
negative_words = set("bad terrible awful horrible sad hate disliked worst problem negative fail".split())


In [16]:
def simple_sentiment(text):
    """Basic lexicon-based sentiment."""
    tokens = clean_text(text).split()
    pos = sum(1 for t in tokens if t in positive_words)
    neg = sum(1 for t in tokens if t in negative_words)
    score = pos - neg
    if score > 1:
        return "positive"
    elif score < -1:
        return "negative"
    else:
        return "neutral"

df["Sentiment"] = df["Data"].apply(simple_sentiment)

print("\nSentiment distribution:\n", df["Sentiment"].value_counts())


Sentiment distribution:
 Sentiment
neutral     1675
positive     238
negative      87
Name: count, dtype: int64


In [17]:
# Sentiment per category
print("\nSentiment by Category:\n", df.groupby("Labels")["Sentiment"].value_counts())


Sentiment by Category:
 Labels                    Sentiment
alt.atheism               neutral      81
                          positive     16
                          negative      3
comp.graphics             neutral      89
                          positive      6
                          negative      5
comp.os.ms-windows.misc   neutral      83
                          negative     10
                          positive      7
comp.sys.ibm.pc.hardware  neutral      82
                          negative      9
                          positive      9
comp.sys.mac.hardware     neutral      86
                          negative      8
                          positive      6
comp.windows.x            neutral      88
                          positive      7
                          negative      5
misc.forsale              neutral      83
                          positive     17
rec.autos                 neutral      81
                          positive     13
               

***summary***

In [18]:
print("\nSummary:")
print("Total posts:", len(df))
print("Categories:", df['Labels'].nunique())
print("Model Accuracy:", round(accuracy_score(y_test, y_pred)*100, 2), "%")


Summary:
Total posts: 2000
Categories: 20
Model Accuracy: 91.75 %


In this assignment, I performed Text Classification using the Naive Bayes algorithm and Sentiment Analysis on blog posts.

***Data Loading and Exploration***
I loaded the dataset blogs.csv, which contains blog posts (Data) and their categories (Labels). I explored the structure of the dataset, checked its shape, columns, and handled any missing values.

***Data Preprocessing***
I cleaned the text data by converting it to lowercase, removing punctuation, numbers, links, and extra spaces. I also removed stopwords to make the text ready for analysis.

***Feature Extraction (TF-IDF)***
I used the TF-IDF (Term Frequency–Inverse Document Frequency) method to convert the text into numerical features that could be understood by the machine learning model.

***Model Building (Naive Bayes Classifier)***
I split the data into training and testing sets. Then, I trained a Multinomial Naive Bayes classifier to categorize the blog posts into their respective categories and made predictions on the test data.

***Model Evaluation***
I evaluated the model performance using accuracy, precision, recall, F1-score, and a confusion matrix to understand how well the model classified the posts.

***Sentiment Analysis***
I performed a simple lexicon-based sentiment analysis to determine whether each blog post was positive, negative, or neutral. I also analyzed how sentiments were distributed across different blog categories.

***Conclusion***
Through this assignment, I learned how to preprocess text data, extract features using TF-IDF, and build a text classification model using Naive Bayes. I also understood how to perform sentiment analysis and interpret model performance metrics.