<a href="https://colab.research.google.com/github/deepacu1986/DATA-SCIENCE-ASSIGNMENTS/blob/main/textminingandsentimentanalysidassignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TEXT CLASSIFICATION USING NAIVE BAYES AND SENTIMENT ANALYSIS ON BLOG POSTS
Overview
In this assignment, you will work on the "blogs_categories.csv" dataset, which contains blog posts categorized into various themes. Your task will be to build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, you will perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. This assignment will enhance your understanding of text classification, sentiment analysis, and the practical application of the Naive Bayes algorithm in Natural Language Processing (NLP).
Dataset
The provided dataset, "blogs_categories.csv", consists of blog posts along with their associated categories. Each row represents a blog post with the following columns:
•	Text: The content of the blog post. Column name: Data
•	Category: The category to which the blog post belongs. Column name: Labels
Tasks
1. Data Exploration and Preprocessing
•	Load the "blogs_categories.csv" dataset and perform an exploratory data analysis to understand its structure and content.
•	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.
•	Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model, using techniques such as TF-IDF.
2. Naive Bayes Model for Text Classification
•	Split the data into training and test sets.
•	Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. You can use libraries like scikit-learn for this purpose.
•	Train the model on the training set and make predictions on the test set.
3. Sentiment Analysis
•	Choose a suitable library or method for performing sentiment analysis on the blog post texts.
•	Analyze the sentiments expressed in the blog posts and categorize them as positive, negative, or neutral. Consider only the Data column and get the sentiment for each blog.
•	Examine the distribution of sentiments across different categories and summarize your findings.
4. Evaluation
•	Evaluate the performance of your Naive Bayes classifier using metrics such as accuracy, precision, recall, and F1-score.
•	Discuss the performance of the model and any challenges encountered during the classification process.
•	Reflect on the sentiment analysis results and their implications regarding the content of the blog posts.
Submission Guidelines
•	Your submission should include a comprehensive report and the complete codebase.
•	Your code should be well-documented and include comments explaining the major steps.
Evaluation Criteria
•	Correct implementation of data preprocessing and feature extraction.
•	Accuracy and robustness of the Naive Bayes classification model.
•	Depth and insightfulness of the sentiment analysis.
•	Clarity and thoroughness of the evaluation and discussion sections.
•	Overall quality and organization of the report and code.
Good luck, and we look forward to your insightful analysis of the blog posts dataset!



In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('/content/blogs.csv')
#df=pd.read_csv('/content/blogs.csv',sep='\t',header=None,names=['Data',])
df.head(1)

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism


In [None]:
df.Data.iloc[0]

'Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!howland.reston.ans.net!agate!doc.ic.ac.uk!uknet!mcsun!Germany.EU.net!thoth.mchp.sni.de!horus.ap.mchp.sni.de!D012S658!frank\nFrom: frank@D012S658.uucp (Frank O\'Dwyer)\nNewsgroups: alt.atheism\nSubject: Re: islamic genocide\nDate: 23 Apr 1993 23:51:47 GMT\nOrganization: Siemens-Nixdorf AG\nLines: 110\nDistribution: world\nMessage-ID: <1r9vej$5k5@horus.ap.mchp.sni.de>\nReferences: <1r4o8a$6qe@fido.asd.sgi.com> <1r5ubl$bd6@horus.ap.mchp.sni.de> <1r76ek$7uo@fido.asd.sgi.com>\nNNTP-Posting-Host: d012s658.ap.mchp.sni.de\n\nIn article <1r76ek$7uo@fido.asd.sgi.com> livesey@solntze.wpd.sgi.com (Jon Livesey) writes:\n#In article <1r5ubl$bd6@horus.ap.mchp.sni.de>, frank@D012S658.uucp (Frank O\'Dwyer) writes:\n#|> In article <1r4o8a$6qe@fido.asd.sgi.com> livesey@solntze.wpd.sgi.com (Jon Livesey) writes:\n#|> #\n#|> #Noting that a particular society, in this case the mainland UK,

In [None]:
df.shape

(2000, 2)

In [None]:
import re
import spacy
nlp=spacy.load('en_core_web_sm')

In [None]:
def data_clean(text_data):
  text_data= ' '.join(re.findall('\w+',text_data))
  doc= nlp(text_data)
  clean_text= [token.lemma_ for token in doc if not token.is_stop and not token.is_punct
               and not token.is_digit and not token.is_bracket and not token.is_currency]
  return clean_text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count=CountVectorizer(analyzer=data_clean)


In [None]:
x=count.fit_transform(df.Data)

In [None]:
x.shape

(2000, 46583)

In [None]:
x.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tfidf=TfidfTransformer()

In [None]:
y=tfidf.fit_transform(x)

In [None]:
y.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
multi=MultinomialNB()
multi.fit(y,df.Labels)

In [None]:
y_pred=multi.predict(y)
y_pred

array(['alt.atheism', 'alt.atheism', 'alt.atheism', ...,
       'talk.religion.misc', 'talk.religion.misc', 'talk.religion.misc'],
      dtype='<U24')

In [None]:
accuracy_score(df.Labels,y_pred)

0.984

In [None]:
x_train,x_test,y_train,y_test=train_test_split(df.Data,df.Labels,test_size=0.2)

In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(1600,)
(400,)
(1600,)
(400,)


In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipeline=Pipeline([('count',CountVectorizer(analyzer=data_clean)),
                   ('tfidf',TfidfTransformer()),
                   ('multi',MultinomialNB())])

In [None]:
pipeline.fit(x_train,y_train)

In [None]:
y_pred=pipeline.predict(x_test)

In [None]:
accuracy_score(y_test,y_pred)

0.805

In [None]:
# sentiment analysis

In [None]:
'''def sent_value(text:str=None):
  sent_ct=0
  if text:
    doc=nlp(text)
    for token in doc:

        sent_ct+=df_dic.get(token.lemma_,0)
  return sent_ct'''

In [None]:
from textblob import TextBlob
def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0: return "positive"
    elif polarity < 0: return "negative"
    else: return "neutral"

df['Sentiment'] = df['Data'].apply(get_sentiment)

In [None]:
df

Unnamed: 0,Data,Labels,Sentiment
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,positive
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,negative
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,positive
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,positive
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,positive
...,...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc,positive
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,positive
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc,positive
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,positive


In [None]:
sentiment_summary = df.groupby('Labels')['Sentiment'].value_counts(normalize=True).unstack()
print(sentiment_summary)

Sentiment                 negative  positive
Labels                                      
alt.atheism                   0.23      0.77
comp.graphics                 0.24      0.76
comp.os.ms-windows.misc       0.22      0.78
comp.sys.ibm.pc.hardware      0.20      0.80
comp.sys.mac.hardware         0.24      0.76
comp.windows.x                0.27      0.73
misc.forsale                  0.16      0.84
rec.autos                     0.17      0.83
rec.motorcycles               0.26      0.74
rec.sport.baseball            0.29      0.71
rec.sport.hockey              0.34      0.66
sci.crypt                     0.19      0.81
sci.electronics               0.19      0.81
sci.med                       0.29      0.71
sci.space                     0.27      0.73
soc.religion.christian        0.13      0.87
talk.politics.guns            0.30      0.70
talk.politics.mideast         0.22      0.78
talk.politics.misc            0.22      0.78
talk.religion.misc            0.14      0.86


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

                          precision    recall  f1-score   support

             alt.atheism       0.50      0.56      0.53        18
           comp.graphics       0.92      0.63      0.75        19
 comp.os.ms-windows.misc       0.96      0.92      0.94        25
comp.sys.ibm.pc.hardware       0.82      0.86      0.84        21
   comp.sys.mac.hardware       0.80      0.80      0.80        20
          comp.windows.x       0.68      0.93      0.79        14
            misc.forsale       0.85      0.89      0.87        19
               rec.autos       1.00      0.78      0.88        27
         rec.motorcycles       0.85      0.94      0.89        18
      rec.sport.baseball       0.89      0.80      0.84        20
        rec.sport.hockey       0.75      1.00      0.86        15
               sci.crypt       0.79      1.00      0.88        15
         sci.electronics       0.93      0.59      0.72        22
                 sci.med       1.00      0.80      0.89        20
         