<h1>Sentinental Analysis Classifier</h1>
<p>We will be creating a sentinetal analysis model that will classify review as positive or negative. We will be using the first dataset and the multinomial naive bayes classifier along with nltk to classify the review.</p>

<h2>Importing the required libraries</h2>
<p>Pandas will be used for handling the data. Nltk will be used to converting the reviews in a training format while sklearn will be used to train and classify the reviews as positive or negative. Pickle is used to save the model.</p>

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import pickle

<h2>Downloading the 'stopwords' package</h2>
<p>The stop words consists of common words that will be removed from the reviews before analysis</p>

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<h2>Importing the data</h1>

In [3]:
df = pd.read_csv('review1.csv')
df.head()

Unnamed: 0,Values,Review
0,1,Despite the fact that I have only played a sma...
1,0,I bought this charger in Jul 2003 and it worke...
2,1,Check out Maha Energy's website. Their Powerex...
3,1,Reviewed quite a bit of the combo players and ...
4,0,I also began having the incorrect disc problem...


<h2>Storing the stop words in a variable</h2>

In [4]:
stop_words = set(stopwords.words('english'))

<h2>Creating a method for preprocessing the text</h2>
<p>Before analysis, the text must be in a certain format. The text is converted to lower to prevent problems occuring due to ascii errors A list of common stopwords are removed for the text</p>

In [5]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = ''.join([c for c in text if c.isalnum() or c.isspace()])
    # Tokenize and remove stopwords
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

<h2>Preprocessing all reviews</h2>

In [6]:
df['Review'] = df['Review'].apply(preprocess_text)

<h2>Converting the reviews in a vectorized format for training and classification</h2>

In [7]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['Review'])
y = df['Values']

<h2>Dividing the data into training and testing data</h2>

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<h2>Using the Multinomial Naive Bayes Classifier</h2>
<p>The MultinomialNB is a machine learning algorithm used for text classification. It is based on the naive bayes algorithm.</p>

In [9]:
model = MultinomialNB()
model.fit(X_train, y_train)

<h2>Using the testing data to predict the accuracy of the model</h2>

In [10]:
y_pred = model.predict(X_test)

In [11]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

In [12]:
print(accuracy)
print(report)

0.8181875
              precision    recall  f1-score   support

           0       0.82      0.82      0.82     40086
           1       0.82      0.82      0.82     39914

    accuracy                           0.82     80000
   macro avg       0.82      0.82      0.82     80000
weighted avg       0.82      0.82      0.82     80000



<h2>Saving the model using the pickle package</h2>

In [13]:
with open('model.pkl','wb') as f:
    pickle.dump(model,f)