## TF-IDF - Term frequency - Inverse Document Frequency, i.e. tf-idf = tf * idf ##
TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document.
`
t is the term/ word
d is the document
D is the total number of documents
{ d ∈ D : t ∈ d } denotes the number of documents in which t occur

tf-idf = tf * idf

Term Frequency = count(t, d) i.e count of term t in document d
Normalized term frequency = count(t,d)/Total terms in that document.
Logarithmic Term Frequency = 1 + log10(count(t,d))
idf ( t, d ) = log ( D / { d ∈ D : t ∈ d })
`

In [13]:
# import modules
import pandas as pd #pandas to deal with tabular data
import numpy as np #numpy for number crunching
from sklearn import metrics #sklearn provides different ml models & methods to prepare training and test data
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import joblib


In [14]:
# load data from csv files
# Dataset used ImDb Movie Reviews Dataset https://www.kaggle.com/mantri7/imdb-movie-reviews-dataset?select=train_data+%281%29.csv
# Data shape 
# ['user-review', 'label(0-bad, 1-good)']
train_data_from_csv = pd.read_csv('train_data.csv')
test_data_from_csv = pd.read_csv('test_data.csv')

      
X_train = train_data_from_csv['0']
y_train = train_data_from_csv['1']
X_test = test_data_from_csv['0']
y_test = test_data_from_csv['1']


In [15]:
# usinf countVectorizer
tf_vectorizer = CountVectorizer()
# extract all the unique words and transform is to make term frequency matrix
# we can fit and then trasform but using fit_transform we can do both the steps in single statement
# fit is to extract all the unique words i.e vocabulary
# transform is to make term frequency matrix of the data for all the unique terms extracted from fit part
X_train_tf = tf_vectorizer.fit_transform(X_train)
# transform the test data into TF vectorized matrix note dont do fit on X_test again because we dont want do create a new vocabulary instead use
# the same vocabulary we extracted from training data
X_test_tf = tf_vectorizer.transform(X_test)

In [16]:
# build naive bayes classification model
nbclf = MultinomialNB()
# train model 
nbclf.fit(X_train_tf, y_train)

MultinomialNB()

In [17]:
# predict the output from testing data(unseen data)
y_pred = nbclf.predict(X_test_tf)
# find the accuracy of the model
score = metrics.accuracy_score(y_test, y_pred)
print('------------------------------')
print("accuracy:   %0.3f" % score)
print('------------------------------')
print(metrics.classification_report(y_test, y_pred, target_names=['Bad', 'Good']))
print('------------------------------')
print(nbclf.predict(tf_vectorizer.transform(["bad"])))


------------------------------
accuracy:   0.814
------------------------------
              precision    recall  f1-score   support

         Bad       0.78      0.88      0.82     12500
        Good       0.86      0.75      0.80     12500

    accuracy                           0.81     25000
   macro avg       0.82      0.81      0.81     25000
weighted avg       0.82      0.81      0.81     25000

------------------------------
[0]


In [18]:
# save model to file
joblib.dump(nbclf, 'nbclf.joblib')
# load model from file
model_loded_from_file = joblib.load('nbclf.joblib')
print(model_loded_from_file.predict(tf_vectorizer.transform(["good"])))


NameError: name 'tfidf_vectorizer' is not defined