## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

<br>
____________________________________________________________________________________________________________________________


# Feature Extraction from Text
In the **Scikit-learn Primer** lecture we applied a simple SVC classification model to the SMSSpamCollection dataset. We tried to predict the ham/spam label based on message length and punctuation counts. In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

## Load a dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv(r"C:\Users\HARDIK\NLP END TO END\NLP_COURSE_HELP\TextFiles\smsspamcollection.tsv", sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [3]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [4]:
# checking for the empty string

(df['message'] == ' ').sum()

0

In [5]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
X = df['message']     # X caps bcz it represesnt higher matrix
y = df['label']       # y no-caps bcz it is just a list or an 1d array

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<br>
____________________________________________________________________________________________________________________________

### CountVectorization

- Text preprocessing
- Tokenizing
- Ability to filter out Stop words.

is all included in count vectorizer. which buits a dictionary of features and transforms documents to feature vectors.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

FIT VECTORIZER TO THE DATA (build a vocab, count the number of words...)

TRANSFORM THE ORIGINAL TEXT MESSAGE TO VECTOR

In [8]:
X_train_counts = count_vect.fit_transform(X_train)

X_train_counts

<4457x7702 sparse matrix of type '<class 'numpy.int64'>'
	with 59296 stored elements in Compressed Sparse Row format>

Cant view it bcz it is a HUGE SPARSE MATRIX.

In [9]:
X_train.shape

(4457,)

In [10]:
X_train_counts.shape

(4457, 7702)

### Transform the counts to frequencies with TF-IDF

Than we will combine the steps with TFIDF vectorizer, than we will train the classifier and build the pipeline.

With this we are gonna give the more important words more wight.

***TfidfTransformer***

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [12]:
X_train_tfidf.shape

(4457, 7702)

Shape is same but it's no longer just counts. Instead we've taken the term frequency and multiplied it by the inverse and multiplied it by its inverse document frequency.

<br>
_____________________________________________________________________________________________________________________________


<b>It's so common to do first countvectorizer than TfidfTransformer.... 
    
That sklearn provides a `TfidfVectorizer` and that actually combines the 2 pervious steps into one single step.</b>

In [13]:
# CountVectorizer + TfidfTransformer } TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_Vectorizer = TfidfVectorizer()

X_train_tfidf_Vectorizer = tfidf_Vectorizer.fit_transform(X_train)

In [14]:
X_train_tfidf_Vectorizer.shape

(4457, 7702)

### Creating a classifier

In [15]:
from sklearn.svm import LinearSVC

clf = LinearSVC()

clf.fit(X_train_tfidf_Vectorizer, y_train)

LinearSVC()

### Creating a PIPELINE

This combines everting together which means we dont have to do all the above things again with y_test set.

In [16]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [17]:
text_clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [18]:
pred = text_clf.predict(X_test)

In [19]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print(accuracy_score(y_test,pred))
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

0.9919282511210762
[[964   2]
 [  7 142]]
              precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       0.99      0.95      0.97       149

    accuracy                           0.99      1115
   macro avg       0.99      0.98      0.98      1115
weighted avg       0.99      0.99      0.99      1115



In [20]:
text_clf.predict(["Hey how are you doing today?"])

array(['ham'], dtype=object)

In [21]:
text_clf.predict(["Dear Walmart shopper, your purchase last month won a %1000 Walmart Gift Card, go to www.xyz.com within 24 hours to claim. (NO2cancel)"])

array(['spam'], dtype=object)

In [25]:
# import tensorflow as tf

# keras_file = "sms_classification.h5"
# text_clf.save(keras_file)
# convert_bytes(get_file_size(keras_file),"MB")

In [None]:




# tf.keras.models.save_model(clf, keras_file)
# converter = lite.TFLiteConverter.from_keras_model(model)
# tfmodel = converter.convert()
# open('sms_classification.tflite', 'wb').write(tfmodel)

In [26]:
import pickle

with open('sms_classification_nlp', 'wb') as f:
    pickle.dump(text_clf, f)