## Text Feature Extraction

Steps in a basic natural language processor:

    1. Build a vocabulary - create unique id for each word

    2. Feature Extraction - map frquency of each word in the vocab across the document in sparse matrices*

    3. Bag of Words and TF-IDF

    4. Stopwords removal
    
    5. Tagging

*since each vector are mostly made up of 0s they are called sparse matrices

In [3]:
import pandas as pd
import numpy as np

In [4]:
df=pd.read_csv('smsspamcollection.tsv',sep='\t')
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [5]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

### Feature Extraction Code Along

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

X = df['message']
y = df['label']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)

In [9]:
count_vect = CountVectorizer()

#Fit the vectorizer to the data (build vocab, count words ........)
    #count_vect.fit(X_train)
#Individual Transform the original text to the data
    #X_train_counts = count_vect.transform(X_train)

#To do both fit and transform together in one line, we use the code fit_transform as shown:
X_train_counts = count_vect.fit_transform(X_train)


In [10]:
print(X_train_counts)
print(X_train.shape)

#the sparse matrix in X_train_counts has 3733x7082 which maches with X train shape

  (0, 7069)	1
  (0, 4415)	1
  (0, 1736)	1
  (1, 5512)	1
  (1, 4270)	1
  (1, 5443)	1
  (1, 6330)	2
  (1, 5791)	1
  (1, 878)	1
  (1, 3008)	1
  (1, 2156)	2
  (1, 5468)	1
  (1, 935)	1
  (1, 2677)	1
  (1, 2566)	1
  (1, 5437)	1
  (1, 6247)	1
  (1, 3280)	1
  (1, 7048)	1
  (1, 957)	1
  (1, 6250)	1
  (1, 4470)	1
  (1, 938)	1
  (1, 5243)	1
  (1, 4513)	1
  :	:
  (3728, 6913)	1
  (3728, 2287)	1
  (3728, 3769)	1
  (3729, 1454)	1
  (3729, 5795)	1
  (3729, 3794)	1
  (3729, 3674)	1
  (3730, 5141)	1
  (3730, 3085)	1
  (3730, 2743)	1
  (3730, 4902)	1
  (3730, 5800)	1
  (3730, 5799)	1
  (3731, 6345)	1
  (3731, 4429)	1
  (3731, 5520)	1
  (3731, 3505)	1
  (3732, 3416)	1
  (3732, 3532)	1
  (3732, 2090)	1
  (3732, 5423)	1
  (3732, 3073)	1
  (3732, 6119)	1
  (3732, 5763)	1
  (3732, 4285)	1
(3733,)


In [11]:
## TF-IDF Transformer with Count Vectorization
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()
X_train_tfidf = tfidf_trans.fit_transform(X_train_counts) # Here we use X_train_counts since we are doing on TFIDF Transformation

# For combined TF-IDF + Count Vectorization, we use TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

#Here we only pass X_train since Vectorizer does both Count Vectorization and TFIDF
X_train_tfidf =vectorizer.fit_transform(X_train)


In [13]:
# SVC
from sklearn.svm import LinearSVC
clf = LinearSVC()

clf.fit(X_train_tfidf,y_train)

#Creating a pipeline to combine the steps in one shot using pipeline
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])
text_clf.fit(X_train,y_train) # Can be fit directly to the pipeline object
predictions = text_clf.predict(X_test)

In [14]:
# Performance for SVC Pipeline
from sklearn.metrics import confusion_matrix,classification_report
print (confusion_matrix(y_test,predictions))
print (classification_report(y_test,predictions))

[[1586    7]
 [  12  234]]
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839

