# Feature Extraction from text
Most Classic Machine learning models or algorithms can't take in raw text as an input. Instead we need to perform a feature extraction from the raw text in order to pass numerical features to the machine learning algorithms. For example we can count the occurence of each word to map text to a number. 



**Feature Extraction using Count Vectorization along with term frequency and Inverse Document frequency**

Let's see how to import these tools using scikit-learn library

### Count Vectorizer 

In [48]:
message=["Hey, let's watch the game today of lakers, bring some snacks with you!"]

In [61]:
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer()

An alternative to Count Vectorizer is something called TfidfVectorizer. It also creates a document term matrix from our messages.

However, instead of filling the DTM with token counts it calculates term frequency-inverse document frequency value for each word(Tf-IDF).

**Term Frequency tf(t,d):** is the raw count of a term in a document, i.e. the number of times that term t occurs in document d.
    
However term frequency alone is not enugh for a through feature analysis of the text. Let's imagine very common terms like "a" or "the". Because the term "the" so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving the weight to more important words like "red" or "clouds".

For this reason we use Inverse document frequency

**An Inverse document frequency factor is incorporated which diminishes the weight of terms that occurs very frequently in the document set and increases the weight of terms that occur rarely.**

### Import Term Frequency-Inverse Document Frequency factor

In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer()
dtm=vect.fit_transform(message)

In [63]:
dtm

<1x13 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

### Building Natural Language Processor  from Scratch

In [68]:
%%writefile 1.txt
This is a story about cats
Cats are furry animals
Cats are super mean pets

Writing 1.txt


In [69]:
%%writefile 2.txt
This story is about dogs
Dogs are super energetic and like to play all the time
Dogs are very faithful to humans

Writing 2.txt


In [71]:
# Let's code a script that will count words occurence in 1.txt
vocab={}
i=1
with open('1.txt') as f:
    x=f.read().lower().split()
    
for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1
vocab    

{'this': 1,
 'is': 2,
 'a': 3,
 'story': 4,
 'about': 5,
 'cats': 6,
 'are': 7,
 'furry': 8,
 'animals': 9,
 'super': 10,
 'mean': 11,
 'pets': 12}

In [72]:
# Let's code a script that will count words occurence in 2.txt
with open('2.txt') as f:
    y=f.read().lower().split()
    
for word in y:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1
vocab

{'this': 1,
 'is': 2,
 'a': 3,
 'story': 4,
 'about': 5,
 'cats': 6,
 'are': 7,
 'furry': 8,
 'animals': 9,
 'super': 10,
 'mean': 11,
 'pets': 12,
 'dogs': 13,
 'energetic': 14,
 'and': 15,
 'like': 16,
 'to': 17,
 'play': 18,
 'all': 19,
 'the': 20,
 'time': 21,
 'very': 22,
 'faithful': 23,
 'humans': 24}

### Feature Extraction
Now that we've encapsulated our "entire language" in a dictionary, let's perform feature extraction on each of our original documents:

In [85]:
# Create an empty vector with space for each word in the vocab
vect=["1.txt"]+[0]*len(vocab)
vect

['1.txt',
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [86]:
# map the frequencies of each word in 1.txt to our vector
with open("1.txt") as f:
    x=f.read().lower().split()
    
for word in x:
    vect[vocab[word]] +=1
        
vect

['1.txt',
 1,
 1,
 1,
 1,
 1,
 3,
 2,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [84]:
# map the frequencies of each word in 2.txt to our vector
vect2=['2.txt']+[0]*len(vocab)

with open('2.txt') as f:
    x=f.read().lower().split()
for word in x:
    vect2[vocab[word]]+=1
    
vect2

['2.txt',
 1,
 1,
 0,
 1,
 1,
 0,
 2,
 0,
 0,
 1,
 0,
 0,
 3,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

In [88]:
# compare two vectors
print(f"{vect}\n{vect2}")

['1.txt', 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['2.txt', 1, 1, 0, 1, 1, 0, 2, 0, 0, 1, 0, 0, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1]


By comparing the vectors we see that some words are common to both, some appear only in 1.txt, others only in 2.txt. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them sparse matrices.

## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

## Feature Extraction From Text(Spam Detector Model)
In scikit-Laerm-Primer notebook our spam detector model's performance was really bad so this time we will use our message feature as an input and see how our model works then

In [89]:
# Import all the tools
import numpy as np
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


### Load Data

In [90]:
df=pd.read_csv('../TextFiles/smsspamcollection.tsv',sep="\t")

In [91]:
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [92]:
df.describe()

Unnamed: 0,length,punct
count,5572.0,5572.0
mean,80.48995,4.177495
std,59.942907,4.623919
min,2.0,0.0
25%,36.0,2.0
50%,62.0,3.0
75%,122.0,6.0
max,910.0,133.0


In [93]:
## Check for missing values in the data
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

## Split the data into X and y


In [95]:
X=df['message']
y=df['label']

## Split the data into training and test datasets

In [96]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [97]:
X_train.shape

(4179,)

In [98]:
X_test.shape

(1393,)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [99]:
# let's convert our text to numerical vectors using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()
X_train_counts=count_vect.fit_transform(X_train)
X_train_counts

<4179x7490 sparse matrix of type '<class 'numpy.int64'>'
	with 55879 stored elements in Compressed Sparse Row format>

In [100]:
X_train_counts.shape

(4179, 7490)

## Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

In [101]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer=TfidfTransformer()
X_train_trans=transformer.fit_transform(X_train_counts)
X_train_trans.shape

(4179, 7490)

## Combine Steps with TfidVectorizer
We can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [102]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer()
X_train_tfidf=vect.fit_transform(X_train)
X_train_tfidf.shape

(4179, 7490)

## Choose a model to fit this data on

In [103]:
from sklearn.svm import LinearSVC
model=LinearSVC()

model.fit(X_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [107]:
from sklearn.pipeline import Pipeline

text_clf=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())])

text_clf.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [108]:
# Make predictions using model

predictions=text_clf.predict(X_test)

## Evaluate the model

In [110]:
text_clf.score(X_test,y_test)

0.990667623833453

In [111]:
from sklearn.metrics import classification_report,confusion_matrix

confusion_matrix(y_test,predictions)

array([[1205,    2],
       [  11,  175]], dtype=int64)

In [112]:
df=pd.DataFrame(confusion_matrix(y_test,predictions),index=["ham","spam"],columns=["ham","spam"])
df

Unnamed: 0,ham,spam
ham,1205,2
spam,11,175


In [114]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1207
        spam       0.99      0.94      0.96       186

   micro avg       0.99      0.99      0.99      1393
   macro avg       0.99      0.97      0.98      1393
weighted avg       0.99      0.99      0.99      1393



## Let's try our model on custom messages

In [117]:
text_clf.predict(["Congratulations, You have won a brand new car in a jackpot click on the link to claim the car, Hurry Up!"])

array(['spam'], dtype=object)