## Text Classification - 1

Supervised learning - We know what are the target labels are.  
Unsupervised learning - We do not know what the target labels are.  
Semi Supervised learning - Combination of both labeled and unlabeled data.

### Classification blueprint. 
Steps:  
    1. Preparation of the train and test data.  
    2. Text normalization  
    3. Feature extraction  
    4. Model training  
    5. Model prediction  
    6. Model evaluation  
    7. Model deployment  

#### 1. Preparation of train and test data. 
If data is of 10 lines then use 8 lines to train the model and 2 lines to test the train model.  
So the data is to be split.  
sklearn.model_selection import train_test_split

#### 2. Text Normalization. 
1. Stemming - It divides words into suffix, prefix and stem and gives the stem 

| Form | Suffix | Stem          |
| ------------- | ------------- | ------------- |
| Studies  | -es  | studi       |
| Studying  | -ing  | study     |

This is not a best way to do because the stem word is not coming as an exact english word. <br/>
2. Lemmatization - Best than stemming, it has intelligence and gives the proper english word. <br/>

| Form | Lemma |
| ------------- | ------------- |
| Studies  | Study  |
| Studying  | Study  |

<br/>
3. Removal of stop - words like articles, etc. <br/>
4. Special characters - Like punctuations, pronouns, etc.

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/varshithavasireddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# Stemming and Lemmatization
sentence = "Hello how are you doing? I am done with the example for this class"
ps = PorterStemmer()
_words = word_tokenize(sentence)
for word in _words:
    print(word,ps.stem(word))

Hello hello
how how
are are
you you
doing do
? ?
I i
am am
done done
with with
the the
example exampl
for for
this thi
class class


In [3]:
lemma = WordNetLemmatizer()
_words = word_tokenize(sentence)
for word in _words:
    print(word,lemma.lemmatize(word))

Hello Hello
how how
are are
you you
doing doing
? ?
I I
am am
done done
with with
the the
example example
for for
this this
class class


In [4]:
# Removal of Stopwords
stopword = set(stopwords.words('english'))
sentence = "If the Easter Bunny and the Tooth fairy had babies would they take your teeth and leave chocolate for you?"
_words = word_tokenize(sentence)
filtered_sentence = [w for w in _words if w not in stopword]
filtered_sentence

['If',
 'Easter',
 'Bunny',
 'Tooth',
 'fairy',
 'babies',
 'would',
 'take',
 'teeth',
 'leave',
 'chocolate',
 '?']

#### 3. Feature Extraction
- Features are unique, measurable attributes are properties for each observation or data point in a data set.  
- Features are usually numeric - we need categorical data for a classifier.  
- What s vectorization? Converting the textual data into numerical values with respect to the frequency or occurrence of the word.  
2 types of Vectorization. 
1. Bag of words aka CountVectorizer
2. TF - IDF Model --> Term Frequency - Inverse document frequency

##### Count Vectorizer 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, 

In [6]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

In [7]:
vec = CountVectorizer() # Model created
vec.fit(corpus) # Model's vocabulary created
vec.vocabulary_ # Getting Vocab

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [8]:
vec.transform(corpus).toarray() # Vectorizing the corpus

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [9]:
counts = vec.fit_transform(corpus)
print(vec.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [10]:
print(counts.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


##### Tf - Idf Vectorizer

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
tfIdf = TfidfVectorizer()
frequency = tfIdf.fit_transform(corpus)
#tfidf weighted document-term matrix is created from fit_transform
print(tfIdf.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [19]:
import pprint
pp = pprint.PrettyPrinter(indent = 1)
pp.pprint(frequency.toarray())

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])


#### 6. Model Evaluation
- True Positive - TP
- True Negative - TN
- False Positive - FP
- False Negative - FN
<br/>
- Accuracy : Proportion of correct predictions of the model  
    Accuracy = (TP + TN)/(TP + TN + FN + FP)
- Precision : Number of predictions made that are accurately correct based on positive class  
    Precision = TP / (TP + FP)
- Recall : Number of instances that of the positive class that were correctly predicted  
    Recall = TP / (TP + FN)
- F-1 Score: Is another accuracy measure which is the harmonic mean of precision and recall.  
    F1 Score = (2 x Precision x Recall)/(Precision + Recall)

# Text Classification 2

### Fit and Transform. 
- The fit() function when applied on training datasets, learns model parameters.
- The transform() function is applied on both train and test dataset.  
- This function of fit and transform on training dataset could be done at once using fit_transform() <br/>

## Naive Bayes Classifier
- Is the most important text classification algorithm and also extremely fast when compared to other classification.  
- Naive (Simple) classification is based on Bayes theorem of probability to predict the class of unknown data.  
- Bayes theorem equation is as follows:  
                  P(c|x) = $ \frac {P(x|c)*P(c)}{P(x)}$ <br/> 
                  where c is class, x is attribute

In [31]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
#MultinomialNB is one of the naive_bayes algorithm

In [30]:
data = pd.read_csv('SMSSpamCollection.csv',
                  sep = '\t',
                  header = None,
                  names = ['label','message'])

In [34]:
# Text Normalization
lemma = WordNetLemmatizer()
data['label'] = data.label.map({'ham': 0, 'spam': 1})  
data['message'] = data.message.map(lambda x: x.lower())  
data['message'] = data.message.str.replace('[^\w\s]', '')  
data['message'] = data['message'].apply(word_tokenize)
data['message'] = data['message'].apply(lambda x: [lemma.lemmatize(y) for y in x])  
data['message'] = data['message'].apply(lambda x: ' '.join(x))

AttributeError: 'float' object has no attribute 'lower'

In [35]:
CountVec = CountVectorizer()
counts = CountVec.fit_transform(data['message'])

AttributeError: 'float' object has no attribute 'lower'

In [36]:
X_train, X_test , y_train, y_test = train_test_split(counts, data['label'],
                                                   test_size = 0.2)

ValueError: Found input variables with inconsistent numbers of samples: [4, 5575]

In [37]:
model = MultinomialNB().fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [38]:
predict = model.predict(X_test)

NameError: name 'model' is not defined

In [39]:
predict

NameError: name 'predict' is not defined

In [40]:
y_test

NameError: name 'y_test' is not defined

In [41]:
print(model.score(X_test, y_test))

NameError: name 'model' is not defined

### Decision Tree Classifier
- Decision tree is a type of supervised learning algorithm that is mostly used in classification problems.
- Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Grade( IX/ X) and Height (5 to 6 ft). 15 out of these 30 play soccer in leisure time.
- With the information we need to split the students who play soccer in their leisure time based on highly significant input variable among all three.
<br/>
- One method is to classify based on outcome of the criterion which leads to homogenous value.
- For example, if the data is split w.r.t gender will it lead to homogenous nodes or is it w.r.t height or it is w.r.t grades.
<br/>

#### Disadvantage of this model is
Since it is supervised learning when it is trained very well, and if test data is given then accuracy of the model decreases.

In [42]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

In [43]:
data = pd.read_table('SMSSpamCollection.csv',
                    sep = '\t',
                    header = None,
                    names=['label', 'message'])

In [44]:
lemma = WordNetLemmatizer()
data['label'] = data.label.map({'ham': 0, 'spam': 1})  
data['message'] = data.message.map(lambda x: x.lower())  
data['message'] = data.message.str.replace('[^\w\s]', '')  
data['message'] = data['message'].apply(word_tokenize)
data['message'] = data['message'].apply(lambda x: [lemma.lemmatize(y) for y in x])  
data['message'] = data['message'].apply(lambda x: ' '.join(x))
df = pd.DataFrame()
df['message'] = data['message']

AttributeError: 'float' object has no attribute 'lower'

In [45]:
df['message'].iloc[[0]]

NameError: name 'df' is not defined

In [46]:
CountVec = CountVectorizer()
counts = CountVec.fit_transform(data['message'])

AttributeError: 'float' object has no attribute 'lower'

In [None]:
counts_df = CountVec.transform(df['message'].iloc[[0]])

In [None]:
X_train, X_test , y_train, y_test = train_test_split(counts, data['label'],
                                                   test_size = 0.2)

In [None]:
model = DecisionTreeClassifier().fit(X_train, y_train)

In [None]:
predict = model.predict(counts_df)

In [None]:
model

In [None]:
predict

In [None]:
print(model.score(X_test, y_test))

### Random Forests
- Ensemble of trees classification is called Random Forest Classifier.
- Random forest classifier creates a set of decision trees from randomly selected subset of training set.
- It then aggregates the votes from different decision trees to decide the final class of the test object.
- Random forest in a way is much more accurate than decision tree classifier.
<br/>

#### Disadvantage
- It takes more time than the Decision Tree Classifier and Naive Bayes Classifier