## <b style="color:green">Text Classification</b>
- **Text Classification of Two Type**
  - ___ML : Machine Learning___
  - ___DL : Deep Learning___
- Pipe Line of Text Classification.

- __Classification__ : Some specific output.
- Type of Classification
  - Binary Classification : Two possible output
  - Multi-class Classification : Multiple possible output
  - Multi-Level Classification : For one input, multiple possible output
- __Application__ :    
  1. Email Classification = Spam or Not Spam    
  2. Email Classification = Sales or Support     
  3. Customer Support Chatbot                    
  4. Sentiment Analysis = Happy, Sad, Normal
  5. Language Detection and Translation
  6. Fake News Detection
- __Pipeline__ :
  <pre>
                         __________________________
                        |    Data Acquisation      |
                        |__________________________|
                                     |
                                    \|/
                                     '
                         __________________________
                        |   Text Preprocessing     |
                        |__________________________|
                                     |
                                    \|/
                                     '
                         __________________________
                        |  Text Vectorization      |
                        |__________________________|
                                     |
                                    \|/
                                     '
                         __________________________
                        |        Modelling         |   ML: Navie Bayes, Random Forest
                        |__________________________|   DL: RNN->LSTM, CNN, BERT
                                     |
                                    \|/
                                     '
                         __________________________
                        |        Evaluvation       |  Matrix: Accuracy Score, Precision and Recall
                        |__________________________|          Confusion Matrix, ROC or AUC Curve
                                     |
                                    \|/
                                     '
                         __________________________
                        |         Delploy          |
                        |__________________________|


  </pre>

- Different Approaches :
  1. Heuristic Approaches
  2. Using API Approaches
  3. Using ML Apporaches = BOW, ngrams,TF-IDF, Navie Baise, RF, SVM
  4. Using DL Approaches = RNN(LSTM), CNN, BERT


### **1. Heuristic Approaches**
- Use when you don't have sufficient data.

### **2. Using API**
- Use ready made API : Application Programming Interface
- Some APIs : AWS, GCP, NLP Cloud

### **3. Using BoW, n-grams and TF-IDF**
- Bag of Words
- <pre>
                                     _______ Naive Bayes   \
                                    /                       \
                                   /                         \
     Text                         /                           \  Compare Both Result
    Preprocessing >---> BoW >--->                             /
                      n-gram      \                          /
                      TF-IDF       \                        /
                                    \_______ Random Forest /
  </pre>

### **4. Use Word2Vec**
- 1. Use Pre-trained Word2Vec
  2. Train Your Word2Vec on Your Data (Make sure you have sufficient data)
- Sentence >---> `avg word2vec`

In [1]:
import numpy as np
import pandas as pd

In [2]:
temp_df = pd.read_csv("../data/IMDB_Dataset.csv")
df = temp_df.iloc[:10000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
df.shape

(10000, 2)

In [4]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [5]:
# No class imbalance. 
# ratio of classes should be equal, else highter will dominate.
df['sentiment'].value_counts()

positive    5028
negative    4972
Name: sentiment, dtype: int64

In [6]:
# check null value
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [7]:
# check duplicated review (rows)
df.duplicated().sum()

17

In [8]:
# drop drop duplicated review
df.drop_duplicates(inplace=True)

df.duplicated().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


0

### **Basic Text Preprocessing**

In [9]:
# Basic Preprocessing
# 1. Remove tags
# 2. Lowercase
# 3. Remove Punctuations
# 4. Remove stopwords
# 5. Spelling correct
# 6. Stemming

In [10]:
# remove tags
import re

def remove_tags(text):
    pattern = re.sub(re.compile('<.*?>'), '', text)
    return pattern

df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [11]:
# lowercasing
df['review'] = df['review'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].str.lower()


In [12]:
# remove url
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

df['review'] = df['review'].apply(remove_url)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_url)


In [13]:
# remove punctuations
import string, time

exclude = string.punctuation

def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

df['review'] = df['review'].apply(remove_punc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_punc)


In [14]:
# remove stopwords
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

sw = stopwords.words('english')

def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in sw:
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

df['review'] = df['review'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_stopwords)


In [15]:
df['review'][1]

' wonderful little production  filming technique   unassuming  oldtimebbc fashion  gives  comforting  sometimes discomforting sense  realism   entire piece  actors  extremely well chosen michael sheen    got   polari      voices  pat    truly see  seamless editing guided   references  williams diary entries     well worth  watching     terrificly written  performed piece  masterful production  one   great masters  comedy   life  realism really comes home   little things  fantasy   guard  rather  use  traditional dream techniques remains solid  disappears  plays   knowledge   senses particularly   scenes concerning orton  halliwell   sets particularly   flat  halliwells murals decorating every surface  terribly well done'

In [16]:
X = df.iloc[:, 0:1]
y = df['sentiment']
X

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz e...
1,wonderful little production filming techniqu...
2,thought wonderful way spend time hot s...
3,basically theres family little boy jake thi...
4,petter matteis love time money visually s...
...,...
9995,fun entertaining movie wwii german spy julie ...
9996,give break anyone say good hockey movi...
9997,movie bad movie watching endless series ...
9998,movie probably made entertain middle sc...


In [17]:
y

0       positive
1       positive
2       positive
3       negative
4       positive
          ...   
9995    positive
9996    negative
9997    negative
9998    negative
9999    positive
Name: sentiment, Length: 9983, dtype: object

In [18]:
# label encoding on df['sentiment'] convert into 1 or 0
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
y

array([1, 1, 1, ..., 0, 0, 1])

In [19]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape

(7986, 1)

### **Use BoW**

In [20]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()
X_train_bow.shape

(7986, 72190)

X_train_bow.shape = (7986, 72190)  >---------> (no of rows, no. of word in vocab)

In [21]:
# Use gaussian naive bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

gnb = GaussianNB()
gnb.fit(X_train_bow, y_train)

y_pred = gnb.predict(X_test_bow)
# accuracy
accuracy_score(y_test, y_pred)*100

62.59389083625438

In [22]:
confusion_matrix(y_test, y_pred)

array([[672, 313],
       [434, 578]], dtype=int64)

In [23]:
# Use Random Forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_bow, y_train)

y_pred = rf.predict(X_test_bow)
#accuracy
accuracy_score(y_test, y_pred)*100

84.07611417125689

In [24]:
confusion_matrix(y_test, y_pred)

array([[838, 147],
       [171, 841]], dtype=int64)

In [25]:
# few things to improve accuracy
# take only 5000 words from vocab
cv = CountVectorizer(max_features=5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()
rf.fit(X_train_bow, y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test, y_pred)*100

83.97596394591888

### **Use n-gram**

In [26]:
# use n-gram
cv = CountVectorizer(ngram_range=(1, 3), max_features=50000)

X_train_ngram = cv.fit_transform(X_train['review']).toarray()
X_test_ngram = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_ngram, y_train)
y_pred = rf.predict(X_test_ngram)

# accuracy
accuracy_score(y_test, y_pred)*100

85.678517776665

### **Use TF-IDF**
- Use for Information Retrival System

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_tfidf, y_train)
y_pred = rf.predict(X_test_tfidf)

# accuracy
accuracy_score(y_test, y_pred)*100

85.728592889334

### **Use Word2Vec**
- 1. Use Pre-trained Word2Vec
  2. Train Your Word2Vec on Your Data (Make sure you have sufficient data)
- Sentence >---> `avg word2vec`