## Text Classification

##### nlpcloud.io

In [2]:
import numpy as np
import pandas as pd

In [78]:
temp_df = pd.read_csv("data/IMDB Dataset.csv")
df = temp_df.iloc[:10000]
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [79]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [80]:
print(f"It will print all that is -> {df['sentiment'].count()}")
print(f"It will print positive and negative differently -> {df['sentiment'].value_counts()}")

It will print all that is -> 10000
It will print positive and negative differently -> sentiment
positive    5028
negative    4972
Name: count, dtype: int64


In [81]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [82]:
df.duplicated().sum()

17

In [83]:
df.drop_duplicates(inplace=True)   # inplace = True means it modify the original dataframe directly 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)   # inplace = True means it modify the original dataframe directly


In [84]:
df.duplicated().sum()

0

In [85]:
# removing tags
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

### How the `remove_tags` function works

- **Input:** Takes `raw_text` as input.
- **Purpose:** Removes HTML tags from the input text.
- **How:**  
  - Uses Python’s `re` (regular expression) module.
  - `re.compile('<.*?>')` creates a pattern to match any content within `< >` (HTML tags).
  - `re.sub(pattern, '', raw_text)` replaces all matched HTML tags in `raw_text` with an empty string (`''`), effectively removing them.
- **Output:** Returns the cleaned text without HTML tags.

In [86]:
df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [87]:
df['review'][1]

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

In [88]:
# to lowercase
df['review'] = df['review'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x:x.lower())


In [89]:
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

In [90]:
# remove stop words

In [91]:
from nltk.corpus import stopwords
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

In [92]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))  # Load once at top

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])


In [93]:
df['review'] = df['review'].apply(remove_stopwords)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_stopwords)


In [94]:
df

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive
...,...,...
9995,"fun, entertaining movie wwii german spy (julie...",positive
9996,"give break. anyone say ""good hockey movie""? kn...",negative
9997,movie bad movie. watching endless series bad h...,negative
9998,"movie probably made entertain middle school, e...",negative


In [95]:
x = df.iloc[:, 0:1]   # many ml or dl models need data in tabular form or we can say 2d
y = df['sentiment']   # it result in giving sentiment column as series (1d)
y


0       positive
1       positive
2       positive
3       negative
4       positive
          ...   
9995    positive
9996    negative
9997    negative
9998    negative
9999    positive
Name: sentiment, Length: 9983, dtype: object

In [96]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y)
y

array([1, 1, 1, ..., 0, 0, 1])

In [97]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=True)
len(x_train), len(y_train), len(x_test), len(y_test)

(7986, 7986, 1997, 1997)

In [98]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

x_train_bow[1].shape

(48282,)

In [99]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(x_train_bow, y_train)

0,1,2
,priors,
,var_smoothing,1e-09


In [100]:
y_pred = gnb.predict(x_test_bow)

In [101]:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, y_pred)

0.6324486730095142

In [102]:
confusion_matrix(y_test, y_pred)

array([[717, 235],
       [499, 546]])

In [104]:
# using random forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(x_train_bow, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [105]:
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test, y_pred)

0.8467701552328493

In [106]:
# using bi-gram
cv = CountVectorizer(ngram_range=(1,2), max_features=500)

x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(x_train_bow, y_train)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [107]:
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test, y_pred)

0.8062093139709564

In [108]:
# using tf-idf (generally we use tf-idf for feature extraction but we can also use it for text classification)
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

rf = RandomForestClassifier()

x_train_bow = tfidf.fit_transform(x_train['review']).toarray()
x_test_bow = tfidf.transform(x_test['review']).toarray()

rf.fit(x_train_bow, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [111]:
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test, y_pred)

0.8467701552328493