Warning: Time and Memory

# Text Cleaning

Text Data preprocessing step:
1. remove punctuations
2. remove stopwords
3. lowercase the text
4. tokenization
5. stemming and lemmatization

Word Embedding:

transform words (texts) into vectors, used as the input training data for any model

Reference:

https://github.com/SpencerPao/Natural-Language-Processing/tree/main/Text_Preprocessing

## Import data

In [1]:
import numpy as np
import pandas as pd

In [2]:
# google drive connection (for coding on google colab)
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
folder_path = 'C:/Users/26011/Downloads/IMDB/' #change it for your personal use
# input_path = folder_path + 'combined_dat_feature_engineered.csv'
movie_dat_path = folder_path + 'IMDB_movie_details.json'
review_dat_path = folder_path + 'IMDB_reviews.json'
# combined_dat = pd.read_csv(input_path)
movie_dat = pd.read_json(movie_dat_path,lines=True)
review_dat = pd.read_json(review_dat_path, lines=True)

In [4]:
movie_dat.head()

Unnamed: 0,movie_id,plot_summary,duration,genre,rating,release_date,plot_synopsis
0,tt0105112,"Former CIA analyst, Jack Ryan is in England wi...",1h 57min,"[Action, Thriller]",6.9,1992-06-05,"Jack Ryan (Ford) is on a ""working vacation"" in..."
1,tt1204975,"Billy (Michael Douglas), Paddy (Robert De Niro...",1h 45min,[Comedy],6.6,2013-11-01,Four boys around the age of 10 are friends in ...
2,tt0243655,"The setting is Camp Firewood, the year 1981. I...",1h 37min,"[Comedy, Romance]",6.7,2002-04-11,
3,tt0040897,"Fred C. Dobbs and Bob Curtin, both down on the...",2h 6min,"[Adventure, Drama, Western]",8.3,1948-01-24,Fred Dobbs (Humphrey Bogart) and Bob Curtin (T...
4,tt0126886,Tracy Flick is running unopposed for this year...,1h 43min,"[Comedy, Drama, Romance]",7.3,1999-05-07,Jim McAllister (Matthew Broderick) is a much-a...


In [5]:
movie_dat['plot_summary'][0]

"Former CIA analyst, Jack Ryan is in England with his family on vacation when he suddenly witnesses an explosion outside Buckingham Palace. It is revealed that some people are trying to abduct a member of the Royal Family but Jack intervenes, killing one of them and capturing the other, and stops the plan in its tracks. Afterwards, he learns that they're Irish revolutionaries and the two men are brothers. During his court hearing the one that's still alive vows to get back at Jack but is sentenced and that seems to be the end of it. However, whilst the man is being transported, he is broken out. Jack learns of this but doesn't think there's anything to worry about. But, when he is at the Naval Academy someone tries to kill him. He learns that they are also going after his family and so he rushes to find them, safe but having also been the victims of a failed assassination. That's when Jack decides to rejoin the CIA, and they try to find the man before he makes another attempt.         

In [6]:
review_dat['review_text'][0]

'In its Oscar year, Shawshank Redemption (written and directed by Frank Darabont, after the novella Rita Hayworth and the Shawshank Redemption, by Stephen King) was nominated for seven Academy Awards, and walked away with zero. Best Picture went to Forrest Gump, while Shawshank and Pulp Fiction were "just happy to be nominated." Of course hindsight is 20/20, but while history looks back on Gump as a good film, Pulp and Redemption are remembered as some of the all-time best. Pulp, however, was a success from the word "go," making a huge splash at Cannes and making its writer-director an American master after only two films. For Andy Dufresne and Co., success didn\'t come easy. Fortunately, failure wasn\'t a life sentence.After opening on 33 screens with take of $727,327, the $25M film fell fast from theatres and finished with a mere $28.3M. The reasons for failure are many. Firstly, the title is a clunker. While iconic to fans today, in 1994, people knew not and cared not what a \'Shaws

After intial check, we choose to use "plot summary" in movie data and "review_text" in reviews data for our text processing and analysis.

"Plot Synopsis" is too long, "review_summary" cannot catch the key information to determine the spoiler.

## Remove Punctuations

Don't do this part if we apply sentence embedding!



In [None]:
# remove all punctuations except comma and period
# def remove_punctuation(text):
#     return text.str.replace(r"[^\w\s.]", '')
# movie_dat['plot_summary'] = movie_dat['plot_summary'].str.replace(r"[^\w\s.]", '')
# review_dat['review_text'] = review_dat['review_text'].str.replace(r"[^\w\s.]", '')

In [None]:
# movie_dat['plot_summary'][0]

In [None]:
# combined_dat = review_dat.merge(movie_dat, how='left',on='movie_id')

In [None]:
# movie_dat['plot_summary'][0]

In [None]:
# review_dat['review_text'][0]

In [None]:
# save the code temperorily
# output_path_v1 = folder_path + 'combined_data_preprocessed.csv'
# combined_dat.to_csv(output_path_v1)

## To Lowercase

In [7]:
movie_dat['plot_summary'].str.lower()
review_dat['review_text'].str.lower()

0         in its oscar year, shawshank redemption (writt...
1         the shawshank redemption is without a doubt on...
2         i believe that this film is the best story eve...
3         **yes, there are spoilers here**this film has ...
4         at the heart of this extraordinary movie is a ...
                                ...                        
573908    go is wise, fast and pure entertainment. assem...
573909    well, what shall i say. this one´s fun at any ...
573910    go is the best movie i have ever seen, and i'v...
573911    call this 1999 teenage version of pulp fiction...
573912    why was this movie made? no doubt to sucker in...
Name: review_text, Length: 573913, dtype: object

## Remove Stopwords and Stemming

In [8]:
# pre-define a stopword list, retrieved from https://www.ranks.nl/stopwords
# stemming using nltk
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_lst = set(stopwords.words('english'))
stemmer = PorterStemmer()
# stopword_path = folder_path + 'stopwords.txt'
# stopwords = pd.read_csv(stopword_path,header = None)
# stopwords.rename(columns={0:'Stopwords'}, inplace = True)
# stopwords_lst = list(stopwords['Stopwords'])
def stem_and_remove_stopwords(text, stopwords_lst):
    word_lst = text.split(' ')
    return ' '.join([x for x in word_lst if x not in stopwords_lst])
# test = movie_dat.head(5)
# test['plot_summary'] = test['plot_summary'].apply(lambda x: stem_and_remove_stopwords(x, stopwords_lst))
# test['plot_summary'][0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\26011\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
%%time
# combined_dat[text_cols] = combined_dat[text_cols].apply(lambda x: remove_stopwords(x, stopwords_lst))
movie_dat['plot_summary'] = movie_dat['plot_summary'].apply(lambda x: stem_and_remove_stopwords(x, stopwords_lst))
review_dat['review_text'] = review_dat['review_text'].apply(lambda x: stem_and_remove_stopwords(x, stopwords_lst))

CPU times: total: 21.5 s
Wall time: 21.8 s


In [10]:
review_dat['review_text'][0]

'In Oscar year, Shawshank Redemption (written directed Frank Darabont, novella Rita Hayworth Shawshank Redemption, Stephen King) nominated seven Academy Awards, walked away zero. Best Picture went Forrest Gump, Shawshank Pulp Fiction "just happy nominated." Of course hindsight 20/20, history looks back Gump good film, Pulp Redemption remembered all-time best. Pulp, however, success word "go," making huge splash Cannes making writer-director American master two films. For Andy Dufresne Co., success come easy. Fortunately, failure life sentence.After opening 33 screens take $727,327, $25M film fell fast theatres finished mere $28.3M. The reasons failure many. Firstly, title clunker. While iconic fans today, 1994, people knew cared \'Shawshank\' was. On DVD, Tim Robbins laughs recounting fans congratulating "that \'Rickshaw\' movie." Marketing-wise, film\'s nightmare, \'prison drama\' tough sell women, story love two best friends spell winner men. Worst all, movie slow molasses. As Desson

In [11]:
movie_dat['plot_summary'][0]

"Former CIA analyst, Jack Ryan England family vacation suddenly witnesses explosion outside Buckingham Palace. It revealed people trying abduct member Royal Family Jack intervenes, killing one capturing other, stops plan tracks. Afterwards, learns they're Irish revolutionaries two men brothers. During court hearing one that's still alive vows get back Jack sentenced seems end it. However, whilst man transported, broken out. Jack learns think there's anything worry about. But, Naval Academy someone tries kill him. He learns also going family rushes find them, safe also victims failed assassination. That's Jack decides rejoin CIA, try find man makes another attempt.                Written by\nrcs0411@yahoo.com"

In [None]:
# import nltk
# from nltk.corpus import stopwords
# nltk.data.path.append("/path/to/nltk_data")
# nltk.download('stopwords')
# stop_words = set(stopwords.words('English'))
# def remove_stopwords(text,stop_words):
#   token_lst = text.split()
#   filtered_tokens = [word for word in token_lst if word.lower() not in stop_words]
#   filtered_text = ' '.join(filtered_tokens)
#   return filtered_text
# ## apply to combined dataframe
# combined_dat[text_cols] = combined_dat[text_cols].apply(lambda x: remove_stopwords(x,stop_words))

# Basic Embedding

Two ways to embed(vectorize) the text:

1. word tokenization/embedding (word2vec, glove)
2. sentence tokenization/embedding (BERT)


## 1. Tokenization and Embedding
Since we are dealing with large dataset, we don't use TF-IDF but should try the **followings**, (personally don't think it's a good idea to analyze on each word, but should be analyzing on the whole sentence, which is sentence embedding)


1. CountVectorizer and TF-IDF:

the simplest vectorizing method, can be used as the baseline combined with logistic regression and SVM model

2. Word2Vec:

pre-trained model provided by google, each word is represented by a vector of length 300, high computational cost (but sufficient data for deep learning)
(probably can use PCA for dim reduction)

3. Transformer (BERT) -> code is at the end:

BERT is another pre-trained model by transferred learning with higher accuracy and interpretation abilities. Similar to Word2Vec, it has high computational cost, but sufficient data for deep learning.


## 

## BERT Embedding:
don't need to remove punctuations, sentence tokenization before throwing into BERT tokenizer


In [12]:
%%time
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def word_lemmatization(paragraph, lemmatizer):
    word_lst = paragraph.split(' ')
    lemmatized_words = [lemmatizer.lemmatize(x) for x in word_lst]
    return ' '.join(lemmatized_words)
movie_dat['plot_summary_cleaned'] = movie_dat['plot_summary'].apply(lambda x: word_lemmatization(x,lemmatizer))
review_dat['review_text_cleaned'] = review_dat['review_text'].apply(lambda x: word_lemmatization(x,lemmatizer))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\26011\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


CPU times: total: 5min 46s
Wall time: 5min 50s


In [13]:
from transformers import BertTokenizer, BertModel
import torch
    
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
## detect whether the device got gpu available for training

if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"GPU ({torch.cuda.get_device_name(0)}) is available.")
else:
    device = torch.device('cpu')
    print(f"train with CPU")

train with CPU


In [17]:
def bert_para_embedding(paragraph, tokenizer, model):
    if torch.cuda.is_available:
        tokens = tokenizer(paragraph, return_tensors = 'pt', padding = True, truncation = True)
        tokens = {k: v.to('cuda') for k, v in tokens.items()}
        model.to('cuda')
        with torch.no_grad():
            outputs = model(**tokens)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return cls_embedding
    else:
        tokens = tokenizer(paragraph, return_tensors = 'pt', padding = True, truncation = True)
#         tokens = {k: v.to('cuda') for k, v in tokens.items()}
#         model.to('cuda')
        with torch.no_grad():
            outputs = model(**tokens)
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return cls_embedding

In [26]:
%%time
test = review_dat['review_text_cleaned'][0]
arr = bert_para_embedding(test,bert_tokenizer,bert_model).numpy()
arr_new = arr.reshape(arr.shape[0],-1)
print(arr_new.shape)
# each paragraph is expected to be represented by a vector of 768 elements

(1, 768)
CPU times: total: 8.47 s
Wall time: 1.45 s


In [None]:
review_dat['bert_embedded_vectors'] = review_dat['review_text_cleaned'].apply(lambda x: bert_para_embedding(x,bert_tokenizer, bert_model).numpy())

In [None]:
%%time
movie_dat['bert_embedded_vectors'] = movie_dat['plot_summary_cleaned'].apply(lambda x: bert_para_embedding(x,bert_tokenizer,bert_model).numpy())
review_dat['bert_embedded_vectors'] = review_dat['review_text_cleaned'].apply(lambda x: bert_para_embedding(x,bert_tokenizer,bert_model).numpy())

## Baseline Model: Logistic Regression 

### Compare BERT with Word2Vec (only)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report

In [None]:
# apply countvectorizer
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfVectorizer

# count_vec_model = CountVectorizer()
# tf_model = TfidfVectorizer()

# x_count = count_vec_model.fit_transform(review_dat['review_text_word'])
# x_tf = tf_model.fit_transform(review_dat['review_text_word'])

In [None]:
review_dat['is_spoiler'] = review_dat['is_spoiler'].apply(lambda x: 1 if x else 0)

In [None]:
X = np.vstack(review_dat['embedded_vectors_final'])

In [None]:
Y = review_dat['is_spoiler'].values

In [None]:
%%time
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=42)
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(x_train,y_train)

In [None]:
y_pred = lr_model.predict(x_test)

In [None]:
accuracy = accuracy_score(y_test,y_pred)
print(f'accuracy of logistic regression model:{accuracy}')

In [None]:
print(classification_report(y_test, y_pred))

### Combined with numerial features (engineered) from movie data

### Combine the plot summary text from movie data

## Other works to do:
1. Add in more features, including:

(1) movie plot summary (embedded vectors)

(2) movie information features (in notebook IMDB EDA)

2. Models:

(1) Tree Models: RandomForest, Xgboost Classifier

(2) Deep Learning Models: CNN, RNN, LSTM

(3) (optional) Transferred Learning Models, Advanced Deep Learning Models