# Preprocessing and exploring the dataset

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

## General information about the dataset
- records = 573913
- users = 263407
- movies = 1572
- spoiler reviews = 150924
- users with at least one spoiler review = 79039
- items with at least one spoiler review = 1570

## Importing the movie reviews dataset
Importing the reviews dataset and showing the first five values.
More information about this dataset can be found in [this page](https://www.kaggle.com/rmisra/imdb-spoiler-dataset/)

- **review_date:** Date the review was written.
- **movie_id:** Unique id for the item.
- **user_id:** Unique id for the review author.
- **is_spoiler:** Indication whether review contains a spoiler.
- **review_text:** Text review about the item.
- **rating:** Rating given by the user to the item.
- **review_summary:** Short summary of the review.

In [4]:
df = pd.read_json('IMDB_reviews.json', lines=True)

ValueError: ignored

In [0]:
df.head()

In [0]:
print('Columns of reviews dataset:', df.columns)
print('\nUser reviews shape: ', df.shape)
print('Unique films in reviews dataset:', df['movie_id'].nunique())

In [0]:
# We check if the dataframe contains null values
df.isnull().values.any()

Since the dataset does not contain null values, we don't have to worry about missing values. We keep only the columns that we are going to use for classification (The review text and the target variable).
<br>

We also convert the ```is_spoiler``` label to one hot encoding

In [0]:
df = df[['is_spoiler', 'review_text']]

In [0]:
df.is_spoiler = df.is_spoiler.astype(int)

In [0]:
df.head()

## Checking if the dataset is balanced or not
We check if the data is balanced. The results indicate that the data is not balanced, so **we will have to balance the train set** (we can't never balance the dataset)

In [0]:
spoiler_length = len(df.loc[df['is_spoiler']==True])
not_spoiler_length = len(df.loc[df['is_spoiler']==False])

spoiler_percentage = (spoiler_length*100)/(spoiler_length + not_spoiler_length)
not_spoiler_percentage = (not_spoiler_length*100)/(spoiler_length + not_spoiler_length)

print('Number of reviews with spoilers: ' + str(spoiler_length) + ' (' + str(round(spoiler_percentage)) + '%)')
print('Number of reviews without spoilers: ' + str(not_spoiler_length) + ' (' + str(round(not_spoiler_percentage)) + '%)')

# Graphical Representation
labels = 'Spoiler', 'Not spoiler'
explode = (0.1, 0)
plt.pie([spoiler_length, not_spoiler_length], explode=explode, labels = labels,autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Pie chart Spoiler vs Not-Spoiler')
plt.show()

## Balancing the dataset
In the current dataset, there is a 26% of reviews with spoilers and a 74% of reviews without spoilers, so we have to use some technique to balance the data (if we don't do this, the resulting algorithm would have good accuracy in majority class samples, but almost 0 accuracy in minority class samples). <br>
We considered using Imbalanced-learn techniques (example: imblearn), but we had to discard the idea due to memory problems (the computers of the team members ran out of memory when trying to calculate the matrices). <br>
We are going to use resampling to balance the dataset, there are two possible ways of applying resampling:
- Oversampling: Duplicate random records of minority class (Problem: Possible overfitting)
- Undersampling: Eliminate samples from the majority class (Problem: Possible Loss of information)

<br>
The dataset is considered relatively big (422.989 reviews without spoilers and 150.924 reviews with spoilers), so we can eliminate samples from the majority class without facing a relevant loss of information. That's why we determined that using undersampling is the best option.

In [0]:
# Class count
count_not_spoiler, count_spoiler = df.is_spoiler.value_counts()

# Divide by class
df_spoiler = df[ df['is_spoiler'] == 1 ]
df_not_spoiler = df[ df['is_spoiler'] == 0 ]

# Random undersampling
# We reduce the number of not spoiler to the number of spoiler
# We use floor division (//)
df_not_spoiler_under = df_not_spoiler.sample(int(count_spoiler//2))
df_spoiler_under = df_spoiler.sample(int(count_spoiler//2))
df_test_under = pd.concat([df_not_spoiler_under, df_spoiler_under], axis=0)


# The resulting dataset is balanced (100.616 reviews with spoilers and 100.616 without spoilers)
print('Random Undersampling')
df_test_under.is_spoiler.value_counts().plot(kind='bar', title='Count')
plt.show()
print(df_test_under.is_spoiler.value_counts())

After balancing the dataset, we obtain a subset with 75.462 reviews containing spoilers and the same number of reviews without spoilers. In this case, there is no majority class, so we could use 0.5 accuracy as baseline

## Preprocessing

We follow the following process for preprocessing:
1. **Lowercase:** We convert the text to lowercase to avoid repetition of words
2. **Remove html tags, email addresses and urls**
3. **Remove punctuation symbols**
4. **POS Tagging:** Necessary for Lemmatization, since it takes the category of the word as a parameters, otherwise the lemmatizer function considers the word as a noun.
5. **Lemmatization:** Receives the type of word as a parameter (obtained after POS tagging). We also tried to apply stemming (both Lancaster and Porter). But it returned worse results.
6. **Stop Words Removal:** We didn't apply this before because we needed the stopwords to apply POS tagging correctly

While exploring the dataset, we also found some words in foreign languages, such as japanese, korean, chinese and russian. Nevertheless, there were only 30 japanese words in total, and less than 10 of the other types, so we could consider these words as noise. In each method, this will be solved in a different way (applying methods such as min_df filtering).

In [0]:
# We create a map that takes the outputs of pos-tagging and convert them into the inputs of lemmatization
# We use name as the default value
pos_map = {
'CC': 'n','CD': 'n', 'DT': 'n','EX': 'n', 'FW': 'n','IN': 'n', 'JJ': 'a','JJR': 'a', 'JJS': 'a','LS': 'n', 'MD': 'v','NN': 'n',
'NNS': 'n','NNP': 'n', 'NNPS': 'n','PDT': 'n', 'POS': 'n','PRP': 'n', 'PRP$': 'r','RB': 'r', 'RBR': 'r','RBS': 'r', 'RP': 'n','TO': 'n',
'UH': 'n','VB': 'v', 'VBD': 'v','VBG': 'v', 'VBN': 'v','VBP': 'v', 'VBZ': 'v','WDT': 'n', 'WP': 'n','WP$': 'n', 'WRB': 'r'
}

In [0]:
# Importing stop words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
len(stop_words)
stop_words

In [0]:
from nltk.stem import WordNetLemmatizer

example_sent = "<HTML>This <p>is.a</p> ! jua@email.com sentences, showing off the <br> stop words filtration. http://www.youtube.com"
# example_sent = "Hi, it's me"
# Since all the stopwords are in lower case, we have to convert the string to lowercase first
example_sent = example_sent.lower()

# This was a simple tokenizer that kept the punctuation symbols
# word_tokens = word_tokenize(example_sent)

# Removing url, emails and html tags
# HTML TAGS
from bs4 import BeautifulSoup
example_sent = BeautifulSoup(example_sent, 'lxml').text

# EMAIL ADDRESSES
import re
example_sent = re.sub(r'[\w\.-]+@[\w\.-]+', ' ', example_sent)

# URLs
example_sent = re.sub(r'http\S+', '', example_sent)

# Removing punctuation symbol
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
word_tokens = tokenizer.tokenize(example_sent)
# Now we have obtained the tokenized words without punctuation symbols and with stopwords

# POS Tagging the data (the stopwords improve the accuracy of the pos tagging, so we'll remove them later)
# This method returns a list of tuples: (word, classification)
tags = nltk.pos_tag(word_tokens)
print('tags', tags)

lemmatizer = WordNetLemmatizer()
# We lemmatize all the words in the text by their category
for i, word in enumerate(word_tokens):
    # Returns the lemmatized word given its category (if the key is not part of the map, the word is considered a noun)
    word_tokens[i] = lemmatizer.lemmatize(word, pos=pos_map.get(tags[i][1] , 'n'))

# Removing stop words
filtered_sentence = [w for w in word_tokens if not w in stop_words]

# Japanese words are kept as a single word, so we can remove them easily, but urls, emails and html tags are splitted, so we
# have to remove them before tokenizing

# In html:  <br>  -->  br
# In email:  jua@gmail.com  --> jua, gmail, com
# In url: https://www.youtube.com  --> http, www, youtube, com
filtered_sentence

### Return values of POS Tagging

- **CC:**	coordinating conjunction
- **CD:**	cardinal digit
- **DT:**	determiner
- **EX:**	existential there (like: "there is" ... think of it like "there exists")
- **FW:**	foreign word
- **IN:**	preposition/subordinating conjunction
- **JJ:**	adjective	'big'
- **JJR:**	adjective, comparative	'bigger'
- **JJS:**	adjective, superlative	'biggest'
- **LS:**	list marker	1)
- **MD:**	modal	could, will
- **NN:**	noun, singular 'desk'
- **NNS:**	noun plural	'desks'
- **NNP:**	proper noun, singular	'Harrison'
- **NNPS:**	proper noun, plural	'Americans'
- **PDT:**	predeterminer	'all the kids'
- **POS:** possessive ending	parent\'s
- **PRP:**	personal pronoun	I, he, she
- **PRP\$:**	possessive pronoun	my, his, hers
- **RB:**	adverb	very, silently,
- **RBR:**	adverb, comparative	better
- **RBS:**	adverb, superlative	best
- **RP:**	particle	give up
- **TO:**	to	go 'to' the store.
- **UH:**	interjection	errrrrrrrm
- **VB:**	verb, base form	take
- **VBD:**	verb, past tense	took
- **VBG:**	verb, gerund/present participle	taking
- **VBN:**	verb, past participle	taken
- **VBP:**	verb, sing. present, non-3d	take
- **VBZ:**	verb, 3rd person sing. present	takes
- **WDT:**	wh-determiner	which
- **WP:**	wh-pronoun	who, what
- **WP\$:**	possessive wh-pronoun	whose
- **WRB:**	wh-abverb	where, when

### Possible values of pos in lemmatization
ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'

We have to download the stopwords: <br>
```>>> import nltk ``` <br>
```>>> nltk.download('stopwords') ``` <br><br>
We have to download wordnet for lemmatization<br>
```>>> import nltk ``` <br>
```>>> nltk.download('wordnet')``` <br><br>
We have to download for POS-Tagging <br>
```>>> import nltk``` <br>
```>>> nltk.download('averaged_perceptron_tagger')```

# Converting the previous preprocessing into a function
This is the function that we are going to apply in the next notebooks for preprocessing

In [0]:
from bs4 import BeautifulSoup
import re
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

# We initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def tokenizer(example_sent):
    # example_sent = "Hi, it's me"
    # Since all the stopwords are in lower case, we have to convert the string to lowercase first
    example_sent = example_sent.lower()

    # This was a simple tokenizer that kept the punctuation symbols
    # word_tokens = word_tokenize(example_sent)
    
    # Japanese words are kept as a single word, so we can remove them easily, but urls, emails and html tags are splitted, so we
    # have to remove them before tokenizing
    
    # Removing url, emails and html tags
    # HTML TAGS
    example_sent = BeautifulSoup(example_sent, 'lxml').text

    # EMAIL ADDRESSES
    example_sent = re.sub(r'[\w\.-]+@[\w\.-]+', ' ', example_sent)

    # URLs
    example_sent = re.sub(r'http\S+', '', example_sent)

    # Removing punctuation symbol
    tokenizer = RegexpTokenizer(r'\w+')
    word_tokens = tokenizer.tokenize(example_sent)
    # Now we have obtained the tokenized words without punctuation symbols and with stopwords

    # POS Tagging the data (the stopwords improve the accuracy of the pos tagging, so we'll remove them later)
    # This method returns a list of tuples: (word, classification)
    tags = nltk.pos_tag(word_tokens)

    # We lemmatize all the words in the text by their category
    for i, word in enumerate(word_tokens):
        # Returns the lemmatized word given its category (if the key is not part of the map, the word is considered a noun)
        word_tokens[i] = lemmatizer.lemmatize(word, pos=pos_map.get(tags[i][1] , 'n'))

    # Removing stop words
    filtered_sentence = [w for w in word_tokens if not w in stop_words]

    # In html:  <br>  -->  br
    # In email:  jua@gmail.com  --> jua, gmail, com
    # In url: https://www.youtube.com  --> http, www, youtube, com
    return filtered_sentence

In [0]:
test = "<HTML>This <p>is.a</p> ! jua@email.com sentences, showing off the <br> stop words filtration. http://www.youtube.com"
tokenizer(test)

Applying the function to the dataset

In [0]:
%%time
df['review_text'] = df['review_text'].apply(tokenizer)

In [0]:
df.head()

In [0]:
reviews_pie = pd.DataFrame()
reviews_pie['is_spoiler'] = df['is_spoiler']
reviews_pie['has_word_twist'] = df['review_text'].apply(lambda text: 1 if 'twist' in text else 0)
reviews_pie['has_word_end'] = df['review_text'].apply(lambda text: 1 if 'end' in text else 0)
reviews_pie['has_word_spoiler'] = df['review_text'].apply(lambda text: 1 if 'spoiler' in text else 0)
reviews_pie['has_word_die'] = df['review_text'].apply(lambda text: 1 if 'die' in text else 0)
reviews_pie['has_word_death'] = df['review_text'].apply(lambda text: 1 if 'death' in text else 0)

In [0]:
reviews_pie.head()

In [0]:
pie1 = reviews_pie['is_spoiler'].value_counts().reset_index().sort_values(by='index')
pie2 = reviews_pie[reviews_pie['has_word_twist'] == 1]['is_spoiler'].value_counts().reset_index().sort_values(by='index')
pie3 = reviews_pie[reviews_pie['has_word_end'] == 1]['is_spoiler'].value_counts().reset_index().sort_values(by='index')
pie4 = reviews_pie[reviews_pie['has_word_spoiler'] == 1]['is_spoiler'].value_counts().reset_index().sort_values(by='index')
pie5 = reviews_pie[reviews_pie['has_word_die'] == 1]['is_spoiler'].value_counts().reset_index().sort_values(by='index')
pie6 = reviews_pie[reviews_pie['has_word_death'] == 1]['is_spoiler'].value_counts().reset_index().sort_values(by='index')

with plt.style.context('seaborn-talk'):
    fig = plt.figure(figsize=(16, 16))

    ax1 = fig.add_subplot(3, 2, 1)
    ax2 = fig.add_subplot(3, 2, 2)
    ax3 = fig.add_subplot(3, 2, 3)
    ax4 = fig.add_subplot(3, 2, 4)
    ax5 = fig.add_subplot(3, 2, 5)
    ax6 = fig.add_subplot(3, 2, 6)

    ax1.pie(pie1['is_spoiler'])
    ax1.set_title('All reviews')

    ax2.pie(pie2['is_spoiler'])
    ax2.set_title('Reviews containing the word \'twist\'')

    ax3.pie(pie3['is_spoiler'])
    ax3.set_title('Reviews containing the word \'end\'')

    ax4.pie(pie4['is_spoiler'])
    ax4.set_title('Reviews containing the word \'spoiler\'')
    
    ax5.pie(pie4['is_spoiler'])
    ax5.set_title('Reviews containing the word \'die\'')
    
    ax6.pie(pie4['is_spoiler'])
    ax6.set_title('Reviews containing the word \'death\'')

    plt.suptitle('Spoiler distribution within the reviews', fontsize=20)
    fig.legend(labels=['Without spoilers', 'With spoilers'], loc='center')

    plt.show()

These graphics prove that it's possible to find patterns in the data