# Explore and Process
In this notebook we explore the dataset and we process the data, making different tasks:
* Tokenization
* Delete the Stop Words
* Remove Punctuation
* Lemmatization

## Import Libraries
We used various libraries: pandas to read the dataset, pandarallel for parallel processing tasks on the dataset, nltk for removing stop words, tokenization and lemmatization and the regex library to remove the punctuation.


In [None]:
import pandas as pd
from pandarallel import pandarallel
import ast
import matplotlib.pyplot as plt
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

## Explore the dataset

### Read the datasets

In [None]:
dataRew=pd.read_json('../Dataset/IMDB_reviews.json',lines=True)
dataMovie=pd.read_json('../Dataset/IMDB_movie_details.json',lines=True)

In [None]:
dataRew.info()

There are no null fields

In [None]:
dataRew[dataRew['review_text']=='']

In [None]:
dataMovie.info()

### Some plot synopsis are not present
We exclude films where the synopsis is not present, as the plot summary does not contain plot twists usefull for detect spoiler.

In [None]:
dataMovie[dataMovie["plot_synopsis"]=='']

In [None]:
dataMovie=dataMovie[dataMovie["plot_synopsis"]!='']

In [None]:
dataMovie[dataMovie["plot_synopsis"]=='']

In [None]:
dataMovie.head()

In [None]:
dataRew.head()

### Compute the average,min,max length of reviews and the plot synopsis

In [None]:
ColonnaL=dataRew["review_text"].apply(len)

In [None]:
ColonnaL.mean()

The average length of the reviews is 1460

In [None]:
ColonnaL.min()

Minimum review length: 18 

In [None]:
ColonnaL.max()

Maximum review length: 14963

### Plot synopsis

In [None]:
tempMovie=dataMovie["plot_synopsis"].apply(len)

In [None]:
tempMovie.mean()

The average length of the plot synopsis is 9644

In [None]:
tempMovie.min()

Minimum plot synopsis length: 45 

In [None]:
tempMovie.max()

Maximum plot synopsis length: 63904

### Compute the min and max data of the review

In [None]:
unique_data=dataRew["review_date"].unique()

In [None]:
data = pd.to_datetime(unique_data)

In [None]:
unique_data

In [None]:
data

In [None]:
data.min()

In [None]:
data.max()

### Distribution of spoilers
The dataset is inbalanced

In [None]:
values=dataRew["is_spoiler"].value_counts()

In [None]:
values.plot(kind='bar',color=['blue','red'])
plt.xlabel('Spoiler')
plt.ylabel('Numbers reviews')
plt.title('Distribution of Spoiler in reviews')

### Estimation of pre-announced spoilers (reviews containing spoiler alert)

In [None]:
only_text = dataRew[['movie_id', 'review_text','is_spoiler']]

In [None]:
#Estimation: select reviews with "spoiler" word and exclude negations like "spoiler free" "not a spoiler" ecc.
filtered_reviews = only_text[
    only_text['review_text'].str.contains('spoiler', case=False) &  # contains "spoiler"
    ~only_text['review_text'].str.contains('no spoiler|not a spoiler|non spoiler|not to spoiler|not spoiler|any spoiler|[^not] spoilers* free', case=False)  # doesn't contain exclusion phrases
]


In [None]:
spoilers = only_text[only_text['is_spoiler'] == True]
labels_3 = ['Only marked as spoiler','Contain also a textual spoiler alert',]
values_2 = [len(spoilers), len(filtered_reviews[filtered_reviews['is_spoiler'] == True])]
colors_2 = ['#ACBED8','#41608b']

plt.pie(values_2, labels=labels_3, colors=colors_2, autopct='%1.1f%%', startangle=140)

# Visualizzazione del plot
plt.axis('equal')
plt.show()


### More exploration

In [None]:
#HOW MANY OF THE "ANNOUNCED SPOILERS" ARE ACTUALLY SPOILERS?
labels_2 = ['Are spoilers', 'Are not']
values_2 = [len(filtered_reviews[filtered_reviews['is_spoiler'] == True]),len(filtered_reviews[filtered_reviews['is_spoiler'] == False]) ]
colors_2 = ['#D52941', '#65B891']
plt.bar(labels_2, values_2, color=colors_2, width=0.9)

In [None]:
#Number of reviews per user
user_total_comments = dataRew.groupby('user_id').size().reset_index(name='count_total')
user_total_comments.hist(bins=100, range=(0,8))

In [None]:
user_total_comments.count_total.mean()

## Processing Text
In this part of the notebook, we perform the process of tokenization, removal of stop words, punctuation and lemmatization.

## Review Cleaning

Initialize pandarallel using all available CPU cores on the machine

In [None]:
# inizialize pandarallel for work in parallel on the dataset

pandarallel.initialize(progress_bar=True)

In [None]:
# this function is used to read the partial dataset, if we save dataset at each operation.
def takeDataset(x):
    ## When i read the csv, the last column is not an array but a string
    dataRew=pd.read_csv(x)
    ## Converting the string in an array before delete stop words
    dataRew["clean_review"]=dataRew.loc[:,"clean_review"].parallel_apply(ast.literal_eval)

### Tokenization

In [None]:
nltk.download('punkt')

In [None]:
def tokenize_text(text):
    from nltk.tokenize import WordPunctTokenizer
    tokenizer=WordPunctTokenizer()
    return tokenizer.tokenize(text)


In [None]:
dataRew['clean_review'] = dataRew.loc[:,"review_text"].parallel_apply(tokenize_text)

### Delete StopWords

In [None]:
nltk.download('stopwords')

In [None]:
def remove_Stop(x):
  from nltk.corpus import stopwords
  stop_words = set(stopwords.words('english'))
  filtered_sentence = [word for word in x if word.lower() not in stop_words]
  return filtered_sentence

In [None]:
dataRew['clean_review']=dataRew.loc[:,'clean_review'].parallel_apply(remove_Stop)

### Regex to  remove of punct.

In [None]:
def remove_punct2(words_list):
    import re
    pattern=re.compile(r'[^\w\s]')
    clean_words=[word for word in words_list if not pattern.search(word)]
    return clean_words

In [None]:
dataRew['clean_review']=dataRew.loc[:,'clean_review'].parallel_apply(remove_punct2)

In [None]:
dataRew["clean_review"]

In [None]:
dataRew.loc[:,"clean_review"]

### Lemmatization

In [None]:
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
def lemmatize(text):
    from nltk.stem import WordNetLemmatizer
    lemmatizer=WordNetLemmatizer()
    lem_token=[lemmatizer.lemmatize(token.lower()) for token in text]
    return lem_token


In [None]:
dataRew['clean_review']=dataRew.loc[:,'clean_review'].parallel_apply(lemmatize)

In [None]:
dataRew.loc[:,"clean_review"]

Let's save the cleaned dataset for later use

In [None]:
dataRew.to_csv('../Dataset/datiClean.csv', index=False)

### Stemming
We also tried stemming in addition to lemmatization, but we opted for the latter since stemming appears too "aggressive".

In [None]:
def stem_text(x):
  from nltk.stem import SnowballStemmer
  stemmer=SnowballStemmer('english')
  s_words=[stemmer.stem(word) for word in x]
  return s_words

In [None]:
def ApplyStem():
    dataRew['clean_review']=dataRew.loc[:,'clean_review'].parallel_apply(stem_text)

## Processing synopsis of the plot
In this part we apply the same processes we made for text review

### Tokenize

In [None]:
def tokenize_text(text):
   from nltk.tokenize import WordPunctTokenizer
   tokenizer=WordPunctTokenizer()
   return tokenizer.tokenize(text)

In [None]:
dataMovie["plot_clean"]=dataMovie.loc[:,"plot_synopsis"].parallel_apply(tokenize_text)

In [None]:
dataMovie["plot_clean"]

### Remove Stop Words

In [None]:
nltk.download('stopwords')
def remove_Stop(x):
   from nltk.corpus import stopwords
   stop_words=set(stopwords.words('english'))
   filtered_sentence=[word for word in x if word.lower() not in stop_words]
   return filtered_sentence

In [None]:
dataMovie["plot_clean"]=dataMovie.loc[:,"plot_clean"].parallel_apply(remove_Stop)

In [None]:
dataMovie["plot_clean"]

### Remove punctuation

Let's use the same function used for the review text

In [None]:
dataMovie['plot_clean']=dataMovie.loc[:,'plot_clean'].parallel_apply(remove_punct2)

In [None]:
dataMovie["plot_clean"]

## Lemmatization

In [None]:
def lemmatize(text):
    from nltk.stem import WordNetLemmatizer
    lemmatizer=WordNetLemmatizer()
    lem_token=[lemmatizer.lemmatize(token.lower()) for token in text]
    return lem_token


In [None]:
dataMovie['plot_clean']=dataMovie.loc[:,'plot_clean'].parallel_apply(lemmatize)

### Stemming
Not used, we also tried stemming in addition to lemmatization, but we opted for the latter since stemming appears too "aggressive".

In [None]:
def Stem(x):
   from nltk.stem import PorterStemmer
   stemmer=PorterStemmer()
   words=[stemmer.stem(word) for word in x]
   return words

### Save the Dataset

In [None]:
dataMovie.to_csv("../Dataset/movieclean.csv")

In [None]:
dataMovie