# Exploring The Book Thief(Movie) Sentiment Analysis using Bilinear Logistic Regression
In this notebook the following shall be done:
* Obtain labelled training data from Kaggle
* Create a cleaning tool
* Choose a model(Bilinear Logistic Regression), split the data into test and train
* Build and train the model
* Scrape the BookThief Reviews from IMDB
* Predict the polarity of the reviews

## Sentiment Analysis
Sentiment analysis is an NLP process of identifying the tone expressed in a text. Is it positive,negative or neutral. 
Opinion mining is especially useful for feedback on how customers feel about a brand or product and helps a business understand better customer needs.

## NLP: Natural Language Processing
This is a branch of machine learning that allows a computer to read, understand and draw meaning from human understandable text. We understand words and computers understand text. Human readable text has to undergo a couple of transformations before machines can make sense of it. These include:
* Segmentation - This is the breaking down of a document into its constituent sentences
* Tokenization - Breaking down of sentences into words
* Removal of Stopwords - Stopwords are a list of words to be filtered out from the document as they are insignificant due to their popularity when it comes to natural language processing.They include prepositions, pronouns, conjunctions etc.
*Lemmatization - This is the process of grouping together inflected forms of a word so they can be analysed as a single word. Example: Is, Are and Am all come under the verb 'to be' but in different persons.

## Binary Logistic Regression to get the tone
There are a variety of models that can be used to get the tone in a text. Some include SVMs, Random Forests and Naive Bayes. For this project, we will be identifying one of two tones, negative or positive.

Logistic regression predicts the probability of an event or outcome. The possible outcomes being categorical -meaning they are stored and identified based on names or labels given to them. In our case, we have prelabelled data where positive = 1 and negative = 0.

In [4]:
#importing the pandas library, reading and exploring the dataset
import pandas as pd
dataset = pd.read_csv("labeledTrainData.tsv", sep='\t', index_col=None)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [6]:
#exploring the dataset 
dataset.head()



Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [15]:
#exploring the number of sentiments by count
dataset['sentiment'].nunique()


2

In [16]:
#exploring the distribution number of sentiments by count
dataset['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

In [17]:
# and dropping the unique_id column
dataset = dataset.dropna()
dataset = dataset.drop(dataset.columns[0], axis=1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  25000 non-null  int64 
 1   review     25000 non-null  object
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [18]:
#importing libraries that will aid in text cleaning and declaring regexes of items for removal
import re
import string

html_tags = re.compile(r'<[^<]+?>')
emojis = re.compile(
    "(["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "])"
)

In [11]:
# import libraries that will allow for stopword removal, tokenization and lemmatization
from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer().lemmatize
wordpunct_tokenize = WordPunctTokenizer().tokenize
en_stop = set(stopwords.words('english'))
print(en_stop)


{"you'll", 'theirs', 'herself', 'mustn', 'if', 'who', 'have', 'o', "it's", 'wasn', 'wouldn', 'whom', 'about', 'weren', 'only', 'she', 'on', "you've", 'before', "wasn't", 'most', 'their', 'had', 'with', 're', 'while', 's', 'now', "mightn't", 'has', "shouldn't", 'do', 'our', 'an', 'down', 'him', "you're", 'why', 'doesn', "needn't", "don't", 'how', 'won', 'was', 'ours', 'having', 'when', 'didn', "won't", 'through', 'after', 'don', 'yourself', 'this', 'isn', 'here', 'each', 've', 'and', 'out', 'off', 'there', 'myself', "she's", 'he', 'which', 'being', 'such', "couldn't", 'then', 'mightn', 'of', 'between', 'i', 'in', 'hasn', 'we', 'yours', 'm', 'doing', 'himself', 'above', 'by', 'other', 'yourselves', 'a', 'against', 'them', 'at', 'shouldn', "wouldn't", "you'd", 'once', "shan't", 'from', 'been', "haven't", "isn't", 'themselves', 'during', 'so', 'for', 'haven', 'itself', 'too', 'couldn', 'your', 'not', 'my', 'you', "hadn't", 'same', 'until', 'again', 'am', 'just', 'can', 'own', 'that', 'unde

In [19]:
#creating a function that cleans the data takes a dataframe as an argument
def cleaning_tool(dataframe):
    '''
    This function cleans the dataframe
    '''
    #removes html tags
    dataframe['review'] = dataframe['review'].apply(lambda words: re.sub(html_tags, ' ', words)) 
    #removes emojis  
    dataframe['review'] = dataframe['review'].apply(lambda words: re.sub(emojis, ' ', words))  
    #removes       
    dataframe['review'] = dataframe['review'].apply(lambda words: ''.join([x for x in words if not x.isdigit()])) 
    #removes    
    dataframe['review'] = dataframe['review'].apply(lambda words: ''.join([x for x in words if x.isascii()]))
    #converts text into lowercase  
    dataframe['review'] = dataframe['review'].apply(lambda words: [str(words).lower()])
    #does word tokenization  
    dataframe['review'] = dataframe['review'].apply( lambda words: wordpunct_tokenize(str(words)))
    #removes punctuation marks 
    dataframe['review'] = dataframe['review'].apply(lambda words: ''.join([str(words).translate( str.maketrans('', '', string.punctuation))]))
    #word lemmatization   
    dataframe['review'] = dataframe['review'].apply(lambda words: lemmatizer(str(words)))
    # removes stopwords   
    dataframe['review'] = dataframe['review'].apply( lambda words: ' '.join([x for x in str(words).split() if x not in (en_stop)]))
       

    return dataframe

# applying the cleaning tool to the labelled data dataframe
data = cleaning_tool(dataset)
data.head()

Unnamed: 0,sentiment,review
0,1,stuff going moment mj started listening music ...
1,1,classic war worlds timothy hines entertaining ...
2,0,film starts manager nicholas bell giving welco...
3,0,must assumed praised film greatest filmed oper...
4,1,superbly trashy wondrously unpretentious explo...


In [20]:
#Defining the dependent and independent variables
Y = data['sentiment']
X = data['review']

In [21]:
#Text vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
reviews_vec = TfidfVectorizer()
X_data = reviews_vec.fit_transform(X)

In [22]:
# creating a dataframe for use in feature comparison
X_data_array = X_data.toarray()
vocab = reviews_vec.get_feature_names_out()
X_data_df = pd.DataFrame(X_data_array, columns=vocab)
X_data_df.info()

<bound method DataFrame.info of         aa  aaa  aaaaaaah  aaaaah  aaaaatch  aaaahhhhhhh  aaaand  aaaarrgh  \
0      0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
1      0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
2      0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
3      0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
4      0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
...    ...  ...       ...     ...       ...          ...     ...       ...   
24995  0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
24996  0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
24997  0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
24998  0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   
24999  0.0  0.0       0.0     0.0       0.0          0.0     0.0       0.0   

       aaah  aaargh  ...  zyura

In [24]:
# splitting the dataset into train and test data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X_data, Y, test_size=0.20, random_state=12)


In [25]:
# building the model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)


In [None]:
#plot the sigmoid curve

In [28]:
# obtaining the model accuracy
lr.score(x_test, y_test)


0.887

## Section 2 : Scraping the Movie Reviews from Rotten Tomatoes and Feeding it into our Model for Sentiment Analysis


In [29]:
#importing beautifulsoup and requests library for webscraping
from bs4 import BeautifulSoup
import requests

In [46]:
# declaring necessary urls and setting up to web-scrape
start_url = 'https://www.imdb.com/title/tt0816442/reviews?ref_=tt_urv'


In [47]:
# creating a session that allows loading of more reviews by navigating the next button
page = requests.get(start_url)
soup = BeautifulSoup(page.text, "html.parser")
list_of_reviews = [reviews.get_text() for reviews in soup.find_all(
            'div', class_="text show-more__control")]

# appending data to dataframe
columns = ['review']
book_thief = pd.concat([pd.DataFrame([i], columns=columns)
                  for i in list_of_reviews], ignore_index=True)
book_thief.head()

Unnamed: 0,review
0,Those familiar with the 2005 award winning and...
1,No extended fight scenes. No unnecessary pyrot...
2,"I do not know the book. But the film, for its ..."
3,Many books have been deemed 'unfilmable' - but...
4,"""The Book Thief"" is certainly a rare kind of f..."


In [48]:
# cleaning the dataset
book_thief = cleaning_tool(book_thief)
book_thief.head()


Unnamed: 0,review
0,familiar award winning best selling novel aust...
1,extended fight scenes unnecessary pyrotechnics...
2,know book film beautiful simplicity useful hig...
3,many books deemed unfilmable anyone read marku...
4,book thief certainly rare kind film day gleams...


In [49]:
# vectorizing the reviews
bookthief_revs = book_thief['review']
reviews_vec = TfidfVectorizer()
bookthief_data = reviews_vec.fit_transform(bookthief_revs)

In [50]:
# creating a dataframe for feature comparison
bookthief_data_array = bookthief_data.toarray()
bookthiefvocab = reviews_vec.get_feature_names_out()
bookthief_data_df = pd.DataFrame(bookthief_data_array, columns=bookthiefvocab)
bookthief_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Columns: 1926 entries, abandoned to zusak
dtypes: float64(1926)
memory usage: 376.3 KB


In [51]:
# accounting for missing vocabulary
not_exist_vocab = [v for v in X_data_df.columns.tolist()
                   if v not in bookthief_data_df]
bookthief_data_df = bookthief_data_df.reindex(
    columns=bookthief_data_df.columns.tolist() + not_exist_vocab)
bookthief_data_df = bookthief_data_df.fillna(0)
bookthief_data_df = bookthief_data_df[X_data_df.columns.tolist()]
bookthief_data_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Columns: 73616 entries, aa to zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
dtypes: float64(73616)
memory usage: 14.0 MB


In [52]:
# converting back to sparse matrix
from scipy import sparse
bookthief_data_array2 = bookthief_data_df.to_numpy()
bookthief_data = sparse.csr_matrix(bookthief_data_array2)

In [53]:
# predicting the sentiment score
predictedvalues = lr.predict(bookthief_data)

In [54]:
# Listing the number of reviews as classified by polarity
print(f'Number of Positive Reviews : {list(predictedvalues).count(1)}')
print(f'Number of Negative Reviews : {list(predictedvalues).count(0)}')


Number of Positive Reviews : 21
Number of Negative Reviews : 4
