# Introduction

- Movie Review Sentiment Analysis
This notebook explores the task of sentiment analysis on a movie review dataset. The gloal is to build a machine learning model that can classify movie review as positive or negative based on the text content.

- Dataset
The dataset comes from the Sentiment Polarity Data Set v2.0 from [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/) by Pang, Lee and Vaithyanathan. It consists of a collection of movie review along with their corresponding sentiment labels (positive or negative). The dataset is divided into a training set and a test set.

The training set contains labeled movie review (text variable) that can be used to train a sentiment analysis model. The test set, on the other hand, includes unlabeled movie reviews, and the objective is to predict the sentiment labels for these reviews using the trained model.

- Approach
To perform sentiment analysis on the movie reviews, we will follow the following steps:

 1. Load the data: We load the movie_reviews_train and movie_reviews_test into pandas dataframe.
 
 2. Import the necessary libraries: in this notebook, I am going to apply Sentiment analysis to movie review dataset.
 
 3. Explore the traing data: This will involve examining the structure of the data, checking for missing values, and gaining insights into the distribution of positive and negative reviews.
 
 4. Explore the test data: This will involve examining the structure of the data, checking for missing values, and gaining insights into the distribution of positive and negative reviews.
 
 5. Pre-processing our training data: Before training a machine learning model, we need to process the text data to fit into our model.
 
 6. Train the model: apply sentiment analysis ML into our training dataset.
 
 7. Predict the test result & predict the result: apply test dataset into the model and print the model accuracy.
 
 8. Validate the model: apply new dataset into the model to see the predict result
 
 9. Evaluate model: try different parameters into the model to see if it will improve model accuracy.
 


## Step 1: Load the Data

This step in our sentiment analysis ML requires us to import the necessary data set using the the pandas library.

In [1]:
import pandas as pd

reviews_training = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_train.csv")
reviews_test = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_test.csv")


Using the pandas read_csv function, we have imported movie review training dataset and movie review test dataset into a pandas data frame.

In [2]:
print(reviews_training)

                                                Content Label
0     every once in a while you see a film that is s...   pos
1     the love for family is one of the strongest dr...   pos
2     after the terminally bleak reservoir dogs and ...   pos
4     having not seen , " who framed roger rabbit " ...   pos
...                                                 ...   ...
1795   " holy man " boasts a sweet , gentle , comic ...   neg
1796  alexander dumas' the three musketeers is one o...   neg
1797   " have you ever heard the one about a movie s...   neg
1798  this is the first film in what would become th...   neg
1799  first impressions : critically , a close-to-aw...   neg

[1800 rows x 2 columns]


Print the reviews_training to check there are only 2 columns: content and label in the dataset and there are 1800 rows. Also it provides us a basic understanding about what the training dataset looks like.

In [3]:
print(reviews_test)

                                               Content Label
0    hedwig ( john cameron mitchell ) was born a bo...   pos
1    one of the more unusual and suggestively viole...   pos
2    what do you get when you combine clueless and ...   pos
3    >from the man who presented us with henry : th...   pos
4    tibet has entered the american consciousness s...   pos
..                                                 ...   ...
195  my inner flag was at half-mast last year when ...   neg
196  if anything , " stigmata " should be taken as ...   neg
197  woof ! too bad that leap of faith was the titl...   neg
198  the plot of big momma's house is martin lawren...   neg
199  in the year 2029 , captain leo davidson ( mark...   neg

[200 rows x 2 columns]


Print the reviews_test to check there are 2 columns: content and label in the dataset which is the same with reviews_training dataset and there are 200 rows. Also it provides us a basic understanding about what the test dataset looks like.

We load the reviews_training and reviews_test into pandas dataframe successfully and print out the dataframe. Now we will explore the training and test dataset in next steps.

## Step 2: Import the necessary libraries

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report



## Step 3: Exploring the Training Data

Prior to performing any analysis, we need to carefully examine the data set to ensure we understand what we are analyzing.

In [5]:
reviews_training.shape

(1800, 2)

We see that there are 1800 data objects, and 2 attributes in our data set.

In [6]:
reviews_training.describe()

Unnamed: 0,Content,Label
count,1800,1800
unique,1800,2
top,every once in a while you see a film that is s...,pos
freq,1,900


The result of reviews_training.describe() provides a summary of the dataset containing reviews and their corresponding labels. Count shows there are 1800 reviews and 1800 labels in the dataset. Unique shows there are 1800 unique review and 2 unique labels. Top shows the most frequently occurring review is "every once in a while you see a film that is s.." Additionally, the labels are categorized as either "pos" or "neg". The frequency count reveals that there are 900 positive reviews and 900 negative reviews, indicating an equal distribution of positive and negative sentiments in the dataset.

## Step 4: Exploring the Test Data

Let's start to explore the test dataset

In [7]:
reviews_test.shape

(200, 2)

We see that there are 200 data objects, and 2 attributes in our test dataset.

In [8]:
reviews_test.describe()

Unnamed: 0,Content,Label
count,200,200
unique,200,2
top,hedwig ( john cameron mitchell ) was born a bo...,pos
freq,1,100


The result of reviews_test.describe() provides a summary of the dataset containing reviews and their corresponding labels. Count shows there are 200 reviews and 200 labels in the dataset. Unique shows there are 200 unique review and 2 unique labels. Top shows the most frequently occurring review is "hedwing (john cameron mitchell) was born a bo..." Additionally, the labels are categorized as either "pos" or "neg". The frequency count reveals that there are 100 positive reviews and 100 negative reviews, indicating an equal distribution of positive and negative sentiments in the dataset.

## Step 5: Split the training and test dataset into content and label

Let's split the training data into content and label before fit into our model.

In [9]:
X_train = reviews_training["Content"]
y_train = reviews_training["Label"]
print(X_train)

0       every once in a while you see a film that is s...
1       the love for family is one of the strongest dr...
2       after the terminally bleak reservoir dogs and ...
4       having not seen , " who framed roger rabbit " ...
                              ...                        
1795     " holy man " boasts a sweet , gentle , comic ...
1796    alexander dumas' the three musketeers is one o...
1797     " have you ever heard the one about a movie s...
1798    this is the first film in what would become th...
1799    first impressions : critically , a close-to-aw...
Name: Content, Length: 1800, dtype: object


In [10]:
print(y_train)

0       pos
1       pos
2       pos
3       pos
4       pos
       ... 
1795    neg
1796    neg
1797    neg
1798    neg
1799    neg
Name: Label, Length: 1800, dtype: object


I split the training dataset into content and label succesfully and print the result is what I expected. The last column is Label and the columns before that are content.

Next, let's split the test data into content and label before fit into our model.

In [11]:
X_test = reviews_test["Content"]
y_test = reviews_test["Label"]
print(X_test)

0      hedwig ( john cameron mitchell ) was born a bo...
1      one of the more unusual and suggestively viole...
2      what do you get when you combine clueless and ...
3      >from the man who presented us with henry : th...
4      tibet has entered the american consciousness s...
                             ...                        
195    my inner flag was at half-mast last year when ...
196    if anything , " stigmata " should be taken as ...
197    woof ! too bad that leap of faith was the titl...
198    the plot of big momma's house is martin lawren...
199    in the year 2029 , captain leo davidson ( mark...
Name: Content, Length: 200, dtype: object


In [12]:
print(y_test)

0      pos
1      pos
2      pos
3      pos
4      pos
      ... 
195    neg
196    neg
197    neg
198    neg
199    neg
Name: Label, Length: 200, dtype: object


I split the test dataset into content and label succesfully and print the result is what I expected. The last column is Label and the columns before that are content.

## Step 6: Pre-Processing text data in training and test dataset

Our training and test dataset contains text data and convert the test datainto numerical features

In [13]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [14]:
print(X_train)

  (0, 37728)	0.019557322533372945
  (0, 3591)	0.018613017612739185
  (0, 12297)	0.03690729640884355
  (0, 17656)	0.03646139735698193
  (0, 12346)	0.04860691140872605
  (0, 27323)	0.03573101423242628
  (0, 15767)	0.034256965318728114
  (0, 29509)	0.03090024505659402
  (0, 22330)	0.036036739822766194
  (0, 29250)	0.06957206198519013
  (0, 10870)	0.04860691140872605
  (0, 8393)	0.03895421032567651
  (0, 28411)	0.01972213541121538
  (0, 9454)	0.0269234257561756
  (0, 35929)	0.0420119478454016
  (0, 6599)	0.03235271241041815
  (0, 7739)	0.045426753803619664
  (0, 10361)	0.0597766998936176
  (0, 9320)	0.030557274972506234
  (0, 29172)	0.05288284389099513
  (0, 36072)	0.03487211423424892
  (0, 14950)	0.05103150581345296
  (0, 14344)	0.01920511390031357
  (0, 33376)	0.030727111018264492
  (0, 37056)	0.024216623086966205
  :	:
  (1799, 27189)	0.021610419271034357
  (1799, 15413)	0.013757250742321862
  (1799, 2440)	0.026978876380204273
  (1799, 26764)	0.04367002153433109
  (1799, 25356)	0.128851

In [15]:
print(X_test)

  (0, 37810)	0.049233302497535825
  (0, 37809)	0.006722119146968519
  (0, 37801)	0.01783853441303375
  (0, 37728)	0.011269008059395832
  (0, 37554)	0.008063419859195647
  (0, 37426)	0.02731475123347245
  (0, 37372)	0.036934013963421386
  (0, 37368)	0.04424874854344269
  (0, 37329)	0.040087702490054514
  (0, 37167)	0.017575014810606034
  (0, 37156)	0.005845986417172121
  (0, 37076)	0.018148478731249307
  (0, 37056)	0.02093057372291351
  (0, 36818)	0.02543141450152861
  (0, 36797)	0.006603127886297591
  (0, 36117)	0.05019527629175949
  (0, 35952)	0.010879548081688246
  (0, 35862)	0.006741572051307036
  (0, 35348)	0.028258157994876743
  (0, 34962)	0.02568542199981158
  (0, 34908)	0.012919199431329811
  (0, 34650)	0.12026310747016354
  (0, 34531)	0.04141024184007774
  (0, 34462)	0.036526654761964275
  (0, 34335)	0.017941877802180164
  :	:
  (199, 2499)	0.05402996851424845
  (199, 2497)	0.06581649435927639
  (199, 2440)	0.009684470387721593
  (199, 2275)	0.035076806787243134
  (199, 2132)	0

I use techniques like TF-IDF vectorization to convert the text data into numerical features

## Step 7: Training our SVM model

Apply SVM model into training dataset.

In [16]:
model = SVC()
model.fit(X_train, y_train)

I successfully apply SVM model into training dataset and next, I will predict the test dataset using the trained model.

## Step 8: Predict the sentiment lables for the test data using the trained model

I apply the X_test data into the trained model to predict the result after run through the training model.

In [17]:
y_pred = model.predict(X_test)

## Step 9: Print out the model accuracy result

Our code then calculates the accuracy, precision and recall scores using the SciKitLearn metrics module. I decide to use the classification_report function is part of the sklearn.metrics module and is used to generate a text-based report of various classification metrics for a machine learning model's performance. It provides a comprehensive overview of the model's precision, recall, F1-score, and support for each class.

In [18]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         neg       0.84      0.86      0.85       100
         pos       0.86      0.84      0.85       100

    accuracy                           0.85       200
   macro avg       0.85      0.85      0.85       200
weighted avg       0.85      0.85      0.85       200



Let's break down the different metrics in the report:

Precision: Precision is the ratio of correctly predicted positive or negative instances to the total predicted positive or negative instances. In this case, the precision for the "neg" (negative) class is 0.84, which means that 84% of the instances predicted as negative were actually negative. The precision for the "pos" (positive) class is 0.86, indicating that 86% of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive or negative instances to the total actual positive or negative instances. The recall for the "neg" class is 0.86, meaning that 86% of the actual negative instances were correctly identified as negative. The recall for the "pos" class is 0.84, indicating that 84% of the actual positive instances were correctly identified as positive.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics. In this case, the F1-score is 0.85 for both the "neg" and "pos" classes, indicating a good balance between precision and recall for both classes.

Support: Support represents the number of actual occurrences of each class in the test set. In this case, both the "neg" and "pos" classes have a support of 100, meaning there are 100 instances of each class in the test set.

Accuracy: Accuracy is the overall percentage of correctly predicted instances out of the total instances. The accuracy of the model is 0.85, meaning that it correctly predicted the sentiment for 85% of the instances in the test set.

Macro Average: The macro average is the average of precision, recall, and F1-score across all classes. In this case, the macro average for precision, recall, and F1-score is 0.85, indicating a balanced performance across both classes.

Weighted Average: The weighted average is the weighted average of precision, recall, and F1-score, taking into account the support (number of instances) for each class. In this case, the weighted average for precision, recall, and F1-score is 0.85, indicating a balanced performance considering the class distribution in the test set.

Overall, the classification report shows that the model has achieved good performance with similar precision, recall, and F1-score for both the positive and negative classes. The accuracy of 85% suggests that the model is able to predict the sentiment of the movie reviews with reasonable accuracy. But there is still a room to improve the model accuracy.

## Step 10: Demonstrate Making Predictions

I decide to make up my own setense to check the model will make good result.

In [19]:
X_new = 'I really enjoy the movie and will watch it again with my friends.'
X_new_vector = vectorizer.transform([X_new])
print(model.predict(X_new_vector))

['pos']


It turns out that the model predicts the result as what I expected.But I need more datasets to test the result.

## Step 11: Model Experimentation

Experiment 1:

I decide to apply some basic parameter like min_df, max_df, sublinear_tf, and use_idf in TfidfVectorizer class to see if I am able to improve the model accuracy.

I need to use the raw data without vectorized to go through the training model to predict the result. So I want to apply few parameters in TfidfVectorizer() and keep SVC() as default.

In [20]:
X_train = reviews_training["Content"]
y_train = reviews_training["Label"]
X_test = reviews_test["Content"]
y_test = reviews_test["Label"]

In [21]:
vectorizer = TfidfVectorizer(min_df = 1, max_df = 1.0, sublinear_tf = True, use_idf = True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


In [22]:
model = SVC()
model.fit(X_train, y_train)

In [23]:
y_pred = model.predict(X_test)

In [24]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         neg       0.90      0.90      0.90       100
         pos       0.90      0.90      0.90       100

    accuracy                           0.90       200
   macro avg       0.90      0.90      0.90       200
weighted avg       0.90      0.90      0.90       200



Let's break down the different metrics in the report:

Precision: Precision is the ratio of correctly predicted positive or negative instances to the total predicted positive or negative instances. In this case, the precision for the "neg" (negative) class is 0.90, which means that 90% of the instances predicted as negative were actually negative. The precision for the "pos" (positive) class is 0.90, indicating that 90%  of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive or negative instances to the total actual positive or negative instances. The recall for the "neg" class is 0.90, meaning that 90% of the actual negative instances were correctly identified as negative. The recall for the "pos" class is 0.90, indicating that 90% of the actual positive instances were correctly identified as positive.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics. In this case, the F1-score is 0.90 for both the "neg" and "pos" classes, indicating a good balance between precision and recall for both classes.

Support: Support represents the number of actual occurrences of each class in the test set. In this case, both the "neg" and "pos" classes have a support of 100, meaning there are 100 instances of each class in the test set.

Accuracy: Accuracy is the overall percentage of correctly predicted instances out of the total instances. The accuracy of the model is 0.90, meaning that it correctly predicted the sentiment for 90% of the instances in the test set.

Macro Average: The macro average is the average of precision, recall, and F1-score across all classes. In this case, the macro average for precision, recall, and F1-score is 0.90, indicating a balanced performance across both classes.

Weighted Average: The weighted average is the weighted average of precision, recall, and F1-score, taking into account the support (number of instances) for each class. In this case, the weighted average for precision, recall, and F1-score is 0.90, indicating a balanced performance considering the class distribution in the test set.

Overall, the classification report shows that the model has achieved good performance with similar precision, recall, and F1-score for both the positive and negative classes. The accuracy of 90% suggests that the model is able to predict the sentiment of the movie reviews with reasonable accuracy. But there is still a room to improve the model accuracy.

Experiment 2: 

I apply other parameters like max_features: It specifies the maximum number of features (terms) to be extracted from the text data. vocabulary: It allows you to provide a custom vocabulary (a set of terms) to be used for vectorization.binary: It determines whether the TF-IDF values are binary or non-binary. Setting binary=True means that all non-zero term counts will be set to 1, resulting in binary TF-IDF values. norm: It specifies the normalization method for the TF-IDF matrix. The default value is 'l2', which normalizes each document's TF-IDF vector to have a Euclidean norm of 1 to see if I will be able to improve the model accuracy.

I need to use the raw data without vectorized to go through the training model to predict the result. So I want to apply more parameters in TfidfVectorizer() and keep SVC() as default.

In [25]:
X_train = reviews_training["Content"]
y_train = reviews_training["Label"]
X_test = reviews_test["Content"]
y_test = reviews_test["Label"]

In [26]:
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=10000, vocabulary=None, binary=True,norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [27]:
model = SVC()
model.fit(X_train, y_train)

In [28]:
y_pred = model.predict(X_test)

In [29]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         neg       0.90      0.93      0.92       100
         pos       0.93      0.90      0.91       100

    accuracy                           0.92       200
   macro avg       0.92      0.92      0.91       200
weighted avg       0.92      0.92      0.91       200



Let's break down the different metrics in the report:

Precision: Precision is the ratio of correctly predicted positive or negative instances to the total predicted positive or negative instances. In this case, the precision for the "neg" (negative) class is 0.90, which means that 90% of the instances predicted as negative were actually negative. The precision for the "pos" (positive) class is 0.93, indicating that 93% of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive or negative instances to the total actual positive or negative instances. The recall for the "neg" class is 0.93, meaning that 93% of the actual negative instances were correctly identified as negative. The recall for the "pos" class is 0.90, indicating that 90% of the actual positive instances were correctly identified as positive.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics. In this case, the F1-score is 0.92 for the "neg" and 0.91 for the "pos" classes, indicating a good balance between precision and recall for both classes.

Support: Support represents the number of actual occurrences of each class in the test set. In this case, both the "neg" and "pos" classes have a support of 100, meaning there are 100 instances of each class in the test set.

Accuracy: Accuracy is the overall percentage of correctly predicted instances out of the total instances. The accuracy of the model is 0.92, meaning that it correctly predicted the sentiment for 92% of the instances in the test set.

Macro Average: The macro average is the average of precision, recall, and F1-score across all classes. In this case, the macro average for precision, recall, and F1-score is 0.92, indicating a balanced performance across both classes.

Weighted Average: The weighted average is the weighted average of precision, recall, and F1-score, taking into account the support (number of instances) for each class. In this case, the weighted average for precision, recall, and F1-score is 0.91, indicating a balanced performance considering the class distribution in the test set.

Overall, the classification report shows that the model has achieved good performance with similar precision, recall, and F1-score for both the positive and negative classes. The accuracy of 92% suggests that the model is able to predict the sentiment of the movie reviews with reasonable accuracy. But there is still a room to improve the model accuracy.

Experiment 3: 

I will keep the parameters in TfidfVectorizer the same as experiement 2 and I will change the parameters in the SVM model to see if I am able to improve model accuracy.

I need to use the raw data without vectorized to go through the training model to predict the result. So I want to apply more parameters in TfidfVectorizer() and keep parameter in SVC() as well.

In [30]:
X_train = reviews_training["Content"]
y_train = reviews_training["Label"]
X_test = reviews_test["Content"]
y_test = reviews_test["Label"]

In [31]:
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, max_features=10000, vocabulary=None, binary=True,norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [32]:
model = SVC(kernel='poly', C=1.0, gamma =  'scale', degree=4, coef0=1.0)
model.fit(X_train, y_train)

In [33]:
y_pred = model.predict(X_test)

In [34]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

         neg       0.91      0.95      0.93       100
         pos       0.95      0.91      0.93       100

    accuracy                           0.93       200
   macro avg       0.93      0.93      0.93       200
weighted avg       0.93      0.93      0.93       200



Let's break down the different metrics in the report:

Precision: Precision is the ratio of correctly predicted positive or negative instances to the total predicted positive or negative instances. In this case, the precision for the "neg" (negative) class is 0.91, which means that 91% of the instances predicted as negative were actually negative. The precision for the "pos" (positive) class is 0.95, indicating that 95% of the instances predicted as positive were actually positive.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of correctly predicted positive or negative instances to the total actual positive or negative instances. The recall for the "neg" class is 0.95, meaning that 95% of the actual negative instances were correctly identified as negative. The recall for the "pos" class is 0.91, indicating that 91% of the actual positive instances were correctly identified as positive.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both metrics. In this case, the F1-score is 0.93 for the "neg" and 0.91 for the "pos" classes, indicating a good balance between precision and recall for both classes.

Support: Support represents the number of actual occurrences of each class in the test set. In this case, both the "neg" and "pos" classes have a support of 100, meaning there are 100 instances of each class in the test set.

Accuracy: Accuracy is the overall percentage of correctly predicted instances out of the total instances. The accuracy of the model is 0.93, meaning that it correctly predicted the sentiment for 93% of the instances in the test set.

Macro Average: The macro average is the average of precision, recall, and F1-score across all classes. In this case, the macro average for precision, recall, and F1-score is 0.93, indicating a balanced performance across both classes.

Weighted Average: The weighted average is the weighted average of precision, recall, and F1-score, taking into account the support (number of instances) for each class. In this case, the weighted average for precision, recall, and F1-score is 0.93, indicating a balanced performance considering the class distribution in the test set.

Overall, the classification report shows that the model has achieved good performance with similar precision, recall, and F1-score for both the positive and negative classes. The accuracy of 93% suggests that the model is able to predict the sentiment of the movie reviews with reasonable accuracy. This is the best experiment result I have so far.

## Conclusion

In this notebook, I applied Sentiment Analysis on Movie Review data. It helps me develop good unberstanding about how to develop a model to interpret text into numerical varialbe. There are three notable things I want to summarize:
1. Understand the dataset: there is 1800 rows in training and 200 rows in validation dataset and there are 3 columns in the each dataset. There is not enough training and validation dataset to fit into the model but it will be a good exercise. One of the most important thing I learned is that I need to change the text data into numerical data by using vectorizer = TfidfVectorizer(). Once the data was vectorized, this allowed me to run a support vector machine to fit the model.

2. Experiment to get better accuracy: first, I run the vectorizer = TfidfVectorizer() and model = SVC() without any parameters. I turn out with apply fews parameters in TfidfVectorizer() first and it helps me improve a little better accuracy and I keep trying more parameter. The result turns out better and better and it helps me build more confident. Finally, making configuration changes to both the vectorizer and the support vector machine was the key to success. 

3. Demonstrate make predictions: I apply the first row in training dataset to fit into the training model to check my predict result. However, I keep getting errors like 'lower not found' until I write for loop to break down into strings.

In summary, this is a good exploring experience for me to learn Sentiment Analysis and it helps me build more confidence to apply this in my future careers.