### Overview

In this workbook, we'll use two movie reviews, one positive, one negative, to train a machine learning model to asses sentiment. This isn't enough data for a meaningful model, and we should not expect predictable or consistent results. However, it can be much easier to see what's happening with a very small dataset.

First, we'll create a couple of reviews. The sentiment list has two categories - 1 for positive, 0 for negative, corresponding to each review

In [1]:
reviews = ['excellent film, excellent acting, well written screenplay, coherent plot',
    'mediocre film, unconvincing acting, stilted dialog, incoherent plot']
sentiments = [1, 0]

In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame({'review': reviews, 'sentiment': sentiments})

In [4]:
df

Unnamed: 0,review,sentiment
0,"excellent film, excellent acting, well written...",1
1,"mediocre film, unconvincing acting, stilted di...",0


### Bag-Of-Words

To create document vectors for each review, we will first need a full list of all the words used in the collection (in this case, the two documents).

We'll do this using the CountVectorizer library from scikit-learn

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
vectorizer = CountVectorizer()

We provide *all* the rows of the reviews column 

In [7]:
X_train = vectorizer.fit_transform(df['review'])

...to extract all terms in the entire document collection

In [9]:
vectorizer.get_feature_names_out()

array(['acting', 'coherent', 'dialog', 'excellent', 'film', 'incoherent',
       'mediocre', 'plot', 'screenplay', 'stilted', 'unconvincing',
       'well', 'written'], dtype=object)

...and create a matrix (in sparse matrix form) showing the frequency of each term in the two documents

In [10]:
print(X_train)

  (0, 3)	2
  (0, 4)	1
  (0, 0)	1
  (0, 11)	1
  (0, 12)	1
  (0, 8)	1
  (0, 1)	1
  (0, 7)	1
  (1, 4)	1
  (1, 0)	1
  (1, 7)	1
  (1, 6)	1
  (1, 10)	1
  (1, 9)	1
  (1, 2)	1
  (1, 5)	1


Normally, you'd want to keep term frequency matrices in sparse form (imagine a document collection with tens or hundreds of thousands of terms, where only a few show up in each document!). For our small collection, we can take a look at the full matrix.

In [11]:
X_train.todense()

matrix([[1, 1, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 1],
        [1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0]])

A dataframe may make it easier to visualize the relanship between the bag of words (the document vocabulary) and the document term vector for each individual record

In [13]:
df_model = pd.DataFrame({'word': vectorizer.get_feature_names_out(), 
              'positive': X_train.todense().tolist()[0],
              'negative': X_train.todense().tolist()[1]})

In [14]:
df_model

Unnamed: 0,word,positive,negative
0,acting,1,1
1,coherent,1,0
2,dialog,0,1
3,excellent,2,0
4,film,1,1
5,incoherent,0,1
6,mediocre,0,1
7,plot,1,1
8,screenplay,1,0
9,stilted,0,1


### Train an ML model

Now that we have a vector for each record, and the classification for that vector (positive or negative), we can use these records and their classification to train a ML model to determine sentiment based on term frequency.

In [15]:
from sklearn.ensemble import RandomForestClassifier

We'll start with a random forest classifier. We provide the document term vectors (X_train), and the classifications (sentiment scores). 

Note - once you've gotten your data into document term vectors, scikit-learn makes it easy to swap out different algorithms. 

In [16]:
clf = RandomForestClassifier().fit(X_train, df['sentiment'])

### Feature importances

Some algorithms provide valuable analysis on your model, even if you don't use it as a predictive tool. Scikit-learn's random forest implementation will provide information on feature imporance, the impact different terms are having on the overall predictiveness of the model (see the scikit-learn documentation for more info).

Note - because our training data set is small, there will be a heavy element of randomness in the feature importances here. In fact, if you run this workbook repeatedly, you should see substantial variance in these numbers. This tends to be smoothed out with much later datasets and a larger number of trees. 

In [17]:
clf.feature_importances_

array([0.  , 0.1 , 0.16, 0.1 , 0.  , 0.1 , 0.04, 0.  , 0.12, 0.14, 0.08,
       0.06, 0.1 ])

In [18]:
df_model['feature_importance'] = clf.feature_importances_

In [35]:
df_model

Unnamed: 0,word,positive,negative,feature_importance
0,acting,1,1,0.0
1,coherent,1,0,0.1
2,dialog,0,1,0.16
3,excellent,2,0,0.1
4,film,1,1,0.0
5,incoherent,0,1,0.1
6,mediocre,0,1,0.04
7,plot,1,1,0.0
8,screenplay,1,0,0.12
9,stilted,0,1,0.14


### Make predictions

Let's create two new reviews, positive and negative, and see how our model predicts their sentiment scores.

In [20]:
positive_review = "excellent film, acting was so so by but the plot was well thought out"

In [21]:
negative_review = "mediocre acting, everything about this was unconvincing, save your money"

as before, we need to transform these sentences into a term frequency vector, using the bag of words we created for our training set.

In [22]:
positive_vector = vectorizer.transform([positive_review])

In [23]:
positive_vector.todense()

matrix([[1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0]])

In [24]:
negative_vector = vectorizer.transform([negative_review])

In [25]:
negative_vector.todense()

matrix([[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]])

In [28]:
df_results = pd.DataFrame({'word': vectorizer.get_feature_names_out(), 
              'positive': positive_vector.todense().tolist()[0],
              'negative': negative_vector.todense().tolist()[0]})

In [29]:
df_results

Unnamed: 0,word,positive,negative
0,acting,1,1
1,coherent,0,0
2,dialog,0,0
3,excellent,1,0
4,film,1,0
5,incoherent,0,0
6,mediocre,0,1
7,plot,1,0
8,screenplay,0,0
9,stilted,0,0


scikit-learn random forest provides two output types - a discrete category (1 or 0), or a probability for each category.

In [30]:
clf.predict(positive_vector)

array([1])

In [31]:
clf.predict(negative_vector)

array([0])

In [32]:
clf.predict_proba(positive_vector)

array([[0.49, 0.51]])

In [33]:
clf.predict_proba(negative_vector)

array([[0.58, 0.42]])

In [34]:
# execise 1 - change the training and test phrases
# what effect does this have on feature importance and prediction?

# exercise 2 - try different ML classification algorithms. 
# from sklearn.neural_network import MLPClassifier
# from sklearn.naive_bayes import MultinomialNB