# Semantic Analysis: Categorization of Sentences #

This set of notes basically follows the presentation in the first 2/3 of Chapter 8 of the text book, with several deviations (given below).  For instance, the textbook shows how to unpack the dataset from scratch, but there is a cleaner version of the text corpus than used in the book which can be found on [kaggle.com](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)  It is also available as imdb-dataset.csv in your CNotesCS323_1. 

In [2]:
import numpy as np
import pandas as pd

In [3]:
mydata = pd.read_csv("/home/courses/CS323_1/CNotes/imdb-dataset.csv")

In [4]:
mydata.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
mydata.loc[0,'review']

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [6]:
mydata.shape

(50000, 2)

In [7]:
mydata.dtypes

review       object
sentiment    object
dtype: object

## Goal: Predict 'positive' or 'negative' based on the text with the 'Bag of Words' model ##

**Idea:** For each entry, we must *vectorize" the text--that is, create a (fixed-length) 'feature vector' from the text--a numerical vector we can input into the usual models--SVM, logistic regression, etc.

+ There are many ways to do this.
+ We will use something called a 'Bag of words' model.
+ This is not state-of-the-art ("SOTA") today, but still very useful to understand 

# Text Processing with Python #

## Point 1:  Each review is a string ##

In [8]:
mystring = mydata.loc[0,'review']

In [9]:
mystring

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## Point 2: Strings are just arrays of unicode characters ##

In [11]:
mystring[:50]

'One of the other reviewers has mentioned that afte'

In [12]:
isinstance(mystring, str)

True

## Point 3: There are lots of python methods for dealing with strings ##

For instance, see [here](https://www.w3schools.com/python/python_ref_string.asp) or [here](https://www.pythoncheatsheet.org/cheatsheet/manipulating-strings) for a lengthy list.

In [13]:
mystring.title()

"One Of The Other Reviewers Has Mentioned That After Watching Just 1 Oz Episode You'Ll Be Hooked. They Are Right, As This Is Exactly What Happened With Me.<Br /><Br />The First Thing That Struck Me About Oz Was Its Brutality And Unflinching Scenes Of Violence, Which Set In Right From The Word Go. Trust Me, This Is Not A Show For The Faint Hearted Or Timid. This Show Pulls No Punches With Regards To Drugs, Sex Or Violence. Its Is Hardcore, In The Classic Use Of The Word.<Br /><Br />It Is Called Oz As That Is The Nickname Given To The Oswald Maximum Security State Penitentary. It Focuses Mainly On Emerald City, An Experimental Section Of The Prison Where All The Cells Have Glass Fronts And Face Inwards, So Privacy Is Not High On The Agenda. Em City Is Home To Many..Aryans, Muslims, Gangstas, Latinos, Christians, Italians, Irish And More....So Scuffles, Death Stares, Dodgy Dealings And Shady Agreements Are Never Far Away.<Br /><Br />I Would Say The Main Appeal Of The Show Is Due To The Fa

In [14]:
mystring.upper()

"ONE OF THE OTHER REVIEWERS HAS MENTIONED THAT AFTER WATCHING JUST 1 OZ EPISODE YOU'LL BE HOOKED. THEY ARE RIGHT, AS THIS IS EXACTLY WHAT HAPPENED WITH ME.<BR /><BR />THE FIRST THING THAT STRUCK ME ABOUT OZ WAS ITS BRUTALITY AND UNFLINCHING SCENES OF VIOLENCE, WHICH SET IN RIGHT FROM THE WORD GO. TRUST ME, THIS IS NOT A SHOW FOR THE FAINT HEARTED OR TIMID. THIS SHOW PULLS NO PUNCHES WITH REGARDS TO DRUGS, SEX OR VIOLENCE. ITS IS HARDCORE, IN THE CLASSIC USE OF THE WORD.<BR /><BR />IT IS CALLED OZ AS THAT IS THE NICKNAME GIVEN TO THE OSWALD MAXIMUM SECURITY STATE PENITENTARY. IT FOCUSES MAINLY ON EMERALD CITY, AN EXPERIMENTAL SECTION OF THE PRISON WHERE ALL THE CELLS HAVE GLASS FRONTS AND FACE INWARDS, SO PRIVACY IS NOT HIGH ON THE AGENDA. EM CITY IS HOME TO MANY..ARYANS, MUSLIMS, GANGSTAS, LATINOS, CHRISTIANS, ITALIANS, IRISH AND MORE....SO SCUFFLES, DEATH STARES, DODGY DEALINGS AND SHADY AGREEMENTS ARE NEVER FAR AWAY.<BR /><BR />I WOULD SAY THE MAIN APPEAL OF THE SHOW IS DUE TO THE FA

In [15]:
mystring.split('.')

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked",
 ' They are right, as this is exactly what happened with me',
 '<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO',
 ' Trust me, this is not a show for the faint hearted or timid',
 ' This show pulls no punches with regards to drugs, sex or violence',
 ' Its is hardcore, in the classic use of the word',
 '<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary',
 ' It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda',
 ' Em City is home to many',
 '',
 'Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more',
 '',
 '',
 '',
 'so scuffles, death stares, dodgy dealings and shady agreements are never far away',
 "<br /><

## Point 4: We are primarily interested in splitting a sentence into words ##

**Idea:** The list of words will be converted into a *numerical* list of *counts* of each word, and the differences in these counts will be used to distinguish one review from another.  Hence "bag of words".   

**Example:** (From the book)

In [16]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [17]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [18]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


**Problem:** Real sentence are messy.

In [19]:
mystring.split(' ')

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 "you'll",
 'be',
 'hooked.',
 'They',
 'are',
 'right,',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me.<br',
 '/><br',
 '/>The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence,',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO.',
 'Trust',
 'me,',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid.',
 'This',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs,',
 'sex',
 'or',
 'violence.',
 'Its',
 'is',
 'hardcore,',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.<br',
 '/><br',
 '/>It',
 'is',
 'called',
 'OZ',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentar

## Point 5: There are lots of other ways to clean up text ##

### Some General Ideas for Cleaning Data ###

+ Lowercasing
+ Removing Punctuation & Special Characters
+ Stop-Words Removal  (Next step)
+ Removal of URLs
+ Removal of HTML Tags

### Capitalization/Lowercasing ###

In [1]:
mystring.lower()

NameError: name 'mystring' is not defined

**Note:**  You need to save your modified string--it won't be converted automatically

In [34]:
mystring2 = mystring.lower()

### Punctuation and HTML Tags with Regular Expresssions ###

The python [re package](https://docs.python.org/3/library/re.html) is a way to manually clean up characters. Unfortunately, reading 're' syntax is like reading Greek.

In [21]:
import re

In [44]:
re.sub('<[^>]*>',' ',mystring2)

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.  the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.  it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.  i would say the main appeal of the show is due to the fact that it goes where other sh

In [45]:
mystring3 = re.sub('<[^>]*>','',mystring2)

In [48]:
re.sub('\.|,|\(|\)',' ',mystring3)

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wo

In [43]:
mystring4 = re.sub('\.|,|\(|\)',' ',mystring3)

You can also remove digits.

In [49]:
re.sub('[\d+]','',mystring4)

"one of the other reviewers has mentioned that after watching just  oz episode you'll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wou

In [50]:
mystring5 = re.sub('[\d+]','',mystring4)

**Note:** Another way to remove HTML is with the "beautiful Soup" package.

In [47]:
from bs4 import BeautifulSoup  # Remember to pip install this package !!

soup = BeautifulSoup(mystring)
soup.get_text()

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

## Point 6: The next step is to tokenize ##

In [52]:
mystring5

"one of the other reviewers has mentioned that after watching just  oz episode you'll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wou

In [54]:
print(mystring5.split(' '))

['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '', 'oz', 'episode', "you'll", 'be', 'hooked', '', 'they', 'are', 'right', '', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'me', 'the', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', '', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', '', 'trust', 'me', '', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', '', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', '', 'sex', 'or', 'violence', '', 'its', 'is', 'hardcore', '', 'in', 'the', 'classic', 'use', 'of', 'the', 'word', 'it', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', '', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', '', 'an', 'experimental', 'section', 'of', 'the',

### The Natural Language Toolkit ###
The [NLTK](https://www.nltk.org/) is a well-known package for tokenizing natural language sentences.  We will use [nltk.tokenize](https://www.nltk.org/api/nltk.tokenize.html).

And they have a[nice book](https://www.nltk.org/book/)  to go along with it.

In [56]:
import nltk  # Remember to 'pip install nltk'

In [58]:
from nltk.tokenize import word_tokenize

In [59]:
word_tokenize(mystring5)

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 'oz',
 'episode',
 'you',
 "'ll",
 'be',
 'hooked',
 'they',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me',
 'the',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go',
 'trust',
 'me',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'this',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word',
 'it',
 'is',
 'called',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'it',
 'focuses',
 'mainly',
 'on',
 

In [60]:
mylist1 = word_tokenize(mystring5)

In [62]:
len(mylist1)

313

## Point 7: Remove Stop Words ##

Some words are so common, they don't add anything by being included in the list of words. 

The NLTK package (short for 'Natural Language Toolkit') is a very powerful package for dealing with a lot of these issues.


In [66]:
import nltk  # Remember to 'pip install nltk'
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /home/aleahy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [27]:
len(stopwords.words('english'))

179

**Question:** How do you remove these stop words?  There's an interesting way to construct a list in python using a for loop:

In [63]:
[word for word in mylist1 if word not in stopwords.words('english')]

['one',
 'reviewers',
 'mentioned',
 'watching',
 'oz',
 'episode',
 "'ll",
 'hooked',
 'right',
 'exactly',
 'happened',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'word',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'many',
 'aryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'goes',
 'shows',
 '

**Note:**  I didn't deal with contractions very well . . . . 

In [64]:
mylist2 = [word for word in mylist1 if word not in stopwords.words('english')]

In [65]:
len(mylist2)

170

In [67]:
print(mylist2)

['one', 'reviewers', 'mentioned', 'watching', 'oz', 'episode', "'ll", 'hooked', 'right', 'exactly', 'happened', 'first', 'thing', 'struck', 'oz', 'brutality', 'unflinching', 'scenes', 'violence', 'set', 'right', 'word', 'go', 'trust', 'show', 'faint', 'hearted', 'timid', 'show', 'pulls', 'punches', 'regards', 'drugs', 'sex', 'violence', 'hardcore', 'classic', 'use', 'word', 'called', 'oz', 'nickname', 'given', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'focuses', 'mainly', 'emerald', 'city', 'experimental', 'section', 'prison', 'cells', 'glass', 'fronts', 'face', 'inwards', 'privacy', 'high', 'agenda', 'em', 'city', 'home', 'many', 'aryans', 'muslims', 'gangstas', 'latinos', 'christians', 'italians', 'irish', 'scuffles', 'death', 'stares', 'dodgy', 'dealings', 'shady', 'agreements', 'never', 'far', 'away', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'would', "n't", 'dare', 'forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 

## Some Additional Steps:  Stemming or Lemmatization ##

See [here](https://www.ibm.com/topics/stemming-lemmatization): "Literature generally defines stemming as the process of stripping affixes from words to obtain stemmed word strings, and lemmatization as the larger enterprise of reducing morphological variants to one dictionary base form.

In [68]:
from nltk.stem.porter import PorterStemmer

In [71]:
porter = PorterStemmer()

In [72]:
[ porter.stem(word) for word in mylist2 ]

['one',
 'review',
 'mention',
 'watch',
 'oz',
 'episod',
 "'ll",
 'hook',
 'right',
 'exactli',
 'happen',
 'first',
 'thing',
 'struck',
 'oz',
 'brutal',
 'unflinch',
 'scene',
 'violenc',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'heart',
 'timid',
 'show',
 'pull',
 'punch',
 'regard',
 'drug',
 'sex',
 'violenc',
 'hardcor',
 'classic',
 'use',
 'word',
 'call',
 'oz',
 'nicknam',
 'given',
 'oswald',
 'maximum',
 'secur',
 'state',
 'penitentari',
 'focus',
 'mainli',
 'emerald',
 'citi',
 'experiment',
 'section',
 'prison',
 'cell',
 'glass',
 'front',
 'face',
 'inward',
 'privaci',
 'high',
 'agenda',
 'em',
 'citi',
 'home',
 'mani',
 'aryan',
 'muslim',
 'gangsta',
 'latino',
 'christian',
 'italian',
 'irish',
 'scuffl',
 'death',
 'stare',
 'dodgi',
 'deal',
 'shadi',
 'agreement',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'goe',
 'show',
 'would',
 "n't",
 'dare',
 'forget',
 'pretti',
 'pictur',
 

In [73]:
mylist3stemmed = [ porter.stem(word) for word in mylist2 ]

In [74]:
from nltk.stem import WordNetLemmatizer

In [75]:
lemmatizer = WordNetLemmatizer()

In [77]:
 nltk.download('wordnet')  # The next line gave an error and suggested I run this

[nltk_data] Downloading package wordnet to /home/aleahy/nltk_data...


True

In [78]:
[ lemmatizer.lemmatize(word) for word in mylist2 ]

['one',
 'reviewer',
 'mentioned',
 'watching',
 'oz',
 'episode',
 "'ll",
 'hooked',
 'right',
 'exactly',
 'happened',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scene',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pull',
 'punch',
 'regard',
 'drug',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'word',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focus',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cell',
 'glass',
 'front',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'many',
 'aryan',
 'muslim',
 'gangsta',
 'latino',
 'christian',
 'italian',
 'irish',
 'scuffle',
 'death',
 'stare',
 'dodgy',
 'dealing',
 'shady',
 'agreement',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'go',
 'show',
 'would',
 "n't",
 'dare',

In [79]:
mylist3lemma = [ lemmatizer.lemmatize(word) for word in mylist2 ]

In [80]:
word_count = {}
for word in mylist3lemma:
    if word not in word_count:
        word_count[word] = 1
    elif word in word_count:
        word_count[word] += 1
print(word_count)


{'one': 1, 'reviewer': 1, 'mentioned': 1, 'watching': 2, 'oz': 6, 'episode': 2, "'ll": 3, 'hooked': 1, 'right': 2, 'exactly': 1, 'happened': 1, 'first': 2, 'thing': 1, 'struck': 2, 'brutality': 1, 'unflinching': 1, 'scene': 1, 'violence': 4, 'set': 1, 'word': 2, 'go': 2, 'trust': 1, 'show': 4, 'faint': 1, 'hearted': 1, 'timid': 1, 'pull': 1, 'punch': 1, 'regard': 1, 'drug': 1, 'sex': 1, 'hardcore': 1, 'classic': 1, 'use': 1, 'called': 1, 'nickname': 1, 'given': 1, 'oswald': 1, 'maximum': 1, 'security': 1, 'state': 1, 'penitentary': 1, 'focus': 1, 'mainly': 1, 'emerald': 1, 'city': 2, 'experimental': 1, 'section': 1, 'prison': 3, 'cell': 1, 'glass': 1, 'front': 1, 'face': 1, 'inwards': 1, 'privacy': 1, 'high': 2, 'agenda': 1, 'em': 1, 'home': 1, 'many': 1, 'aryan': 1, 'muslim': 1, 'gangsta': 1, 'latino': 1, 'christian': 1, 'italian': 1, 'irish': 1, 'scuffle': 1, 'death': 1, 'stare': 1, 'dodgy': 1, 'dealing': 1, 'shady': 1, 'agreement': 1, 'never': 1, 'far': 1, 'away': 2, 'would': 2, 'sa

# SKLearn and the 'Bag of Words' Model #

You now have to do this with *every sentence* in the corpus . . . 

**Key point:** This process of *vectorization* can be automated, and SKLearn does this with its [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) package

## CountVectorizer ##

This [class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) counts occurences of words in the *entire* corpus.

In [81]:
from sklearn.feature_extraction.text import CountVectorizer

In [83]:
mydata['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [84]:
mycount = CountVectorizer()

In [85]:
bag = mycount.fit_transform(mydata['review'])

**Note:** This is a new type of data structure:  A scipy [sparse matrix](https://docs.scipy.org/doc/scipy/reference/sparse.html)

In [89]:
bag

<50000x101895 sparse matrix of type '<class 'numpy.int64'>'
	with 6826529 stored elements in Compressed Sparse Row format>

In [88]:
bag.shape

(50000, 101895)

In [87]:
bag[0]

<1x101895 sparse matrix of type '<class 'numpy.int64'>'
	with 186 stored elements in Compressed Sparse Row format>

In [90]:
bag[0].nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32),
 array([ 64131,  63757,  90160,  64776,  75511,  40745,  57558,  90137,
          2970,  98226,  48473,  65469,  30113, 101096,  53223,   8758,
         42850,  90347,   5707,  75915,   6166,  90455,  46765,  30975,
         98847,  40445,  99740,  56982,  12041,  33526,  90399,  86597,
          1866,  98149,  46954,  12974,   4541,  94620,  78746,  97

In [94]:
len(bag[0].data)

186

In [95]:
bag[bag[0].nonzero()]

matrix([[ 1,  7, 16,  2,  1,  1,  1,  4,  1,  2,  2,  6,  2,  3,  3,  2,
          1,  1,  2,  2,  4,  3,  9,  1,  2,  1,  5,  4,  6,  2,  1,  2,
          1,  3,  2,  1,  6,  1,  1,  4,  1,  1,  3,  1,  2,  1,  1,  3,
          3,  5,  1,  1,  3,  1,  1,  1,  1,  1,  6,  1,  1,  1,  1,  1,
          6,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  3,  1,  2,  1,  1,
          1,  3,  2,  1,  1,  1,  1,  1,  1,  1,  3,  1,  2,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  2,  1,  2,  1,  1,  2,  1,  1,  1,  1,  1,  3,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  2,  1,  1,
          2,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1]])

In [97]:
bag[bag[3].nonzero()]

matrix([[ 7, 16,  2,  3,  2,  4,  3,  9,  5,  6,  2,  2,  6,  1,  3,  5,
          3,  6,  2,  1,  1,  1,  1,  1,  1,  1,  1,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0]])

##  TfidfTransformer ##

This gives a (fixed-length) numerical vector for each sentence in our corpus.  But this isn't really a good vector to use.  Why?

Instead of the **term frequency**, we will compute TF-IDF:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t)$$

where

$$\text{idf}(t) = \text{log}\frac{n}{1+\text{df}(t)}$$

though there are variations on this, such as 

$$\text{idf} (t,d) = \text{log}\frac{1 + n}{1 + \text{df}(t)}$$

where $n$ is the total number of documents in the corpus and $\text{df}(t)$ is the number of documents that contain the term $t$.  (The effect of this is to make rarely occurring words have a higher weight.  How??)  The tf-idf equation is implemented in scikit-learn is slightly different and is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

Moreover, the [TfidfTrransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) normalizes the tf-idfs directly:

$$v_{\text{norm}} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v_{1}^{2} + v_{2}^{2} + \dots + v_{n}^{2}}} = \frac{v}{\big (\sum_{i=1}^{n} v_{i}^{2}\big)^\frac{1}{2}}$$

In [98]:
from sklearn.feature_extraction.text import TfidfTransformer

In [102]:
mytfidf = TfidfTransformer()

In [104]:
mynewbag = mytfidf.fit_transform(bag)

In [105]:
mynewbag.shape

(50000, 101895)

In [108]:
mynewbag[mynewbag[0].nonzero()]

matrix([[0.02911322, 0.05068733, 0.04479186, 0.02244943, 0.09255941,
         0.07188521, 0.03798399, 0.02331236, 0.05603874, 0.0405969 ,
         0.02324466, 0.06081672, 0.03718882, 0.04558484, 0.18469651,
         0.04837916, 0.03966464, 0.08812306, 0.06598397, 0.04594793,
         0.05750647, 0.05318728, 0.06723554, 0.08362849, 0.03477363,
         0.03054229, 0.01938486, 0.02408115, 0.17026481, 0.05739846,
         0.05110907, 0.0539727 , 0.06140152, 0.13109702, 0.0508575 ,
         0.05208189, 0.08048678, 0.06472807, 0.05576775, 0.05991473,
         0.04336909, 0.03752364, 0.09527006, 0.07956894, 0.04243153,
         0.03678788, 0.06383417, 0.0626505 , 0.1100581 , 0.03002716,
         0.05797293, 0.03388801, 0.05017671, 0.06765681, 0.05991473,
         0.07293804, 0.05626202, 0.07358243, 0.06244624, 0.09031335,
         0.16991551, 0.03334867, 0.054682  , 0.11309272, 0.06889804,
         0.42048426, 0.01953166, 0.04889158, 0.08991524, 0.0460941 ,
         0.05966131, 0.01649938, 0

# Building the Model? #

In [109]:
X_train = mydata.loc[:25000, 'review'].values
y_train = mydata.loc[:25000, 'sentiment'].values
X_test = mydata.loc[25000:, 'review'].values
y_test = mydata.loc[25000:, 'sentiment'].values

In [111]:
y_test[:50]

array(['negative', 'negative', 'positive', 'positive', 'negative',
       'positive', 'negative', 'positive', 'positive', 'negative',
       'positive', 'positive', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'negative', 'positive',
       'positive', 'negative', 'positive', 'negative', 'positive',
       'positive', 'negative', 'positive', 'negative', 'negative',
       'positive', 'positive', 'positive', 'positive', 'positive',
       'positive', 'negative', 'negative', 'positive', 'negative',
       'positive', 'negative', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'negative', 'positive'],
      dtype=object)

In [113]:
temp = mycount.fit_transform(X_train)

In [114]:
bagX = mytfidf.fit_transform(temp)

In [115]:
from sklearn.linear_model import LogisticRegression

In [117]:
mylogreg = LogisticRegression()

In [118]:
mymodel = mylogreg.fit(bagX, y_train)

In [119]:
temptest = mycount.fit_transform(X_test)

In [120]:
bagXtest = mytfidf.fit_transform(temptest)

In [121]:
mypredictedy = mymodel.predict(bagXtest)

ValueError: X has 76885 features, but LogisticRegression is expecting 76497 features as input.

**Problem:** I think the dictionaries for the test and train had different lengths.  I will try this again with one dictionary.

# Another Attempt at Building the Model #

In [122]:
temp = mycount.fit_transform(mydata['review'].values)

In [156]:
bagX = mytfidf.fit_transform(temp)

In [157]:
bagX.shape

(50000, 101895)

In [158]:
X_train = bagX[:25000]
X_test = bagX[25000:]

In [159]:
X_train.shape

(25000, 101895)

In [160]:
X_test.shape

(25000, 101895)

In [161]:
y_train = mydata.loc[:24999, 'sentiment']

In [162]:
y_test = mydata.loc[25000:, 'sentiment']

In [163]:
y_train.shape

(25000,)

In [164]:
y_test.shape

(25000,)

In [166]:
mylogreg = LogisticRegression()

In [167]:
mymodel = mylogreg.fit(X_train, y_train)

In [169]:
mypredictions = mymodel.predict(X_test)

In [170]:
mypredictions.shape

(25000,)

In [171]:
from sklearn.metrics import accuracy_score

In [172]:
accuracy_score(mypredictions, y_test)

0.89084

### Comments ###

I don't think 89% accuracy on a simple logistic regression model is all that bad.  Note that I didn't even include my cleaned up data when I vectorized my text.