# Live Coding - NLP Basics

The ultimate goal of this live coding is to build a model that predicts whether a user rated the movie "Matrix" with the best possible value or not. This prediction should be based on the review the user has written. Pretty nice task, isn't it?

The steps to get to this goal are the following:

1. Read text data and store in data frame
2. Extract metadata from text (e.g. rating)
3. Clean text (e.g. remove symbols)
4. Remove stopwords
5. Stemming / Lemmatizing
5. Create train / test set
6. Create features (bag-of-words / tf-idf)
7. Run classification model
8. Identify positive / negative words

# 1. Read text data

Let's start by reading the text from the txt-file

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
path="/content/drive/My Drive/Colab Notebooks/WCS/the_matrix_reviews.txt"
# read all content
#open() takes a filename and a mode as its arguments. r opens the file in read only mode.
#To write data to a file, pass in w as an argument instead
with open(path, "r") as file:
  content = file.read()

In [None]:
type(content)

str

In [None]:
#content

In [None]:
# schow first 200 characters including spaces
print(content[:200])

Author: mambubukid / Rating: 10/10 / Date: 19 September 2000
The story of a reluctant Christ-like protagonist set against a baroque, MTV backdrop, The Matrix is the definitive hybrid of technical wiza


Let's see what we get by splitting the lines by line breaks (i.e. \n)

In [None]:
# schow first 20 lines
#The splitlines() method splits a string into a list.
# The splitting is done at line breaks, ie. \n in html
content.splitlines()[:10]

['Author: mambubukid / Rating: 10/10 / Date: 19 September 2000',
 "The story of a reluctant Christ-like protagonist set against a baroque, MTV backdrop, The Matrix is the definitive hybrid of technical wizardry and contextual excellence that should be the benchmark for all sci-fi films to come. Hollywood has had some problems combining form and matter in the sci-fi genre.  There have been a lot of visually stunning works but nobody cared about the hero. (Or nobody simply cared about anything.)  There a few, though, which aroused interest and intellect but nobody 'ooh'-ed or 'aah'-ed at the special effects.  With The Matrix, both elements are perfectly en sync.  Not only did we want to cheer on the heroes to victory, we wanted them to bludgeon the opposition.  Not only did we sit in awe as Neo evaded those bullets in limbo-rock fashion, we salivated. But what makes The Matrix several cuts above the rest of the films in its genre is that there are simply no loopholes.  The script, writte

Great, it seems as if we can separate different reviews by -----------

In [None]:
a='hallo ------\n.asgdgajhg------ahgshgd'
a.split('------')

['hallo ', '\n.asgdgajhg', 'ahgshgd']

In [None]:
reviews_text = content.split("--------------------------------------------------\n")
len(reviews_text) # number of reviews after splitting

4257

In [None]:
reviews_text[:2]

["Author: mambubukid / Rating: 10/10 / Date: 19 September 2000\nThe story of a reluctant Christ-like protagonist set against a baroque, MTV backdrop, The Matrix is the definitive hybrid of technical wizardry and contextual excellence that should be the benchmark for all sci-fi films to come. Hollywood has had some problems combining form and matter in the sci-fi genre.  There have been a lot of visually stunning works but nobody cared about the hero. (Or nobody simply cared about anything.)  There a few, though, which aroused interest and intellect but nobody 'ooh'-ed or 'aah'-ed at the special effects.  With The Matrix, both elements are perfectly en sync.  Not only did we want to cheer on the heroes to victory, we wanted them to bludgeon the opposition.  Not only did we sit in awe as Neo evaded those bullets in limbo-rock fashion, we salivated. But what makes The Matrix several cuts above the rest of the films in its genre is that there are simply no loopholes.  The script, written b

In [None]:
# have a look at review 1 and 2
print(reviews_text[0])
print(reviews_text[1])

Author: mambubukid / Rating: 10/10 / Date: 19 September 2000
The story of a reluctant Christ-like protagonist set against a baroque, MTV backdrop, The Matrix is the definitive hybrid of technical wizardry and contextual excellence that should be the benchmark for all sci-fi films to come. Hollywood has had some problems combining form and matter in the sci-fi genre.  There have been a lot of visually stunning works but nobody cared about the hero. (Or nobody simply cared about anything.)  There a few, though, which aroused interest and intellect but nobody 'ooh'-ed or 'aah'-ed at the special effects.  With The Matrix, both elements are perfectly en sync.  Not only did we want to cheer on the heroes to victory, we wanted them to bludgeon the opposition.  Not only did we sit in awe as Neo evaded those bullets in limbo-rock fashion, we salivated. But what makes The Matrix several cuts above the rest of the films in its genre is that there are simply no loopholes.  The script, written by t

Hm, there always seems to be two lines with the first line containing metadata and the second line being the actual comment. Let's separate them


In [None]:
reviews_text[1]

'Rating: 10/10 / Date: 26 July 2014 / Helpful: 208/257 / Author: gogoschka-1\n** May contain spoilers ** There aren\'t many movies I watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'Rage against the Machine\'), I queued up for the next screening of \'The Matrix\'. I was so blown away by that film, I feared - and probably rightly so - that I hadn\'t caught every detail of what I\'d just seen. I later found out that many of my friends had had a similar reaction to the film, and I know virtually no one who liked the film and didn\'t watch it at least twice. It\'s simply one of those rare films that are so rich you just have to watch them several times. In structure, style and concept, \'The Matrix\' was ground-breaking; it marked the first time the visual style of Manga comic books and Anime such as \'Akira\' or \'Ghost in the Shell\' had been successfully translated to a live-action film. Apart from \'Bla

In [None]:
meta = reviews_text[1].splitlines()[0]
print(meta)

Rating: 10/10 / Date: 26 July 2014 / Helpful: 208/257 / Author: gogoschka-1


In [None]:
text = reviews_text[1].splitlines()[1]
text

'** May contain spoilers ** There aren\'t many movies I watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'Rage against the Machine\'), I queued up for the next screening of \'The Matrix\'. I was so blown away by that film, I feared - and probably rightly so - that I hadn\'t caught every detail of what I\'d just seen. I later found out that many of my friends had had a similar reaction to the film, and I know virtually no one who liked the film and didn\'t watch it at least twice. It\'s simply one of those rare films that are so rich you just have to watch them several times. In structure, style and concept, \'The Matrix\' was ground-breaking; it marked the first time the visual style of Manga comic books and Anime such as \'Akira\' or \'Ghost in the Shell\' had been successfully translated to a live-action film. Apart from \'Blade Runner\', which has a totally different mood and pace (but is also a maste

In [None]:
# let's build the dataframe
df_reviews = pd.DataFrame(columns = ["text", "metadata"])

#When you use enumerate(), the function gives you back two loop variables:
#The count of the current iteration, The value of the item at the current iteration

for i, review in enumerate(reviews_text):
  try:
    df_reviews.loc[i, "metadata"] = review.splitlines()[0]
    df_reviews.loc[i, "text"] = review.splitlines()[1]
  except:
    print(f"Skipping entry at {i} with '{review}'.")


Skipping entry at 4256 with ''.


In [None]:
df_reviews.tail()

Unnamed: 0,text,metadata
4251,Wow what a ride! Plenty of time to think about...,Author: TheDoc-3 / Date: 19 June 1999 / Helpfu...
4252,"The story was boring, even though the plot had...",Helpful: 0/2 / Author: Poe-6 / Date: 1 April 1...
4253,Having read some of the reviews of this film o...,Author: Coolflic / Rating: 9/10 / Date: 15 Jun...
4254,In a parallel universe William Gibson knocked ...,Author: Tim-85 / Helpful: 0/2 / Rating: 5/10 /...
4255,I saw the Matrix tonight in a crowded theater ...,Helpful: 0/1 / Author: knute123 / Date: 24 Mar...


In [None]:
df_reviews.iloc[4255]

text        I saw the Matrix tonight in a crowded theater ...
metadata    Helpful: 0/1 / Author: knute123 / Date: 24 Mar...
Name: 4255, dtype: object

In [None]:
#make a copy of our dataframe to test out another Version of the following information extraction techniques from the colun metadata
df_reviews2 = df_reviews.copy()

# 2. Extract metadata from text

The metadata seems to be quite interesting, so it would be cool to extract the information. Unfortunately though, it seems as if the metadata can not be easily separated as the order of the data semmes to differ...

To still achieve our goal we therefore want to use regular expressions below. As use tools such as [https://regex101.com/](https://regex101.com/) can come in quite handy, let's fetch few examples that we can copy over.

In [None]:
df_reviews.head(20)["metadata"]

0     Author: mambubukid / Rating: 10/10 / Date: 19 ...
1     Rating: 10/10 / Date: 26 July 2014 / Helpful: ...
2     Rating: 10/10 / Date: 2 December 2005 / Helpfu...
3     Date: 23 February 2020 / Helpful: 65/80 / Rati...
4     Date: 11 April 2018 / Author: notoriousCASK / ...
5     Date: 3 March 2001 / Helpful: 588/817 / Author...
6     Rating: 10/10 / Author: denisbabak / Date: 25 ...
7     Author: bencoops / Rating: 10/10 / Date: 31 Ja...
8     Author: emptyskies / Date: 23 April 2002 / Rat...
9     Helpful: 75/100 / Date: 30 July 2015 / Author:...
10    Helpful: 15/17 / Author: DanielStephens1988 / ...
11    Author: ozer-can2 / Date: 7 February 2019 / Ra...
12    Author: Pessimisticynic / Rating: 10/10 / Date...
13    Rating: 10/10 / Date: 22 May 2018 / Author: el...
14    Date: 24 June 2019 / Rating: 10/10 / Author: h...
15    Rating: 10/10 / Author: mail-3216 / Helpful: 7...
16    Helpful: 384/586 / Date: 29 March 2000 / Autho...
17    Helpful: 18/22 / Rating: 10/10 / Author: j

In [None]:
print("\n".join(['test1','test2', 'test3']))

test1
test2
test3


In [None]:
type(df_reviews.head(20)["metadata"])

pandas.core.series.Series

In [None]:
print("\n".join(df_reviews.head(20)["metadata"]))

Author: mambubukid / Rating: 10/10 / Date: 19 September 2000
Rating: 10/10 / Date: 26 July 2014 / Helpful: 208/257 / Author: gogoschka-1
Rating: 10/10 / Date: 2 December 2005 / Helpful: 582/763 / Author: MinorityReporter
Date: 23 February 2020 / Helpful: 65/80 / Rating: 10/10 / Author: MR_Heraclius
Date: 11 April 2018 / Author: notoriousCASK / Helpful: 194/258 / Rating: 10/10
Date: 3 March 2001 / Helpful: 588/817 / Author: SdrolionGM
Rating: 10/10 / Author: denisbabak / Date: 25 December 2018
Author: bencoops / Rating: 10/10 / Date: 31 January 2019 / Helpful: 28/34
Author: emptyskies / Date: 23 April 2002 / Rating: 10/10
Helpful: 75/100 / Date: 30 July 2015 / Author: ivo-cobra8 / Rating: 10/10
Helpful: 15/17 / Author: DanielStephens1988 / Date: 25 July 2019 / Rating: 10/10
Author: ozer-can2 / Date: 7 February 2019 / Rating: 10/10 / Helpful: 15/17
Author: Pessimisticynic / Rating: 10/10 / Date: 3 April 1999 / Helpful: 432/636
Rating: 10/10 / Date: 22 May 2018 / Author: elto-30283
Date: 

#### Extract date

In [None]:
import re
#d matched digit character, w word character ,
#( Denotes the start of a capturing group, {1,2} captures up to 1  to 2 digits
string = "Date: 24 June 2019 / Rating: 10/10 / Author: happytoms"
pattern = "Date: (\d{,2} \w+ \d{2,4})"
re.findall(pattern, string)

['24 June 2019']

In [None]:
def find_date(metadata):

  pattern = "Date: (\d{,2} \w+ \d{2,4})"

  matches = re.findall(pattern, metadata)

  if len(matches) == 0:
    print(f"Found no dates in {metadata}")
    return None
  elif len(matches) > 1:
    raise ValueError("Multiple dates") # if we get an error, we don't get any values back.
  else:
    return matches[0]

In [None]:
#map or apply Used for substituting each value in a Series with another value,
#that may be derived from a function, a dict or a Series7
#https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas

df_reviews["date_string"] = df_reviews["metadata"].apply(find_date)

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date_string
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,19 September 2000
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,26 July 2014
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2 December 2005
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,23 February 2020
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,11 April 2018


Great, apparently every entry had exactly one date! Let's convert the date strings to actual dates

In [None]:
#pd.to_datetime converts arguments (int, float str etc.) to datetime format
df_reviews["date"] = pd.to_datetime(df_reviews["date_string"])

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date_string,date
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,19 September 2000,2000-09-19
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,26 July 2014,2014-07-26
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2 December 2005,2005-12-02
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,23 February 2020,2020-02-23
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,11 April 2018,2018-04-11


In [None]:
df_reviews.drop("date_string",  axis = 1, inplace= True)

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11


In [None]:
#import matplotlib.pyplot as plt
#import seaborn as sns

#sns.countplot(df_reviews["date"].dt.year)
#plt.xticks(rotation = 90)

In [None]:
print("\n".join(df_reviews.head(6)["metadata"]))

Author: mambubukid / Rating: 10/10 / Date: 19 September 2000
Rating: 10/10 / Date: 26 July 2014 / Helpful: 208/257 / Author: gogoschka-1
Rating: 10/10 / Date: 2 December 2005 / Helpful: 582/763 / Author: MinorityReporter
Date: 23 February 2020 / Helpful: 65/80 / Rating: 10/10 / Author: MR_Heraclius
Date: 11 April 2018 / Author: notoriousCASK / Helpful: 194/258 / Rating: 10/10
Date: 3 March 2001 / Helpful: 588/817 / Author: SdrolionGM


#### Extract rating

In [None]:
def find_rating(metadata):

  pattern = "Rating: (\d{1,2})/10"

  matches = re.findall(pattern, metadata)

  if len(matches) == 0:
    #print(f"Found no ratings in {metadata}")
    return None
  elif len(matches) > 1:
    raise ValueError("Multiple ratings")
  else:
    return matches[0]

In [None]:
df_reviews["rating_string"] = df_reviews["metadata"].map(find_rating) # map alternatively to the apply function

In [None]:
df_reviews.head(10)

Unnamed: 0,text,metadata,date,rating_string
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0
5,"The Matrix...when I first heard about it, I ex...",Date: 3 March 2001 / Helpful: 588/817 / Author...,2001-03-03,
6,The Matrix is one of the best science fiction ...,Rating: 10/10 / Author: denisbabak / Date: 25 ...,2018-12-25,10.0
7,"The first time i watched this, i was absolutel...",Author: bencoops / Rating: 10/10 / Date: 31 Ja...,2019-01-31,10.0
8,The Wachowski brothers really did excel themse...,Author: emptyskies / Date: 23 April 2002 / Rat...,2002-04-23,10.0
9,My review of the best epic Science Fiction Act...,Helpful: 75/100 / Date: 30 July 2015 / Author:...,2015-07-30,10.0


In [None]:
df_reviews["rating_string"].value_counts()

10    1749
9      670
8      347
7      199
1      136
6      111
5       77
4       62
3       59
2       54
Name: rating_string, dtype: int64

In [None]:
a='10'
int(a)

10

In [None]:
#f error ‘coerce’, then invalid parsing will be set as NaN
df_reviews["rating"] = pd.to_numeric(df_reviews["rating_string"], errors= "coerce")
df_reviews.drop("rating_string", axis = 1, inplace= True)

In [None]:
df_reviews

Unnamed: 0,text,metadata,date,rating
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0
...,...,...,...,...
4251,Wow what a ride! Plenty of time to think about...,Author: TheDoc-3 / Date: 19 June 1999 / Helpfu...,1999-06-19,
4252,"The story was boring, even though the plot had...",Helpful: 0/2 / Author: Poe-6 / Date: 1 April 1...,1999-04-01,6.0
4253,Having read some of the reviews of this film o...,Author: Coolflic / Rating: 9/10 / Date: 15 Jun...,1999-06-15,9.0
4254,In a parallel universe William Gibson knocked ...,Author: Tim-85 / Helpful: 0/2 / Rating: 5/10 /...,1999-06-13,5.0


In [None]:
# mean rating
np.mean(df_reviews["rating"])

8.489896073903003

In [None]:
print("\n".join(df_reviews.head(5)["metadata"]))

Author: mambubukid / Rating: 10/10 / Date: 19 September 2000
Rating: 10/10 / Date: 26 July 2014 / Helpful: 208/257 / Author: gogoschka-1
Rating: 10/10 / Date: 2 December 2005 / Helpful: 582/763 / Author: MinorityReporter
Date: 23 February 2020 / Helpful: 65/80 / Rating: 10/10 / Author: MR_Heraclius
Date: 11 April 2018 / Author: notoriousCASK / Helpful: 194/258 / Rating: 10/10


#### Extract helpfulness

In [None]:
string = "Date: 11 April 2018 / Author: notoriousCASK / Helpful: 194/258 / Rating: 10/10"
string2 = "Date: 23 February 2020 / Helpful: 65/80 / Rating: 10/10 / Author: MR_Heraclius"
pattern = "Helpful: (\d+)/(\d+)"

matches = re.findall(pattern, string + string2)

In [None]:
matches

[('194', '258'), ('65', '80')]

In [None]:
def find_helpfulness(metadata):

  pattern = "Helpful: (\d+)/(\d+)"

  matches = re.findall(pattern, metadata)

  if len(matches) == 0:
    return None
  elif len(matches) > 1:
    raise ValueError("Multiple helpfulness")
  else:
    #convert string to int:
    nb_help = int(matches[0][0])
    nb_all = int(matches[0][1])
    if nb_all <= 0:
      return None
    else:
      return nb_help / nb_all

In [None]:
df_reviews["helpful"] = df_reviews["metadata"].map(find_helpfulness)
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938


##Alternative: simplified matcher function:

In [None]:
test=df_reviews['metadata'][0]
test

'Author: mambubukid / Rating: 10/10 / Date: 19 September 2000'

In [None]:
#match seacrh for match only at beginning of string
z = re.match("Author: (.*) / Rating: (.*) / Date: (.*)", test)

z.groups()

('mambubukid', '10/10', '19 September 2000')

In [None]:
mystr= "Author: asdasdlj / Rating: 10/10 / Date: 26 July 2014 / Helpful: Yes"
import re
def matcher(mystr):
  mylist=mystr.split(sep='/')
  #print(mylist)
  mydict={'Author':"",'Rating':"",'Date':"",'Helpful':""}
  for x in mydict.keys():
    for i in mylist:
      matchstring=f".*{x}: (.*)"
      m = re.match(matchstring,i)
      if m:
        mydict[x]=m.groups()[0]

  return mydict


In [None]:
matcher(mystr)

{'Author': 'asdasdlj ',
 'Rating': '10',
 'Date': '26 July 2014 ',
 'Helpful': 'Yes'}

In [None]:
df_reviews2["metadata"].apply(matcher).apply(pd.Series)

Unnamed: 0,Author,Rating,Date,Helpful
0,mambubukid,10,19 September 2000,
1,gogoschka-1,10,26 July 2014,208
2,MinorityReporter,10,2 December 2005,582
3,MR_Heraclius,10,23 February 2020,65
4,notoriousCASK,10,11 April 2018,194
...,...,...,...,...
4251,TheDoc-3,,19 June 1999,0
4252,Poe-6,6,1 April 1999,0
4253,Coolflic,9,15 June 1999,
4254,Tim-85,5,13 June 1999,0


In [None]:
df_reviews2_new = pd.concat([df_reviews2, df_reviews2["metadata"].apply(matcher).apply(pd.Series)], axis = 1)
df_reviews2_new

Unnamed: 0,text,metadata,Author,Rating,Date,Helpful
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,mambubukid,10,19 September 2000,
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,gogoschka-1,10,26 July 2014,208
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,MinorityReporter,10,2 December 2005,582
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,MR_Heraclius,10,23 February 2020,65
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,notoriousCASK,10,11 April 2018,194
...,...,...,...,...,...,...
4251,Wow what a ride! Plenty of time to think about...,Author: TheDoc-3 / Date: 19 June 1999 / Helpfu...,TheDoc-3,,19 June 1999,0
4252,"The story was boring, even though the plot had...",Helpful: 0/2 / Author: Poe-6 / Date: 1 April 1...,Poe-6,6,1 April 1999,0
4253,Having read some of the reviews of this film o...,Author: Coolflic / Rating: 9/10 / Date: 15 Jun...,Coolflic,9,15 June 1999,
4254,In a parallel universe William Gibson knocked ...,Author: Tim-85 / Helpful: 0/2 / Rating: 5/10 /...,Tim-85,5,13 June 1999,0


In [None]:
df_reviews2_new['Date'] = pd.to_datetime(df_reviews2_new['Date'])


In [None]:
df_reviews2_new['Rating'] = pd.to_numeric(df_reviews2_new['Rating'])

In [None]:
df_reviews2_new

Unnamed: 0,text,metadata,Author,Rating,Date,Helpful
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,mambubukid,10.0,2000-09-19,
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,gogoschka-1,10.0,2014-07-26,208
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,MinorityReporter,10.0,2005-12-02,582
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,MR_Heraclius,10.0,2020-02-23,65
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,notoriousCASK,10.0,2018-04-11,194
...,...,...,...,...,...,...
4251,Wow what a ride! Plenty of time to think about...,Author: TheDoc-3 / Date: 19 June 1999 / Helpfu...,TheDoc-3,,1999-06-19,0
4252,"The story was boring, even though the plot had...",Helpful: 0/2 / Author: Poe-6 / Date: 1 April 1...,Poe-6,6.0,1999-04-01,0
4253,Having read some of the reviews of this film o...,Author: Coolflic / Rating: 9/10 / Date: 15 Jun...,Coolflic,9.0,1999-06-15,
4254,In a parallel universe William Gibson knocked ...,Author: Tim-85 / Helpful: 0/2 / Rating: 5/10 /...,Tim-85,5.0,1999-06-13,0


# 3. Clean text

In the following we want to start looking into the text. In order to use more sophisticated methods such as stopword removal and stemming below, however, we want to first clean the text.

Let's get a random text sample

In [None]:
# second review in dataframe
sample_text = df_reviews.iloc[1]["text"]
sample_text

'** May contain spoilers ** There aren\'t many movies I watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'Rage against the Machine\'), I queued up for the next screening of \'The Matrix\'. I was so blown away by that film, I feared - and probably rightly so - that I hadn\'t caught every detail of what I\'d just seen. I later found out that many of my friends had had a similar reaction to the film, and I know virtually no one who liked the film and didn\'t watch it at least twice. It\'s simply one of those rare films that are so rich you just have to watch them several times. In structure, style and concept, \'The Matrix\' was ground-breaking; it marked the first time the visual style of Manga comic books and Anime such as \'Akira\' or \'Ghost in the Shell\' had been successfully translated to a live-action film. Apart from \'Blade Runner\', which has a totally different mood and pace (but is also a maste

First, we want to put everything into lower case

In [None]:
text = sample_text.lower()
text

'** may contain spoilers ** there aren\'t many movies i watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'rage against the machine\'), i queued up for the next screening of \'the matrix\'. i was so blown away by that film, i feared - and probably rightly so - that i hadn\'t caught every detail of what i\'d just seen. i later found out that many of my friends had had a similar reaction to the film, and i know virtually no one who liked the film and didn\'t watch it at least twice. it\'s simply one of those rare films that are so rich you just have to watch them several times. in structure, style and concept, \'the matrix\' was ground-breaking; it marked the first time the visual style of manga comic books and anime such as \'akira\' or \'ghost in the shell\' had been successfully translated to a live-action film. apart from \'blade runner\', which has a totally different mood and pace (but is also a maste

Notice that there are contractions such as "aren't" and "hadn't" in the text. Let's get rid of them. Luckily, there is a nice little Python library that can do this for us ;-)

In [None]:
!pip install contractions
import contractions



Test the library on some examples

In [None]:
contractions.fix("you'll")

'you will'

In [None]:
text = contractions.fix(text)
text

'** may contain spoilers ** there are not many movies i watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'rage against the machine\'), i queued up for the next screening of \'the matrix\'. i was so blown away by that film, i feared - and probably rightly so - that i had not caught every detail of what i would just seen. i later found out that many of my friends had had a similar reaction to the film, and i know virtually no one who liked the film and did not watch it at least twice. it is simply one of those rare films that are so rich you just have to watch them several times. in structure, style and concept, \'the matrix\' was ground-breaking; it marked the first time the visual style of manga comic books and anime such as \'akira\' or \'ghost in the she will\' had been successfully translated to a live-action film. apart from \'blade runner\', which has a totally different mood and pace (but is also a

Great! Now notice that there are still quite some special characters in the text (. or : or ,). Let's get rid of them by using Regex

In [None]:
digi_punct = "[^a-z]" # all non alphabet characters
text = re.sub(digi_punct, " ", text) # only kepp characters of the alphabet
text

'   may contain spoilers    there are not many movies i watched in the theatre twice   let alone on the same day   but immediately after the credits had rolled  and still pumped up by  rage against the machine    i queued up for the next screening of  the matrix   i was so blown away by that film  i feared   and probably rightly so   that i had not caught every detail of what i would just seen  i later found out that many of my friends had had a similar reaction to the film  and i know virtually no one who liked the film and did not watch it at least twice  it is simply one of those rare films that are so rich you just have to watch them several times  in structure  style and concept   the matrix  was ground breaking  it marked the first time the visual style of manga comic books and anime such as  akira  or  ghost in the she will  had been successfully translated to a live action film  apart from  blade runner   which has a totally different mood and pace  but is also a masterpiece an

As a last step we want to make sure that there is always only one space between two words

In [None]:
len(text.split())

469

In [None]:
text = " ".join(text.split())# first split and then join
text

'may contain spoilers there are not many movies i watched in the theatre twice let alone on the same day but immediately after the credits had rolled and still pumped up by rage against the machine i queued up for the next screening of the matrix i was so blown away by that film i feared and probably rightly so that i had not caught every detail of what i would just seen i later found out that many of my friends had had a similar reaction to the film and i know virtually no one who liked the film and did not watch it at least twice it is simply one of those rare films that are so rich you just have to watch them several times in structure style and concept the matrix was ground breaking it marked the first time the visual style of manga comic books and anime such as akira or ghost in the she will had been successfully translated to a live action film apart from blade runner which has a totally different mood and pace but is also a masterpiece and visionary film making there simply had 

Very cool, this text looks pretty clean now, doesn't it? Let's put the steps above in a function and run it on all the reviews.

In [None]:
def clean_text(text):

  # put to lower case
  text = text.lower()

  # fix contractions
  text = contractions.fix(text)

  # remove special characters
  digi_punct = "[^a-z ]"
  text = re.sub(digi_punct, " ", text)

  # remove double whitespace
  text = " ".join(text.split())

  return text

sample_text_clean = clean_text(sample_text)
sample_text_clean

'may contain spoilers there are not many movies i watched in the theatre twice let alone on the same day but immediately after the credits had rolled and still pumped up by rage against the machine i queued up for the next screening of the matrix i was so blown away by that film i feared and probably rightly so that i had not caught every detail of what i would just seen i later found out that many of my friends had had a similar reaction to the film and i know virtually no one who liked the film and did not watch it at least twice it is simply one of those rare films that are so rich you just have to watch them several times in structure style and concept the matrix was ground breaking it marked the first time the visual style of manga comic books and anime such as akira or ghost in the she will had been successfully translated to a live action film apart from blade runner which has a totally different mood and pace but is also a masterpiece and visionary film making there simply had 

In [None]:
df_reviews["text_clean"] = df_reviews["text"].apply(clean_text)
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...


In [None]:
#Alternative1 to punctuation removal without regex:
import string

In [None]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
a = 'test1 " ! J KKJH6 7 !!! ????'

In [None]:
"".join([c for c in a if c not in string.punctuation])

'test1   J KKJH6 7  '

In [None]:

a = '... some string with punctuation ...Please give me some time. @ sd  4 232. hallo, mein name ist Patricia. It is my sisters bday today. haha'
print("".join([w for w in a if w not in string.punctuation]))

 some string with punctuation Please give me some time  sd  4 232 hallo mein name ist Patricia It is my sisters bday today haha


In [None]:
'hallo'.isalpha()

True

In [None]:
a.split()

['...',
 'some',
 'string',
 'with',
 'punctuation',
 '...Please',
 'give',
 'me',
 'some',
 'time.',
 '@',
 'sd',
 '4',
 '232.',
 'hallo,',
 'mein',
 'name',
 'ist',
 'Patricia.',
 'It',
 'is',
 'my',
 'sisters',
 'bday',
 'today.',
 'haha']

In [None]:
[w for w in a.split() if w.isalpha()]

['some',
 'string',
 'with',
 'punctuation',
 'give',
 'me',
 'some',
 'sd',
 'mein',
 'name',
 'ist',
 'It',
 'is',
 'my',
 'sisters',
 'bday',
 'haha']

In [None]:
print(" ".join([w for w in a.split() if w.isalpha()]))

some string with punctuation give me some sd mein name ist It is my sisters bday haha


In [None]:
#Alternative 2 to punctuation removal:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
a = '... some string with punctuation ...Please give me some time. @ sd  4 232 Hallo. Wi egeht es dir. Montag.'


In [None]:
sent_tokenize(a)

['... some string with punctuation ...Please give me some time.',
 '@ sd  4 232 Hallo.',
 'Wi egeht es dir.',
 'Montag.']

In [None]:
word_tokenize(a)

['...',
 'some',
 'string',
 'with',
 'punctuation',
 '...',
 'Please',
 'give',
 'me',
 'some',
 'time',
 '.',
 '@',
 'sd',
 '4',
 '232',
 'Hallo',
 '.',
 'Wi',
 'egeht',
 'es',
 'dir',
 '.',
 'Montag',
 '.']

In [None]:
sent_tokenize(a)

['... some string with punctuation ...Please give me some time.',
 '@ sd  4 232 Hallo.',
 'Wi egeht es dir.',
 'Montag.']

In [None]:
sent_tokens_cleaned = []
for sent in sent_tokenize(a):
  tokens = [word for word in word_tokenize(sent)]
  words = [word.lower() for word in tokens if word.isalpha()]
  print(words)
  print(" ".join(words))
  sent_tokens_cleaned.append(" ".join(words))
print(sent_tokens_cleaned)

['some', 'string', 'with', 'punctuation', 'please', 'give', 'me', 'some', 'time']
some string with punctuation please give me some time
['sd', 'hallo']
sd hallo
['wi', 'egeht', 'es', 'dir']
wi egeht es dir
['montag']
montag
['some string with punctuation please give me some time', 'sd hallo', 'wi egeht es dir', 'montag']


In [None]:
#Alternative 3 without sent tokenzier:
a ='... some string with punctuation ...Please give me some time. @ sd  4 232'
words = nltk.word_tokenize(a)

words=[word.lower() for word in words if word.isalpha()]
print(words)
print(" ".join(words))

['some', 'string', 'with', 'punctuation', 'please', 'give', 'me', 'some', 'time', 'sd']
some string with punctuation please give me some time sd


# 4. Remove Stop Words

In NLP it is a very common procedure to remove words that happen extremely often such as "the", "it", "a", "to", ... as they don't give us a lot of information.

Luckily, there a great package called *nltk* that already contains the most common stopwords for different languages!

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print(sorted(stopwords.words('english')))
print(len(stopwords.words('english')))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
type(stopwords.words('english'))

list

Having this list we can now easily delete the stopwords.

Notice: Stopwords also include words like "not" and "no". I delete them for now but for our task of predicting perfect ratings one might be better of to keep them.

In [None]:
my_stopwords = stopwords.words('english')

In [None]:
type(my_stopwords)

list

In [None]:
sample_text_clean

'may contain spoilers there are not many movies i watched in the theatre twice let alone on the same day but immediately after the credits had rolled and still pumped up by rage against the machine i queued up for the next screening of the matrix i was so blown away by that film i feared and probably rightly so that i had not caught every detail of what i would just seen i later found out that many of my friends had had a similar reaction to the film and i know virtually no one who liked the film and did not watch it at least twice it is simply one of those rare films that are so rich you just have to watch them several times in structure style and concept the matrix was ground breaking it marked the first time the visual style of manga comic books and anime such as akira or ghost in the she will had been successfully translated to a live action film apart from blade runner which has a totally different mood and pace but is also a masterpiece and visionary film making there simply had 

In [None]:
sample_text_stopwords = " ".join([word for word in sample_text_clean.split() if word not in my_stopwords])
sample_text_stopwords

'may contain spoilers many movies watched theatre twice let alone day immediately credits rolled still pumped rage machine queued next screening matrix blown away film feared probably rightly caught every detail would seen later found many friends similar reaction film know virtually one liked film watch least twice simply one rare films rich watch several times structure style concept matrix ground breaking marked first time visual style manga comic books anime akira ghost successfully translated live action film apart blade runner totally different mood pace also masterpiece visionary film making simply anything even remotely like jaw dropping action sequences raw gripping energy feel like adrenalin overdose unlike action films never overshadow story contrary enhance make complete sense within universe story think one original fascinating sci fi tales likely ever see screen clearly inspired japanese anime manga yet also authors like isaac asimov philip k dick story humanity war creat

In [None]:
df_reviews["text_stopwords"] = df_reviews["text_clean"].apply(lambda txt: " ".join([word for word in txt.split() if word not in my_stopwords]))
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...,story reluctant christ like protagonist set ba...
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...,may contain spoilers many movies watched theat...
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...,writing review matrix hard thing film means lo...
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...,film well crafted matrix another level entirel...
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...,without doubt one best influential movies time...


Nice, we ended up with a text that does not contain stopwords and therefore makes it easier to process below.

# 5. Stemming / Lemmatizing

As a next step we want to perform stemming / lemmatizing, i.e. bring each word back to it's roots. The idea is that e.g. different tenses are ignored and e.g.
love = loves = loved = loving

In [None]:
sentence = "I am loving the best movies about AI"

#### Stemming

Let's start by stemming, i.e. only keeping the root the each word. Notice that there is no sentiment included in stemming.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
" ".join([stemmer.stem(word) for word in sentence.split()]) # stem every word in the sentence and join the stemmed words again

'i am love the best movi about ai'

Cool, let's run it over all texts

In [None]:
def stemming(sentence):

  stemmer = PorterStemmer()
  return " ".join([stemmer.stem(word) for word in sentence.split()])

In [None]:
df_reviews["text_stemming"] = df_reviews["text_stopwords"].apply(stemming)

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords,text_stemming
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...,story reluctant christ like protagonist set ba...,stori reluct christ like protagonist set baroq...
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...,may contain spoilers many movies watched theat...,may contain spoiler mani movi watch theatr twi...
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...,writing review matrix hard thing film means lo...,write review matrix hard thing film mean lot t...
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...,film well crafted matrix another level entirel...,film well craft matrix anoth level entir scien...
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...,without doubt one best influential movies time...,without doubt one best influenti movi time mat...


#### Lemmatizing

Let us try another possible way to bring words back to their roots - lemmatizing. Notice that lemmatizing does take care some sentiment as we will see below.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(sentence)

doc

I am loving the best movies about AI

In [None]:
type(doc)

spacy.tokens.doc.Doc

In [None]:
sentence

'I am loving the best movies about AI'

In [None]:
" ".join([token.lemma_ for token in doc])

'I be love the good movie about AI'

That is very interesting: It did not only bring each words to its roots but even converted:
- I -> PRON
- am -> be
- best -> good

Again, let's run it on the complete text. Unfortunately though, this runs for quite some time... :-(


In [None]:
def lemmatization(sentence):

  doc = nlp(sentence)
  return " ".join([token.lemma_ for token in doc])

In [None]:
df_reviews["text_lemma1"] = df_reviews["text_stopwords"].map(lemmatization)

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords,text_stemming,text_lemma1
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...,story reluctant christ like protagonist set ba...,stori reluct christ like protagonist set baroq...,story reluctant christ like protagonist set ba...
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...,may contain spoilers many movies watched theat...,may contain spoiler mani movi watch theatr twi...,may contain spoiler many movie watch theatre t...
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...,writing review matrix hard thing film means lo...,write review matrix hard thing film mean lot t...,write review matrix hard thing film mean lot t...
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...,film well crafted matrix another level entirel...,film well craft matrix anoth level entir scien...,film well craft matrix another level entirely ...
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...,without doubt one best influential movies time...,without doubt one best influenti movi time mat...,without doubt one good influential movie time ...


In [None]:
#Alternative:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
" ".join([lemmatizer.lemmatize(word) for word in sentence.split()]) # stem every word in the sentence and join the stemmed words again

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


'I am loving the best movie about AI'

In [None]:
def lemmatization2(sentence):

  lemmatizer = WordNetLemmatizer()
  return " ".join([lemmatizer.lemmatize(word) for word in sentence.split()])

In [None]:
df_reviews["text_lemma2"] = df_reviews["text_stopwords"].map(lemmatization2)

In [None]:
df_reviews.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords,text_stemming,text_lemma1,text_lemma2
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...,story reluctant christ like protagonist set ba...,stori reluct christ like protagonist set baroq...,story reluctant christ like protagonist set ba...,story reluctant christ like protagonist set ba...
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...,may contain spoilers many movies watched theat...,may contain spoiler mani movi watch theatr twi...,may contain spoiler many movie watch theatre t...,may contain spoiler many movie watched theatre...
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...,writing review matrix hard thing film means lo...,write review matrix hard thing film mean lot t...,write review matrix hard thing film mean lot t...,writing review matrix hard thing film mean lot...
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...,film well crafted matrix another level entirel...,film well craft matrix anoth level entir scien...,film well craft matrix another level entirely ...,film well crafted matrix another level entirel...
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...,without doubt one best influential movies time...,without doubt one best influenti movi time mat...,without doubt one good influential movie time ...,without doubt one best influential movie time ...


# 6. Create train / test set

Let's create a training and test set based on all the reviews that contain a rating.

In [None]:
df_reviews["rating"]

0       10.0
1       10.0
2       10.0
3       10.0
4       10.0
        ... 
4251     NaN
4252     6.0
4253     9.0
4254     5.0
4255     NaN
Name: rating, Length: 4256, dtype: float64

In [None]:
# copy reviews that have a rating
df_reviews_with_rating = df_reviews.copy().dropna(subset=["rating"])
df_reviews_with_rating.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3464 entries, 0 to 4254
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   text            3464 non-null   object        
 1   metadata        3464 non-null   object        
 2   date            3464 non-null   datetime64[ns]
 3   rating          3464 non-null   float64       
 4   helpful         1354 non-null   float64       
 5   text_clean      3464 non-null   object        
 6   text_stopwords  3464 non-null   object        
 7   text_stemming   3464 non-null   object        
 8   text_lemma1     3464 non-null   object        
 9   text_lemma2     3464 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 297.7+ KB


As mentioned above, we want to predict whether some rated the movie with a perfect 10 or not. We therefore need to build the corresponding target.

In [None]:
df_reviews_with_rating["rating"][0] == 10

True

In [None]:
df_reviews_with_rating["perfect"] = 1 *(df_reviews_with_rating["rating"] == 10)

In [None]:
df_reviews_with_rating.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords,text_stemming,text_lemma1,text_lemma2,perfect
0,The story of a reluctant Christ-like protagoni...,Author: mambubukid / Rating: 10/10 / Date: 19 ...,2000-09-19,10.0,,the story of a reluctant christ like protagoni...,story reluctant christ like protagonist set ba...,stori reluct christ like protagonist set baroq...,story reluctant christ like protagonist set ba...,story reluctant christ like protagonist set ba...,1
1,** May contain spoilers ** There aren't many m...,Rating: 10/10 / Date: 26 July 2014 / Helpful: ...,2014-07-26,10.0,0.809339,may contain spoilers there are not many movies...,may contain spoilers many movies watched theat...,may contain spoiler mani movi watch theatr twi...,may contain spoiler many movie watch theatre t...,may contain spoiler many movie watched theatre...,1
2,Writing a review of The Matrix is a very hard ...,Rating: 10/10 / Date: 2 December 2005 / Helpfu...,2005-12-02,10.0,0.762779,writing a review of the matrix is a very hard ...,writing review matrix hard thing film means lo...,write review matrix hard thing film mean lot t...,write review matrix hard thing film mean lot t...,writing review matrix hard thing film mean lot...,1
3,The film is as well crafted as the matrix itse...,Date: 23 February 2020 / Helpful: 65/80 / Rati...,2020-02-23,10.0,0.8125,the film is as well crafted as the matrix itse...,film well crafted matrix another level entirel...,film well craft matrix anoth level entir scien...,film well craft matrix another level entirely ...,film well crafted matrix another level entirel...,1
4,Without a doubt one of the best and most influ...,Date: 11 April 2018 / Author: notoriousCASK / ...,2018-04-11,10.0,0.751938,without a doubt one of the best and most influ...,without doubt one best influential movies time...,without doubt one best influenti movi time mat...,without doubt one good influential movie time ...,without doubt one best influential movie time ...,1


In [None]:
df_reviews_with_rating["perfect"].value_counts()

1    1749
0    1715
Name: perfect, dtype: int64

Perform the train/test split

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_reviews_with_rating,
                                     test_size = 0.2,
                                     random_state = 42,
                                     stratify = df_reviews_with_rating["perfect"] #split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset
                                     )

#df_train.reset_index(inplace=True, drop=True)
#df_test.reset_index(inplace=True, drop=True)

In [None]:
print(len(df_train))
print(len(df_test))

2771
693


# 7. Create features

In order to convert text into numerical values, let us try 2 different strategies.

### Bag-of-words

The first thing we want to try is bag-of-words, i.e. for every word that appears in any of the reviwes, for each review we count how often it appears in it.

In [None]:
df_train["text_lemma1"]

4107    matrix action thriller sci fi profound philoso...
644     movie fantastic super interesting point might ...
1833    matrix one good picture ever make obvious peop...
1901    matrix really mesmerise film first time watch ...
3886    awe inspire mix philosophical debate spectacul...
                              ...                        
1121    hard see movie twice already enough want buy b...
420                             epic journey relevant day
273     director matrix mistake style dark depressing ...
1982    many moviegoer vote matrix one great film time...
3530    ok watch good say right good wait tell retarde...
Name: text_lemma1, Length: 2771, dtype: object

In [None]:
df_train.iloc[0]["text_lemma1"]

'matrix action thriller sci fi profound philosophical movie think want matrix system think maker history need way change author impart win appearance philosophical idea lot young people see movie action hope feel important idea soon'

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
bag_of_words = count_vectorizer.fit_transform(df_train["text_lemma1"])
print(bag_of_words.shape)# in the trainings data we have 11843 different words -> 11843 features, and 2771 train data points

(2771, 11843)


In [None]:
#2771 is len of trainset, ie. number of documents corresponding to rows in dataframe
#11843 is num of unique words in our vocabulary, ie. number features/of columns


In [None]:
print(bag_of_words[0])

  (0, 6405)	2
  (0, 106)	2
  (0, 10545)	1
  (0, 9053)	1
  (0, 3887)	1
  (0, 8118)	1
  (0, 7714)	2
  (0, 6811)	2
  (0, 10510)	2
  (0, 11439)	1
  (0, 10276)	1
  (0, 6268)	1
  (0, 4874)	1
  (0, 6963)	1
  (0, 11497)	1
  (0, 1581)	1
  (0, 722)	1
  (0, 5155)	1
  (0, 11633)	1
  (0, 496)	1
  (0, 5066)	2
  (0, 6161)	1
  (0, 11803)	1
  (0, 7623)	1
  (0, 9125)	1
  (0, 4940)	1
  (0, 3857)	1
  (0, 5177)	1
  (0, 9618)	1


Let's check how often each word appears

bag_of_words.indices is an attribute of the bag_of_words object, and it is an array of integers.

Each integer in bag_of_words.indices represents the column index (feature index) of a non-zero entry in the sparse matrix.

Non-zero entries correspond to words (or tokens) that occur in the respective document. The value of a non-zero entry indicates the number of times that word appears in the document.

bag_of_words.indices contains all the unique word indices across all documents. If a word occurs in multiple documents, its index will appear multiple times in the array.

In [None]:
bag_of_words

<2771x11843 sparse matrix of type '<class 'numpy.int64'>'
	with 176258 stored elements in Compressed Sparse Row format>

In [None]:
bag_of_words.indices[0:20]

array([ 6405,   106, 10545,  9053,  3887,  8118,  7714,  6811, 10510,
       11439, 10276,  6268,  4874,  6963, 11497,  1581,   722,  5155,
       11633,   496], dtype=int32)

In [None]:
len(bag_of_words.indices)

176258

In [None]:
# dictionary of all words as keys, and indices as values. Indices are given by albhabetical order
#count_vectorizer.vocabulary_

In [None]:
bagofwords_feat = count_vectorizer.vocabulary_

In [None]:
#.get_feature_names_out() returns an array of all words/keys from the count_vectorizer.vocabulary_, alphabetically ordered
count_vectorizer.get_feature_names_out()[0:20]

array(['aaaa', 'aah', 'aaliyah', 'aalox', 'aaron', 'abandon',
       'abbreviate', 'abbreviation', 'abduct', 'abdul', 'aberration',
       'ability', 'able', 'ably', 'abnormally', 'aboard', 'abolutely',
       'abominable', 'abomination', 'aboriginal'], dtype=object)

In [None]:
print(bag_of_words[0])# (, index of column/ word), only words that appears in the review are listed

  (0, 6405)	2
  (0, 106)	2
  (0, 10545)	1
  (0, 9053)	1
  (0, 3887)	1
  (0, 8118)	1
  (0, 7714)	2
  (0, 6811)	2
  (0, 10510)	2
  (0, 11439)	1
  (0, 10276)	1
  (0, 6268)	1
  (0, 4874)	1
  (0, 6963)	1
  (0, 11497)	1
  (0, 1581)	1
  (0, 722)	1
  (0, 5155)	1
  (0, 11633)	1
  (0, 496)	1
  (0, 5066)	2
  (0, 6161)	1
  (0, 11803)	1
  (0, 7623)	1
  (0, 9125)	1
  (0, 4940)	1
  (0, 3857)	1
  (0, 5177)	1
  (0, 9618)	1


Hm, we only get back the indices...

Let's make it more readable and actually print the words together with it's appearances.

In [None]:
count_vectorizer.get_feature_names_out()[9618]

'soon'

In [None]:
for index in bag_of_words[0].indices:
  print(f"{count_vectorizer.get_feature_names_out()[index]}: {bag_of_words[0, index]}")

matrix: 2
action: 2
thriller: 1
sci: 1
fi: 1
profound: 1
philosophical: 2
movie: 2
think: 2
want: 1
system: 1
maker: 1
history: 1
need: 1
way: 1
change: 1
author: 1
impart: 1
win: 1
appearance: 1
idea: 2
lot: 1
young: 1
people: 1
see: 1
hope: 1
feel: 1
important: 1
soon: 1


Great! :-)

### TF-IDF

In the lecture we talked about the TF-IDF. Let's implement it below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df_train["text_lemma1"])


In [None]:
# Later for training, fit transform on the train text df_train["text_lemma1"] returns our X_train, and transform returns out X_test
#X_train = tfidf_vectorizer.fit_transform(df_train["text_lemma"])
#X_test = tfidf_vectorizer.transform(df_test["text_lemma"])

In [None]:
print(tfidf_matrix[0])

  (0, 9618)	0.1898752440779455
  (0, 5177)	0.18824312609231755
  (0, 3857)	0.12664949888895954
  (0, 4940)	0.16085671372920773
  (0, 9125)	0.06439119257217343
  (0, 7623)	0.10559874975865711
  (0, 11803)	0.19848837492520308
  (0, 6161)	0.11976119830842037
  (0, 5066)	0.24176733241453705
  (0, 496)	0.29305454162348415
  (0, 11633)	0.2121704722170167
  (0, 5155)	0.3097068169814539
  (0, 722)	0.25825638760421793
  (0, 1581)	0.162819455474006
  (0, 11497)	0.10617315862375835
  (0, 6963)	0.14312359798320493
  (0, 4874)	0.17629502195576488
  (0, 6268)	0.2121704722170167
  (0, 10276)	0.18048943582806434
  (0, 11439)	0.11866859406933869
  (0, 10510)	0.1702719380523902
  (0, 6811)	0.10354739158570456
  (0, 7714)	0.3124047079856391
  (0, 8118)	0.225608963937177
  (0, 3887)	0.10178824438416487
  (0, 9053)	0.10172331225305536
  (0, 10545)	0.1898752440779455
  (0, 106)	0.15419172361685327
  (0, 6405)	0.1258045156065948


Ok, again the same thing. We could now print each word. However, let us do something else instead. Let's quickly check the TF-IDFs for "matrix" and "action"

In [None]:
feat_dict = tfidf_vectorizer.vocabulary_

In [None]:
tfidf_vectorizer.get_feature_names_out()[:10]

array(['aaaa', 'aah', 'aaliyah', 'aalox', 'aaron', 'abandon',
       'abbreviate', 'abbreviation', 'abduct', 'abdul'], dtype=object)

In [None]:
feat_dict['matrix']

6405

In [None]:
feat_dict['action']

106

In [None]:
#same index for both CountVectorizer and TFidfVectorizer
bagofwords_feat['matrix']

6405

In [None]:
bagofwords_feat['action']

106

In [None]:
# bag of words
print(f"'matrix': {bag_of_words[0, 6405]}")
# tfidf
print(f"'matrix': {tfidf_matrix[0, 6405]}")

'matrix': 2
'matrix': 0.1258045156065948


In [None]:
# bag of words
print(f"'action': {bag_of_words[0, 106]}")
# tfidf
print(f"'action': {tfidf_matrix[0, 106]}")

'action': 2
'action': 0.15419172361685327


Interesting, while they both appear twice in the first review, "action" has a higher TF-IDF than "matrix". This is because "matrix" appears more often in other documents...

Hm, it would be in nice to understand which word has the highest TF-IDF, wouldn't it?

In [None]:
type(tfidf_matrix)

scipy.sparse._csr.csr_matrix

In [None]:
tfidf_matrix

<2771x11843 sparse matrix of type '<class 'numpy.float64'>'
	with 176258 stored elements in Compressed Sparse Row format>

In [None]:
tfidf_matrix.argmax() # index of word

5218855

In [None]:
tfidf_matrix.shape

(2771, 11843)

In [None]:
index_maxtfidf = tfidf_matrix.argmax()% tfidf_matrix.shape[1]
index_maxtfidf

7935

In [None]:
tfidf_vectorizer.get_feature_names_out()[index_maxtfidf]

'positive'

In [None]:
feat_dict['positive']

7935

Ok, *positive* is the word with the highest TF-IDF. Let's look at the corresponding text...

Hm, this is not a very helpful text, is it? :-D

In [None]:
int(tfidf_matrix.argmax()/ tfidf_matrix.shape[1])

440

In [None]:
df_train.iloc[440]

text                                                     Positives:
metadata          Author: Cirene404 / Rating: 8/10 / Date: 6 Jul...
date                                            2019-07-06 00:00:00
rating                                                          8.0
helpful                                                         NaN
text_clean                                                positives
text_stopwords                                            positives
text_stemming                                                 posit
text_lemma1                                                positive
text_lemma2                                                positive
perfect                                                           0
Name: 2827, dtype: object

In [None]:
print(df_train.iloc[int(tfidf_matrix.argmax()/ tfidf_matrix.shape[1])]["text"])

Positives:


In [None]:
# bag of words
print(f"'positives': {bag_of_words[440, 7935]}")
# tfidf
print(f"'positives': {tfidf_matrix[440, 7935]}")

'positives': 1
'positives': 1.0


# 8. Classification Model

In [None]:
df_train.head()

Unnamed: 0,text,metadata,date,rating,helpful,text_clean,text_stopwords,text_stemming,text_lemma1,text_lemma2,perfect
4107,"The Matrix is not only action, thriller and sc...",Rating: 10/10 / Helpful: 0/1 / Date: 30 May 20...,2001-05-30,10.0,0.0,the matrix is not only action thriller and sci...,matrix action thriller sci fi profound philoso...,matrix action thriller sci fi profound philoso...,matrix action thriller sci fi profound philoso...,matrix action thriller sci fi profound philoso...,1
644,The movie is fantastic and super interesting t...,Author: kvngkesh / Helpful: 0/0 / Date: 24 May...,2020-05-24,6.0,,the movie is fantastic and super interesting t...,movie fantastic super interesting point might ...,movi fantast super interest point might forget...,movie fantastic super interesting point might ...,movie fantastic super interesting point might ...,0
1833,Matrix is one of the best pictures ever made. ...,Rating: 10/10 / Date: 10 August 2001 / Author:...,2001-08-10,10.0,,matrix is one of the best pictures ever made t...,matrix one best pictures ever made obvious peo...,matrix one best pictur ever made obviou peopl ...,matrix one good picture ever make obvious peop...,matrix one best picture ever made obvious peop...,1
1901,The Matrix really is a mesmerising film- the f...,Date: 7 February 2001 / Rating: 7/10 / Author:...,2001-02-07,7.0,,the matrix really is a mesmerising film the fi...,matrix really mesmerising film first time watc...,matrix realli mesmeris film first time watch m...,matrix really mesmerise film first time watch ...,matrix really mesmerising film first time watc...,0
3886,An awe inspiring mix of philosophical debate a...,Author: eamon-hennedy / Date: 15 May 2003 / Ra...,2003-05-15,10.0,,an awe inspiring mix of philosophical debate a...,awe inspiring mix philosophical debate spectac...,awe inspir mix philosoph debat spectacular act...,awe inspire mix philosophical debate spectacul...,awe inspiring mix philosophical debate spectac...,1


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features = 5000)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

col_type = "text_stemming"

X_train_tfidf = tfidf_vectorizer.fit_transform(df_train[col_type])
X_test_tfidf = tfidf_vectorizer.transform(df_test[col_type])

X_train_count = count_vectorizer.fit_transform(df_train[col_type])
X_test_count = count_vectorizer.transform(df_test[col_type])


y_train = df_train["perfect"]
y_test = df_test["perfect"]

lr_tfidf = LogisticRegression()
lr_tfidf.fit(X_train_tfidf, y_train)
#pipe = Pipeline(steps=[("tfidf", TfidfVectorizer()),
 #                      ("lr", LogisticRegression(random_state=42))])

#pipe.fit(df_train[col_type], y_train)
lr_count = LogisticRegression()
lr_count.fit(X_train_count, y_train)

print("Accuracy")
print(accuracy_score(y_train, lr_tfidf.predict(X_train_tfidf)))
print(accuracy_score(y_test, lr_tfidf.predict(X_test_tfidf)))

print("Accuracy")
print(accuracy_score(y_train, lr_count.predict(X_train_count)))
print(accuracy_score(y_test, lr_count.predict(X_test_count)))


Accuracy
0.858534824972934
0.6681096681096681
Accuracy
0.9750992421508481
0.6392496392496393


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# 8.1 Positive Negative Words

In [None]:
lr_tfidf.coef_[0]

array([ 0.0860943 ,  0.05691675, -0.03444926, ...,  0.28344726,
       -0.10122485, -0.0719828 ])

In [None]:
coefs = [(coef, i) for i, coef in enumerate(lr_tfidf.coef_[0])]

In [None]:
#lets look at the top 10 features with lowest lr coef values, ie. those that are negatively corelated with our target var
sorted(coefs)[:10]

[(-2.184458913780049, 301),
 (-1.7056838232123575, 445),
 (-1.6832872432979025, 3486),
 (-1.6021852744577616, 1820),
 (-1.5039324551187812, 1130),
 (-1.449129368168427, 4448),
 (-1.4488427862226698, 4963),
 (-1.4223221092495453, 1732),
 (-1.409007365064243, 3872),
 (-1.3868115510214525, 2277)]

In [None]:
#lets look at the top 10 features with highest lr coef values, ie. those that are positively related to our target var
sorted(coefs)[-9:]

[(1.5882042415906283, 2739),
 (1.704496608306673, 1456),
 (1.8049452884515333, 3875),
 (1.841815748788472, 4515),
 (1.860382912268572, 2552),
 (2.2044030843615694, 364),
 (2.2406239874745215, 288),
 (2.2917646092106283, 129),
 (2.650740293511976, 1450)]

In [None]:
tfidf_vectorizer.get_feature_names_out()[620]

'characterist'

In [None]:
for _, index in sorted(coefs)[:10]:
  print(tfidf_vectorizer.get_feature_names_out()[index])

bad
bore
rate
good
dialogu
terribl
worst
fun
seem
kind


Pretty interesting, isn't it? And overall, most of the words actually make sense...

Let's look at the positive ones!

In [None]:
for _, index in sorted(coefs)[-9:]:
  print(tfidf_vectorizer.get_feature_names_out()[index])

movi
everyth
seen
time
masterpiec
best
awesom
amaz
ever


# 9. Pipeline Classification model

Great, now that we have done all the pre-processing it is time to fit our model! We will be using a logistic regression below as a baseline model.

Notice that we use a pipeline in order to vectorize the words and then train the algorithm. Using this is highly encouraged as we might have significant leakage in our cross-validation if we don't do this...

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss

col_type = "text_stemming"

y_train = df_train["perfect"]
y_test = df_test["perfect"]

pipe = Pipeline(steps=[("tfidf", TfidfVectorizer()),
                       ("lr", LogisticRegression(random_state=42))])

pipe.fit(df_train[col_type], y_train)

print("Accuracy")
print(accuracy_score(y_train, pipe.predict(df_train[col_type])))
print(cross_val_score(pipe, df_train[col_type], y_train, cv=5).mean())
print(accuracy_score(y_test, pipe.predict(df_test[col_type])))


print("\nLogloss")
print(log_loss(y_train, pipe.predict_proba(df_train[col_type])[:,1]))
print(-1*cross_val_score(pipe, df_train[col_type], y_train, cv=5, scoring="neg_log_loss").mean())
print(log_loss(y_test, pipe.predict_proba(df_test[col_type])[:,1]))

Accuracy
0.8639480332010104
0.7170663804598824
0.6695526695526696

Logloss
0.4636857909946523
0.5847677007701256
0.6001425449535787


#### Hyperparameter optimization

Notice that there is quite a lot of leakage - let's see if hyperparameter optimization can help us in any sense...

In [None]:
param_grid = {
    'tfidf__max_features': [1000, 2000, 3000, 5000, None], # consider the top max_features ordered by term frequency across the corpus.
    'tfidf__max_df': [0.3, 0.5, 0.7, 1],#When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, (corpus-specific stop words)
    'lr__C': [0.5, 0.7, 1, 2]
}

In [None]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(pipe, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=5, verbose=5)
gs.fit(df_train[col_type], y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


In [None]:
gs.best_params_

{'lr__C': 2, 'tfidf__max_df': 0.7, 'tfidf__max_features': None}

In [None]:
pipe = Pipeline(steps=[("tfidf", TfidfVectorizer(max_features=None, max_df=0.7)),
                       ("lr", LogisticRegression(random_state=42, C=2))])

pipe.fit(df_train[col_type], y_train)

print("Accuracy")
print(accuracy_score(y_train, pipe.predict(df_train[col_type])))
print(cross_val_score(pipe, df_train[col_type], y_train, cv=5).mean())
print(accuracy_score(y_test, pipe.predict(df_test[col_type])))


print("\nLogloss")
print(log_loss(y_train, pipe.predict_proba(df_train[col_type])[:,1]))
print(-1*cross_val_score(pipe, df_train[col_type], y_train, cv=5, scoring="neg_log_loss").mean())
print(log_loss(y_test, pipe.predict_proba(df_test[col_type])[:,1]))

Accuracy
0.901840490797546
0.7076813998113637
0.6753246753246753

Logloss
0.3961330074220713
0.5790324245708401
0.602240648573212


Ok, results got slightly better, but actually not by very much... Unfortunately, this is quite common when dealing with text data the way we did as there are many, many (sparse) features compared to the amount of observations

Out of interest, let's try to understand which words are identified as "positive" words and which ones as "negative" words according to the model.

# 9.1 Positive/Negative Words

In [None]:
coefs = [(coef, i) for i, coef in enumerate(pipe.named_steps["lr"].coef_[0])]

In [None]:
#lets look at the top 10 features with lowest lr coef values, ie. those that are negatively corelated with our target var
sorted(coefs)[:10]

[(-2.8990088185072707, 620),
 (-2.2218090372392285, 943),
 (-2.158440273336379, 6821),
 (-2.1493711497036454, 9664),
 (-2.1477857455878397, 2244),
 (-2.0690311308513505, 8576),
 (-2.0600474117287293, 3411),
 (-1.9773345456337235, 4695),
 (-1.912598606137006, 7451),
 (-1.9121503979063141, 6054)]

In [None]:
#lets look at the top 10 features with highest lr coef values, ie. those that are positively related to our target var
sorted(coefs)[-9:]

[(2.0371888391505606, 7447),
 (2.1824224828620897, 7454),
 (2.229011971140846, 2865),
 (2.3441027267695422, 8716),
 (2.716721958216159, 5175),
 (2.839895026844542, 768),
 (2.8539563062269817, 253),
 (3.01315683728349, 585),
 (3.4179503824718758, 2857)]

In [None]:
pipe.named_steps["tfidf"].get_feature_names_out()[620]

'bad'

In [None]:
for _, index in sorted(coefs)[:9]:
  print(pipe.named_steps["tfidf"].get_feature_names_out()[index])

bad
bore
rate
worst
dialogu
terribl
fun
kind
seem


Pretty interesting, isn't it? And overall, most of the words actually make sense...

Let's look at the positive ones!

In [None]:
for _, index in sorted(coefs)[-9:]:
  print(pipe.named_steps["tfidf"].get_feature_names_out()[index])

see
seen
everyth
time
masterpiec
best
amaz
awesom
ever
