## 1. Data Collection and Cleaning

### 1.1 Import Libraries and Packages

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, recall_score, precision_score
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob, Word
from imblearn.over_sampling import ADASYN
%matplotlib inline

### 1.2 Load Metacritic Album Review Data

__This data set consists of user reviews that were web scraped from the albums listed on www.metacritic.com. Please refer to the "Metacritic Scraper" notebook in this respository for the code used to collect this data.__

The following are explanations of the features in this dataset

In [2]:
df = pd.read_csv('user_album_reviews.csv', engine='python')

In [3]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,rating,review
0,0,0,\n\nAgricultural Tragic\n\n,Corb Lund,New West,"Jun 26, 2020",80.0,\ntbd\n,Country,The latest full-length release for the Canadia...,\nKrishnaKniar\n,"Jul 28, 2020",7.0,\nThis album has a very good vibe and you woul...
1,1,1,\n\nCloser Than Together\n\n,The Avett Brothers,Universal,"Oct 4, 2019",56.0,\ntbd\n,Folk,the 10th full-length studio release for the fo...,\ndjbrate\n,"Nov 12, 2019",1.0,\nIncredibly disappointed with the political r...
2,2,2,\n\nIII\n\n,The Lumineers,Dualtone Music,"Sep 13, 2019",72.0,\n8.6\n,Country,The third full-length release for the Colorado...,\nDididi\n,"Oct 9, 2019",10.0,"\nA literal masterpiece, its so good, very goo..."
3,3,3,\n\nIII\n\n,The Lumineers,Dualtone Music,"Sep 13, 2019",72.0,\n8.6\n,Country,The third full-length release for the Colorado...,\nbrrunosouzza\n,"Sep 25, 2019",10.0,'III' é um dos poucos álbuns que fiquei ansios...
4,4,4,\n\nIII\n\n,The Lumineers,Dualtone Music,"Sep 13, 2019",72.0,\n8.6\n,Country,The third full-length release for the Colorado...,\ngollygee93\n,"Sep 26, 2019",9.0,Rather than continuing down the path of sample...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84536 entries, 0 to 84535
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    84536 non-null  int64  
 1   Unnamed: 0.1  84536 non-null  object 
 2   title         84054 non-null  object 
 3   artist        84054 non-null  object 
 4   label         83678 non-null  object 
 5   release_date  84054 non-null  object 
 6   metascore     84054 non-null  float64
 7   user_score    84054 non-null  object 
 8   genre         84054 non-null  object 
 9   summary       82742 non-null  object 
 10  name          84054 non-null  object 
 11  date          84054 non-null  object 
 12  rating        84054 non-null  float64
 13  review        84054 non-null  object 
dtypes: float64(2), int64(1), object(11)
memory usage: 9.0+ MB


### 1.3 Data Cleaning

Due to the webscraping process, there are about 500 rows where reviews that had spaces in between paragraphs were made into there own row for each additional paragraph and are located in the 'Unnamed' columns. The remaining columns for those rows were all null so they will be dropped. 

In [5]:
df = df[df['title'].notna()]

In [6]:
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1'])

In [7]:
def dropn(x):
    return x.replace('\n', '')

In [8]:
df['title'] = df['title'].apply(dropn)
df['user_score'] = df['user_score'].apply(dropn)
df['name'] = df['name'].apply(dropn)
df['review'] = df['review'].apply(dropn)

In [9]:
df['release_date'] = pd.to_datetime(df['release_date'])
df['date'] = pd.to_datetime(df['date'])

Albums that have 'tbd' as a user_score indicates that they have too few user_rating to aggregate the overall score. (Go back to see how you wanna do this

In [10]:
#df['user_score'] = np.where(df['user_score'] == 'tbd', '0.0', df['user_score'] )

In [11]:
#df['user_score'] = df['user_score'].astype('float')

In [12]:
df = df.sort_values(['genre', 'release_date'], ascending = (True,False)).reset_index(drop = True)

In [13]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,rating,review
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,10.0,This is John Mayer in the zone. This is where...
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,1.0,"I give Little, Good John kudos for at least t..."
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,3.0,John Mayer... oh John Mayer. A talented blues...
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,8.0,John Mayer brings a great sounding album as a ...
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,10.0,It is great to have John Mayer back. This alb...
...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,10.0,Wonderful compilation. Very impressed.
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,10.0,Inspirational
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,9.0,Sensacional
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,8.0,"8.5......pretty ingratiating. somehow, after ..."


In [14]:
def clean_rev(text):
    text = text.str.replace("<br/>", "")
    text = text.str.replace("'", '')
    text = text.str.replace("-", '')
    text = text.str.replace('(<a).*(>).*(</a>)', '')
    text = text.str.replace('&amp', '')
    text = text.str.replace('&gt', '')
    text = text.str.replace('&lt', '')
    text = text.str.replace('\xa0', ' ')
    text = text.str.replace('[^\w\s]', ' ')
    text = text.str.replace('[0-9]', ' ')
    text = text.str.lower() 
    return text
df['clean_review'] = clean_rev(df['review'])

In [15]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,rating,review,clean_review
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,10.0,This is John Mayer in the zone. This is where...,this is john mayer in the zone this is where...
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,1.0,"I give Little, Good John kudos for at least t...",i give little good john kudos for at least t...
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,3.0,John Mayer... oh John Mayer. A talented blues...,john mayer oh john mayer a talented blues...
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,8.0,John Mayer brings a great sounding album as a ...,john mayer brings a great sounding album as a ...
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,10.0,It is great to have John Mayer back. This alb...,it is great to have john mayer back this alb...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,10.0,Wonderful compilation. Very impressed.,wonderful compilation very impressed
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,10.0,Inspirational,inspirational
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,9.0,Sensacional,sensacional
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,8.0,"8.5......pretty ingratiating. somehow, after ...",pretty ingratiating somehow after ...


In [16]:
df['length'] = df['review'].astype(str).apply(len)
df['word_count'] = df['review'].apply(lambda x: len(str(x).split()))

In [17]:
analyzer = SentimentIntensityAnalyzer()


In [18]:
df['sentiment'] = [analyzer.polarity_scores(x)['compound'] for x in df['clean_review']]
df['negative'] = [analyzer.polarity_scores(x)['neg'] for x in df['clean_review']]
df['neutral'] = [analyzer.polarity_scores(x)['neu'] for x in df['clean_review']]
df['positive'] = [analyzer.polarity_scores(x)['pos'] for x in df['clean_review']]

In [25]:
df['sent_class'] = np.where(df['sentiment'] >= 0.05, 1, df['sentiment'])
df['sent_class'] = np.where((df['sentiment'] > -0.05) & (df['sentiment'] < 0.05), 0, df['sent_class'])
df['sent_class'] = np.where(df['sentiment'] <= -0.05, -1, df['sent_class'])

In [20]:
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')

#df['clean_review'] = df['clean_review'].apply(tokenizer.tokenize)

In [19]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,rating,review,clean_review,length,word_count,sentiment,negative,neutral,positive
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,10.0,This is John Mayer in the zone. This is where...,this is john mayer in the zone this is where...,441,83,0.7227,0.026,0.890,0.083
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,1.0,"I give Little, Good John kudos for at least t...",i give little good john kudos for at least t...,575,102,0.1878,0.113,0.767,0.120
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,3.0,John Mayer... oh John Mayer. A talented blues...,john mayer oh john mayer a talented blues...,653,117,0.9601,0.039,0.789,0.172
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,8.0,John Mayer brings a great sounding album as a ...,john mayer brings a great sounding album as a ...,108,20,0.7964,0.000,0.677,0.323
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,10.0,It is great to have John Mayer back. This alb...,it is great to have john mayer back this alb...,123,22,0.9001,0.000,0.633,0.367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,10.0,Wonderful compilation. Very impressed.,wonderful compilation very impressed,39,4,0.7960,0.000,0.220,0.780
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,10.0,Inspirational,inspirational,14,1,0.5106,0.000,0.000,1.000
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,9.0,Sensacional,sensacional,15,1,0.0000,0.000,1.000,0.000
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,8.0,"8.5......pretty ingratiating. somehow, after ...",pretty ingratiating somehow after ...,187,30,0.7351,0.000,0.789,0.211


In [21]:
stopwords_list=stopwords.words('english')+list(string.punctuation)

In [22]:
#df['clean_review'] = df['clean_review'].apply(lambda text_list: [x for x in text_list if x not in stopwords_list])

In [24]:
def reduce(text):
    tokens = tokenizer.tokenize(text) # tokenize every review
    removed = [token.lower() for token in tokens if token.lower() not in stopwords_list]
    return removed

In [27]:
clean_reviews = df['clean_review']
target = df['sent_class']

In [29]:
# remove all stopwords, punctuations & unimportant words from the reviews and make a list
processed_data = list(map(reduce, clean_reviews))

In [28]:
lemmatizer = WordNetLemmatizer()

In [22]:
text = df['clean_review'].iloc[1]

In [24]:
df['clean_review'].iloc[1]

['give',
 'little',
 'good',
 'john',
 'kudos',
 'least',
 'turning',
 'lights',
 'studio',
 'said',
 'wish',
 'would',
 'crawl',
 'back',
 'primordial',
 'adultcontemporay',
 'ooze',
 'david',
 'gray',
 'sprung',
 'poor',
 'mans',
 'mark',
 'knopfler',
 'neither',
 'chops',
 'writing',
 'ability',
 'former',
 'dire',
 'straits',
 'frontman',
 'im',
 'ashamed',
 'admit',
 'room',
 'squares',
 'sucked',
 'miasma',
 'realized',
 'jig',
 'decidedly',
 'produced',
 'live',
 'albums',
 'three',
 'studio',
 'records',
 'since',
 'touchstone',
 'sadly',
 'best',
 'gig',
 'since',
 'backing',
 'dave',
 'chappelle']

In [31]:
lem_review = []
for j in processed_data:
    lem = ' '.join([lemmatizer.lemmatize(w) for w in j])
    lem_review.append(lem)

In [21]:
#def lem_function(text):
    #lem_review = []
    #for j in text:
        #lem = lemmatizer.lemmatize(j)
        #lem_review.append(lem)
    #return lem_review


In [35]:
lemmatizer.lemmatize('impressed')

'impressed'

In [31]:
test = ['fights', 'geese','trucks']

In [32]:
lem_function(test)

['fight', 'goose', 'truck']

In [33]:
df['lemmatized'] = df['clean_review'].apply(lem_function)

In [34]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,rating,review,clean_review,length,word_count,sentiment,negative,neutral,positive,lemmatized
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,10.0,This is John Mayer in the zone. This is where...,"[john, mayer, zone, lives, kind, music, making...",441,83,0.7227,0.026,0.890,0.083,"[john, mayer, zone, life, kind, music, making,..."
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,1.0,"I give Little, Good John kudos for at least t...","[give, little, good, john, kudos, least, turni...",575,102,0.1878,0.113,0.767,0.120,"[give, little, good, john, kudos, least, turni..."
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,3.0,John Mayer... oh John Mayer. A talented blues...,"[john, mayer, oh, john, mayer, talented, blues...",653,117,0.9601,0.039,0.789,0.172,"[john, mayer, oh, john, mayer, talented, blues..."
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,8.0,John Mayer brings a great sounding album as a ...,"[john, mayer, brings, great, sounding, album, ...",108,20,0.7964,0.000,0.677,0.323,"[john, mayer, brings, great, sounding, album, ..."
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,10.0,It is great to have John Mayer back. This alb...,"[great, john, mayer, back, album, definitely, ...",123,22,0.9001,0.000,0.633,0.367,"[great, john, mayer, back, album, definitely, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,10.0,Wonderful compilation. Very impressed.,"[wonderful, compilation, impressed]",39,4,0.7960,0.000,0.220,0.780,"[wonderful, compilation, impressed]"
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,10.0,Inspirational,[inspirational],14,1,0.5106,0.000,0.000,1.000,[inspirational]
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,9.0,Sensacional,[sensacional],15,1,0.0000,0.000,1.000,0.000,[sensacional]
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,8.0,"8.5......pretty ingratiating. somehow, after ...","[pretty, ingratiating, somehow, listens, forei...",187,30,0.7351,0.000,0.789,0.211,"[pretty, ingratiating, somehow, listens, forei..."


In [39]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,...,review,clean_review,length,word_count,sentiment,negative,neutral,positive,lemmatized,sent_class
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,...,This is John Mayer in the zone. This is where...,"[john, mayer, zone, lives, kind, music, making...",441,83,0.7227,0.026,0.890,0.083,"[john, mayer, zone, life, kind, music, making,...",1.0
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,...,"I give Little, Good John kudos for at least t...","[give, little, good, john, kudos, least, turni...",575,102,0.1878,0.113,0.767,0.120,"[give, little, good, john, kudos, least, turni...",1.0
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,...,John Mayer... oh John Mayer. A talented blues...,"[john, mayer, oh, john, mayer, talented, blues...",653,117,0.9601,0.039,0.789,0.172,"[john, mayer, oh, john, mayer, talented, blues...",1.0
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,...,John Mayer brings a great sounding album as a ...,"[john, mayer, brings, great, sounding, album, ...",108,20,0.7964,0.000,0.677,0.323,"[john, mayer, brings, great, sounding, album, ...",1.0
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,...,It is great to have John Mayer back. This alb...,"[great, john, mayer, back, album, definitely, ...",123,22,0.9001,0.000,0.633,0.367,"[great, john, mayer, back, album, definitely, ...",1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,...,Wonderful compilation. Very impressed.,"[wonderful, compilation, impressed]",39,4,0.7960,0.000,0.220,0.780,"[wonderful, compilation, impressed]",1.0
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,...,Inspirational,[inspirational],14,1,0.5106,0.000,0.000,1.000,[inspirational],1.0
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,...,Sensacional,[sensacional],15,1,0.0000,0.000,1.000,0.000,[sensacional],0.0
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,...,"8.5......pretty ingratiating. somehow, after ...","[pretty, ingratiating, somehow, listens, forei...",187,30,0.7351,0.000,0.789,0.211,"[pretty, ingratiating, somehow, listens, forei...",1.0


In [None]:
test

In [None]:
df['lemmatized'].iloc[76]

In [None]:
df['lemmatized'] = df['lemmatized'].apply

In [None]:
df.review.iloc[84022]

In [None]:
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [None]:
df['blob_sentiment'] = df.review.apply(detect_sentiment)

In [None]:
df

In [None]:
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

In [None]:
df.genre.value_counts()

In [None]:
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

#### TFIDF

In [35]:
XL = lem_review
yL = target

In [36]:
XL_train, XL_test, yL_train, yL_test = train_test_split(XL, yL, test_size=0.2, random_state=1)
tfVectorizer = TfidfVectorizer()

XL_train_tf = tfVectorizer.fit_transform(XL_train)
XL_test_tf = tfVectorizer.transform(XL_test)

Baseline Models

In [75]:
#Fitting & predicting the Dummy Classifier (Baseline Model)
from sklearn.dummy import DummyClassifier
dclf = DummyClassifier() 

In [76]:

dclf.fit(XL_train_tf, yL_train)
y_preds_b = dclf.predict(XL_test_tf)
print('dummy accuracy:',accuracy_score(yL_test, yL_preds),
      'dummy forest f1:',f1_score(yL_test, yL_preds, average = 'weighted'))

dummy accuracy: 0.7775266194753435 dummy forest f1: 0.6802121975210639




In [37]:
rf_classifier = RandomForestClassifier(n_estimators=250)

In [42]:
from datetime import datetime
startTime = datetime.now()

#do something

#Python 3: 
print(datetime.now() - startTime)

2020-08-11 11:35:29.050818


In [43]:
startTime = datetime.now()

rf_classifier.fit(XL_train_tf, yL_train)
yL_preds = rf_classifier.predict(XL_test_tf)
print('random forest accuracy:',accuracy_score(yL_test, yL_preds),
      'random forest f1:',f1_score(yL_test, yL_preds, average = 'weighted'))
print(datetime.now() - startTime)

random forest accuracy: 0.8454583308547975 random forest f1: 0.8139312640202593
0:07:46.302236


In [45]:
nb_classifier = MultinomialNB()

In [46]:
startTime = datetime.now()

nb_classifier.fit(XL_train_tf, yL_train)
yL_preds = nb_classifier.predict(XL_test_tf)
print('naive bayes accuracy:',accuracy_score(yL_test, yL_preds),
      'naive bayes f1:',f1_score(yL_test, yL_preds, average = 'weighted'))
print(datetime.now() - startTime)

naive bayes accuracy: 0.7889477128070906 naive bayes f1: 0.7177447226071049
0:00:00.094633


In [48]:
from sklearn.svm import SVC
svc_classifier = SVC(kernel='linear')

In [49]:
startTime = datetime.now()

svc_classifier.fit(XL_train_tf, yL_train)
yL_preds = svc_classifier.predict(XL_test_tf)
print('support vector machine accuracy:',accuracy_score(yL_test, yL_preds),
      'support vector machine f1:',f1_score(yL_test, yL_preds, average = 'weighted'))
print(datetime.now() - startTime)

naive bayes accuracy: 0.8840045208494438 naive bayes f1: 0.8774995467780754
0:16:43.757172


Grid Search

In [50]:
from sklearn.model_selection import GridSearchCV

In [54]:
nb_params = {'alpha': [0.01,0.03,0.05,0.07,0.09,0.11,0.13,0.15,0.17,0.19],
              'fit_prior': [True, False],
              'class_prior': [[-1,0,1],[1,0,-1]]}

In [55]:
grid_nb = GridSearchCV(nb_classifier, param_grid=nb_params, cv=7, scoring='accuracy', verbose =1, n_jobs=-1)
grid_nb.fit(XL_train_tf, yL_train)

Fitting 7 folds for each of 40 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:    3.6s finished
  self.class_log_prior_ = np.log(class_prior)
  self.class_log_prior_ = np.log(class_prior)


GridSearchCV(cv=7, estimator=MultinomialNB(), n_jobs=-1,
             param_grid={'alpha': [0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13,
                                   0.15, 0.17, 0.19],
                         'class_prior': [[-1, 0, 1], [1, 0, -1]],
                         'fit_prior': [True, False]},
             scoring='accuracy', verbose=1)

In [56]:
# examine the best model
print(grid_nb.best_score_)
# Dictionary containing the parameters (min_samples_split) used to generate that score
print(grid_nb.best_params_)
# Shows default parameters that we did not specify
print(grid_nb.best_estimator_)
#Identify the best score during fitting with cross-validation

0.7745936388061276
{'alpha': 0.01, 'class_prior': [1, 0, -1], 'fit_prior': True}
MultinomialNB(alpha=0.01, class_prior=[1, 0, -1])


In [57]:
yL_preds = grid_nb.best_estimator_.predict(XL_test_tf)
print('naive bayes accuracy:',accuracy_score(yL_test, yL_preds),
      'naive bayes f1:',f1_score(yL_test, yL_preds, average = 'weighted'))

naive bayes accuracy: 0.7775266194753435 naive bayes f1: 0.6802121975210639


In [72]:
#svm_params = {'C': [1, 10, 100, 1000],'kernel': ['rbf'], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

In [73]:
#svm_classifier = SVC()

In [74]:
#grid_svm = GridSearchCV(svm_classifier, param_grid=svm_params, cv=7, scoring='accuracy', verbose =1, n_jobs=-1)
#grid_svm.fit(XL_train_tf, yL_train)

Fitting 7 folds for each of 20 candidates, totalling 140 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


KeyboardInterrupt: 

In [56]:
# examine the best model
print(grid_nb.best_score_)
# Dictionary containing the parameters (min_samples_split) used to generate that score
print(grid_nb.best_params_)
# Shows default parameters that we did not specify
print(grid_nb.best_estimator_)
#Identify the best score during fitting with cross-validation

0.7745936388061276
{'alpha': 0.01, 'class_prior': [1, 0, -1], 'fit_prior': True}
MultinomialNB(alpha=0.01, class_prior=[1, 0, -1])


In [57]:
yL_preds = grid_nb.best_estimator_.predict(XL_test_tf)
print('naive bayes accuracy:',accuracy_score(yL_test, yL_preds),
      'naive bayes f1:',f1_score(yL_test, yL_preds, average = 'weighted'))

naive bayes accuracy: 0.7775266194753435 naive bayes f1: 0.6802121975210639


Spacy

In [79]:
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups


In [81]:
sp = spacy.load('en_core_web_md')
lookups = Lookups()
lemm = Lemmatizer(lookups)

In [82]:
def lem_function(text):
    dummy = []
    #this is just a test to see if it works
    for word in sp(text):
        dummy.append(word.lemma_)
    return ' '.join(dummy)


In [87]:
from tqdm import tqdm

In [88]:
tqdm.pandas()

  from pandas import Panel


In [93]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,...,review,clean_review,length,word_count,sentiment,negative,neutral,positive,sent_class,sp_lm
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,...,This is John Mayer in the zone. This is where...,this is john mayer in the zone this is where...,441,83,0.7227,0.026,0.890,0.083,1.0,this be John Mayer in the zone . this be whe...
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,...,"I give Little, Good John kudos for at least t...",i give little good john kudos for at least t...,575,102,0.1878,0.113,0.767,0.120,1.0,"give little , Good John kudos for at least..."
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,...,John Mayer... oh John Mayer. A talented blues...,john mayer oh john mayer a talented blues...,653,117,0.9601,0.039,0.789,0.172,1.0,John Mayer ... oh John Mayer . a talented bl...
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,...,John Mayer brings a great sounding album as a ...,john mayer brings a great sounding album as a ...,108,20,0.7964,0.000,0.677,0.323,1.0,John Mayer bring a great sounding album as a m...
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,...,It is great to have John Mayer back. This alb...,it is great to have john mayer back this alb...,123,22,0.9001,0.000,0.633,0.367,1.0,be great to have John Mayer back . this al...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,...,Wonderful compilation. Very impressed.,wonderful compilation very impressed,39,4,0.7960,0.000,0.220,0.780,1.0,wonderful compilation . very impressed .
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,...,Inspirational,inspirational,14,1,0.5106,0.000,0.000,1.000,1.0,Inspirational
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,...,Sensacional,sensacional,15,1,0.0000,0.000,1.000,0.000,0.0,sensacional
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,...,"8.5......pretty ingratiating. somehow, after ...",pretty ingratiating somehow after ...,187,30,0.7351,0.000,0.789,0.211,1.0,"8.5 ...... pretty ingratiating . somehow , a..."


In [95]:
df['sp_clean'] = clean_rev(df['review'])

In [96]:
df['sp_lm'] = df['sp_clean'].progress_apply(lambda x: lem_function(x))

100%|██████████| 84054/84054 [20:37<00:00, 67.91it/s] 


In [97]:
df['sp_lm'] = df['sp_lm'].progress_apply(lambda x: x.replace('-PRON-', ' '))

100%|██████████| 84054/84054 [00:00<00:00, 465946.31it/s]


In [98]:
df

Unnamed: 0,title,artist,label,release_date,metascore,user_score,genre,summary,name,date,...,clean_review,length,word_count,sentiment,negative,neutral,positive,sent_class,sp_lm,sp_clean
0,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ibadukefan,2014-02-02,...,this is john mayer in the zone this is where...,441,83,0.7227,0.026,0.890,0.083,1.0,this be john mayer in the zone this be wher...,this is john mayer in the zone this is where...
1,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ToddW,2006-09-27,...,i give little good john kudos for at least t...,575,102,0.1878,0.113,0.767,0.120,1.0,i give little good john kudos for at least...,i give little good john kudos for at least t...
2,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ChristopherG.,2007-08-01,...,john mayer oh john mayer a talented blues...,653,117,0.9601,0.039,0.789,0.172,1.0,john mayer oh john mayer a talented bl...,john mayer oh john mayer a talented blues...
3,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,jfrotylpe532,2012-12-21,...,john mayer brings a great sounding album as a ...,108,20,0.7964,0.000,0.677,0.323,1.0,john mayer bring a great sounding album as a m...,john mayer brings a great sounding album as a ...
4,Continuum,John Mayer,Sony,2006-09-12,67.0,8.9,Adult Alternative,The singer-songwriter's first album in three y...,ErinY,2006-09-12,...,it is great to have john mayer back this alb...,123,22,0.9001,0.000,0.633,0.367,1.0,be great to have john mayer back this al...,it is great to have john mayer back this alb...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84049,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,billye,2006-09-18,...,wonderful compilation very impressed,39,4,0.7960,0.000,0.220,0.780,1.0,wonderful compilation very impressed,wonderful compilation very impressed
84050,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,JohnO,2006-11-12,...,inspirational,14,1,0.5106,0.000,0.000,1.000,1.0,inspirational,inspirational
84051,Tropicalia: A Brazilian Revolution In Sound,Various Artists,Soul Jazz,2006-02-13,93.0,8.3,World,This 19-song compiliation from the esteemed So...,Iky009,2014-01-06,...,sensacional,15,1,0.0000,0.000,1.000,0.000,0.0,sensacional,sensacional
84052,Congotronics,Konono No. 1,Crammed Discs,2005-09-27,87.0,7.2,World,"The first installment of a series of ""Congotro...",larryl,2006-04-12,...,pretty ingratiating somehow after ...,187,30,0.7351,0.000,0.789,0.211,1.0,pretty ingratiating somehow aft...,pretty ingratiating somehow after ...


In [100]:
XS = df['sp_lm']
yS = target

In [101]:
XS_train, XS_test, yS_train, yS_test = train_test_split(XS, yS, test_size=0.2, random_state=1)
tfVectorizer = TfidfVectorizer()

XS_train_tf = tfVectorizer.fit_transform(XS_train)
XS_test_tf = tfVectorizer.transform(XS_test)

In [37]:
rf_classifier = RandomForestClassifier(n_estimators=250)

In [102]:
startTime = datetime.now()

rf_classifier.fit(XS_train_tf, yS_train)
yS_preds = rf_classifier.predict(XS_test_tf)
print('random forest accuracy:',accuracy_score(yS_test, yS_preds),
      'random forest f1:',f1_score(yS_test, yS_preds, average = 'weighted'))
print(datetime.now() - startTime)

random forest accuracy: 0.8331449646065077 random forest f1: 0.7921355768688797
0:08:19.702632


In [45]:
nb_classifier = MultinomialNB()

In [103]:
startTime = datetime.now()

nb_classifier.fit(XS_train_tf, yS_train)
yS_preds = nb_classifier.predict(XS_test_tf)
print('naive bayes accuracy:',accuracy_score(yS_test, yS_preds),
      'naive bayes f1:',f1_score(yS_test, yS_preds, average = 'weighted'))
print(datetime.now() - startTime)

naive bayes accuracy: 0.7901968948902505 naive bayes f1: 0.7198077804243568
0:00:00.128877


In [48]:
from sklearn.svm import SVC
svc_classifier = SVC(kernel='linear')

In [104]:
startTime = datetime.now()

svc_classifier.fit(XS_train_tf, yS_train)
yS_preds = svc_classifier.predict(XS_test_tf)
print('support vector machine accuracy:',accuracy_score(yS_test, yL_preds),
      'support vector machine f1:',f1_score(yS_test, yS_preds, average = 'weighted'))
print(datetime.now() - startTime)

support vector machine accuracy: 0.7775266194753435 support vector machine f1: 0.8666692261184341
0:51:32.721307


In [None]:
df['sp'] = df['clean_review'].apply(tokenizer.tokenize)

In [None]:
df