# <center>Machine Learning on Textual Data</center>

---

In this exercise, we will preprocess the data with SpaCy based on the techniques we learnt in SpaCy introduction.

Then, we will see a few application on the data.

- TextClassification
- Text Clustering
- Sentiment Analysis

### Dataset

The data set contains about 1000 online reviews each for various items on 

- Amazon, 
- Yelp and 
- IMDB, 

and of these reviews about 500 were labelled positive and 500 were labelled negative reviews. 

For each company, the data was given the text format which are needed to be added to a dataframe

<a href="https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences">Dataset Link</a>

## <center>Text Classification</center>

---

#### Load Libraries

In [16]:
import pandas as pd

#### Import Data

I. Yelp Data

In [17]:
data_yelp = pd.read_table('sentiment_data/yelp_labelled.txt', header = None)

print(data_yelp.shape)

data_yelp[:3]

(1000, 2)


Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0


II. Amazon Data

In [19]:
data_amazon = pd.read_table('sentiment_data/amazon_cells_labelled.txt', header = None)

print(data_amazon.shape)

data_amazon[:3]

(1000, 2)


Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1


III. Imdb Data

In [20]:
data_imdb = pd.read_table('sentiment_data/imdb_labelled.txt', header = None)

print(data_imdb.shape)

data_imdb[:3]

(748, 2)


Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0


#### Combine the datasets

In [21]:
# Stack the 3 dataframes on top of another
reviews = pd.concat([data_amazon, data_imdb, data_yelp], axis = 0, keys = ["Amazon", "imdb", "yelp"])

# Set column names
reviews.columns = ['review', 'label']

reviews.iloc[[500, 501, 502, 1500, 1501, 1502, 2500, 2501, 2502]]

Unnamed: 0,Unnamed: 1,review,label
Amazon,500,"The bose noise cancelling is amazing, which is...",1
Amazon,501,This battery is an excellent bargain!,1
Amazon,502,Defective crap.,0
imdb,500,It's a case of 'so bad it is laughable'.,0
imdb,501,") very bad performance plays Angela Bennett, a...",0
imdb,502,"It is a film about nothing, just a pretext to ...",0
yelp,752,"Level 5 spicy was perfect, where spice didn't ...",1
yelp,753,We were sat right on time and our server from ...,1
yelp,754,Main thing I didn't enjoy is that the crowd is...,0


#### Check the shape

In [22]:
reviews.shape

(2748, 2)

### Text Preprocessing

We will implement the following steps in our preprocessing pipeline.

- Tokenisation
- Lemmatization
- Stop Words Removal
- Punctuations Removal
- Vectorisation

#### Create Stop Words list

In [23]:
from spacy.lang.en import English

In [24]:
import spacy
from  spacy.lang.en.stop_words import STOP_WORDS

In [None]:
nlp = spacy.load('en')

# To build a list of stop words for filtering

In [25]:
stopwords = list(STOP_WORDS)

print(stopwords[:100])

['thereby', 'beforehand', 'amount', 'well', 'twenty', 'such', 'regarding', 'out', 'unless', 'again', 'fifty', 'than', 'something', 'therefore', '‘ve', 'others', 'only', 'in', 'three', "'re", 'but', 'cannot', 're', 'become', 'where', '’m', '‘ll', 'ever', 'everyone', 'never', 'becoming', 'make', '’s', 'now', 'has', 'not', 'their', 'next', 'no', "'m", 'whereas', 'show', 'very', 'might', 'per', 'until', 'whenever', 'although', 'as', 'name', 'everywhere', 'there', '‘re', 'throughout', 'some', 'hers', 'with', 'always', 'perhaps', 'elsewhere', 'somehow', 'empty', 'sometime', 'all', 'due', 'do', 'either', 'further', 'third', 'none', 'seemed', 'whole', 'must', 'me', 'call', 'i', 'indeed', 'thus', 'here', 'had', 'whatever', 'from', 'five', 'when', 'nobody', 'alone', 'latterly', 'why', '’d', 'side', 'whether', 'nothing', 'done', 'my', 'several', 'each', 'serious', 'nevertheless', 'wherever', 'whereby']


Check if your word in the list

In [26]:
print('which' in stopwords)
print('thing' in stopwords)

True
False


Let's add the word `thing` to the stop words list

In [27]:
stopwords = ['thing'] + list(STOP_WORDS)

print(stopwords[:5])

['thing', 'thereby', 'beforehand', 'amount', 'well']


#### Create Punctuation list

We will use the default punctuation list.

In [28]:
import string

punctuations = string.punctuation

print(punctuations)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


#### Create preprocessing Pipeline

Tokenized words needs to be lemmatized and filtered for pronouns, stopwords and punctuations using the defined method 'preprocess'.

In [32]:
import re
# Function to tokenise the text
# Define a function to handle all data cleaning
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    #lememtization
    #stemming
    return text

Let's understand the how the preprocess function will work for a simple example

In [33]:
sent = "He was saying some thing but he says a lot of things so I just wasn't paying attention." # :D 

clean_text(sent) 

# After pre-processing we have only one token left in the list. Saves so much computation. Pheww.
# Remove the rule 3 and see the results

['saying', 'says', 'lot', 'things', 'wasnt', 'paying', 'attention']

#### Vectorisation

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a vectorizer object and pass the preprocess function we created to the tokeniser argument
tfvectorizer = TfidfVectorizer(tokenizer = clean_text)

#### Build a Support Vector Classifier Object

In [38]:
from sklearn.naive_bayes import MultinomialNB
# Initialise Model Object
classifier = MultinomialNB()


#### Train - Test Split

The data is split into training and test datasets prior to feeding into the machine learning pipeline. 

In [39]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( reviews['review'], reviews['label'], 
                                                    test_size = 0.2, random_state = 42)


Check the shape

In [40]:
print('Train Shape : ', X_train.shape)

print('Test Shape : ', X_test.shape)

Train Shape :  (2198,)
Test Shape :  (550,)


#### Create Machine Learning Pipeline

In [41]:
from sklearn.pipeline import Pipeline

# Create the  pipeline to clean, tokenize, vectorize, and classify using"Count Vectorizor"
# Multiple models can be added to the Pipeline object to be executed in sequence.
model_pipe = Pipeline( [ ('vectorizer', tfvectorizer), 
                         ('classifier', classifier) ] )

#### Fit the Model

In [42]:
model_pipe.fit(X_train,y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function clean_text at 0x00000191D72BB948>)),
                ('classifier', MultinomialNB())])

#### Predict on test data

In [43]:
preds = model_pipe.predict(X_test)
preds[:10]

array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0], dtype=int64)

In [51]:
X_test[:10][2]

"Let's start with all the problems\x97the acting, especially from the lead professor, was very, very bad.  "

In [21]:
type(X_test[:10])

pandas.core.series.Series

In [None]:
mo

#### Compute the accuracy

In [52]:
from sklearn.metrics import accuracy_score 

# Accuracy
print("Train Accuracy: ", model_pipe.score(X_train, y_train))

# Accuracy
print("Test Accuracy: ", model_pipe.score(X_test, y_test))

Train Accuracy:  0.9499545040946314
Test Accuracy:  0.7963636363636364


---

## <center>Text Clustering</center>

---


#### Compute the tfidf matrix

In [53]:
tfidf = tfvectorizer.fit_transform(reviews['review'])

#### View the feature names

In [54]:
tfvectorizer.get_feature_names()

['',
 '0',
 '010',
 '1',
 '10',
 '100',
 '1010',
 '11',
 '110',
 '1199',
 '12',
 '13',
 '15',
 '15lb',
 '17',
 '18',
 '18th',
 '1928',
 '1947',
 '1948',
 '1949',
 '1971',
 '1973',
 '1979',
 '1980s',
 '1986',
 '1995',
 '1998',
 '2',
 '20',
 '2000',
 '2005',
 '2006',
 '2007',
 '20th',
 '20the',
 '2160',
 '23',
 '24',
 '25',
 '2mp',
 '3',
 '30',
 '30s',
 '325',
 '34ths',
 '35',
 '350',
 '375',
 '3o',
 '4',
 '40',
 '400',
 '40min',
 '42',
 '45',
 '4s',
 '5',
 '50',
 '5020',
 '510',
 '5320',
 '54',
 '5of',
 '5year',
 '6',
 '680',
 '7',
 '70',
 '70000',
 '700w',
 '70s',
 '744',
 '750',
 '785',
 '8',
 '80',
 '80s',
 '810',
 '8125',
 '815pm',
 '8525',
 '8530',
 '8pm',
 '9',
 '90',
 '90s',
 '910',
 '95',
 'aailiyah',
 'abandoned',
 'abhor',
 'ability',
 'able',
 'abound',
 'abovepretty',
 'abroad',
 'absolute',
 'absolutel',
 'absolutely',
 'absolutley',
 'abstruse',
 'abysmal',
 'ac',
 'academy',
 'accents',
 'accept',
 'acceptable',
 'access',
 'accessable',
 'accessible',
 'accessing',
 'acc

#### Convert tfidf matrix to dense form

In [55]:
dense = tfidf.todense()
print(dense.shape)
print(type(dense))

(2748, 5146)
<class 'numpy.matrix'>


#### Build the kmeans model and retrieve labels

In [56]:
# Creating cluster using the kmeans
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, init = 'k-means++')

model = kmeans.fit(dense)
labels = model.labels_

print(labels)

[0 1 2 ... 0 0 0]


#### View the frequency count for each value in the label

We can see the model has not done a good job as it has predicted most values in one of the cluster. As this is review data for different companies, the context might be quite similar. It might just be clustering similar length sentences together.

In [57]:
pd.DataFrame(labels)[0].value_counts()

0    2426
1     171
2     151
Name: 0, dtype: int64