## Applying ML to an NLP Problem

In this questionnaire, you will use the built-in machine learning (ML) model to predict whether a review is positive or negative.

## Business Scenario

You work for an online retail store that wants to enhance customer engagement by addressing negative reviews. The company aims to detect negative reviews and assign them to a customer service agent for resolution.

Your task is to build an ML model that detects negative reviews. You have access to a dataset that contains reviews classified as either positive or negative. You will use this dataset to train an ML model that predicts the sentiment of new reviews.

## About the Dataset

The [AMAZON-REVIEW-DATA-CLASSIFICATION.csv](https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp/tree/master/data/examples) file contains actual product reviews. Each review includes a sentiment classification: positive or negative. The dataset serves as training data for the ML model you will develop.

## Assignment Structure

- Read the instructions carefully.
- Follow the guided steps to complete the ML pipeline.
- Submit your completed work as per the given submission guidelines.

### Instructions

- Ensure your code is syntactically correct before submission.
- Answer all required questions concisely and clearly.
- Document your approach where necessary.

By following this structure, you will gain practical experience in applying ML to sentiment analysis problems. Good luck!

Start by installing/upgrading pip, sagemaker, and scikit-learn.

[scikit-learn](https://scikit-learn.org/stable/) is an open source machine learning library. It provides various tools for model fitting, data preprocessing, model selection and evaluation and many other utilities.

( provide your answer or code in the box below.)

In [131]:
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
import xgboost as xgbswer

## 1. Reading the dataset

( provide your answer or code in the box below.)

In [132]:
import pandas as pd
df = pd.read_csv('AMAZON-REVIEW-DATA-CLASSIFICATION.csv')
df.shape

(70000, 6)

Look at the first five rows in the dataset.

( provide your answer or code in the box below.)

In [133]:
df.head(5)

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO sMALL FOR ME""\nLAPTOP. IDEAL FOR LEARNING A\nFUTURE GOOD SKILL. HER CHOICE\nOF BOOKS IS A PLUS AS WAS THIS BOOK!",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,"I attempted to install this OS on two different PCs. it will not complete the install.\nWhen it gets to the page to select the language, and country the mouse and keyboard become non-functional.",I attempted to install this OS on two different PCs ...,True,1518912000,0.0,0.0
4,"I've spent 14 fruitless hours over the past two days fruitlessly attempting to install this software on my computer and nothing I've found has worked. I need the software to type proficiently due to disability, and it will not install. The download itself seems to be a corrupted file, I have a fair amount of computer skills and no amount of tinkering has made the program work, so, judging by other reviews, it must be the program itself.",Do NOT Download.,True,1441929600,1.098612,0.0


change the options in the notebook to display more of the text data.

( provide your answer or code in the box below.)

In [140]:
pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)
df.head()


Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO sMALL FOR ME""\nLAPTOP. IDEAL FOR LEARNING A\nFUTURE GOOD SKILL. HER CHOICE\nOF BOOKS IS A PLUS AS WAS THIS BOOK!",IDEAL FOR BEGINNER!,True,1361836800,0.0,0
1,unable to open or use,Two Stars,True,1452643200,0.0,1
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,1
3,"I attempted to install this OS on two different PCs. it will not complete the install.\nWhen it gets to the page to select the language, and country the mouse and keyboard become non-functional.",I attempted to install this OS on two different PCs ...,True,1518912000,0.0,1
4,"I've spent 14 fruitless hours over the past two days fruitlessly attempting to install this software on my computer and nothing I've found has worked. I need the software to type proficiently due to disability, and it will not install. The download itself seems to be a corrupted file, I have a fair amount of computer skills and no amount of tinkering has made the program work, so, judging by other reviews, it must be the program itself.",Do NOT Download.,True,1441929600,1.098612,1


FInd what is the content of 580 entries

( provide your answer or code in the box below.)

In [141]:
df.loc[580]

Unnamed: 0,580
reviewText,"i had no intention of using this product; like others, i just got it as part of a bundle that gave me a substantial rebate. since so many other people have written about their problems actually getting the rebate, i figured i should chime in with my experience. if you follow the direction precisely you will get your rebate. just don't ever actually install the software. and put a reminder on your calendar to go and cancel your automatic subscription renewal. their rebate tracking is actually pretty good. it took about two weeks from when i sent it (snail mail) to when their system said they received it. then it took another two weeks for them to approve it. when i filed the rebate i opted for the expedited version which cost me 10% of the rebate (deducted directly from the rebate-no money up front). after approval, they mailed it within four days and a received it a couple of days later. It comes in a completely nondescript envelope so it's easy to miss. But it's a real, live AmEX prepaid gift card. I used it last night in fact. And today I remembered to go cancel my automatic renewal service. Contrary to what some others have said, it took on the first time and wasn't a big deal. I used a throw away email address when I signed up (though I haven't received any email from them since) so it took me longer to figure out what my username was to sign in than it did for me to actually cancel."
summary,worth the rebate if you can follow instructions. i wouldn't install the software though....
verified,False
time,1404172800
log_votes,0.0
isPositive,0


Find what data types you are dealing with. You can use `dtypes` on the dataframe to display the types.

( provide your answer or code in the box below.)

In [142]:
df.dtypes

Unnamed: 0,0
reviewText,object
summary,object
verified,bool
time,int64
log_votes,float64
isPositive,int64


## 2. Performing exploratory data analysis
([Go to top](#Lab-2.1:-Applying-ML-to-an-NLP-Problem))

Find what is the target distribution for your dataset.

( provide your answer or code in the box below.)

In [143]:
df['isPositive'].value_counts()

Unnamed: 0_level_0,count
isPositive,Unnamed: 1_level_1
0,43692
1,26308


The business problem is concerned with finding the negative reviews (_0_). However, the model tuning for linear learner defaults to finding positive values (_1_). You can make this process run more smoothly by switching the negative values (_0_) and positive values (_1_). By doing so, you can tune the model more easily.

( provide your answer or code in the box below.)

In [144]:
df['isPositive'] = df['isPositive'].map({0: 1, 1: 0})
df['isPositive'].value_counts()

Unnamed: 0_level_0,count
isPositive,Unnamed: 1_level_1
1,43692
0,26308


Check the number of missing values:

( provide your answer or code in the box below.)

In [146]:
df.isnull().sum()

Unnamed: 0,0
reviewText,12
summary,15
verified,0
time,0
log_votes,0
isPositive,0


The text fields have missing values. Typically, you would decide what to do with these missing values. You could remove the data or fill it with some standard text.

( provide your answer or code in the box below.)

In [121]:
df = df.dropna()
df.isnull().sum()

Unnamed: 0,0
reviewText,0
summary,0
verified,0
time,0
log_votes,0
isPositive,0


## 3. Text processing: Removing stopwords and stemming
([Go to top](#Lab-2.1:-Applying-ML-to-an-NLP-Problem))

In this step, you will remove some of the stopwords, and perform stemming on the text data. You are normalizing the data to reduce the amount of different information you have to deal with.

[nltk](https://www.nltk.org/) is a popular platform for working with human language data. It provides interfaces and functions for processing text for classification, tokenization, stemming, tagging, parsin, and semantic reasoning.

Once imported, you can download only the functionality you need. In this example, you will use:

- **punkt** is a sentence tokenizer
- **stopwords** provides a list of stopwords you can use.

( provide your answer or code in the box below.)

In [147]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

create the processes for removing stopwords and cleaning the text in the following section. The Natural Language Toolkit (NLTK) library provides a list of common stopwords. You will use the list, but you will first remove some of the words from that list. The stopwords that you keep in the text are useful for determining sentiment.

( provide your answer or code in the box below.)

In [148]:
import nltk, re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

stop = stopwords.words('english')
exclude = ["againt", "not", "don", "don\'t", "ain", "are", "aren\t", "could", "couldn\'t",
           "did", "didn\'t", "does", "doesn\'t", "had", "hadn\'t", "has", "hasn\'t",
           "have", "haven\'t", "is", "is\'t", "might", "mightn\'t", "must", "mustn\'t",
           "weren\'t", "won\'t", "would", "wouldn\'t"]

stopwords = [word for word in stop if word not in exclude]

you can use snowball stemmer will stem words.

( provide your answer or code in the box below.)

In [149]:
snow = SnowballStemmer("english")

You must perform a few other normalization tasks on the data. The following function will:

- Replace any missing values with an empty string
- Convert the text to lowercase
- Remove any leading or training whitespace
- Remove any extra space and tabs
- Remove any HTML markup

In the `for` loop, any words that are __NOT__ numeric, longer than 2 characters, and not part of the list of stop words will be kept and returned.

( provide your answer or code in the box below.)

In [150]:
def process_text(text):
  final_text_list = []
  for sent in text:
    if isinstance(sent, str) == False:
      sent = ''

    filter_sequence = []
    sent = sent.lower()
    sent = sent.strip()
    sent = re.sub('\s+', '', sent)
    sent = re.compile('<.*?>').sub('', sent)

    for w in word_tokenize(sent):
      if(not w.isnumeric()) and (len(w)>2) and (w not in stopwords):
        filter_sequence.append(snow.stem(w))
    final_string = " ".join(filter_sequence)

    final_text_list.append(final_string)

  return final_text_list

## 4. Splitting training, validation, and test data
([Go to top](#Lab-2.1:-Applying-ML-to-an-NLP-Problem))

In this step, you will split the dataset into training (80 percent), validation (10 percent), and test (10 percent) by using the sklearn [__train_test_split()__](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

The training data will be used to train the model which is then tested with the test data. The validation set is used once the model has been trained to give you metrics on how the model might perform on real data.

( provide your answer or code in the box below.)

In [151]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[['reviewText', 'summary', 'time', 'log_votes']],
                                                  df['isPositive'],
                                                  test_size=0.2,
                                                  shuffle=True,
                                                  random_state=324
                                                  )

X_val, X_test, y_val, y_test = train_test_split(X_val, y_val,
                                                  test_size=0.5,
                                                  shuffle=True,
                                                  random_state=324
                                                  )
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 56000
Validation set size: 7000
Test set size: 7000


With the dataset split, you can now run the `process_text` function defined above on each of the text features in the training, test, and validation sets.

( provide your answer or code in the box below.)

In [152]:
nltk.download('punkt_tab')
# Processing reviewText
print('Processing reviewText...')
X_train['reviewText'] = process_text(X_train['reviewText'].tolist())
X_val['reviewText'] = process_text(X_val['reviewText'].tolist())
X_test['reviewText'] = process_text(X_test['reviewText'].tolist())

# Processing summary
print('Processing summary...')
X_train['summary'] = process_text(X_train['summary'].tolist())
X_val['summary'] = process_text(X_val['summary'].tolist())
X_test['summary'] = process_text(X_test['summary'].tolist())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Processing reviewText...
Processing summary...


## 5. Processing data with pipelines and a ColumnTransformer
([Go to top](#Lab-2.1:-Applying-ML-to-an-NLP-Problem))

You will often perform many tasks on data before you use it to train a model. These steps must also be done on any data that's used for inference after the model is deployed. A good way of organizing these steps is to define a _pipeline_. A pipeline is a collection of processing tasks that will be performed on the data. Different pipelines can be created to process different fields. Because you are working with both text and numeric data, you can define the following pipelines:

   * For the numerical features pipeline, the __numerical_processor__ uses a MinMaxScaler. (You don't need to scale features when you use decision trees, but it's a good idea to see how to use more data transforms.) If you want to perform different types of processing on different numerical features, you should build different pipelines, like the ones that are shown for the two text features.
   * For the text features pipeline, the __text_processor__ uses `CountVectorizer()` for the text fields.
   
The selective preparations of the dataset features are then put together into a collective ColumnTransformer, which will be used with in a pipeline along with an estimator. This process ensures that the transforms are performed automatically on the raw data when you fit the model or make predictions. (For example, when you evaluate the model on a validation dataset via cross-validation, or when you make predictions on a test dataset in the future.)

( provide your answer or code in the box below.)

In [153]:
numerical_features = ['time', 'log_votes']
text_features = ['summary', 'reviewText']

model_features = numerical_features + text_features
model_target = 'isPositive'

In [154]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

numerical_processor = Pipeline([
    ('num_imputer', SimpleImputer(strategy='mean')),
    ('num_scaler', MinMaxScaler())
])

text_processor_0 = Pipeline([
    ('text_vect_1', CountVectorizer(binary=True, max_features=50))
])

text_processor_1 = Pipeline([
    ('text_vect_1', CountVectorizer(binary=True, max_features=150))
])

data_preprocessor = ColumnTransformer([
    ('numerical_pre', numerical_processor, numerical_features),
    ('text_pre_0', text_processor_0, text_features[0]),
    ('text_pre_1', text_processor_1, text_features[1])
                                       ])

print("Data shapes before processing:", X_train.shape, X_val.shape, X_test.shape)

X_train = data_preprocessor.fit_transform(X_train).toarray()
X_val = data_preprocessor.transform(X_val).toarray()
X_test = data_preprocessor.transform(X_test).toarray()

print("Data shapes after processing:", X_train.shape, X_val.shape, X_test.shape)

Data shapes before processing: (56000, 4) (7000, 4) (7000, 4)
Data shapes after processing: (56000, 202) (7000, 202) (7000, 202)


In [155]:
print(X_train[0])

[0.7561223 0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.        0.        0.        0.
 0.        0.        0.        0.       