# Model Training
We are going to train a sentiment analysis Model to determine if a Yelp review is either Positive Or Negative.

This model training session is also going to be used as a case study to see how accurate the XGBoost Framework can be used in this type of scenario. This is typically not an NLP framework used.

This session will teach the importance of Preprocessing data as well as how to train a model. We will use the popular nltk project for preprocessing.

In [None]:
# Install additional libraries
!pip install nltk

In [None]:
### RESTART KERNEL FoR NEW LIBRARY TO TAKE ####

# Preparing our Data
We just created a dataset and saved it as 'training_data.csv' in Session 2. We are going to use this to train our model. We will need to clean up and preprocess our data. We will use nltk for this.

In [None]:
# Import our dataset to Pandas DF
import pandas as pd

data = pd.read_csv('training_data.csv', index_col=0)
data

# Preprocess with nltk
We will use stopwords and WordNetLemmatizer.

**Stop words** - Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

**Word Lemmatizer** - Reduce a word to its root form, also called a lemma. For example, the verb "running" would be identified as "run." Lemmatization studies the morphological, or structural, and contextual analysis of words.


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download corpora (shit ton of text)
nltk.download('stopwords')
nltk.download('wordnet')

# English stop words here
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
# Function that cleans up the text. Remove things like punctuation, convert to lowercase, lemmatize and remove stop words

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text, re.UNICODE)
    # convert to lowercase
    text = text.lower()
    # Lemmatize
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")] 
    # remove stop words
    text = [word for word in text if not word in stop_words] 
    # Bring the list back into a string
    text = " ".join(text)
    
    return text

In [None]:
# Apply text cleaning function above to each row (lambda) of our dataset 
# Here we add an additional row just to demonstrate.
# This is the column we will train our model on!
data['cleaned_text'] = data.text.apply(lambda x: clean_text(x))

In [None]:
# Read it an notice the differences between text and cleaned_text columns
data

# Feature Extraction
In natural language processing (NLP), feature extraction is a fundamental task that involves converting raw text data into a format that can be easily processed by machine learning algorithms.

Machines need read numerical data!

In [None]:
# Use Sklearn feature extractor
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

# Vectorize (convert to numbers) on the cleaned_text column and make it an array to pass to our model for training
vectorizer = TfidfVectorizer(max_features=5000)
tfid_obj = vectorizer.fit_transform(data['cleaned_text'])

# Save our fitted vectorizer using pickle
vec_file = 'vectorizer.pickle'
pickle.dump(vectorizer, open(vec_file, 'wb'))


In [None]:
# Convert the features to an array and Read the array just to see
data_features = tfid_obj.toarray()
data_features

In [None]:
# We also need to make our sentiments a numerical value. To keep this simple we will make positive a 1 and negative a 0.
def encode_sentiment(sentiment):
    if sentiment == 'positive':
        sentiment_value = 1
    else:
        sentiment_value = 0

    return int(sentiment_value)



In [None]:
# Add new column with encodings 
data['encoded_sentiment'] = data.sentiment.apply(lambda x: encode_sentiment(x))

In [None]:
# Take a look at dataset
data

# Split the data
The train-validation-test split is a strategy that divides a dataset into three essential subsets: the training set, the validation set, and the test set. Each subset serves a distinct purpose.

In [None]:
from sklearn.model_selection import train_test_split

# X_train, X_test will be our array above
# y_train, y_test is our encoded_sentiment. y is the result (prediction_ you are going for)
X_train, X_test, y_train, y_test = train_test_split(data_features, data['encoded_sentiment'], test_size=0.2, random_state=42)

In [None]:
X_train, y_train

# Train the Model
Now the fun part!!!! Train an model!!!

Now we are going to train an XGBoost model our our training data (X_train, y_train)

In [None]:
import xgboost as xgb

# Use the Classifier. We can adjust the parameters here to try to get better results.
model = xgb.XGBClassifier(max_depth=10, n_estimators=1000, learning_rate=0.01)
model.fit(X_train, y_train)

In [None]:
# Save our model locally
model.save_model("model.json")

In [None]:
# Now we can do model evaluations using our Test Dataset
# Use the model to run predictions across 2000K rows
predictions = model.predict(X_test)

In [None]:
# This will output the encoded sentiments
predictions

In [None]:
# There are several ways we can see the results with different 

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

# Inference (Real Tests)
Now lets run through the process of using some real text to get predictions. We must follow the same steps we did for training as we do to get results.

**Inference** - Applying a machine learning model to a dataset and generating an output or “prediction”. This output might be a numerical score, a string of text, an image, or any other structured or unstructured data.

In [None]:
# Load the model if needed
# import xgboost as xgb

# model = xgb.XGBClassifier(max_depth=10, n_estimators=1000, learning_rate=0.01)
# model.load_model("model.json")

In [None]:
# Follow same steps as dataprocessing for model
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

# Download corpora (shit ton of text)
nltk.download('stopwords')
nltk.download('wordnet')

# English stop words here
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
# Load our vectorizer
loaded_vectorizer = pickle.load(open('vectorizer.pickle', 'rb'))

def clean_and_vectorize(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text, re.UNICODE)
    # convert to lowercase
    text = text.lower()
    # Lemmatize
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")] 
    # remove stop words
    text = [word for word in text if not word in stop_words] 
    # Bring the list back into a string
    text = " ".join(text)

    # Vectorize from our vectorizer created above
    data_features = loaded_vectorizer.transform([text])
    # Create an array as it expects
    data_features = data_features.toarray()

    # Get the prediciton 
    prediction = model.predict(data_features)[0]

    # 1 is positive 0 is negative
    if prediction == 1:
        sentiment = 'positive'
    else: 
        sentiment = 'negative'

    return sentiment

In [None]:
text = 'This place is the greatest on earth!'
prediction = clean_and_vectorize(text)
prediction

In [None]:
text = """
Alqueria surpassed my expectations ten fold. You can tell that their food is authentic farm-to-table and is just incredibly fresh.
We ordered the shrimp and drunken goat cheese as our appetizers and they proved we made the right decision in choosing Alqueria for dinner. The shrimp, with the oil that it is in, is unrivaled. I would come back just to eat more of this! The lamb and spaghetti squash were also very good. The lamb fell right off the bone.
Our server was very sweet to us, offered her suggestions, and always checked in.
I do love the intimate feel of the restaurant, however, reservations are necessary since it is a smaller place.
The menu changes often which I think is a fun concept, but I am really hoping they don't ever take the shrimp off the menu!
Braised Lamb Shank & Local Spaghetti Squash
"""

prediction = clean_and_vectorize(text)
prediction

In [None]:
text = """
Wow! Downright awful! Some background: I have avoided eating Panera basically my entire life because it's terrible. However! In Orlando a group of friends wanted to go to Panera, so I went. It was awesome! So good that I told several people about it and was like we have to go when we're all back in Columbus because it was so good. Which was so surprising. Well. Here we are. And it was awful. Terrible. Put together terribly and tasted like feet and rot. Stay away. Better to not eat than to waste money on this garbage.
"""
prediction = clean_and_vectorize(text)
prediction