### Demonstration Sentiment Analysis

This notebook is for training purposes and as such has been simplified to highlight the core steps in training an ML model.

#### Step 1: Load, Clean, Prepare the data

In [None]:
import pandas as pd

df = pd.read_csv('https://s3.eu-west-1.amazonaws.com/neueda.conygre.com/pydata/ml_fc/Restaurant_Reviews.tsv', sep='\t')

print(df.shape)
df.head()

Note - for simplicity this preparation function is greatly simplified. We could include further steps such as:
* stemming / lemmatization
* removing stop words
* identification of the most symantically valuable words

In [None]:
import re

def prepare_review(review):
    
    # A VERY simple "clean up" - remove everything non-alphabetical, then lower case
    review = re.sub('[^a-zA-Z]', ' ', review)    
    
    return review.lower()


In [None]:
prepare_review('Not tasty and the texture was just nasty.')

In [None]:
# apply the prepare_review function to each review (df.apply)
cleaned_reviews = df['Review'].apply(prepare_review).tolist()

# print the first few "cleaned" reviews
cleaned_reviews[:5]

##### This step is encoding the "Bag of Words" - the CountVectorizer utility does this for us.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500)

#### Step 2: Split into independant and dependant variables

In [None]:
# remember to pass reviews through the count vectorizer first
X = cv.fit_transform(cleaned_reviews).toarray()  # Independent variables

y = df['Liked'].values

#### Step 3: Split our data into training and test sets 

In [None]:
from sklearn.model_selection import train_test_split

# TODO: split our X and y data into 80% for training the model and 20% kept for testing

#### Step 4: Choose and train the model

In [None]:
from sklearn.linear_model import LogisticRegression

# TODO: create a LogisticRegression model and "train" it

#### Step 5: Examine & Measure the model

In this case as this is a classification problem, we will use a confusion matrix. There are a number of further calculations we might do to extract more metrics from our model.


#### Here we are demonstrating inference / prediction

In [None]:
# simulating a new review
new_review = "Service was awful, tasted revolting!" # should be predicted as negative!

# pass the new review through our "prepare_review" function

# pass the cleaned new review to the count vectorizer to create the "number list"

# ask the model to predict the sentiment


For a more formal measurement we create a confusion matrix for the test data.

This gives us information on where the model is likely to be correct/incorrect.

In [None]:
from sklearn.metrics import confusion_matrix

# TODO: Mearure accuracy with a confusion matrix and the 20% of data kept for testing

