# Section A: An Intro to Machine Learning and NLP

<img src="pic7.png" width="700">

https://www.javatpoint.com/machine-learning

## Supervised Machine Learning

<img src="pic1.png" width="800">

https://karthikvegeta.medium.com/welcome-to-the-hood-of-machine-learning-199dd31f39e6

## Classification vs. Regression: Weather Forecasting

<img src="pic2.jpeg" width="700">

https://medium.com/@ali_88273/regression-vs-classification-87c224350d69

## Text Classification

▪ Logistic Regression

▪ Naive Bayes

▪ Comparing Methods: Classification Metrics

## How Does Logistic Regression Work?

▪ A binary classification task: To predict whether an animal is a cat or not.

<img src="pic8.png" width="600">

https://towardsdatascience.com/analytics-building-blocks-binary-classification-d205890314fc

![](https://i163.photobucket.com/albums/t281/kyin_album/m6.png)

## Fruits Classifier?

▪ A multiclass classification task: To predict whether a fruit is an apple, orange or grapes.

![](https://i163.photobucket.com/albums/t281/kyin_album/m1_1.png)

# Section B: Classification with Logistic Regression

## Classification Task: Detection of spam emails

<img src="pic4.png" width="500">

https://developers.google.com/machine-learning/guides/text-classification

## Step 1: Preparing Data 

▪ Make sure the data in **spam.txt** is labeled data.

### Loading Data into a DataFrame

In [None]:
import pandas as pd

data = pd.read_table('spam.txt', encoding = 'windows-1252', header = None)
data

### Adding New Labels to Data

In [None]:
data.columns = ['label', 'text']
data

### Warm Up Exercise

▪ **Question**: Swapping the "label" and "text" columns

In [None]:
df_2 = data.reindex(columns = ['text', 'label'])
df_2

## Walk-through Examples Before Data Preprocessing

### Example 1A: Normal Function

In [None]:
# Create a normal function that identifies an even number
def even(num):
    return num % 2 == 0

nums = [5, 10]

print(nums[0], "is even:", even(nums[0]))
print(nums[1], "is even:", even(nums[1]))

In [None]:
print(type(even))

### Python Lambda

▪ Python lambda functions are **small, anonymous functions** defined using the **lambda** keyword.

<img src="pic9.png" width="450">

https://www.scaler.com/topics/how-to-use-lambda-functions-in-python/

### Example 1B: Lambda Function

In [None]:
# Create a lambda function that identifies an even number
result = lambda num: num % 2 == 0

nums = [5, 10]

print(nums[0], "is even:", result(nums[0]))
print(nums[1], "is even:", result(nums[1]))

In [None]:
print(type(result))

### Warm Up Exercise

▪ **Question**: Create and use a lambda function that multiplies two numbers.

In [None]:
# Code?

### Invoking Functions with and without Parentheses

▪ When we call a function with parentheses, the function gets execute and returns the result to the callable.

▪ When we call a function without parentheses, a function reference is sent to the callable rather than executing the function itself.

https://www.geeksforgeeks.org/python-invoking-functions-with-and-without-parentheses/

### Example 1

In [None]:
def func_A(num = 5):
    print("Function A")

def func_B(num = 10):
    print("Function B")

In [None]:
func_B(func_A())

In [None]:
func_B(func_A)

### Example 2

In [None]:
import re

text = "I love AVENGER movies, especially The Endgame."

def lowercase(match_obj):
    return match_obj.group(0).lower()

clean_text = re.sub(r"\b[A-Z]+\b", lowercase, text)
clean_text

### Pandas Series map()

▪ The map() function is used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

https://www.w3resource.com/pandas/series/series-map.php

## Step 2: Preprocessing Data 

In [None]:
# Before preprocessing data... 
data.head()

In [None]:
import re
import string

# Remove numbers
x_number = lambda x: re.sub(r"\w*\d\w*", '', x)

# Remove punctuation and capital letters
x_punc_upper = lambda x: re.sub('[%s]' %(string.punctuation), '', x.lower())

In [None]:
data['text'] = data.text.map(x_number).map(x_punc_upper)

# After preprocessing data...
data.head()

## Step 3: Splitting Data into Input and Output

▪ **Input**: Features, Predictors, Independent Variables, X's 
    
▪ **Output**: Label, Outcome, Dependent Variable, Y
    
![](https://i163.photobucket.com/albums/t281/kyin_album/m2.png)

In [None]:
# Inputs to be fed into the model
X = data.text

# Output of the model
y = data.label 

In [None]:
X.head()

In [None]:
y.head()

### Overfitting

▪ Overfitting refers to a model that models the training data too well.

![](pic5.png)

https://www.ibm.com/cloud/learn/overfitting

### How to Avoid Overfitting?

![](https://i163.photobucket.com/albums/t281/kyin_album/m4.png)

## Step 4: Splitting Data into Training Data and Test Data

### train_test_split()

▪ X: independent variable(s)

▪ y: dependent variable

▪ test size = 30% of observations, which means training size = 70% of observations

▪ random state = 42, so we all get the same random train / test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

![](pic6.png)

In [None]:
X_train.shape

In [None]:
X_train.head()

In [None]:
y_train.shape

In [None]:
y_train.head()

In [None]:
X_test.shape

In [None]:
y_test.shape

## Step 5: Numerically Encoding the Input Data

![](pic10.png)

https://www.educative.io/answers/countvectorizer-in-python

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')

X_train_cv = cv.fit_transform(X_train).toarray()

print(X_train_cv.shape)

In [None]:
print(X_train_cv)

In [None]:
# Transform test data using the same vocabularies
X_test_cv = cv.transform(X_test).toarray() 

print(X_test_cv.shape)

## Step 6: Fitting The Model and Predicting Outcomes

In [None]:
# Use a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(X_train_cv, y_train)

In [None]:
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
y_pred_cv = lr.predict(X_test_cv)

# The output is all of the predictions/ labels
y_pred_cv 

## Step 7: Evaluating The Model

![](https://i163.photobucket.com/albums/t281/kyin_album/m5.png)

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_cv)
cm

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'], 
            annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu")

In [None]:
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]

accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg), 3)
precision = round((true_pos) / (true_pos + false_pos), 3)
recall = round((true_pos) / (true_pos + false_neg), 3)
f1 = round(2 * (precision * recall) / (precision + recall), 3)

print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

# Section C: Classification with Naive Bayes

## Training the Model

In [None]:
# Use a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

# Create a Naive Bayes prediction model object
nb = MultinomialNB()

# Train the model
nb.fit(X_train_cv, y_train)

# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
y_pred_cv_nb = nb.predict(X_test_cv)

# The output is all of the predictions
y_pred_cv_nb 

## Evaluating the Model

In [None]:
cm = confusion_matrix(y_test, y_pred_cv_nb)

sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu")

true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]

accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg), 3)
precision = round((true_pos) / (true_pos + false_pos), 3)
recall = round((true_pos) / (true_pos + false_neg), 3)
f1 = round(2 * (precision * recall) / (precision + recall), 3)

print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

# Section D: NLP Showcase

## Name Gender Classifier

▪ To create a classifier that would automatically classify a given name into either male or female.

In [None]:
from nltk.corpus import names

male_names = names.words("male.txt")
female_names = names.words("female.txt")

In [None]:
names_list = [(name, 'male') for name in male_names]
names_list += [(name, 'female') for name in female_names]

In [None]:
import random

random.shuffle(names_list)

In [None]:
def extract_gender_features(name):
    
    # Convert all names to lowercase
    name = name.lower()
    
    # Create an empty dictionary
    features = {}
    
    # Extract different lengths of suffixes from names as features
    features["suffix1"] = name[-1:]
    features["suffix2"] = name[-2:] if len(name) > 1 else name[0]
    features["suffix3"] = name[-3:] if len(name) > 2 else name[0]
    #features["suffix4"] = name[-4:] if len(name) > 3 else name[0]
    #features["suffix5"] = name[-5:] if len(name) > 4 else name[0]
    #features["suffix6"] = name[-6:] if len(name) > 5 else name[0]
    
    # Extract different lengths of prefixes from names as features
    features["prefix1"] = name[:1]
    features["prefix2"] = name[:2] if len(name) > 1 else name[0]
    features["prefix3"] = name[:3] if len(name) > 2 else name[0]
    #features["prefix4"] = name[:4] if len(name) > 3 else name[0]
    #features["prefix5"] = name[:5] if len(name) > 4 else name[0]
    #features["wordLen"] = len(name)
   
    return features

data = [(extract_gender_features(name), gender) for (name, gender) in names_list]

In [None]:
# Set a limit for splitting training and testing data
train_count = int(.8 * len(data))
train_count

In [None]:
# Make the first 80% (the value of trainCount) dataset as the training data
train_data = data[:train_count]

# Make the remaining dataset as the test data
test_data = data[train_count:]

In [None]:
import nltk

# Train Naive Bayes classifier
bayes = nltk.NaiveBayesClassifier.train(train_data)

In [None]:
# Test the classifier
# Code?

In [None]:
# Use classify() to do gender prediction
prediction = [(bayes.classify(features), bayes.classify(features) == label) for features, label in test_data]

In [None]:
names_test = names_list[train_count:]

# Create an empty list to store name, gender, prediction, true/false
result = []

# Use sum() to combine two tuples into a new tuple
for index in range(len(prediction)):
    result.append(sum((names_test[index], prediction[index]), ()))

In [None]:
import pandas as pd

df = pd.DataFrame(result, columns = ['Name', 'Gender', 'Prediction', 'T/F'])
df[:20]

In [None]:
# Evaluate the performance in terms of accuracy
print("Test data accuracy =", nltk.classify.accuracy(bayes, test_data))

In [None]:
# Show the 25 most informative features that our model used
bayes.show_most_informative_features(25)

In [None]:
# Show all incorrect predictions
errors = []

for (name, label) in names_list:
    if bayes.classify(extract_gender_features(name)) != label:
        errors.append({"name": name, "label": label})

errors[0:20]

# Section E: Machine Learning and NLP Exercises

## Question 1

### Instructions

▪ We will be using a review dataset from Kaggle (e.g., coffee.csv) for this exercise. 

▪ The product we'll be focusing on this time is a cappuccino cup.

▪ Later on, split your dataset based on the ratio of 80% training data + 20% testing data.

### Step 1: Preparing Data 

### Step 2: Preprocessing Data 

### Step 3: Splitting Data into Input and Output

### Step 4: Splitting Data into Training Data and Test Data

### Step 5: Numerically Encoding the Input Data

### Step 6: Fitting Different Models and Predicting Outcomes

### I. Logistic Regression

###  II. Naive Bayes

### III. SVM

### IV. Decision Tree

### V. Random Forest

### VI. KNN

### Step 7: Evaluating Predictive Models

**Example Output**:

Accuracy score for LR  = 0.1651<br>
Accuracy score for NB  = 0.6514<br>
Accuracy score for SVM = 0.5413<br>
Accuracy score for DT  = 0.5505<br>
Accuracy score for RF  = 0.5872<br>
Accuracy score for KNN = 0.5963<br>
Accuracy score for NB  = 0.6514

## Question 2

Predict the rate of this review: **"I dislike this coffee, terrible taste and very greasy."** by using Linear Regression, SVM, Decision Tree, Random Forest, KNN, and Naive Bayes