---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Pandas (self-study)

### 🔗 **Link**: https://bit.ly/WA_LEC9_TEXT

### 🛢️ **Data**: http://bit.ly/someBookReviews 

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 🚪 1. Introduction

As always, we'll import the packages we'll need for this notebook.

In [None]:
import nltk
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Data
The file that I've placed in `./files/books.csv` contains 2,000 Amazon book reviews. 

The data set contains two features
- the first column (contained in quotes) is the review text. 
- the second column is a binary label indicating if the review is positive or negative.

Let's read the data into a pandas data frame. 
- You'll notice two new attributes in `pd.read_csv()` that we've never seen before. The first, `quotechar` is tell us what is being used to "encapsulate" the text fields. Since our review text is surrounding by double quotes, we let pandas know. We use a `\` since the quote is also used to surround the quote. This backslash is known as an escape character. We also let pandas now this.


In [None]:
data = pd.read_csv("files/books.csv", quotechar="\"", escapechar="\\")

data.head(20)

In [None]:
data.iloc[50]['review_text']

# 📝🏷️ 2. Text classification
We are going to look at some Amazon reviews and classify them into positive or negative.

## From text to numbers
Going from text to numeric data is very easy. Let's take a look at how we can do this. We'll start by separating out our X and Y data.

In [None]:
X_text = data['review_text']
Y = data['positive']

Next, we will turn `X_text` into just `X` -- a numeric representation!

In [None]:
# Create a vectorizer that will track text as binary features
binary_vectorizer = CountVectorizer(binary=True)

# Let the vectorizer learn what tokens exist in the text data
binary_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = binary_vectorizer.transform(X_text)

In [None]:
X

## Modeling
We have a ton of features, let's use them in some different models.

In [None]:
# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average accuracy rounded to three decimal points
print(f"The accuracy of our classifier is {np.mean(accs):3f}")

Let's try using full counts instead of a binary representation. I've just copy and pasted what is above and removed the `binary=True` from the vectorizer.

In [None]:
# Create a vectorizer that will track text as counted features
count_vectorizer = CountVectorizer()

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression(max_iter=100000)

# Use this model and our data to get 5-fold cross validation accuracy
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average AUC rounded to three decimal points
print(f"The accuracy of our classifier is {np.mean(accs):3f}")

Let's try using TF-IDF.

In [None]:
# Create a vectorizer that will track text as binary features
tfidf_vectorizer = TfidfVectorizer()

# Let the vectorizer learn what tokens exist in the text data
tfidf_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = tfidf_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression(max_iter=100000)

# Use this model and our data to get 5-fold cross validation accuracy
accs = cross_val_score(logistic_regression, X, Y, scoring="accuracy", cv=5)

# Print out the average AUC rounded to three decimal points
print(f"The accuracy of our classifier is {np.mean(accs):3f}")

# 🔍🔧 3. Feature Engineering

At the start of this class, we explored two ways of dealing with categorical data: binarizing and numerical scaling. I would like to show how to do these two things in Python. We will use the same simple 5 record data from class.

Go here to get the data: http://bit.ly/someCategoricalData


In [None]:
data_dict = {
    'Minutes': [100, 220, 500, 335, 450],
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male'],
    'Marital': ['Single', 'Married', 'Divorced', 'Single', 'Married'],
    'Satisfaction': ['Low', 'Very Low', 'High', 'Neutral', 'Very High'],
    'Churn': [0, 0, 1, 0, 1]
}

data = pd.DataFrame(data_dict)

data.head()

## Binarizing
Get a list of features you want to binarize, go through each feature and create new features for each level.

In [None]:
features_to_binarize = ["Gender", "Marital"]

# Get dummies for the desired columns and drop the first category for each
data_dummies = pd.get_dummies(data[features_to_binarize], drop_first=True)

# Drop the original columns from the data
data = data.drop(features_to_binarize, axis=1)

# Concatenate the original data and the dummies
data = pd.concat([data, data_dummies], axis=1)

data.head()

## Numeric scaling
We can also replace text levels with some numeric mapping we create

In [None]:
data['Satisfaction'] = data['Satisfaction'].replace(['Very Low', 'Low', 'Neutral', 'High', 'Very High'], 
                                                    [-2, -1, 0, 1, 2])

data.head()

## Creating new features

When modeling certain datasets, linear relationships between the features and the target variable might not capture the underlying patterns effectively. By squaring a feature, we introduce a form of non-linearity to our model.
- The 'minutes_squared' feature, for instance, can help in situations where the effect of minutes on the target variable accelerates or decelerates. Think of it as a way to capture the idea that "every additional minute has a larger (or smaller) effect than the previous one."
- For example, in scenarios like battery discharge, the first few minutes might not have a significant effect, but as time progresses, every additional minute might have a more pronounced effect. By squaring the 'minutes' feature, we can help our model learn such patterns.
- Remember that it's crucial to validate the performance of the model with this new feature. It's possible that for some datasets, introducing such non-linearity might not be beneficial and could even overcomplicate the model. Always rely on cross-validation or a test set to evaluate the impact of introducing new features.

In [None]:
data['Minutes_squared'] = data['Minutes'] ** 2
data.head()

Feature engineering is an important part of creating a machine learning model. It's a process that requires a lot of creativity and domain knowledge. It's also a process that can be automated to some extent.

# 4. More text preprocessing

## N-grams

N-grams are contiguous sequences of n items (words, letters) from a given sample of text. Beyond just individual words (1-grams or unigrams), using 2-grams, 3-grams, etc., helps in capturing phrases and the context in which words appear. For instance, "not good" as a 2-gram has a different sentiment than the individual words "not" and "good".

In [None]:
data = pd.read_csv("files/books.csv", quotechar="\"", escapechar="\\")
text_data = data['review_text']

# Using 1-grams
unigram_vectorizer = CountVectorizer()
X_unigrams = unigram_vectorizer.fit_transform(text_data)

# Using 2-grams
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
X_bigrams = bigram_vectorizer.fit_transform(text_data)

# notice that the number of columns greatly increased
print(X_unigrams.shape)
print(X_bigrams.shape)


## Part-of-Speech (POS) Tagging

POS tagging classifies words into their parts of speech (like nouns, verbs, adjectives). Understanding the grammatical structure can be beneficial in various NLP tasks.

In [None]:
# download these two first
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

sample_text = text_data.iloc[0]
tokens = nltk.word_tokenize(sample_text)
pos_tags = nltk.pos_tag(tokens)

print(sample_text)
for token, pos in pos_tags:
    print(f"{token} - {pos}")


## Text Complexity & Readability
Readability scores indicate how difficult a reading passage is to understand. For instance, the Flesch-Kincaid score can be used to gauge the complexity of text.

In [None]:
# %pip install textstat
from textstat import flesch_reading_ease

readability_score = flesch_reading_ease(sample_text)
print(f"Text: {sample_text} \nReadability score: {readability_score}")


## Stemming and Lemmatization
Both techniques reduce words to their base or root form. 
- Stemming can be more crude and cut off prefixes/suffixes
- Lemmatization ensures the root word is meaningful.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
# nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

sample_word = ["running", "ran", "runs", "runner"]
for word in sample_word:
    stemmed_word = stemmer.stem(word)
    lemmatized_word = lemmatizer.lemmatize(word)
    print(f"The word {word} stemmed is {stemmed_word} and lemmatized is {lemmatized_word}")

## Stop Words


In [None]:
from nltk.corpus import stopwords
#nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(f"Original tokens: {tokens}")
print(f"Filtered tokens: {filtered_tokens}")


# 5. Challenge

Go to the data set in the beggining of this lecture. 

The challenge is to build a better predictive model.
Some things you can do are the following
- use n-grams instead of 1-grams
- remove stopwords
- lower case
- use a stemmer 
- remove punctuation
- use a different machine learning model
- ....anything else you might imagine.