<a href="https://colab.research.google.com/github/christinabrnn/Python-Study/blob/main/BA820/text_analysis_basics_unsolved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Course: BA820 - Unsupervised and Unstructured ML**

**Notebook created by: Mohannad Elhamod**

## 0. Prerequisite: Text Cleaning and Regex

We will start with some interesting sentences. Each sentence in this context can be considered *a document*.

In [None]:
corpus = [
    "In 1945, the US dropped two nuclear bombs on Japan. Japan surrendered afterwards.",
    "Japan is located in Asia. Tokyo is its capital.",
    "The capital of the USA is Washington D.C., which is located on the eastern seaboard.",
    "I like eating apples! I eat 2.3 pounds everyday.",
    "The capitol of Canada is Ottawa. My aunt's number there is (613)-554-2121. I enjoy visiting here.",
    "       5/2 = 2.5.",
    "The professor was very kind to us when creating the midterm exam.",
    "An apple a day keeps the doctor away!",
    "I love Apple products",
    "@jason We won the game! #WeAreTheChampions.",
    "My phone number in Canada is (613)-224-2311        ",
    "Eat this apple."
]

import pandas as pd
df = pd.DataFrame({'text':corpus})
df

Unnamed: 0,text
0,"In 1945, the US dropped two nuclear bombs on J..."
1,Japan is located in Asia. Tokyo is its capital.
2,"The capital of the USA is Washington D.C., whi..."
3,I like eating apples! I eat 2.3 pounds everyday.
4,The capitol of Canada is Ottawa. My aunt's num...
5,5/2 = 2.5.
6,The professor was very kind to us when creatin...
7,An apple a day keeps the doctor away!
8,I love Apple products
9,@jason We won the game! #WeAreTheChampions.


Some pre-processing you might want to consider:

- lower/upper casing. *Is the effect positive or negative?*
- Removing trailing spaces.
- Removing punctuation. *Is the effect positive or negative?*
- Replacing synonyms.

In [None]:
df_modified = df.copy()
df_modified = pd.DataFrame(df_modified.text.str.lower()) # to lower case.
# df_modified = pd.DataFrame(df_modified.text.str.strip()) # removing leading and trailing spaces

# df_modified = pd.DataFrame(df_modified.text.str.replace("like", "enjoy")) # synonym replacement
# df_modified = pd.DataFrame(df_modified.text.str.replace('[^\w\s]','', regex=True)) # remove punctuation

### returning matches
# df_modified = pd.DataFrame(df_modified.text.str.findall('@\S+|#\S+'))  # list of matches
# df_modified = df_modified[df_modified.text.str.contains(r'@\S+|#\S+', regex=True)] # rows that contain matches
# df_modified = df_modified[df_modified.text.str.contains("us")] # True or False

df_modified

## 1. Tokenization: Text to Tokens

In order to represent text, we need to first break it down into units.

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize, WhitespaceTokenizer, RegexpTokenizer
from nltk.tokenize.casual import TweetTokenizer
import nltk


tokenized = [word_tokenize(t) for t in corpus] # word tokenization
# tokenized = [WhitespaceTokenizer().tokenize(t) for t in corpus] # based on white spaces.
# tokenized = [TweetTokenizer().tokenize(t) for t in corpus] #Tweets tokenization
# tokenized = [RegexpTokenizer(r'\d{4}|\d{3}', gaps=False).tokenize(t) for t in corpus] # '\([0-9]{3}\)-[0-9]{3}-[0-9]{4}' #'\d{4}|\d{3}' # Regex tokenization. This keeps phone numbers only
tokenized

We may want to remove stop words.

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) # Get the set of stop words

tokenized_filtered =


tokenized_filtered

Tokenization does not have to be at the word level... What matters is that we split the text into units in some way.

In [None]:
nltk.download('words')
from nltk.corpus import words
from nltk.tokenize import SyllableTokenizer

ST = SyllableTokenizer() # Tokenizes words not sentences.

tokenized_filtered_syllables = []
for sentence in df_modified["text"]:
  tokenized_filtered_syllables.append(ST.tokenize(sentence))

tokenized_filtered_syllables

**Questions:**

- any interesting sentences you want to try?
- What if I want to only collect phone numbers?
- What if I want to only collect email addresses?
- Is it better to tokenize by word, sentence, character, or "sub-words?".


###Stemming

Alternatively, we can stem words and hopefully keep the essence of the meaning each sentence. This might make text comparison easier.

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

tokenized_filtered_stemmed =

tokenized_filtered_stemmed

You can always put the tokens back together into one string.

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

for tokenized_sentence in tokenized_filtered_stemmed:
  print(TreebankWordDetokenizer().detokenize(tokenized_sentence))

# 2. Frequency-based Vectorization

### BoW

Now that we have tokenized the sentences, we can vectorize them. Let's use **Bag of Words**, the simplest way we know.

Let's create and fit the model.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def tokenize_lemmatize(sentence):
  tokens = word_tokenize(sentence) # tokenize
  lemmatized_tokens = [ps.stem(word) for word in tokens] # lemmatize
  return lemmatized_tokens

#model
cv = #tokenizer= word_tokenize, tokenizer=tokenize_lemmatize stop_words='english'

# fit
cv.

print('number of `tokens`', len(cv.vocabulary_))
cv.vocabulary_

You can print the list of stop words that was used

In [None]:
cv.get_stop_words()

Now, let's transform the documents into BoW format

In [None]:
dtm = cv.
bow = pd.DataFrame(, columns=)
bow

**Questions:**

- What does this table remind you of? Is there any relevance?
- How does the representation using TF-IDF look like?





We can see which tokens were extracted for a sentence using `cv.inverse_transform`

In [None]:
recognized_tokens_sentence0 = cv.
recognized_tokens_sentence0

**Questions:**

- Why did it not return the sentence back?

###Document Similarity

Let's compare cosine similarity vs. Euclidean distance. We will calculate the *similariy matrix*.

First, cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

# Cosine sim
cos_sim = pd.DataFrame(  )
cos_sim

We can try to answer a question

In [None]:
q = "What' is my aunt's number'?"

q_vector = cv.transform(   ) # Get the question's represntation in terms of BOW

pd.DataFrame(   ) # How similar is each sentence to that questions?

We can see that the top two  matches are the ones with phone numbers (mine and my aunt's).

Now, Euclidean *similarity*

In [None]:
uclidean_distances =

uclidean_similarity

In [None]:
dist_matrix =
sim_matrix =

sim_matrix

Notice that when using Euclidean distance, documents that are not related still have a non-zero similarity, which is not ideal.

###TF-IDF

Let's rerun the same experiment with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_model =

tfidf_model.

df_tfidf_transformed = tfidf_model.
tfidf_vectors =
tfidf_vectors

### Application: Spam Detection

Let's apply what we have learned to a dataset of emails. Each email could be ham or scam.


In [None]:
url = "https://raw.githubusercontent.com/elhamod/BA820/main/Hands-on/04-text-mining/hamspam.csv"
df_sms = pd.read_csv(url, names = ['type', 'text'], index_col='type')

X = df_sms['text']
y = df_sms.index

# df = pd.DataFrame(df.text.str.lower()) # We can try lower-casing.

df_sms

**Unsupervised step: Vectorize!**

In [None]:
from sklearn.model_selection import train_test_split

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
vectorizer = CountVectorizer() #lowercase=False

# create the vectorizer.
X_train_counts = vectorizer.

# vectorize the test set
X_test_counts = vectorizer.

In [None]:
X_train_counts.toarray().shape

**Question:** How many unique tokens do we have? in this email dataset?

**Supervised learning: Let's train a classifier and look at the results.**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix

# train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_counts, y_train)

# Predict on the test data
y_pred = model.predict(X_test_counts)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
pd.DataFrame(confusion_matrix(y_test, y_pred, normalize='true'), columns=model.classes_, index=model.classes_ )

**Questions:**

- Would making tokens lower case help?
- Explore whether using a purly supervised approach leads to better results (e.g., by feature engineering the text. Think text length, number of exclamation marks, etc.)
- What tokens is logistic regression considering the most in its classification decision?

### Let's explore n-grams

In [None]:
import numpy as np

# Some parameters we could play with.

lowercase= True
n_gram_range = (1,3)

In [None]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 1: Let's vectorize.

In [None]:
import string

def tokenize(doc):
  tokens = word_tokenize(doc)

  # Remove punctuation
  tokens = [word for word in tokens if word not in string.punctuation]

  return tokens


In [None]:
import sklearn
vectorizer_ngram = CountVectorizer(lowercase=lowercase, ngram_range=, tokenizer=tokenize, stop_words='english')

In [None]:
# Fit on the training data. Also transform it.
X_train_ngram = vectorizer_ngram.fit_transform(X_train)

# Transform the test data.
X_test_ngram = vectorizer_ngram.transform(X_test)

X_train_ngram_df = pd.DataFrame(X_train_ngram.toarray(), columns=vectorizer_ngram.get_feature_names_out())

In [None]:
X_train_ngram_df

How many "1"s are there here? What does that mean?

In [None]:
X_train_ngram_df.astype(int).sum().sum()

Step 2: Predict!

In [None]:
# train the model
model_ngram = LogisticRegression(max_iter=1000)
model_ngram.fit(X_train_ngram, y_train)

# Predict on the test data
y_pred_ngram = model_ngram.predict(X_test_ngram)

# Evaluate the model
accuracy_ngram = accuracy_score(y_test, y_pred_ngram)
f1_score_ngram = sklearn.metrics.f1_score(y_test, y_pred_ngram, pos_label="spam")
print(f"Accuracy: {accuracy_ngram}")
print(f"f1_score: {f1_score_ngram}")
print(sklearn.metrics.classification_report(y_test,y_pred_ngram))
pd.DataFrame(confusion_matrix(y_test, y_pred_ngram, normalize='true'), columns=model_ngram.classes_, index=model_ngram.classes_ )

**Questions:**

- What would happen if I use a large context (i.e., large n)?
- What would happen if I use a large range of "n"s (i.e., mixed n-gram model)
- Would the results change if we balance the dataset?