# Natural Language Processing + Logistic Regression

---

## What to expect:

1. Natural Language Processing
2. Introduction to NLTK
3. Dataset: Spam Emails
4. Text cleaning
5. Text Feature Extraction
6. Example of using a Sklearn Vectorizer
7. Implementing a Logistic Regression Model

---

In [1]:
# Import the usual...
import numpy as np
import pandas as pd

# Regex
import re

## 1. Natural Language Processing

![picture](https://miro.medium.com/v2/resize:fit:1358/0*ZAqYrctJClczGHps)

## 3. Dataset: Spam Bam Thank You Ma'am!

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,572 messages, tagged according to being ham (legitimate) or spam.

<img src='https://64.media.tumblr.com/d49d6e69d3c5a8d086959acf3a2dba2d/tumblr_npzsif8c7H1tjmfrio1_1280.jpg' width=400>

In [2]:
# Upload the spam dataset
spam = pd.read_csv('spam.csv', encoding = "ISO-8859-1")
# Rename the columns in the dataset
spam.rename(columns={'v1': 'Email Type', 'v2': 'Email'}, inplace=True)

In [3]:
# Let's takea look at our feature and element count
spam.shape

(5572, 5)

In [4]:
spam.head()

Unnamed: 0,Email Type,Email,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
# Let us take a look at null values and empty columns
spam.isnull().sum()

Email Type       0
Email            0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [6]:
# Some feature engineering
spam = spam.drop(columns =['Unnamed: 2','Unnamed: 3', 'Unnamed: 4'])

In [7]:
# That looks much better
spam.head()

Unnamed: 0,Email Type,Email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


---

## 2. Inroduction to NLTK

The **Natural Language Toolkit (NLTK)** is a powerful and versatile library for natural language processing (NLP) in Python.

It provides tools and resources for tasks such as:

- **Text Processing**
- **Linguistic Analysis**
- **Machine learning**

With a comprehensive collection of datasets and algorithms, NLTK facilitates tasks like:
- Tokenization
- Part-of-speech tagging,
- Sentiment analysis and more.

Its user-friendly interface and extensive documentation make it accessible for beginners, while its scalability and functionality appeal to researchers and professionals. NLTK plays a pivotal role in advancing NLP research and applications, serving as a foundational tool in academia, industry, and the development of cutting-edge language models.

```python

# Make sure that you have installed the library into your local machine
!pip install nltk

#import the pakage
import nltk

# This is an alternative to the using the NLTK downloader
nltk.download() # may be expensive for certain laptops.
#download directly
nltk.download(['punkt','stopwords'])

###  Natural Language Toolkit (NLTK) downloader tool:

<img src="https://github.com/Explore-AI/Pictures/blob/master/nltk_downloader.png?raw=true" width=50%/>

Use it to navigate to the item we need to download:
- stopwords corpus (Corpora tab)
- punkt tokenizer models (Models tab)

Navigate to these, click the download button, and exit the downloader when finished.

## 3. Text Cleaning

Text cleaning is a crucial preprocessing step in Natural Language Processing (NLP) that involves transforming raw text data into a format suitable for analysis. Here are key steps in text cleaning:

![gif](https://media2.giphy.com/media/FHEjBpiqMwSuA/200.webp?cid=ecf05e47g0sxxtxnpwujotm7zj2dqex8is0jyahlqzwh6kvl&ep=v1_gifs_search&rid=200.webp&ct=g)

### Removing Noise

Removing noise, such as **special characters**, is a crucial step in text preprocessing for Natural Language Processing (NLP). Special characters, symbols, and punctuation may introduce **unnecessary complexity** and hinder analysis. By systematically **eliminating these elements**, the text becomes cleaner and **more focused on the essential information**. Striking a balance between maintaining meaningful content and eliminating distracting noise is essential for extracting valuable insights from textual data in diverse domains, from social media analytics to scientific literature mining.

### <font color='maroon'>Regular Expression</font>


Regular expressions (regex or regexp) are powerful tools for pattern matching and string manipulation. The re module provides functions and methods to work with regular expressions in Python. There are a quite a few useful tools from the library and I encourage you to check it out!

### Removing Emails

```python
# Importing the Regular Expressions Library
import re
```

In [8]:
# Example text containing email addresses
text_with_emails = """
    John Doe's email is john.doe@example.com,
    and Jane Smith can be reached at jane.smith@gmail.com.
"""

In [9]:
# Define a regex pattern for matching email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

In [10]:
# Use re.sub() to replace email addresses with an empty string
text_without_emails = re.sub(email_pattern, '', text_with_emails)

In [11]:
# Print the text without emails
print(text_without_emails)


    John Doe's email is ,
    and Jane Smith can be reached at .



### Removing  Punctuations

In [12]:
# Create a new column
spam['Email_Lower'] = spam['Email'].copy().str.lower()

In [13]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro..."


In [14]:
spam['Email_Lower'].iloc[2]

"free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's"

In [15]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [16]:
def punctuation_remover(email):
    return ''.join([l for l in email if l not in string.punctuation])

In [17]:
spam['Email_no_punc'] = spam['Email_Lower'].apply(punctuation_remover)
spam['Email_no_punc'].iloc[2]

'free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s'

In [18]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


### Tokanization

Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of **converting a sequence of text into smaller parts**, known as **tokens**. These tokens can be as small as characters or as long as words.

![gif](https://media0.giphy.com/media/VGVwLultLZjrrssAak/200w.webp?cid=ecf05e47i3nnw7s6mxydile8vewidzswfodj9j6yoji9msjk&ep=v1_gifs_search&rid=200w.webp&ct=g)

In [19]:
# Download library tools for tokanization
from nltk.tokenize import word_tokenize
from nltk.tokenize import TreebankWordTokenizer
text = "Tokenization is an important step in NLP."
tokens = word_tokenize(text)
print(tokens)

['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '.']


#### Let's see how we can tokanize the entire email corpus

In [20]:
# Lets work with a more efficient tokanizer
tokeniser = TreebankWordTokenizer()
spam['Email_tokanized'] = spam['Email_no_punc'].apply(tokeniser.tokenize)

In [21]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc,Email_tokanized
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."


In [22]:
tokanized = spam['Email_tokanized'].iloc[2]
print(tokanized)

['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'questionstd', 'txt', 'ratetcs', 'apply', '08452810075over18s']


### Stemming

Stemming in Natural Language Processing (NLP) is a linguistic normalization technique that involves **reducing words** to their base or **root** form. It simplifies word variations, allowing algorithms to treat related words as equivalent. Stemming **enhances text analysis by standardizing vocabulary**, aiding in tasks like text classification and sentiment analysis.

![gif](https://i.imgur.com/BAfaGBL.gif)

In [23]:
# Stemming packages from nltk
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer

In [24]:
# string example 1
red = "Redness Red Redden"

In [25]:
# string example 2
snore = "sleep sleeping sleepiest"

In [26]:
# find the stem of each word in words, noyice the word sleepiest!
stemmer = SnowballStemmer('english')
for word in snore.split(): # This will create a list of the strings...
    print(stemmer.stem(word))

sleep
sleep
sleepiest


In [27]:
spam['Email'].iloc[4]

"Nah I don't think he goes to usf, he lives around here though"

In [28]:
# split data and stem individual words
''''for word in spam['Email_tokanized']:
    print(stemmer.stem(word))''''

SyntaxError: EOL while scanning string literal (1342369942.py, line 3)

In [29]:
spam['Email_stemmed'] = spam['Email_tokanized'].apply(lambda row: [stemmer.stem(word) for word in row])

In [30]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc,Email_tokanized,Email_stemmed
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, until, jurong, point, crazi, avail, onli,..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, so, earli, hor, u, c, alreadi, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, i, dont, think, he, goe, to, usf, he, li..."


---

### Lematization

Lemmatization in Natural Language Processing (NLP) involves extracting the linguistic **root or lemma** of a word, providing its canonical form.

![gif](https://media1.giphy.com/media/12XDYvMJNcmLgQ/200.webp?cid=ecf05e478vd7gtq6jep6wrs9dle004apxt49yvcykoinkcvm&ep=v1_gifs_search&rid=200.webp&ct=g)

In [31]:
# Lemmatizer imports
from nltk.stem import WordNetLemmatizer
import nltk

# Download WordNet data
nltk.download('wordnet')

# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

# Lemmatize different words
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("larvae"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("tulips"))
print(lemmatizer.lemmatize("pythons"))
print(lemmatizer.lemmatize("greater", pos="a"))  # "a" indicates adjective
print(lemmatizer.lemmatize("greatest", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("ran", pos='v'))  # "v" indicates verb


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/claudiaelliotwilson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


cat
larva
goose
tulip
python
great
great
run
run


In [32]:
from nltk.stem import WordNetLemmatizer
import nltk

# Download the WordNet dataset
nltk.download('wordnet')


# Create WordNetLemmatizer instance
lemmatizer = WordNetLemmatizer()

# Lemmatize each word in the list
spam['Email_lemmatized'] = spam['Email_tokanized'].apply(lambda row: [lemmatizer.lemmatize(word) for word in row])

# Print the lemmatized words
spam.head()


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/claudiaelliotwilson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc,Email_tokanized,Email_stemmed,Email_lemmatized
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, until, jurong, point, crazi, avail, onli,...","[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, so, earli, hor, u, c, alreadi, t...","[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, i, dont, think, he, goe, to, usf, he, li...","[nah, i, dont, think, he, go, to, usf, he, lif..."


---

### Stop words

![gif](https://media3.giphy.com/media/2WNQ0N41BK5Us/200w.webp?cid=ecf05e47n7g6qrvr4lbv5ni8u2twz2zuxtudsnct5hiwpovl&ep=v1_gifs_search&rid=200w.webp&ct=g)

Stop words in Natural Language Processing (NLP) are common words, such as "the," "and," and "is," that are frequently used but often add little semantic meaning to a text.

In [33]:
# Import stopwords library of words
from nltk.corpus import stopwords

In [34]:
# Lets take a look!
stopwords_list = stopwords.words('english')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [35]:
# Download the NLTK stop words dataset
nltk.download('stopwords')

# Sample text
text = "This is an example sentence with some stop words."

# Tokenize the text
words = word_tokenize(text)

# Remove stop words
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]

# Print the result
print("Original Text:", text)
print("After Removing Stop Words:", ' '.join(filtered_words))

Original Text: This is an example sentence with some stop words.
After Removing Stop Words: example sentence stop words .


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/claudiaelliotwilson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
# Download the NLTK stop words dataset
nltk.download('stopwords')


# Remove stop words
spam['no_stop_words'] = spam['Email_lemmatized'].apply(lambda words: [word for word in words if word not in stopwords.words('english')])

# Print the result
print(spam['Email_lemmatized'].iloc[4])
print(spam['no_stop_words'].iloc[4])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/claudiaelliotwilson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['nah', 'i', 'dont', 'think', 'he', 'go', 'to', 'usf', 'he', 'life', 'around', 'here', 'though']
['nah', 'dont', 'think', 'go', 'usf', 'life', 'around', 'though']


---

## 5. Text Feature Extraction:

Text extraction in Natural Language Processing (NLP) involves **isolating and retrieving relevant information** from *unstructured* text data

### Bag of Words

Focuses on the frequency of words in a document, creating a sparse matrix or a dictionary where each unique word is a feature, and its frequency is the corresponding value.

![gif](https://media0.giphy.com/media/26tkl6oesuCt5akbC/giphy.webp?cid=ecf05e47e3t1n6o72gxowf3cea64f8vc3venhga5mfxy5lbq&ep=v1_gifs_search&rid=giphy.webp&ct=g)

In [37]:
# Sample document
document = "This is a simple example of a Bag of Words representation."

# Tokenize the document into words
words = document.lower().split()

# Create a Bag of Words representation using a dictionary
bow_representation = {}
for word in words:
    bow_representation[word] = bow_representation.get(word, 0) + 1

# Print the resulting dictionary
print(bow_representation)

{'this': 1, 'is': 1, 'a': 2, 'simple': 1, 'example': 1, 'of': 2, 'bag': 1, 'words': 1, 'representation.': 1}


In [38]:
spam['no_stop_words'].iloc[2]

['free',
 'entry',
 '2',
 'wkly',
 'comp',
 'win',
 'fa',
 'cup',
 'final',
 'tkts',
 '21st',
 'may',
 '2005',
 'text',
 'fa',
 '87121',
 'receive',
 'entry',
 'questionstd',
 'txt',
 'ratetcs',
 'apply',
 '08452810075over18s']

In [39]:
# Tokenize the email into words
words = spam['no_stop_words'].iloc[2]

# Create a Bag of Words representation using a dictionary
bow_representation = {}
for word in words:
    bow_representation[word] = bow_representation.get(word, 0) + 1

# Print the resulting dictionary
print(bow_representation)

{'free': 1, 'entry': 2, '2': 1, 'wkly': 1, 'comp': 1, 'win': 1, 'fa': 2, 'cup': 1, 'final': 1, 'tkts': 1, '21st': 1, 'may': 1, '2005': 1, 'text': 1, '87121': 1, 'receive': 1, 'questionstd': 1, 'txt': 1, 'ratetcs': 1, 'apply': 1, '08452810075over18s': 1}


---

### N-grams

N-grams in Natural Language Processing (NLP) are sequential **word combinations** of 'n' length, where 'n' represents the number of words in each **grouping**.

In [40]:
from nltk.util import ngrams

def word_grams(words, min_n=1, max_n=4):
    list_1 = []
    for n in range(min_n, max_n + 1):  # corrected the range
        for ngram in ngrams(words, n):
            list_1.append(' '.join(str(i) for i in ngram))
    return list_1

# Example use case from spam dataset:
words = spam['no_stop_words'].iloc[4]


result = word_grams(words, min_n=2, max_n=3)
print(result)

['nah dont', 'dont think', 'think go', 'go usf', 'usf life', 'life around', 'around though', 'nah dont think', 'dont think go', 'think go usf', 'go usf life', 'usf life around', 'life around though']


---

## 6. Sklearn Vectorizers

Luckily for us we have tools that make it much easier to process text!

Scikit-learn provides several text vectorizers for converting text data into numerical vectors, commonly used in Natural Language Processing (NLP) tasks. Here are two popular text vectorizers from scikit-learn:

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Get the feature names (unique words in the corpus)
feature_names = vectorizer.get_feature_names_out()

# Print the CountVectorizer matrix and feature names
print("CountVectorizer Matrix:")
print(X.toarray())


CountVectorizer Matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [42]:
# Lets convert to a dataframe!
matrix = pd.DataFrame(X.toarray())
#Add column names
matrix.columns = feature_names
# Let's have a look
matrix

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


In [43]:
spam['new_text_cleaned'] = spam['no_stop_words'].apply(lambda words: ' '.join(words) )

In [44]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc,Email_tokanized,Email_stemmed,Email_lemmatized,no_stop_words,new_text_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, until, jurong, point, crazi, avail, onli,...","[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...",free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, so, earli, hor, u, c, alreadi, t...","[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, i, dont, think, he, goe, to, usf, he, li...","[nah, i, dont, think, he, go, to, usf, he, lif...","[nah, dont, think, go, usf, life, around, though]",nah dont think go usf life around though


In [45]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the documents
X_spam = vectorizer.fit_transform(spam['new_text_cleaned'])

# Get the feature names (unique words in the corpus)
feature_names_spam = vectorizer.get_feature_names_out()

# Print the CountVectorizer matrix and feature names
print("CountVectorizer Matrix:")
print(X_spam.toarray())

CountVectorizer Matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [46]:
# Lets convert to a dataframe!
spam_matrix = pd.DataFrame(X_spam.toarray())
#Add column names
spam_matrix.columns = feature_names_spam
# Let's have a look
# With all the words available in the dataset can we do more feature engineering?
spam_matrix

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,ìï,ìïll,ûthanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

## 7. Implementing a Logistic Regression Model

First things first, we need to encode our y-variable into 1s and 0s for our model to be able to use! **It is important to note that sklearn can do this for you, however, so if you forget to do it, your model will still work!** For the sake of demonstration, we'll use the LabelEncoder to do this:

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

spam['Email Type'] = encoder.fit_transform(spam['Email Type'])

In [47]:
spam.head()

Unnamed: 0,Email Type,Email,Email_Lower,Email_no_punc,Email_tokanized,Email_stemmed,Email_lemmatized,no_stop_words,new_text_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...","go until jurong point, crazy.. available only ...",go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, until, jurong, point, crazi, avail, onli,...","[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar... joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entri, in, 2, a, wkli, comp, to, win, f...","[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...",free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor... u c already then say...,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, so, earli, hor, u, c, alreadi, t...","[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]",u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...","nah i don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, i, dont, think, he, goe, to, usf, he, li...","[nah, i, dont, think, he, go, to, usf, he, lif...","[nah, dont, think, go, usf, life, around, though]",nah dont think go usf life around though


In [48]:
X = X_spam.toarray()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [49]:
y = np.array(spam['Email Type'])
y

array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], dtype=object)

In [50]:
# split up our data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [51]:
# fit a Logistic Regression model

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

In [52]:
# make predictions!

pred_lr = lr.predict(X_test)

In [53]:
pred_lr

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [54]:
pred_df = pd.DataFrame(pred_lr)
pred_df.head()

Unnamed: 0,0
0,ham
1,ham
2,ham
3,ham
4,spam


In [55]:
y_test_df = pd.DataFrame(y_test)
y_test_df.head()

Unnamed: 0,0
0,ham
1,ham
2,spam
3,ham
4,spam


# Time to check our understanding 
### Binary Classification + Logistic Regression

**1. What method is used to best fit the curve in Logistic Regression?**

    a. Least Squares Method
    
    b. Euclidean Distance
    
    c. Maximum Likelihood Estimation
    
    d. Classification

**2. T/F: Binary classification refers to situations where the outcome variable has only two potential values**

**3. T/F: When trying to create a classification model, we *don't* need to encode our target variable when it is in a text/object format** (ignoring sklearn's default encoding)

**4. Complete the following:**

```python

from sklearn.preprocessing import ...

encoder = ...

df['outcome_variable'] = ....fit_transform(...)

```

**5. What is the shape of a Logistic Regression curve?**

**6. What is wrong with the following code?**

```python

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X_train, y_train, test_size = 0.2, random_state = 42)

```

**7. Why can we not fit a straight line (like linear regression) to a classification problem?**

**8. Which of the following are true regarding disadvantages of Logistic Regression?**

    i. Doesn't handle large number of categorical variables well

    ii. Does not work well with non-linearly separable data

    iii. Cases of collinearity are worse for Logistic Regression than for Linear Regression


- ii only
- i & ii
- all of the above

# NLP Exercises

### Solve the following:

#### Question 1: What is the adjective word for the lemma "cool" 😎

In [None]:
# Lemmatizer imports
from nltk.stem import WordNetLemmatizer
import nltk

# Download WordNet data
nltk.download('wordnet')

# Create a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

# Lemmatize the adjective for "cool
print(lemmatizer.lemmatize("...", pos="a"))


#### Question 2: Tokanize the following sentance 🪙

In [None]:
# Download library tools for tokanization
from nltk.tokenize import word_tokenize
text = "How will you tokanize the following?"
tokens = ...
print(...)

#### Question 3: Create a matrix using the following code

![gif](https://media2.giphy.com/media/1yvoDVJQsTfHi/200w.webp?cid=ecf05e47rz1ww0wvlzk9x0ak6fniv8curhgxj45gqs9bejl0&ep=v1_gifs_search&rid=200w.webp&ct=g)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a TfidfTransformer instance
vectorizer = CountVectorizer()

# Fit and transform the documents
X = ...fit_transform(spam['Email'])

# Get the feature names (unique words in the corpus)
feature_names = ...get_feature_names_out()

# Print the CountVectorizer matrix and feature names
print("CountVectorizer Matrix:")
print(X.toarray())

## Theory: Questions

Q1: **What is the difference between stemming and lemmatization?**
1. Stemming converts the words to lowercase while lemmatization reduce the word to its root
2. Both reduce the word to a root form, but lemmatization uses the context of the word to return a meaningful base form
3. Lemmatization splits sentences into individual words, stemming splits words into individual characters


Q2: **In the text cleaning process, which comes first: tokenisation or stemming?**

Q3: **How many trigrams can we make from this question?**
- (hint: ‘tri’ = 3)


Q4: **True or False: Bag of words only counts one occurrence of each word in the text?**

Q5: **What package(s) can we import to help us with removing punctuation?**

---

## Theory: Answers

Q1: **What is the difference between stemming and lemmatization?**
1. Stemming converts the words to lowercase while lemmatization reduce the word to its root
2. **Both reduce the word to a root form, but lemmatization uses the context of the word to return a meaningful base form**
3. Lemmatization splits sentences into individual words, stemming splits words into individual characters

Q2: **In the text cleaning process, which comes first: tokenisation or stemming?**
- Tokenisation. Sentences need to be split into their individual words before stemming can work on each of those words


Q3: **How many trigrams can we make from this question?**
- 7
  - ‘How many trigrams’
  - ‘many trigrams can’
  - ‘trigrams can we’
  - ‘can we make’
  - ‘we make from’
  - ‘make from this’
  - ‘from this question?’


Q4: **True or False: Bag of words only counts one occurrence of each word in the text?**
- False - Bag of Words counts the total number of times each word appears in the text


Q5: **What package(s) can we import to help us with removing punctuation?**
- import string, re


---