# Sample Text

In [None]:
text = """Urban green spaces have become essential parts of city planning.
Studies show that having parks and community gardens helps improve mental and
physical health. In 2023, over 68% of city residents in the U.S. lived within a
10-minute walk of a public green space. A recent report from the National Urban
Landscape Association was shared via email (contact_us@green2030.org) and
highlighted the need for biodiversity in city environments. Another contributor,
michael92@ecochange.net, stated that urban ecosystems should be included in
infrastructure planning to lessen the impact of climate change. As cities keep
growing, sustainable urban design will be more important."""

# 1. What are regular expressions? Provide a pattern to extract emails containing both numbers and alphabets. Discuss the advantages and limitations of using regular expressions.

Regular expressions are sequences of characters that matches a pattern in text. They are primarily used for matching and manipulating text as they are a powerful tool to find, validate and transform text data.

**Pattern to extract emails containing both numbers and alphabets.**

In [None]:
email_regex = r"[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,}"

#Assertion for non alphanumeric characters
pattern = r"(?<![\w\-\.])" + email_regex + r"(?![\w\-\.])"

Note: The pattern above does not match email address containing periods, underscores or hyphen

In [None]:
import re

emails = re.findall(pattern, text)

for email in emails:
  print(email)

michael92@ecochange.net


**Advantages**


*   Regular Expressions efficiently matches complex string patterns with minimal code
*   Regular expressions are really powerful for text parsing and data validation tasks.
*   Regex is supported across many programming languages and platforms



**Limitations**

*   Regex patterns in raw text form are difficult to read and maintain.
*   Regex gives poor performance on very large or nested inputs.
*   Regex is not suitable for parsing hierarchical data structures like XML.

# 2. What is the Bag of Words (BoW) technique? ● Explain how it differs from regular expressions and describe its limitations.

Bag of Words model is a text representation technique that converts a document or a string of text into an unordered collection or a 'bag' of words by counting word occurences or frequencies. It therefore disregards word order or grammar.

The code to employ BoW model in python is below:

In [None]:
#Importing necessary libraries and packages
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
#Downloads
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Data Preprocessing:

In [None]:
#Pre-processing function that returns list of sentences with stopwords removed
def preprocess(text):
  sentences = sent_tokenize(text)
  processed = list()

  for sentence in sentences:
    sentence = sentence.lower()
    words = word_tokenize(sentence)
    words = [word for word in words if word not in stopwords.words("english")]
    sentence = " ".join(words)
    processed.append(sentence)

  return processed

processed = preprocess(text)

Vectorize:

In [None]:
# Vectorizer object
vectorizer = CountVectorizer()

# Bag of Words array
bow = vectorizer.fit_transform(processed).toarray()
print(bow)

[[0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
  1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1]
 [0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
  0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 0]
 [0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0
  0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0]]


**Difference between Regular Expressions and Bag of Words**

Both Regular Expressions and Bag of Words are used for text processing, but serve different purposes and operate on different principles.

*   Regular Expressions match text patterns whereas Bag of Words is used to represent word frequency in text numerically.
*   Regular Expressions use symbols for pattern matching where BoW uses token frequency vectors.
*   Regex is rule-based specific. Bag of Words, on the other hand, is general-purpose and statistical

# 3. What is TF-IDF (Term Frequency-Inverse Document Frequency)?
# Explain the advantages of this approach and how it differs from regular expressions and the Bag of Words technique.

TF-IDF or Term Frequency-Inverse Document Frequency is a statistical measure used in information retrieval to evaluate the importance of a word in a document relative to a corpus. It combines Term Frequency (TF) with inverse document frequency (IDF) to reduce the weight of common terms and highlight distinctive words.

Below is the implementation in python:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

processed = preprocess(text)

In [None]:
tf_vec = TfidfVectorizer() # TF_IDF Vectorizer object

tfidf = tf_vec.fit_transform(processed).toarray() # TF_IDF Array
print(tfidf)

[[0.         0.         0.         0.         0.         0.40238599
  0.         0.         0.         0.27857682 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.40238599 0.         0.32996226 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.40238599 0.         0.32996226 0.
  0.         0.         0.         0.         0.         0.
  0.40238599 0.         0.         0.         0.23871917 0.
  0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.31622777
  0.         0.         0.         0.         0.         0.
  0.         0.         0.31622777 0.         0.         0.
  0.31622777 0.31622777 0.         0.         0.         0.31622777
  0.         0.         0.         0.         0.   

**Advantages**
*   TF_IDF emphasises important terms as it reduces the weight of frequent, non-informative words
*   It is simple and efficient for text mining and retrieval tasks
*   TF-IDF improves relevance in document ranking, search engines etc.

**Differences among Regular Expressions, Bag of Words and TF-IDF**

**Regular Expressions**:
*   Pattern matching for identifying text structures like emails, dates, etc.
*   Used for text preprocessing rather than feature extraction.
*   No statistical or frequency-based representation of terms.

**Bag of Words**:
*   Represents text as a frequency count of words.
*   Ignores grammar, word order or context.
*   Simple and fast, but of no use when assessing importance of words.

**TF_IDF**:
*   Takes BoW one step further by weighting terms by rarity across documents
*   Reduces the influence of less-informative but common words
*   Provides more meaningful features for classification and retrieval

# 4. What are word embeddings? Describe the advantages of using word embeddings.


Word Embedding in Natural Language Processing represents a word as real-valued vectors that encodes semantic and syntactic relationships based on context. Words learnt from corpora are mapped to continuous vector spaces, enabling algorithms to capture similarities, analogies and syntactic structures. Popular models include Word2Vec, GloVe, FastText, etc.

In [None]:
#!pip install gensim
import gensim.models

data = word_tokenize(text)
model = gensim.models.Word2Vec([data], min_count=1) #Word to Vectors model
model.wv.most_similar(['parks'], topn=10) #List of words related to 'parks'

[('need', 0.23269957304000854),
 (')', 0.21507105231285095),
 ('city', 0.1953369677066803),
 ('should', 0.13609836995601654),
 ('2023', 0.120979905128479),
 ('for', 0.10494939237833023),
 ('health', 0.09511755406856537),
 ('of', 0.09415967017412186),
 ('have', 0.09364578872919083),
 ('green2030.org', 0.09035404026508331)]

# 5. What are stop words and how to remove them using the NLTK library?

Stop words are common words with minimal semantic value. They are deemed irrelevant for NLP tasks as they do not contribute significantly to the content's meaning. Filtering them out during text preprocessing for tasks like text classification, sentiment analysis or information retrieval reduce computational load and improve the efficiency of NLP models.

Examples of stopwords are 'the', 'is', 'in', 'and', etc.

Stopword removal in python is carried out using the `stopwords` module of nltk.corpus package.

In [24]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

#Tokenizing functions to extract tokens from the list
def tokenize(text):
  sentences = sent_tokenize(text)
  tokens = list()

  for sentence in sentences:
    words = word_tokenize(sentence.lower())
    tokens.extend(words)

  return tokens


# Set of stopwords
stop_words = set(stopwords.words('english'))

# Tokenized text
tokens = tokenize(text)

#List of non-stopwords
filtered = [word for word in tokens if word.lower() not in stop_words]

print(filtered)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# What is sentence tokenization and word tokenization?

**Sentence Tokenization**: Process of splitting a text into individual sentences. It involves identifying sentence boundaries, which can be challenging due to punctuation ambiguity. It is critical preprocessing step for tasks like parsing, machine translation, and sentiment analysis.

Implementation in python is done using the sent_tokenize function from nltk.tokenize package in nltk library.

In [25]:
import nltk
from nltk.tokenize import sent_tokenize

#Pre-trained tokenizer models to identify sentence boundaries correctly
nltk.download("punkt")

sentences = sent_tokenize(text) # Tokenizer
print(sentences)

['Urban green spaces have become essential parts of city planning.', 'Studies show that having parks and community gardens helps improve mental and\nphysical health.', 'In 2023, over 68% of city residents in the U.S. lived within a\n10-minute walk of a public green space.', 'A recent report from the National Urban\nLandscape Association was shared via email (contact_us@green2030.org) and\nhighlighted the need for biodiversity in city environments.', 'Another contributor,\nmichael92@ecochange.net, stated that urban ecosystems should be included in\ninfrastructure planning to lessen the impact of climate change.', 'As cities keep\ngrowing, sustainable urban design will be more important.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Word Tokenization**: Process of splitting text into individual words or terms, which are the basic units of analysis. It enables text classification, sentiment analysis, and language modelling.

In [28]:
import nltk
from nltk.tokenize import word_tokenize

#Pre-trained tokenizer models to identify sentence boundaries correctly
nltk.download("punkt")

words = word_tokenize(text) # Tokenizer
print(words)

['Urban', 'green', 'spaces', 'have', 'become', 'essential', 'parts', 'of', 'city', 'planning', '.', 'Studies', 'show', 'that', 'having', 'parks', 'and', 'community', 'gardens', 'helps', 'improve', 'mental', 'and', 'physical', 'health', '.', 'In', '2023', ',', 'over', '68', '%', 'of', 'city', 'residents', 'in', 'the', 'U.S.', 'lived', 'within', 'a', '10-minute', 'walk', 'of', 'a', 'public', 'green', 'space', '.', 'A', 'recent', 'report', 'from', 'the', 'National', 'Urban', 'Landscape', 'Association', 'was', 'shared', 'via', 'email', '(', 'contact_us', '@', 'green2030.org', ')', 'and', 'highlighted', 'the', 'need', 'for', 'biodiversity', 'in', 'city', 'environments', '.', 'Another', 'contributor', ',', 'michael92', '@', 'ecochange.net', ',', 'stated', 'that', 'urban', 'ecosystems', 'should', 'be', 'included', 'in', 'infrastructure', 'planning', 'to', 'lessen', 'the', 'impact', 'of', 'climate', 'change', '.', 'As', 'cities', 'keep', 'growing', ',', 'sustainable', 'urban', 'design', 'will'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
