https://collectionofbestporn.com/video/girlfriend-takes-revenge-by-taping-threesome-2.html

SpaCy and NLTK  



Feature	                          SpaCy	                                                    NLTK
Programming Paradigm	     Object-oriented	                                          String processing
Ease of Use	                More user-friendly, intuitive API	                  User-friendly but requires more manual setup
Algorithm Selection	    Automatically selects the best algorithm	               Allows user to customize and select algorithms
Performance	               Efficient out-of-the-box	                                Powerful but requires tweaking
Community	                  very active	                                            Older, less active
Target Users	            App developers	                                       Researchers and advanced users

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Doctor Strange loves power. Budget Mumbai and Hull clause chart.")
for sent in doc.sents:
    print(sent.text)  # Prints sentences
    for token in sent:
        print(token.text)  # Prints words in sentence



Doctor Strange loves power.
Doctor
Strange
loves
power
.
Budget Mumbai and Hull clause chart.
Budget
Mumbai
and
Hull
clause
chart
.


Tokenization is the process of splitting text into meaningful segments: sentence tokenization splits paragraphs into sentences, and word tokenization splits sentences into words. This is a key pre-processing step in NLP pipelines

Token and Span Objects:-
    SpaCy represents tokens as Token objects and slices of tokens as Span objects, enabling extraction of substrings or token subsequences with Python-like slice syntax (e.g., doc[1:5])




Here's a clear explanation of the key code snippets and concepts from the video on SpaCy tokenization:

---

### 1. **Installing SpaCy**
```bash
pip install spacy
```
- Installs the SpaCy library for NLP tasks.

---

### 2. **Creating a Blank Language Object**
```python
import spacy
nlp = spacy.blank("en")  # "en" for English
```
- Creates an **NLP pipeline object** for English.
- This object contains a **tokenizer** by default but no other components like tagger or parser.
- You can also create for other languages by changing `"en"` to `"de"` (German), `"fr"` (French), `"hi"` (Hindi), etc.

---

### 3. **Creating a Document**
```python
text = "Doctor Strange's visit in Mumbai."
doc = nlp(text)
```
- The `doc` object represents the processed text.
- SpaCy automatically tokenizes the text into tokens (words, punctuation, symbols).

---

### 4. **Iterating Over Tokens**
```python
for token in doc:
    print(token.text)
```
- Prints each token in the text.
- Tokens include words, punctuation, currency symbols, etc.
- SpaCy handles special cases (e.g., splits "Doctor Strange's" into tokens like `"Doctor"`, `"Strange"`, `"'s"`).

---

### 5. **Accessing Token Attributes**
```python
token = doc[0]
print(token.text)       # Text of the token
print(token.is_alpha)   # Is the token alphabetic?
print(token.is_digit)   # Is the token a digit?
print(token.is_currency) # Is the token a currency symbol?
```
- Tokens have many useful attributes for analysis:
  - `is_alpha`: True if token is alphabetic.
  - `is_digit`: True if token is a number.
  - `is_currency`: True if token is a currency symbol.
- These help in filtering or classifying tokens.

---

### 6. **Customizing Tokenization (Special Cases)**
```python
from spacy.symbols import ORTH
special_case = [{ORTH: "gim"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)
doc = nlp("gimme the book")
print([token.text for token in doc])
```
- You can customize how SpaCy tokenizes specific words or slang.
- Here, "gimme" is split into two tokens: `"gim"` and `"me"`.

---

### 7. **Sentence Tokenization**
```python
from spacy.pipeline import Sentencizer
nlp.add_pipe("sentencizer")

text = "Hulk and Strange both are loving their trip. Enjoying the food."
doc = nlp(text)

for sent in doc.sents:
    print(sent.text)
```
- Adds a **sentence boundary detector** to the pipeline.
- Splits paragraphs into sentences.
- Useful for sentence-level analysis.

---

### 8. **Extracting Emails from Text File**
```python
with open("student.txt") as f:
    text = " ".join(f.readlines())

doc = nlp(text)
emails = [token.text for token in doc if token.like_email]
print(emails)
```
- Reads a text file and joins lines into one string.
- Extracts tokens recognized as emails using `token.like_email`.
- Shows SpaCy‚Äôs ability to identify special token types.

---

### 9. **Working with Other Languages (Hindi Example)**
```python
nlp_hi = spacy.blank("hi")
text_hi = "‡§â‡§∏‡§®‡•á ‡§Æ‡•Å‡§ù‡§∏‡•á ‡§™‡•à‡§∏‡•á ‡§â‡§ß‡§æ‡§∞ ‡§≤‡§ø‡§è ‡§π‡•à‡§Ç‡•§"
doc_hi = nlp_hi(text_hi)

for token in doc_hi:
    print(token.text, token.is_currency, token.like_num)
```
- Creates a blank Hindi tokenizer.
- Tokenizes Hindi text.
- Checks token attributes like currency and number.

---

### Summary Table of Key Concepts

| Code Concept                  | Purpose                                    |
|------------------------------|--------------------------------------------|
| `spacy.blank("en")`           | Create blank English NLP pipeline          |
| `doc = nlp(text)`             | Process text and tokenize                   |
| `for token in doc:`           | Iterate over tokens                         |
| `token.is_alpha`              | Check if token is alphabetic                |
| `token.is_currency`           | Check if token is currency symbol           |
| `add_special_case`            | Customize tokenization rules                 |
| `add_pipe("sentencizer")`    | Add sentence segmentation to pipeline       |
| `token.like_email`            | Detect email tokens                          |

---

## Tokenization Concepts and Importance in NLP

- **Tokenization** is the process of splitting text into meaningful segments: *sentence tokenization* splits paragraphs into sentences, and *word tokenization* splits sentences into words. This is a key pre-processing step in NLP pipelines    .
- Simple rules like splitting sentences by a period (`.`) are insufficient due to language complexities (e.g., "Dr." does not indicate sentence end). Tokenization requires language-specific rules and understanding to handle such exceptions correctly   .
- SpaCy provides built-in tokenizers that intelligently handle prefixes, suffixes, exceptions, punctuation, and currency symbols, making tokenization more accurate than naive splitting by spaces     .

## Using SpaCy for Tokenization

- To use SpaCy, install it via `pip install spacy`, then import and create a language object with `spacy.blank('en')` for English or other language codes (e.g., `de` for German, `hi` for Hindi)    .
- Creating a `Doc` object by passing text to an `nlp` object automatically tokenizes the text into word tokens accessible via iteration or indexing, e.g., `for token in doc` or `doc[0]`    .
- Tokens in SpaCy are objects with many useful attributes and methods such as `.is_alpha`, `.is_digit`, `.is_currency`, `.text`, etc., allowing detailed text analysis beyond simple token text extraction      .
- SpaCy treats symbols and punctuation as separate tokens, enabling precise token boundaries (e.g., `$` and `2` are distinct tokens)   .

## Token and Span Objects

- SpaCy represents tokens as `Token` objects and slices of tokens as `Span` objects, enabling extraction of substrings or token subsequences with Python-like slice syntax (e.g., `doc[1:5]`)   .

## Practical Use Case: Extracting Emails

- SpaCy can be used to process large text files by converting lines to a single string and creating a `Doc` for tokenization. Using token attributes (like an email detector), you can extract entities such as email addresses efficiently, sometimes more conveniently than regex      .

## Tokenization in Different Languages

- SpaCy supports multiple languages, and tokenization respects language-specific rules. For example, Hindi tokenization can detect currency symbols and numbers correctly, though some languages may lack full pipeline support yet (e.g., Hindi currently has no pipeline)    .

## Customizing the Tokenizer

- SpaCy's tokenizer can be customized to handle special cases or slang by adding rules that split tokens differently (e.g., splitting "gimme" into "gim" and "me") using SpaCy symbols and special case handlers      .
- Customization respects that the original text should not be altered, only the tokenization behavior changes.

## Sentence Tokenization with SpaCy Pipelines

- Blank SpaCy pipelines include only tokenization. To enable sentence tokenization, a `sentencizer` component must be added manually to the pipeline using `nlp.add_pipe('sentencizer')`   .
- Full pipelines loaded from SpaCy models include multiple components (tagger, parser, named entity recognizer, etc.) for richer language understanding and better sentence splitting   .
- The basic sentencizer may not fully resolve complex sentence boundary cases (e.g., splitting after "Dr.") but full models improve this accuracy  .

## Exercises and Practical Advice

- Exercises include extracting URLs and monetary transactions from given text paragraphs using SpaCy tokenization and attribute methods, reinforcing practical NLP skills    .
- Emphasis on hands-on practice: watching tutorials alone is insufficient; active coding and problem solving are crucial for mastering NLP and becoming a successful NLP engineer    .

---

> **üí° Key Insight:** Tokenization is foundational for all NLP tasks, but requires nuanced language-specific rules beyond simple splitting. SpaCy's tokenizer and pipeline components provide powerful, customizable tools for accurate token and sentence segmentation, enabling advanced text analysis and extraction tasks.   

## Language Processing Pipeline in spaCy

- The **language processing pipeline** in spaCy is a sequence of components that process text after tokenization. A blank pipeline contains only the tokenizer by default and no additional components   .
- Components include:
  - **Tagger:** assigns part-of-speech tags to tokens.
  - **Parser:** analyzes syntactic dependencies.
  - **NER (Named Entity Recognition):** identifies entities like persons, organizations, and monetary values in text.
  - **Lemmatizer:** finds the base form of words (lemmas)    .

## Creating and Using Pipelines

- Creating a blank pipeline (`spacy.blank()`) provides only tokenization without additional processing components   .
- Pre-trained pipelines can be downloaded for different languages using commands like `spacy download en_core_web_sm`, which installs an English small model pipeline with multiple components included   .
- Loading a pre-trained pipeline with `spacy.load()` automatically includes components such as tagger, parser, NER, and lemmatizer, enabling rich linguistic annotations   .

## Components and Their Functions

| Component     | Function                                                                                   |
|---------------|--------------------------------------------------------------------------------------------|
| Tokenizer     | Splits text into tokens                                                                    |
| Tagger        | Assigns part-of-speech (POS) tags (e.g., noun, verb, proper noun)                          |
| Parser        | Analyzes syntactic structure (not detailed in this video)                                 |
| Lemmatizer   | Extracts lemma (base form) of words                                                        |
| Named Entity Recognition (NER) | Detects named entities such as persons, organizations, and monetary values         |

- **Part-of-Speech (POS)** tagging assigns grammatical categories to words, e.g., "double" can be a noun or verb, "person" is a noun, "Tesla" as a proper noun    .
- **Lemmatization** converts inflected forms to their base form, e.g., "said" ‚Üí "say"  .
- **NER** identifies entities in text and classifies them into categories like organizations (ORG), monetary values (MONEY), or persons (PERSON)    .

## Visualizing Named Entities

- spaCy provides a module (`spacy.displacy`) to render entities visually in an easily interpretable format, highlighting different entity types in the text  .

## Customizing Pipelines

- You can customize blank pipelines by adding specific components from pre-trained pipelines; for example, adding only the NER component from an English pipeline to a blank pipeline   .
- This allows solving specific problems without loading the full pipeline, improving efficiency and control over processing   .

## Multilingual Support

- spaCy supports multiple languages, and pre-trained pipelines can be downloaded for languages like French, Chinese, etc. If a language pipeline is not downloaded, an error occurs when loading it, so it must be installed first    .
- Tokenization is available for languages even if a full pipeline is not present, e.g., Hindi has tokenization but no full pipeline yet  .
- Using language-appropriate pipelines improves accuracy of components like NER, as shown by better entity recognition when using an English pipeline for English sentences compared to mismatched languages   .

## Practical Notes and Tools

- spaCy pipelines can be run locally with required compute resources or accessed via cloud APIs such as firstlanguage.in, which simplifies NLP tasks through API calls without heavy local computation    .
- Understanding each pipeline component is essential for building NLP applications; upcoming videos promise detailed explanations of POS tagging, NER, and other components    .

> **üí° Key Insight:** The spaCy language processing pipeline is modular and customizable, allowing users to tailor NLP workflows for specific tasks by adding or removing components, supporting multiple languages, and leveraging pre-trained models for efficient and accurate analysis.  
   

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [2]:
doc = nlp("Tesla is going to acquire Twitter for 45 billion dollars.")
for ent in doc.ents:
    print(ent.text,"|",ent.label,"|",spacy.explain(ent.label_))

Tesla | 383 | Companies, agencies, institutions, etc.
Twitter | 380 | People, including fictional
45 billion dollars | 394 | Monetary values, including unit


## Stemming and Lemmatization in NLP Preprocessing

- **Purpose:** Both stemming and lemmatization reduce words to their base or root form, which helps in NLP tasks like text classification and search (e.g., mapping "talking," "talked," and "talk" to the base word "talk")   .
- **Benefit:** Mapping different word forms to a base form improves consistency and accuracy in NLP applications such as sentiment analysis by treating variations of a word as the same token  .

## Stemming: Rule-Based Reduction

- **Definition:** Stemming applies a fixed set of simple heuristic rules to strip suffixes or prefixes from words to obtain a root form (e.g., removing "ing" from "talking" to get "talk") without language understanding   .
- **Examples of suffix removal:**  
  - "ing" ‚Üí talking, walking, running ‚Üí talk, walk, run  
  - "able" ‚Üí adjustable, removable ‚Üí adjust, remove  
- **Limitations:** Stemming can produce non-words (e.g., "ability" reduced to "abil") because it does not consider linguistic context or meaning  .
- **Tools:** NLTK supports multiple stemmers like Porter and Snowball stemmers and is often used because it is fast and simple, though less accurate than lemmatization    .

## Lemmatization: Linguistically Informed Reduction

- **Definition:** Lemmatization reduces words to their lemma (base form) using linguistic knowledge and rules of the language, resulting in meaningful base forms (e.g., "ate" ‚Üí "eat")   .
- **Advantages:** Produces valid base words that consider vocabulary and language context, avoiding nonsensical roots produced by stemming (e.g., "ability" remains "ability")   .
- **Implementation:** Libraries like spaCy provide lemmatization based on pre-trained language models that map words to lemmas using learned rules and vocabulary   .

## Comparison of Stemming and Lemmatization

| Aspect           | Stemming                           | Lemmatization                           |
|------------------|----------------------------------|---------------------------------------|
| Approach         | Rule-based heuristic stripping   | Linguistic knowledge and vocabulary   |
| Output          | May be non-words (e.g., "abil")  | Valid dictionary base forms            |
| Complexity      | Simple, fast                     | More complex, slower                   |
| Tools           | NLTK stemmers                   | spaCy lemmatizer                      |
| Use cases       | When speed is critical, approximate results suffice | When accuracy and meaningful roots are important |

     

## Demonstrations and Practical Usage

- **NLTK Stemming Example:**  
  Creating a Porter stemmer object and applying `.stem()` to words removes suffixes based on fixed rules, e.g., "eating" ‚Üí "eat," "adjustable" ‚Üí "adjust" (though sometimes resulting in invalid roots)    .
- **spaCy Lemmatization Example:**  
  Loading an English model and iterating over tokens allows access to `token.lemma_` which returns the lemma, e.g., "ate" ‚Üí "eat," "adjustable" ‚Üí "adjustable," preserving meaningful forms based on the model's vocabulary and rules    .

## Customizing Lemmatization Behavior in spaCy

- **Problem:** Default language models may not recognize slang or custom words, e.g., "bro" remains "bro" as lemma  .
- **Solution:** Customize the attribute ruler in spaCy's pipeline to assign specific lemmas to custom tokens (e.g., mapping "bro" and "bra" to lemma "brother") by adding custom rules to the attribute ruler component   .
- **Effect:** After customization, queries for lemma return the assigned base word (e.g., "bro" ‚Üí "brother") enhancing model adaptability to domain-specific vocabulary  .

## Additional Notes

- **API and Cloud Services:** Platforms like firstlanguage.in offer cloud-based NLP APIs for tasks like text classification without needing local compute resources, making NLP accessible without deep knowledge or hardware    .
- **Resource Recommendations:** NLTK for stemming and spaCy for lemmatization are recommended tools; spaCy authors prefer lemmatization due to its linguistic accuracy, hence it lacks stemming support  .
- **Learning Tips:** Check video descriptions for exercises, corrections, and additional resources to deepen understanding and practice stemming and lemmatization   .

---

> **‚ùó Important:** Stemming is faster and simpler but can produce invalid roots, whereas lemmatization is more accurate but requires linguistic knowledge and more computation, making both useful depending on the NLP application context.


In [14]:

from nltk.stem import PorterStemmer

stemmer =PorterStemmer()
words = ["talking", "walked", "eating", "adjustable", "ability"]

for word in words:
    print(word,"|",stemmer.stem(word))

talking | talk
walked | walk
eating | eat
adjustable | adjust
ability | abil


In [16]:
doc = nlp("talking walked eating adjustable ability")
for token in doc:
    print(token,"|",token.lemma_,"|",token.lemma)

talking | talk | 13939146775466599234
walked | walk | 1674876016505392235
eating | eat | 9837207709914848172
adjustable | adjustable | 6033511944150694480
ability | ability | 11565809527369121409


## NLP Platforms and API Usage

- NLP tasks can be performed using libraries such as spaCy, which can be run locally or via cloud platforms offering HTTP API calls, eliminating the need for high local compute resources like GPUs   .
- Cloud platforms provide free tiers where users can sign up, obtain API keys, and use SDKs in Python or TypeScript for easy NLP integration without deep expertise  .

## Parts of Speech Overview

- **Adjectives** add meaning to nouns; **adverbs** modify verbs, adjectives, or other adverbs, providing more detail about the action (e.g., "quickly," "slowly")   .
- The eight basic parts of speech include pronouns, adverbs, verbs, adjectives, nouns, interjections, conjunctions, and prepositions     .
- **Interjections** express strong emotions or reactions (e.g., "alas," "wow")  .
- **Conjunctions** (e.g., "and," "but," "or") connect groups of words or phrases  .
- **Prepositions** link nouns to other words, affecting sentence meaning based on the preposition used (e.g., "in," "on," "at")   .

## Detailed POS Tagging with spaCy

- spaCy's POS tags extend beyond the basic eight parts of speech to include finer categories such as numerals, articles, and determiners, reflecting subcategories in English grammar   .
- The library provides explanations for POS tags (e.g., "PROPN" for proper noun, "VERB" for verb) using `spacy.explain()` for better understanding   .
- Proper nouns refer to specific entities (e.g., "Elon"), while common nouns refer to general items (e.g., "person")  .
- spaCy categorizes tokens beyond just POS tags, including special tags like "X" for unknown or miscellaneous tokens  .

## POS Tagging and Verb Tense Identification

- spaCy supports detailed tagging that identifies verb tenses, such as past tense ("VBD") or third person singular present tense ("VBZ"), accessible via `token.tag_`     .
- This capability helps in NLP applications requiring tense recognition, improving text understanding and processing accuracy   .

## Practical Text Processing Examples

- Real-world text (e.g., Microsoft's earnings report) can be processed with spaCy to remove unwanted tokens like punctuation, spaces, or miscellaneous characters ("X") based on their POS tags for cleaner analysis      .
- Filtering tokens by POS tags allows extraction of meaningful words while discarding noise, which is critical for building effective NLP pipelines   .

## Counting POS Tags in Text

- spaCy provides a `count_by` API to count tokens by attributes such as POS or tag, facilitating analysis of text composition (e.g., number of nouns, proper nouns, punctuation)    .
- This counting helps understand the structure of texts, useful in applications like text summarization or content analysis  .

## Learning and Practice Recommendations

- Exercises involving POS tagging, such as extracting nouns and numbers from news stories, are recommended to reinforce learning  .
- Contributions to exercise repositories on platforms like GitHub are encouraged to expand learning resources, but students should attempt solutions independently before viewing provided answers   .

> **üí° Key Insight:** Understanding and utilizing detailed POS tagging, including verb tense and subcategories, enhances NLP applications by enabling nuanced text analysis beyond basic grammar rules.  

In [6]:
import spacy
nlp=spacy.load("en_core_web_sm")
doc = nlp("Elon went to Mars arin was great in sports or in coding, arin papa hai ")
for token in doc:
    print(token,"|",token.pos_,"|",token.tag_,"|",spacy.explain(token.pos_))

Elon | PROPN | NNP | proper noun
went | VERB | VBD | verb
to | ADP | IN | adposition
Mars | PROPN | NNP | proper noun
arin | NOUN | NN | noun
was | AUX | VBD | auxiliary
great | ADJ | JJ | adjective
in | ADP | IN | adposition
sports | NOUN | NNS | noun
or | CCONJ | CC | coordinating conjunction
in | ADP | IN | adposition
coding | NOUN | NN | noun
, | PUNCT | , | punctuation
arin | PROPN | NNP | proper noun
papa | PROPN | NNP | proper noun
hai | PROPN | NNP | proper noun


## Named Entity Recognition (NER) Overview and Use Cases
- **NER** extracts and classifies entities (e.g., persons, companies, locations, products, money) from text, distinguishing between ambiguous terms like "Tesla" the company vs. "Tesla" the person. This helps identify the precise entity type within text data.    
- **Search**: NER enables news aggregators (e.g., Google News) to tag entities in articles automatically, improving search relevance by recognizing companies, products, or persons mentioned in the text.    
- **Recommendation Systems**: By extracting entities from user-read articles (e.g., persons, locations, production houses), recommendation engines suggest related content based on user preferences for these entities, similar to personalized movie or news suggestions.   
- **Customer Care**: NER can automatically identify entities like course names from free-text customer queries, enabling routing to specialized support teams without requiring manual dropdown selections.    

## Using spaCy for NER
- spaCy's English model includes an NER component that identifies entities such as organizations, money, products, etc., from text using pre-trained statistical models and rule-based patterns.    
- Entities can be accessed via `doc.ents` and their labels printed. For example, "Tesla" may be labeled as an organization and "45 billion" as money. spaCy provides explanations for entity labels (e.g., ORG = organization).    
- Capitalization and suffixes like ".inc" affect recognition accuracy; spaCy's default NER may miss some entity mentions or misclassify (e.g., "Twitter" not recognized unless capitalized). This reflects limitations of out-of-the-box models.   
- Different pre-trained models (e.g., Hugging Face transformers) may support a different set of entity types and vary in accuracy, highlighting dependence on training data and model architecture.    

## Customizing and Extending NER in spaCy
- **Span Class**: Used to define slices of tokens as entities. For example, `Span(doc, start, end, label)` creates a custom entity span. This allows manual tagging of entities such as "Tesla" or "Twitter" as organizations.    
- `doc.set_ents()` method updates the document‚Äôs entity annotations, enabling addition or correction of entity labels while optionally preserving existing ones.   
- Custom entities improve recognition when the default model misses domain-specific or new entities.  

## Approaches to Building Your Own NER System
| Approach         | Description                                                                                  | Use Case/Notes                                      |
|------------------|----------------------------------------------------------------------------------------------|----------------------------------------------------|
| Lookup-based     | Use a database of known entities (e.g., companies, drugs) and match tokens against it.        | Simple, naive, but practical for controlled domains.|
| Rule-based       | Define linguistic or pattern-based rules (e.g., regex, POS tags) to identify entities.        | spaCy‚Äôs EntityRuler supports this; good for phone numbers, dates, etc.|
| Machine Learning | Train models (e.g., Conditional Random Fields, BERT) to learn entity patterns from annotated data.| More flexible and accurate but requires annotated data and training.|

- Lookup and rule-based methods can be effective for specific, controlled vocabularies or patterns without requiring complex ML models.     
- Machine learning approaches like CRF or BERT offer better generalization but involve more complexity and resources.  

## Practical Tips and Insights
> **‚ÑπÔ∏è Note:** Out-of-the-box NER models are not perfect; they make errors stemming from training data or heuristic rules. Customization or retraining is often needed for domain-specific accuracy.    
>
> **üí° Key Insight:** Combining multiple approaches (lookup, rules, ML) can yield practical and efficient NER systems tailored to specific applications.    

## Example spaCy NER Code Snippet
```python
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp("Tesla is going to acquire Twitter.")

for ent in doc.ents:
    print(ent.text, "|", ent.label_)

# Custom entity addition
s1 = Span(doc, 0, 1, label="ORG")  # Tesla
s2 = Span(doc, 5, 6, label="ORG")  # Twitter
doc.set_ents([s1, s2], default="unmodified")

for ent in doc.ents:
    print(ent.text, "|", ent.label_)
```
- This snippet shows basic extraction and manual addition of entities using spaCy's Span and set_ents methods.    

---

In [1]:
import spacy
from spacy import displacy
from spacy.tokens import Span


nlp = spacy.load("en_core_web_sm")
text = "Tesla is going to acquire Twitter for 45 billion dollars"

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, "|", ent.label_)



Tesla | ORG
Twitter | PERSON
45 billion dollars | MONEY


In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
text = "Tesla is going to acquire Twitter"

doc = nlp(text)


s1 = Span(doc, 0, 1, label="ORG")  # 'Tesla'
s2 = Span(doc, 5, 6, label="ORG")  # 'Twitter'

doc.set_ents([s1,s2],default="unmodified")

# Print
for ent in doc.ents:
    print(ent.text, "|", ent.label_)


Tesla | ORG
Twitter | ORG


## Feature Engineering in Machine Learning

- **Feature engineering** is a crucial step in the machine learning pipeline where data scientists spend considerable time extracting meaningful features from raw data to improve model performance  .
- A **feature** is an individual measurable property or characteristic of a phenomenon being observed. For example, in a property price prediction problem, features include area, facilities, age of the home, and location   .
- In image classification, features correspond to distinguishable parts such as eyes, nose, ears, and whiskers that help identify objects like cats or dogs   .
- Neural networks identify features by assigning specific tasks to individual neurons, such as detecting cat ears or noses; if all relevant features are detected, the network confirms the presence of a cat   .
- Human intuition works similarly when recognizing images, sometimes rejecting images if features like ears or nose do not match expectations, illustrating how feature detection influences classification  .

## Text Representation as Feature Engineering in NLP

- Text data must be converted into numerical features because machine learning models cannot directly process raw text; this conversion is a core part of **feature engineering** in NLP, often called **text representation**   .
- Unlike images or tabular data, text features are less obvious; one approach is to define handcrafted features such as whether a word represents a person, location, or other categories, assigning binary values (e.g., "Dhoni" as person = 1, location = 0)   .
- Words can be represented as **vectors** (sets of numbers) rather than single numbers, enabling mathematical operations like **cosine similarity** to measure semantic similarity between words   .
- For example, vectors for "Dhoni" and "comments" may have a high cosine similarity, indicating semantic closeness, while "Dhoni" and "Australia" may be less similar; this helps NLP models understand relationships between words beyond exact matches   .
- This vector representation is also known as the **Vector Space Model**, which can be applied to various text units such as words, phrases, sentences, or paragraphs   .

## Importance and Approaches to Text Representation

- Effective text representation is fundamental to NLP success; better feature extraction from text often leads to better model results than using sophisticated algorithms on poor representations   .
- Common approaches to text representation include **one-hot encoding**, **bag of words**, and **TF-IDF**; one-hot encoding is less popular due to sparsity issues, while bag of words is widely used in classic NLP tasks like spam detection   .
- Feature engineering in NLP is about transforming raw text into meaningful numerical vectors that capture semantic and syntactic properties, enabling machine learning algorithms to perform tasks like classification, sentiment analysis, and more   .

> **üí° Key Insight:** "Feeding a good representation to an ordinary algorithm will get you much farther than applying top-notch algorithm to an ordinary text representation file." This highlights the critical role of feature engineering over model complexity in NLP tasks. 

## Introduction to Spam Detection and Text Classification

- Spam detection is a classical text classification problem within the domain of Natural Language Processing (NLP). Gmail uses machine learning to accurately classify emails as spam or ham (non-spam) by analyzing keywords and urgency cues in the text.   
- Machine learning models only understand numerical data, so text must be converted into numerical vectors, a process known as text representation or feature engineering. This vectorization enables models like Naive Bayes classifiers to distinguish spam from ham.   

## Basic Text Representation Approaches: Label Encoding and One Hot Encoding

- **Label Encoding:** Create a vocabulary from all unique words in the training emails, assign each word a unique numeric index, and represent a text by a list (vector) of these indices. This is a primitive way to convert text to numbers but is simple and intuitive.     
- **One Hot Encoding:** Using the vocabulary, represent each word as a vector where the position corresponding to the word‚Äôs index is set to 1 and all other positions are 0. This creates sparse vectors indicating presence or absence of words.    

## Disadvantages of Label Encoding and One Hot Encoding

| Disadvantage | Explanation |
|--------------|-------------|
| **1. Lack of semantic similarity** | One hot vectors treat all words as independent; similar words like "help" and "assistance" have completely different vectors, failing to capture meaning or similarity.    |
| **2. High memory consumption** | Large vocabularies (e.g., 100,000 words) create very large vectors for each word, leading to excessive memory usage, especially for long texts with many words.    |
| **3. Out-of-vocabulary (OOV) problem** | Words not in the vocabulary (e.g., new or rare words like "bahubali") cannot be accurately represented, often lumped into a generic "unknown" token, losing distinctiveness.    |
| **4. Variable input size for models** | Different texts have different lengths, so flattening one hot vectors leads to inconsistent input sizes, which is problematic for fixed-size input requirements in machine learning models like neural networks.    |

- Label encoding avoids the large vector size issue but still does not capture semantic relationships or solve the OOV problem. Both approaches are considered "dumb" or primitive for text representation in modern NLP.    

## Summary and Next Steps

- Label encoding and one hot encoding are foundational but outdated methods for text representation due to their inability to encode meaning, inefficiency, and inflexibility.  
- Modern NLP typically uses more advanced techniques such as word embeddings and TF-IDF to overcome these issues.
- The next topic to explore is the "Bag of Words" model, which will be accompanied by coding exercises to deepen practical understanding. 

## Bag of Words (BOW) Technique in NLP

- Bag of Words is a text representation technique where a vocabulary of unique words is created from a corpus, and each document is represented as a vector of word counts based on that vocabulary. This vector is called a **count vectorizer**    .
- Example: For news articles about companies like Tesla and Apple, key terms such as "Elon Musk," "Model 3," and "iPhone" help identify the company. Counting occurrences of these words in articles allows classification of the document by company    .
- Vocabulary is built by collecting all unique words from documents (after stemming or lemmatization), which can be very large (e.g., 10,000+ words), leading to high-dimensional vectors   .

## Processing and Limitations of Bag of Words

- Each document is transformed into a vector where each element is the frequency of a vocabulary word in that document. This results in a **sparse vector** because most words do not appear in every document, leading to many zero values   .
- BOW does not capture word meanings or semantic similarity. For example, "help" and "assistance" are treated as different words despite similar meanings, resulting in different vector representations  .
- BOW vectors are generally smaller than one-hot encoding vectors for individual words but still consume significant memory and computational resources due to sparsity   .

## Practical Application: Spam Email Classification

- The tutorial uses BOW with a dataset of over 5,000 emails labeled as spam or non-spam (ham). The dataset is imbalanced with more non-spam emails   .
- Labels are converted to numeric form (spam = 1, ham = 0) using pandas `apply` method with a lambda function or a custom function for transformation    .
- Data is split into training (80%) and test (20%) sets using scikit-learn's `train_test_split` to avoid biased model evaluation    .

## Using CountVectorizer in Scikit-learn

- `CountVectorizer` converts text documents into count vectors representing word frequencies. It builds the vocabulary and transforms emails into sparse matrices    .
- The sparse matrix can be converted to a dense NumPy array for inspection. The vocabulary size in the example is 7,675 unique words    .
- Vocabulary words can be accessed and indexed to understand which word corresponds to which position in the vector, aiding interpretation of the feature vectors    .
- Non-zero elements in vectors correspond to words present in the email; their indices map back to vocabulary words, showing how the vector represents the text content      .

## Building and Evaluating a Naive Bayes Classifier

- A multinomial Naive Bayes classifier is used because it suits discrete count data like word frequencies. The model is trained by calling the `fit` method on training vectors and labels    .
- The test emails are transformed using the same `CountVectorizer` and predictions are made. Performance is evaluated using `classification_report` from scikit-learn, which provides precision, recall, and F1-score, important metrics especially for imbalanced datasets    .
- Example: Emails with phrases like "55 million dollars" and "exclusive offer" are typically predicted as spam, demonstrating the model's practical utility   .

## Simplifying Workflow with Scikit-learn Pipeline

- Scikit-learn's `Pipeline` allows chaining of preprocessing (`CountVectorizer`) and classification (`MultinomialNB`) into a single object, simplifying code and avoiding manual vector transformations during training and prediction    .
- Pipeline usage improves code readability and reduces errors by automating the sequence of transformations and model fitting/prediction steps    .

---

> **‚ÑπÔ∏è Note:** Bag of Words is simple and interpretable but has limitations in semantic understanding and high dimensionality. Despite this, it can achieve good accuracy on classical tasks like spam detection when combined with suitable classifiers like Naive Bayes.   

In [4]:
import pandas as pd
import numpy as np

data=pd.read_csv("C:/Users/arink/Downloads/spam.csv")
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

In [6]:
data["spam"]=data["Category"].apply(lambda x: 1 if x=="spam" else 0)
data.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.Message, data.spam, test_size=0.2)

In [17]:
X_train[:4]

5517    Miles and smiles r made frm same letters but d...
3120                             Stop knowing me so well!
1683    HI BABE U R MOST LIKELY TO BE IN BED BUT IM SO...
3261    I'm always looking for an excuse to be in the ...
Name: Message, dtype: object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv=v.fit_transform(X_train.values)
X_train_cv

<bound method _cs_matrix.toarray of <Compressed Sparse Row sparse matrix of dtype 'int64'
	with 59312 stored elements and shape (4457, 7753)>>

In [22]:
X_train_cv.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [23]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [28]:
from sklearn.metrics import classification_report

X_test_cv = v.transform(X_test)
y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       944
           1       0.99      0.89      0.94       171

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [29]:
import spacy

# Load English model
nlp = spacy.load('en_core_web_sm')

def remove_stopwords(text):
    doc = nlp(text)
    filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(filtered_tokens)

# Example usage
chat_text = "Can you help me find a yoga mat on your website?"
clean_text = remove_stopwords(chat_text)
print(clean_text)

help find yoga mat website


## Bag of n-grams and Vocabulary Challenges
- Traditional bag of words models do not solve the out-of-vocabulary (OOV) problem, where new words in prediction are unseen during training and thus hard to vectorize. This limitation affects performance and memory usage  .
- The `CountVectorizer` class from sklearn supports an `n_gram_range` parameter, allowing generation of n-grams beyond unigrams (single words). Setting `n_gram_range=(2,2)` creates bigrams; `(1,3)` includes unigrams, bigrams, and trigrams in the vocabulary   .
- N-grams capture sequences of tokens, e.g., bigrams are pairs of consecutive words, which can provide richer contextual information than single tokens  .

## Text Preprocessing for Vectorization
- Preprocessing includes removing stop words and lemmatization (reducing words to their base form), implemented using the spaCy library. Tokens identified as stop words or punctuation are filtered out, and lemmas are joined back into a string for vectorization     .
- Example: "Loki is eating pizza" preprocesses to "Loki eat pizza," removing "is" (stop word) and converting "eating" to its lemma "eat"  .
- Preprocessing improves the quality of input text before applying n-gram vectorization, making vectors more representative of the core content  .

## Vectorization and Representation
- After preprocessing, applying `CountVectorizer` with `n_gram_range=(1,2)` generates a vocabulary including unigrams and bigrams, with each token or n-gram assigned an index for vector representation  .
- Text is converted to numeric vectors using the vector space model, essential because machine learning models require numeric input rather than raw text  .
- Example: A sentence like "Thor eat pizza" is transformed into a sparse vector indicating presence (1) or absence (0) of vocabulary tokens or n-grams     .
- OOV words (e.g., "Hulk" not in training vocabulary) cannot be represented, illustrating the OOV problem in practice  .

## News Category Classification Dataset and Handling Imbalance
- The dataset used contains news articles labeled into six categories (e.g., business, sports, science), loaded into a pandas DataFrame from JSON format   .
- Exploratory analysis shows class imbalance; some categories (like science) have fewer samples than others (business, sports)  .
- To address imbalance, under-sampling is applied by randomly selecting the minimum number of samples present among classes (138 in this case) to create a balanced dataset, acknowledging that discarding data is generally not ideal in real scenarios    .
- Balanced subsets for each category are concatenated row-wise using `pd.concat` to form a balanced DataFrame for training    .

## Preparing Data for Machine Learning
- Categories (strings) are mapped to numeric labels using pandas `.map()` to allow model training, as models expect numeric targets    .
- The balanced dataset is split into training and testing sets with an 80-20 ratio, using a `random_state` for reproducibility and `stratify` to maintain class distribution in both sets  .
- Stratification ensures equal representation of all classes in train and test, preventing model bias towards majority classes   .

## Model Training and Evaluation
- A pipeline is created combining the vectorizer (bag of words or n-grams) and a classifier (Multinomial Naive Bayes recommended for text classification, though other classifiers like KNN, Random Forest, Decision Trees can be compared)   .
- The model is trained on training data and predictions are made on the test set; performance is evaluated with a classification report showing metrics like precision, recall, and F1-score   .
- Experimentation shows bag of words (unigrams) often performs better than bigrams or trigrams on this dataset, but results may vary based on problem specifics and require trial and error   .
- Example predictions demonstrate correct classification of news categories, showing the model's practical application   .

## Effect of Preprocessing on Model Performance
- Training the same model with preprocessed text (stop words removed, lemmatized) improves classification metrics compared to using raw text, as shown by higher F1-scores across most classes      .
- Preprocessing is generally recommended for NLP tasks, though some cases may not benefit; it depends on the problem context  .

## Practice and Further Learning
- Exercises and additional resources (like confusion matrix plotting code) are provided in the video description and notebook to encourage hands-on practice, emphasizing that active coding is essential for mastering NLP and machine learning   .

---

> **‚ÑπÔ∏è Note:** The bag of n-grams approach extends bag of words by capturing contiguous sequences of tokens, improving contextual representation but increasing vocabulary size and computational cost. Preprocessing (stop word removal, lemmatization) enhances vector quality and model performance, especially in text classification tasks with imbalanced data handled via sampling techniques.  
     

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

v=CountVectorizer(ngram_range=(1,3))
v.fit(["Thor ate pizza or eat samosa"])
v.vocabulary_



{'thor': 12,
 'ate': 0,
 'pizza': 8,
 'or': 5,
 'eat': 3,
 'samosa': 11,
 'thor ate': 13,
 'ate pizza': 1,
 'pizza or': 9,
 'or eat': 6,
 'eat samosa': 4,
 'thor ate pizza': 14,
 'ate pizza or': 2,
 'pizza or eat': 10,
 'or eat samosa': 7}

In [None]:
corpus = [
    "Loki is eating pizza",
    "Thor ate pizza",
    "Hulk likes pizza"
]

In [5]:
import spacy

nlp=spacy.load("en_core_web_sm")

def preprocesse(text):
    doc=nlp(text)
    filtered_tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct:
            filtered_tokens.append(token.lemma_)
    return " ".join(filtered_tokens)
preprocesse("arin was a greet person and he doing greate in coding")

'arin greet person greate code'

## TF-IDF Representation in NLP

- **Background:** The video builds on the Bag of Words (BoW) and Bag of n-grams models used for text classification, focusing on classifying news articles by company mentions such as Tesla or Apple   .
- **Issue with BoW:** Generic terms like "price," "market," "investor" appear frequently across documents and suppress meaningful, relevant terms (e.g., "iphone," "musk"), misleading the model into thinking unrelated articles are similar due to these common words    .
- **Stop Words Limitation:** Removing stop words helps but does not fully solve the problem because many generic but domain-relevant terms still affect the model's accuracy   .

## Document Frequency and Inverse Document Frequency (IDF)

- **Document Frequency (DF):** Measures in how many documents a term appears, not how many times per document. For example, "gigafactory" appears in only one document, while "price" appears in three out of four documents  .
- **Inverse Document Frequency (IDF):** To reduce the influence of common terms, IDF is calculated as the logarithm of the ratio of total documents to the number of documents containing the term:
  
  $$
  \text{IDF}(t) = \log \frac{N}{n_t}
  $$
  
  where \(N\) is total documents and \(n_t\) is documents containing term \(t\). Terms appearing in fewer documents get higher IDF scores, emphasizing their importance     .
- **Use of Logarithm:** Logarithm dampens the effect of extremely high term frequencies, preventing rare terms from dominating and common terms from being overly penalized. The log function flattens as frequency increases, stabilizing influence    .

## Term Frequency (TF) and Combining with IDF

- **Term Frequency (TF):** Normalizes word count by the total number of tokens in a document to account for document length differences:
  
  $$
  \text{TF}(t,d) = \frac{\text{count of } t \text{ in } d}{\text{total words in } d}
  $$
  
  This prevents bias toward longer documents   .
- **TF-IDF Score:** Calculated as the product of TF and IDF, balancing term importance within a document and across the corpus. Relevant terms have higher TF-IDF scores, while common terms get lower scores   .

## Limitations of TF-IDF

- **Sparsity:** As vocabulary size grows, TF-IDF vectors become sparse, which can affect model performance  .
- **Lack of Semantic Relationship:** TF-IDF does not capture relationships between words; unlike word or sentence embeddings, it only counts term frequency and document distribution  .
- **Out-of-Vocabulary Problem:** Words not in the training vocabulary cannot be represented, limiting model generalization  .

## Practical TF-IDF Implementation (Using scikit-learn)

- **TF-IDF Vectorizer:** Create an instance, then fit and transform a corpus (collection of documents) to generate TF-IDF vectors   .
- **Vocabulary and IDF Scores:** Vocabulary maps terms to indices; IDF scores can be accessed via the vectorizer object to inspect term importance (e.g., common terms like "is" have low IDF, rare terms like "apple" have high IDF)      .
- **TF-IDF Vectors:** Transformed output is a sparse matrix; converting it to an array allows inspection of TF-IDF scores per document and term      .

## Application: E-Commerce Text Classification

- **Dataset:** Uses Amazon item descriptions labeled into four categories (e.g., electronics, household, books, clothing) with balanced classes (6000 items each)     .
- **Label Encoding:** Converts categorical labels to numerical form for machine learning compatibility using pandas mapping functions   .
- **Train-Test Split:** Splits data into 80% training and 20% testing sets using stratified sampling to preserve class balance   .

## Model Training and Evaluation

- **Classifiers Used:** K-Nearest Neighbors (KNN), Random Forest, and Multinomial Naive Bayes classifiers are trained to compare performance   .
- **Pipeline:** Utilizes sklearn pipelines to chain TF-IDF vectorization with classifiers for streamlined training and prediction   .
- **Performance Metrics:** Classification reports show precision, recall, and F1 scores around 95-98%, indicating strong model performance across categories    .
- **Prediction Inspection:** Individual predictions align well with true labels, showing the model's effectiveness in classifying e-commerce product descriptions    .
- **Classifier Comparison:** Random Forest often yields the best performance, but choice depends on data and problem context; Naive Bayes is a common starting point for text classification    .

## Text Preprocessing Impact

- **Preprocessing Steps:** Removal of stop words, punctuation, and lemmatization to normalize text improves model accuracy   .
- **Implementation:** Applies preprocessing function to text column using pandas `.apply()` method, creating a new preprocessed text column   .
- **Retraining:** Models trained on preprocessed text show slightly better F1 scores (up to 99%), confirming the value of preprocessing in text classification tasks   .
- **General Guideline:** Preprocessing is recommended but effectiveness may vary depending on dataset and task  .

## Learning and Practice Advice

> **üí° Key Insight:** Mastery of machine learning and NLP requires consistent practice beyond watching tutorials; coding along and completing exercises is essential for skill development.
  

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [7]:
v = TfidfVectorizer()
v.fit(corpus)
transform_output = v.transform(corpus)

In [8]:
print(v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [9]:
all_feature_names = v.get_feature_names_out()

for word in all_feature_names:
    
    #let's get the index in the vocabulary
    indx = v.vocabulary_.get(word)
    
    #get the score
    idf_score = v.idf_[indx]
    
    print(f"{word} : {idf_score}")

already : 2.386294361119891
am : 2.386294361119891
amazon : 2.386294361119891
and : 2.386294361119891
announcing : 1.2876820724517808
apple : 2.386294361119891
are : 2.386294361119891
ate : 2.386294361119891
biryani : 2.386294361119891
dot : 2.386294361119891
eating : 1.9808292530117262
eco : 2.386294361119891
google : 2.386294361119891
grapes : 2.386294361119891
iphone : 2.386294361119891
ironman : 2.386294361119891
is : 1.1335313926245225
loki : 2.386294361119891
microsoft : 2.386294361119891
model : 2.386294361119891
new : 1.2876820724517808
pixel : 2.386294361119891
pizza : 2.386294361119891
surface : 2.386294361119891
tesla : 2.386294361119891
thor : 2.386294361119891
tomorrow : 1.2876820724517808
you : 2.386294361119891


## Shortcomings of Traditional Text Vectorization Techniques
- Traditional text vectorization methods often produce high-dimensional sparse vectors with many zeros and fail to capture the semantic meaning of words properly. For example, two similar sentences might have very different vectors under these methods, limiting their usefulness in understanding language meaning.  

## Word Embeddings: Concept and Advantages
- **Word embeddings** are dense, lower-dimensional vector representations of words (commonly 50, 100, or 300 dimensions) where similar words have similar vectors, effectively capturing semantic relationships. For instance, the vectors for "good" and "great" are close but not identical, reflecting their similarity.   
- These dense vectors contrast with sparse vectors from older techniques, which had many zero values, improving efficiency and expressiveness.  

## Popular Word Embedding Techniques
- Common word embedding methods include **Word2Vec**, **GloVe**, and **FastText**, which rely on approaches like Continuous Bag of Words (CBOW) and Skip-Gram models to learn word representations from large corpora.   
- More recent transformer-based embeddings, such as **BERT** and **GPT**, represent advanced NLP models that capture context more effectively and are used in applications like Google Search.  

## Additional Embedding Models and Their Foundations
- Models like **ELMo** use LSTM-based architectures to generate embeddings, showing the diversity of techniques for capturing word meanings.  
- These models convert words or sentences into vectors that not only represent meaning but also allow for arithmetic operations on vectors, revealing semantic relationships (e.g., "king" - "man" + "woman" ‚âà "queen"). This arithmetic property is a powerful demonstration of embedding capabilities.   

## Training Variations Based on Data Corpora
- Word embedding models can be trained on different datasets (corpora), resulting in variations tailored to specific domains or language styles. For example, Word2Vec trained on Google News differs from one trained on Twitter data, the latter better capturing slang and informal language.  
- Similarly, transformer-based models like BERT have domain-specific versions such as **BioBERT** for biomedical text and **FinBERT** for financial data, illustrating how training data influences embeddings.  

## Overview of Jargon and Further Learning Recommendations
- Numerous jargons like Word2Vec, GloVe, FastText, BERT, BioBERT, FinBERT, ALBERT, RoBERTa, and Virtuac represent various embedding techniques or model variants, often reflecting differences in architecture or training data.    
- For a deeper understanding, especially of Word2Vec, dedicated tutorials are recommended as these foundational techniques underpin many NLP applications.   

## Summary of Purpose and Scope of Word Embeddings
- The primary goal of word embedding techniques is to convert text‚Äîwhether a single word, sentence, paragraph, or entire document‚Äîinto numerical vectors that machines can process. This enables machine learning models to understand and work with textual data effectively. 

## Accessing Word Vectors in SpaCy

- To use word vectors in SpaCy, you must load either the **medium** (`md`) or **large** (`lg`) English model, as the **small** model has no vectors (zero keys, zero unique vectors)   .
- The medium model contains about 514 keys and vectors of 300 dimensions; the large model has many more unique vectors (around 20k) with the same dimension size   .
- Install the large model via command line with `python -m spacy download en_core_web_lg`; expect a large download size and some time for installation  .

## Working with Tokens and Vectors

- You can iterate over tokens in a document and check properties like `token.has_vector` (boolean if the token has a vector) and `token.is_oov` (out-of-vocabulary flag)  .
- Common words like "dog", "cat", and "banana" have vectors, while random or regional words (e.g., "kem's") do not, because they were not seen in the training corpus   .
- SpaCy uses **GloVe embeddings** trained on large English datasets (e.g., Wikipedia, news articles) to capture general English language knowledge  .

## Vector Representation and Dimensions

- Each word vector is a 300-dimensional numpy array accessible by `token.vector` with shape `(300,)`   .
- Sentence vectors in SpaCy are computed as the average of the individual word vectors in the sentence, so a single-word sentence vector equals that word‚Äôs vector  .

## Measuring Similarity Between Words

- Similarity between two tokens can be computed using `token1.similarity(token2)`, which returns a value between 0 and 1, where 1 means identical vectors   .
- Words appearing in similar contexts tend to have higher similarity scores. For example, "bread" and "sandwich" have a higher similarity than "bread" and "car" because they co-occur in related contexts   .
- Similarity reflects **contextual closeness**, not semantic equivalence. Antonyms like "profit" and "loss" may have high similarity because they appear in similar contexts despite opposite meanings   .

## Practical Function to Print Similarities

- A reusable function can be created to compare a base word against a list of words, printing their similarity scores, facilitating quick exploration of vector similarities in SpaCy  .

## Examples of Similarity Observations

| Word Pair          | Similarity Score | Explanation                                   |
|--------------------|------------------|-----------------------------------------------|
| bread - sandwich   | ~0.6             | High contextual similarity                    |
| bread - burger     | ~0.4             | Moderate similarity                           |
| bread - car        | ~0.06            | Very low similarity, unrelated contexts      |
| iPhone - Samsung   | ~0.67            | Often compared together in news/Wikipedia    |
| iPhone - apple     | ~0.43            | Less similarity due to context in training data |
| dog - iPhone       | ~0.08            | Minimal similarity, unrelated concepts       |

- Similarity scores depend heavily on the training corpus and context frequency, not just conceptual relationships    .

## Out-of-Vocabulary (OOV) Words

- Words not present in the training corpus (OOV) have no vector and `token.has_vector` is `False`. For example, regional language words like Gujarati terms will be OOV in the English SpaCy model   .

## Vector Arithmetic and Analogies

- Vector arithmetic can capture semantic relationships, e.g., `king - man + woman ‚âà queen`  .
- Cosine similarity is used to measure how close the resulting vector is to the target vector (e.g., queen) with values above 0.5 considered reasonably similar   .

> **‚ùó Important:** Word vector similarity measures contextual similarity based on co-occurrence patterns in training data, not dictionary definitions or antonymy. Interpret results accordingly.  
>    

## Summary of Key Concepts

| Concept                       | Description                                                                                       |
|------------------------------|---------------------------------------------------------------------------------------------------|
| Word vectors                 | Dense 300-dimensional vectors representing words, capturing semantic and contextual info          |
| Medium vs Large SpaCy models | Medium: ~514 vectors; Large: ~20k vectors; both 300-dimensional                                   |
| Token properties             | `has_vector` (bool), `is_oov` (bool)                                                             |
| Similarity measure           | `token1.similarity(token2)` returns context-based similarity (0 to 1)                             |
| Sentence vector              | Average of constituent word vectors                                                               |
| OOV words                   | Tokens without vectors due to absence in training corpus                                          |
| Vector arithmetic            | Enables semantic analogies, e.g., king - man + woman ‚âà queen                                      |

## Recommended Further Topics

- Explore cosine similarity in detail to understand vector comparison metrics better  .
- Future tutorials will cover word vectors in Gensim and practical text classification applications using embeddings  .

## Label Encoding for Classification
- Convert categorical labels (e.g., "fake" or "real") into numerical format using a mapping function: fake ‚Üí 0, real ‚Üí 1, creating a new label number column for model training.  

## Using SpaCy Word Vectors
- Load a large SpaCy model that includes pre-trained 300-dimensional word vectors to convert text data into dense vector representations.   
- Convert each news article text into a SpaCy document object, then extract its `.vector` attribute to get the 300-element word vector.  
- Apply this vectorization to every row in the pandas DataFrame using `apply()` with a lambda function, creating a new column storing vectors for each text.  
- This vectorization process is computationally expensive and may take several minutes to complete for large datasets.  

## Preparing Data for Model Training
- Use `train_test_split` to divide data into training and testing sets, specifying features (X as vectors) and labels (Y as label numbers), with a typical test size of 20%. Set a random state for reproducibility.  
- Extract numpy arrays of vectors with `.values` but note that the result is an array of arrays, which is not suitable for scikit-learn classifiers expecting a 2D array.  
- Convert the nested array structure into a proper 2D numpy array using `numpy.stack()`, ensuring compatibility with classifiers.   

## Model Training and Challenges
- Import and train a multinomial Naive Bayes classifier, commonly used for NLP tasks due to its effectiveness with text data.  
- Multinomial Naive Bayes requires non-negative feature values; SpaCy vectors contain negative values causing errors.  
- Apply Min-Max scaling to transform vectors into a positive range using `MinMaxScaler` from scikit-learn. Use `fit_transform` on training data and `transform` on test data to maintain consistency.   

## Model Evaluation
- After training, predict labels on the test set and generate a classification report showing precision, recall, and F1 score to evaluate model performance.  
- The multinomial Naive Bayes model achieves robust results with over 90% in key metrics, demonstrating effectiveness using SpaCy embeddings.   

## Alternative Classifier: K-Nearest Neighbors (KNN)
- Train a KNN classifier (with k=5 neighbors) on the same SpaCy vector features; KNN typically struggles with high-dimensional sparse data but performs well with dense 300-dimensional vectors.  
- KNN achieves near-perfect precision and recall (~99%), outperforming previous TF-IDF or bag-of-ngrams approaches where KNN struggled due to high dimensionality.  
- Dense embeddings reduce dimensionality issues, allowing KNN to excel compared to sparse vector representations.   

## Summary and Practical Notes
| Step                     | Description                                                                                  |
|--------------------------|----------------------------------------------------------------------------------------------|
| Label Encoding           | Map text labels to numeric values for classification                                        |
| Vectorization            | Convert texts to 300-dimensional SpaCy word vectors                                         |
| Data Preparation         | Use `numpy.stack()` to create 2D arrays compatible with scikit-learn                         |
| Scaling                  | Apply Min-Max scaling to ensure non-negative features for multinomial Naive Bayes            |
| Model Training           | Train multinomial Naive Bayes and KNN classifiers                                           |
| Evaluation               | Evaluate with precision, recall, F1 score; KNN shows superior performance with dense vectors |

- Using pre-trained SpaCy embeddings simplifies feature engineering compared to manual TF-IDF or bag-of-ngrams approaches, speeding up the NLP classification pipeline.  
- Exercises related to this tutorial are often posted separately; reviewing them is recommended for deeper learning and practice.  

> **üí° Key Insight:** Dense vector representations from SpaCy enable simpler and more effective classification, especially benefiting models like KNN that struggle with high-dimensional sparse data.  
   

In [4]:
import pandas as pd

df =pd.read_csv(r"C:\Users\arink\Downloads\Fake_Real_Data.csv")
df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [8]:
df.label.value_counts()

label
Fake    5000
Real    4900
Name: count, dtype: int64

In [26]:
df['label_num']=df['label'].map({'Fake':0,'Real':1})
df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


In [27]:
!python -m spacy download en_core_web_lg --timeout 100


Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
     ---------------------------------------- 0.0/400.7 MB ? eta -:--:--
     ---------------------------------------- 0.3/400.7 MB ? eta -:--:--
     ---------------------------------------- 0.5/400.7 MB 2.1 MB/s eta 0:03:11
     ---------------------------------------- 0.8/400.7 MB 1.8 MB/s eta 0:03:47
     ---------------------------------------- 1.3/400.7 MB 2.0 MB/s eta 0:03:23
     ---------------------------------------- 2.1/400.7 MB 2.4 MB/s eta 0:02:47
     ---------------------------------------- 3.1/400.7 MB 2.8 MB/s eta 0:02:25
     ---------------------------------------- 3.9/400.7 MB 3.0 MB/s eta 0:02:12
     ---------------------------------------- 4.7/400.7 MB 3.1 MB/s eta 0:02:10
      --------------------------------------- 5.2/400.7 MB 3.1 MB/s eta 0:02:08
      -------------------------------


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\arink\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [28]:
import spacy

nlp=spacy.load("en_core_web_lg")

In [29]:
doc= nlp("arin kumar is greate and he one of the best code in wordl")
doc.vector.shape

(300,)

In [30]:
df['vector']=df['Text'].apply(lambda x:nlp(x).vector)

In [31]:
df.head()

Unnamed: 0,Text,label,label_num,vector
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0,"[-0.103623025, 0.17802684, -0.11873861, -0.034..."
1,U.S. conservative leader optimistic of common ...,Real,1,"[-0.0063406364, 0.16712041, -0.06661373, 0.017..."
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1,"[-0.122753024, 0.17192385, -0.024732638, -0.06..."
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0,"[-0.027337318, 0.12501417, -0.0073965387, -0.0..."
4,Democrats say Trump agrees to work on immigrat...,Real,1,"[-0.032708026, 0.093958504, -0.03287002, -0.00..."


In [None]:
from sklearn.model_selection import train_test_split




X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values, df.label_num, test_size=0.2, random_state=2022
)

In [36]:
import numpy as np 

X_train_2d=np.stack(X_train)
X_test_2d=np.stack(X_test)

In [37]:
X_train_2d

array([[-0.02370346,  0.14819953, -0.05906299, ..., -0.06582212,
        -0.05378761,  0.08668853],
       [-0.01595326,  0.15394837, -0.10800642, ..., -0.03003666,
        -0.04334445,  0.03076661],
       [-0.04449651,  0.11169833, -0.04756551, ..., -0.10499363,
        -0.00837316,  0.06351685],
       ...,
       [ 0.02167883,  0.12635042, -0.01003216, ..., -0.08063941,
        -0.06881595,  0.04882506],
       [-0.07091133,  0.08315557, -0.06580248, ..., -0.06301989,
         0.02095402,  0.09888683],
       [-0.08993341,  0.14425951, -0.14141384, ..., -0.03444797,
         0.02387965,  0.06281336]], dtype=float32)

In [39]:
from sklearn.preprocessing import MinMaxScaler 

scaler=MinMaxScaler()
scale_train_embed=scaler.fit_transform(X_train_2d)
scale_test_embed=scaler.fit_transform(X_test_2d)

In [40]:
from sklearn.naive_bayes import MultinomialNB

clf=MultinomialNB()

clf.fit(scale_train_embed,y_train)

In [42]:
y_predict=clf.predict(scale_test_embed)
from sklearn.metrics import classification_report

print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95      1024
           1       0.97      0.93      0.95       956

    accuracy                           0.95      1980
   macro avg       0.95      0.95      0.95      1980
weighted avg       0.95      0.95      0.95      1980



In [47]:
from sklearn.neighbors import KNeighborsClassifier

clf=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
clf.fit(X_train_2d,y_train)
y_predt=clf.predict(X_test_2d)

print(classification_report(y_test,y_predt))


              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1024
           1       0.97      0.99      0.98       956

    accuracy                           0.98      1980
   macro avg       0.98      0.98      0.98      1980
weighted avg       0.98      0.98      0.98      1980



## Overview of Gensim and Word Embeddings

- **Gensim** is a Python NLP library primarily used for topic modeling, similar to SpaCy, but with a convenient API for word vectors, making it a good choice for working with word embeddings   .
- Word embeddings in Gensim can be loaded using the `API.load()` method, where you specify the dataset or model type, such as the Google News pretrained Word2Vec model   .
- The **Google News Word2Vec model** is large (~1.6 GB), trained on 100 billion words, and contains about 3 million word vectors, making it suitable for large-scale NLP tasks   .
- Smaller models are available too, such as those trained on Twitter data (~199 MB, 1.1 million vectors) using different algorithms like GloVe or Word2Vec, useful for lightweight or domain-specific analysis   .

## Understanding Word Vector Similarity in Gensim

- The `similarity()` function measures the similarity between two words based on their contexts, not strict synonyms; for example, "great" and "good" have a similarity score around 0.7   .
- Similarity reflects how often words appear in similar contexts, not semantic equivalence, which explains why antonyms like "good" and "bad" can have high similarity if they share surrounding words in training corpora   .
- This contextual similarity is a key feature of word embeddings, distinguishing them from traditional models like TF-IDF or bag-of-words that lack semantic understanding   .

## Practical Examples of Word Embedding Use

- Words like "dog," "puppy," and "Golden Retriever" cluster closely because they appear in similar contexts; similarly, "cat" and "dog" are similar due to shared usage contexts in training data  .
- Vector arithmetic on embeddings can reveal meaningful relationships, e.g., $$\text{France} - \text{Paris} + \text{Berlin} = \text{Germany}$$, illustrating how embeddings capture semantic structure beyond word co-occurrence   .
- Gensim's `most_similar()` method supports positive and negative examples to perform such vector arithmetic and returns results ranked by similarity, e.g., Queen is the result of $$\text{King} - \text{Man} + \text{Woman}$$   .

## Additional Gensim APIs for Semantic Tasks

- The `doesnt_match()` method identifies the odd word out in a list based on semantic fit, e.g., in a list of company names with "cat" included, "cat" would be identified as not matching  .
- This demonstrates Gensim's ability to understand language context and semantic categories beyond simple word matching  .

## Comparing Models and Their Contextual Influence

- Loading different pretrained models (e.g., Twitter 25 model trained on 2 billion tweets) can yield different similarity results due to domain-specific training data influencing word contexts   .
- For example, similarity results for "good" differ significantly between Google News and Twitter models, reflecting the different language use in news articles vs. tweets  .
- Semantic tasks such as `most_similar()` and `doesnt_match()` work across models but results depend on the underlying training corpus and embedding technique (Word2Vec vs. GloVe)   .

## Important Concepts and Clarifications

> **Similarity in word embeddings refers to contextual similarity, not synonymy or antonymy.** Words appearing in similar contexts have high similarity scores, regardless of their dictionary meanings.  
   

- Word embeddings are trained using self-supervised learning on large text corpora, generating training samples automatically from context windows without labeled data  .
- Using pretrained embeddings in NLP models helps capture semantic relationships and generalize better to unseen but semantically related words, unlike traditional sparse vector models  .
- Multiple libraries (Gensim, SpaCy, PyTorch) support loading pretrained embeddings, but Gensim offers a convenient API for vector operations and semantic queries   .
- Different embedding algorithms (Word2Vec, GloVe) and datasets (Google News, Twitter) affect the size, vocabulary, and quality of the embeddings, so choosing the right model depends on the application domain    .

---

| Concept                  | Description                                                                                      | Example/Note                                         |
|--------------------------|------------------------------------------------------------------------------------------------|-----------------------------------------------------|
| Word2Vec                 | An algorithm for training word embeddings based on predicting context words                      | Google News Word2Vec model (3M vectors, 1.6 GB)     |
| GloVe                    | Another embedding algorithm based on matrix factorization of word co-occurrence statistics      | Twitter GloVe model (smaller, domain-specific)       |
| Similarity               | Measures how often two words share similar contexts, not dictionary meaning                     | "good" vs. "bad" similarity ~0.7 due to similar context usage |
| Vector Arithmetic        | Adding and subtracting word vectors to find semantic relations                                  | $$\text{King} - \text{Man} + \text{Woman} = \text{Queen}$$ |
| `most_similar()`         | Finds words closest in vector space to given words or vector expressions                        | Find words similar to "good" or "France - Paris + Berlin" |
| `doesnt_match()`         | Identifies the word that semantically doesn't fit in a list                                    | In ["dog", "cat", "lion", "Microsoft"], "Microsoft" is the odd one out |

                    

In [None]:
import gensim
import gensim.downloader as api

wv=api.load("word2vec-google-news-300")


[--------------------------------------------------] 0.1% 2.4/1662.8MB downloaded

## Overview of News Classification Task Using Gensim Word Vectors
- The goal is to classify news articles into "fake" or "real" categories using a dataset sourced from Kaggle, containing approximately 9900 samples with balanced classes, avoiding class imbalance issues.   
- Gensim's pre-trained Word2Vec model on Google News is used to obtain 300-dimensional word embeddings, which can be accessed like a Python dictionary to get vectors for individual words.   

## Data Preparation and Label Encoding
- The dataset is loaded into a pandas DataFrame, and the target labels ("fake", "real") are converted into numeric format: fake = 0, real = 1, enabling machine learning compatibility.   

## Text Preprocessing and Vectorization
- A combined preprocessing and vectorization function is created to:
  - Remove stopwords and punctuation.
  - Lemmatize tokens to their base forms.
  - Convert processed tokens into a sentence embedding by averaging their word vectors.    
- Example: The sentence embedding is computed by averaging vectors of individual words (e.g., vectors for "worry" and "understand" averaged to form a 300-dimensional sentence vector). This is a common NLP practice to create sentence-level representations from word embeddings.   

## Using Gensim's get_mean_vector Method
- Gensim's `get_mean_vector` method simplifies averaging word vectors from a list of tokens to produce the sentence embedding, with an option for normalization to improve machine learning performance.    
- Normalization helps by scaling the vectors consistently, which typically enhances model results.   

## Applying Vectorization to the Dataset
- The preprocessing and vectorization function is applied to each news article, creating a new column in the DataFrame with the 300-dimensional sentence vectors. This operation is computationally intensive and may take significant time to complete.   

## Preparing Data for Model Training
- Data is split into training (80%) and testing (20%) sets.
- Sentence vectors are converted from arrays of arrays into a native 2D NumPy array using `np.stack` to allow compatibility with machine learning classifiers.   

## Model Training and Evaluation
- A Gradient Boosting Classifier is trained on the vectorized news data, a method frequently used in the author's machine learning tutorials.  
- The model achieves approximately 98% accuracy, with high precision, recall, and F1 scores in both classes (fake and real), indicating strong classification performance.  

## Prediction and Confusion Matrix Interpretation
- The model is tested on new news samples, correctly predicting their categories.
- The confusion matrix shows most predictions on the diagonal (correct classifications):
  - 965 real news correctly predicted as real.
  - 972 fake news correctly predicted as fake.
- Misclassifications include 28 fake news predicted as real, and 15 real news predicted as fake, indicating some classification errors but overall strong performance.   

> **üí° Key Insight:** Averaging word embeddings to create sentence vectors is an effective and widely used technique for text classification tasks, especially when combined with powerful classifiers like Gradient Boosting.   

In [None]:
import pandas as pd

df= pd.read_csv(r"C:\Users\arink\Downloads\fake_and_real_news.csv")

df.head()

In [None]:
df.label.value_counts()

In [None]:
df["label_num"]=df.label.map({'Fake':1,'Real':0})
df.head()

In [None]:
import spacy

nlp=spacy.load("en_core_web_lg")

def preprocessing_and_vectorizing(text):
    doc=nlp(text)
    filter_token=[]
    
    for token in doc:
        if token.is_punct or token.is_stop:
            continue
        filter_token.append(token.lemma_)

    return filter_token    

preprocessing_and_vectorizing(df)

In [None]:
df['vector']=df['text'].apply(lambda text:preprocessing_and_vectorizing(text))

In [None]:
df.head()

In [None]:
from sklearn.model_selection import train_test_split



X_train, X_test, y_train, y_test = train_test_split(
    df.vector.values, 
    df.label_num, 
    test_size=0.2, 
    random_state=2022,
    stratify=df.label_num
)

In [None]:
print("Shape of X_train before reshaping: ", X_train.shape)
print("Shape of X_test before reshaping: ", X_test.shape)


X_train_2d = np.stack(X_train)
X_test_2d =  np.stack(X_test)

print("Shape of X_train after reshaping: ", X_train_2d.shape)
print("Shape of X_test after reshaping: ", X_test_2d.shape)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report


clf = GradientBoostingClassifier()


clf.fit(X_train_2d, y_train)


y_pred = clf.predict(X_test_2d)


print(classification_report(y_test, y_pred))

In [None]:
test_news = [
    "Michigan governor denies misleading U.S. House on Flint water (Reuters) - Michigan Governor Rick Snyder denied Thursday that he had misled a U.S. House of Representatives committee last year over testimony on Flint√¢‚Ç¨‚Ñ¢s water crisis after lawmakers asked if his testimony had been contradicted by a witness in a court hearing. The House Oversight and Government Reform Committee wrote Snyder earlier Thursday asking him about published reports that one of his aides, Harvey Hollins, testified in a court hearing last week in Michigan that he had notified Snyder of an outbreak of Legionnaires√¢‚Ç¨‚Ñ¢ disease linked to the Flint water crisis in December 2015, rather than 2016 as Snyder had testified. √¢‚Ç¨≈ìMy testimony was truthful and I stand by it,√¢‚Ç¨¬ù Snyder told the committee in a letter, adding that his office has provided tens of thousands of pages of records to the committee and would continue to cooperate fully.  Last week, prosecutors in Michigan said Dr. Eden Wells, the state√¢‚Ç¨‚Ñ¢s chief medical executive who already faced lesser charges, would become the sixth current or former official to face involuntary manslaughter charges in connection with the crisis. The charges stem from more than 80 cases of Legionnaires√¢‚Ç¨‚Ñ¢ disease and at least 12 deaths that were believed to be linked to the water in Flint after the city switched its source from Lake Huron to the Flint River in April 2014. Wells was among six current and former Michigan and Flint officials charged in June. The other five, including Michigan Health and Human Services Director Nick Lyon, were charged at the time with involuntary manslaughter",
    " WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began.",
    " Sarah Palin Celebrates After White Man Who Pulled Gun On Black Protesters Goes Unpunished (VIDEO) Sarah Palin, one of the nigh-innumerable  deplorables  in Donald Trump s  basket,  almost outdid herself in terms of horribleness on Friday."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
clf.predict(test_news_vectors)

In [None]:


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm


from matplotlib import pyplot as plt
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')

## Overview of Word Embedding Techniques

- Word2Vec uses two main architectures for word embeddings: Continuous Bag of Words (CBOW) predicts a target word from its context, while Skip-gram predicts context words from a target word. Both train a neural network whose weights serve as word embeddings    .
- Word2Vec treats the **word as the smallest unit** for training the neural network, which can cause problems with out-of-vocabulary (OOV) words if they do not appear in the training corpus  .

## Introduction to FastText

- FastText is similar to Word2Vec but differs by training on **character n-grams** instead of whole words, capturing subword information (e.g., for "capable" with n=3, n-grams include "cap", "apa", "pab", etc.)   .
- This subword modeling enables FastText to handle OOV words effectively because it can infer embeddings from known character n-grams even if the full word was unseen during training  .
- FastText is often the **first choice for training custom embeddings in specialized domains** due to its efficiency and ability to handle domain-specific vocabulary better than Word2Vec or BERT    .

## FastText: Technique and Library

- FastText is both a **technique and a library** developed by Facebook Research, providing pre-trained models and Python modules for easy use   .
- Pre-trained FastText models are available for many languages and are trained on large corpora such as Wikipedia and Common Crawl, enabling them to capture general language properties   .

## Using Pre-trained FastText Models

- The FastText Python API allows loading pre-trained models and accessing word vectors, nearest neighbors, and analogies   .
- Nearest neighbors in FastText embeddings reflect **contextual similarity, not synonyms or antonyms**; e.g., "good" and "bad" appear close because they occur in similar contexts   .
- FastText vectors have a typical dimension of 300, and methods like `get_word_vector` and `get_analogies` demonstrate semantic relationships (e.g., Berlin:Germany :: Delhi:India)   .

## Domain-Specific Training Example: Indian Food Recipes

- Pre-trained general models may perform poorly on domain-specific terms (e.g., Indian food items like "chutney" or regional terms like "saragua") because such terms are rare or absent in general corpora   .
- Training a FastText model on a specialized dataset (e.g., Indian food recipes) involves cleaning the text data (removing special characters, extra spaces, converting to lowercase) using regular expressions for better quality input   .
- Exporting the cleaned recipe text into a plain text file with one recipe per line prepares the data for unsupervised FastText training   .
- Training uses FastText‚Äôs `train_unsupervised` method, typically with skip-gram by default, which generates word embeddings tailored to the domain-specific language   .

## Benefits of Custom FastText Models

- Custom-trained FastText models provide **more meaningful nearest neighbor results** for domain-specific terms compared to general models, reflecting better understanding of the domain vocabulary and relationships    .
- This approach helps improve NLP tasks in specialized fields by producing embeddings that capture relevant semantic nuances not present in general language models   .

## FastText Model Hyperparameters and Further Exploration

- FastText training supports tuning of hyperparameters such as model type (`cbow` or `skipgram`), embedding dimension, number of epochs, learning rate, and number of threads to optimize performance for specific datasets   .
- Users are encouraged to experiment with these parameters to improve model quality when training on custom data  .
- The FastText library website provides tutorials and resources for deeper exploration, including supervised tasks like text classification  .

---

> **üí° Key Insight:** FastText‚Äôs use of character n-grams significantly reduces the out-of-vocabulary problem inherent in word-level embeddings, making it highly suitable for custom domain-specific NLP applications.  
   

## Text Preprocessing Using Regular Expressions

- To prepare text for fastText modeling, all punctuation and special characters must be removed, and text converted to lowercase without extra whitespace. This cleaning is done using regular expressions (regex)   .
- In regex, `\s` matches any whitespace character (spaces, tabs), and `\w` matches any word character (letters, digits, underscore). Characters that are neither word characters nor whitespace can be targeted for removal or substitution   .
- Using `re.sub()`, all special characters can be substituted with a space. For example, the pattern `[^\w\s]` matches any character that is not a word character or whitespace, enabling their removal to clean the text   .
- To reduce multiple consecutive spaces to a single space, the pattern `\s+` (one or more spaces) is replaced with a single space. The `strip()` method removes leading/trailing spaces, and `lower()` converts text to lowercase for normalization    .
- This preprocessing step is encapsulated in a function that can be applied to an entire pandas DataFrame column using `map()` or `apply()`, enabling batch cleaning of text data   .

## Preparing Data for fastText Training

- After preprocessing, the dataset is split into training and testing sets, typically using an 80/20 split, resulting in two DataFrames with separate labeled samples for model training and evaluation   .
- fastText‚Äôs supervised learning function, `train_supervised()`, requires input data in a specific text file format where each line contains a label and the corresponding item description. The training and test DataFrames are saved as CSV files without headers or indices to meet this format   .

## Training and Evaluating the fastText Model

- The `train_supervised()` method in fastText is used for text classification. Unlike previous use for word embeddings, here it combines embedding generation with classification training  .
- Model evaluation is done on the test set using precision and recall metrics. Precision indicates the percentage of correct predictions out of all predictions made, with a typical example showing 96% precision, indicating strong model performance  .
- The model‚Äôs prediction API is simple: given an input string, it predicts the most likely category label (e.g., Electronics, Books, Clothing) based on the learned embeddings and classification model  .

## Additional Model Capabilities and Practical Advice

- The trained fastText model retains word embeddings, allowing tasks like finding nearest neighbors to a word (similar words) using `model.get_nearest_neighbors()`. This can reveal semantically related terms learned during training  .
- Practical learning requires active coding practice alongside video tutorials to reinforce understanding and skills in text classification with fastText   .

---

> **üí° Key Insight:** Preprocessing text by removing punctuation, normalizing case, and reducing whitespace is essential to prepare clean input data that improves fastText classification accuracy.     
> **üí° Key Insight:** fastText integrates word embedding learning and supervised classification, enabling efficient and accurate text categorization from raw text data.  

In [3]:
import pandas as pd

data=pd.read_csv((r"C:\Users\arink\Downloads\ecommerce_dataset.csv"),names=["catogry","description"],header=None)
print(data.shape)
data.head(3)

(50425, 2)


Unnamed: 0,catogry,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [24]:
data.catogry.value_counts()

catogry
__label__Household               19313
__label__Books                   11820
__label__Electronics             10621
__label__Clothing_Accessories     8670
Name: count, dtype: int64

In [10]:
data.dropna(inplace=True)
data.shape

(50424, 2)

In [13]:
data.catogry.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [15]:
data.catogry.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data.catogry.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


In [16]:
data.catogry.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

In [17]:
data['catogry'] = '__label__' + data['catogry'].astype(str)
data.head(5)

Unnamed: 0,catogry,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [18]:
data['catogry_description'] = data['catogry'] + ' ' + data['description']
data.head(3)

Unnamed: 0,catogry,description,catogry_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


In [19]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [20]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [21]:
data['catogry_description'] = data['catogry_description'].map(preprocess)
data.head()

Unnamed: 0,catogry,description,catogry_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [23]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

## Types of Chatbots

- **Flow-based (Rule-based) Chatbots:** These operate via fixed menu options or decision trees where users select predefined choices step-by-step, requiring no machine learning, only simple if-else programming. Examples include Verizon Wireless and PNC Bank chatbots that guide users through hierarchical options like plan management or payments    .
- **NLP-based Flow Chatbots:** These understand free-form English queries within a specific domain, parsing user intent and responding accordingly. Examples include Capital One‚Äôs Eno chatbot, which understands varied phrasings about credit and transactions, and Domino‚Äôs chatbot, which processes natural language orders and validates inputs like addresses    .
- **Open-ended Chatbots:** These support unrestricted, multi-topic conversations without fixed domains. ChatGPT exemplifies this, capable of handling diverse questions and changing topics fluidly, unlike rule-based bots with limited scope    .

## Implementation Approaches

| Aspect                    | Flow-based Chatbots                       | NLP-based Chatbots (e.g., using OpenAI API)              |
|---------------------------|------------------------------------------|----------------------------------------------------------|
| Setup and Configuration   | Relatively easy, no ML needed             | More technical knowledge required, involves ML components|
| Natural Language Understanding (NLU) | Handled natively through rule-based logic | Requires custom intent/entity extraction and training    |
| Training Data             | Not needed                               | Fine-tuning or custom training often necessary           |
| Context Management        | Built-in context handling                 | Needs extensive customization                             |
| Integration               | Easier integration with platforms like Slack | More complex integration, custom development needed      |
| Cost                      | Relatively cheaper                       | Can be costly due to usage-based pricing                  |

This comparison highlights why frameworks like Google Dialogflow remain relevant despite the rise of LLMs like ChatGPT     .

## Frameworks and Tools for NLP Chatbots

- Popular chatbot frameworks include Google Dialogflow, Rasa, IBM Watson Assistant, Amazon Lex, and Microsoft Azure Bot Service.
- Dialogflow is favored for its ease of use, built-in NLU features, context management, and integration capabilities.
- Custom implementations can leverage APIs from OpenAI (GPT models), open-source LLMs (Hugging Face, Bloom), or cloud foundational models (AWS Bedrock) for flexibility and advanced capabilities   .

## Advantages of Chatbots over Human Customer Support

- **Scalability:** Easily scale to handle growing customer bases by deploying cloud resources, unlike human staffing which has practical limits.
- **24/7 Availability:** Chatbots operate continuously without breaks or shift changes.
- **Cost Efficiency:** Cheaper to maintain than large human support teams.
- **Improved Customer Experience:** Instant responses prevent delays common in human call centers during peak loads    .

## Summary of Chatbot Use Cases and Next Steps

- Chatbots are widely used in customer service, ordering systems, banking, and travel booking.
- Flow-based chatbots suit structured tasks with limited scope.
- NLP and open-ended chatbots handle more complex, natural language interactions.
- Upcoming tutorials will build an end-to-end Dialogflow-based chatbot for a real business use case, covering data collection, cleaning, and deployment   .

---

> **üí° Key Insight:** Despite the excitement around large language models like ChatGPT, traditional chatbot frameworks remain vital due to their ease of setup, cost-effectiveness, and integration strengths in many real-world applications. Custom AI solutions require more technical expertise and investment, making them complementary rather than replacement technologies. 