# Text Tagging

### Summary

This video introduces two text tagging methods: Parts of Speech (POS) tagging, which labels tokens as verbs, nouns, etc., and Named Entity Recognition (NER), which identifies entities like people and places. These techniques aid in text exploration and feature creation for machine learning.

### Highlights

- 🏷️ Parts of Speech (POS) tagging labels each token with its grammatical role (verb, noun, adjective).
- 📍 Named Entity Recognition (NER) identifies and extracts named entities like people, places, and organizations.
- 🔍 These tagging methods are useful for exploring and understanding text data.
- 📊 Tagging can also be used to create additional features for machine learning algorithms.
- 📈 These methods can be used for standalone analysis.
- 🐍 The next lessons will cover how to implement these methods in Python.

# Parts of Speech (POS) tagging

### Summary

This lesson demonstrates Parts of Speech (POS) tagging using Spacy and Pandas in Python. It involves loading a Spacy model, processing text (Jane Austen's Emma), extracting tokens and their POS tags into a DataFrame, and analyzing the frequency of tokens and tags.

### Highlights

- 🐍 Uses Spacy and Pandas libraries for POS tagging.
- 📖 Processes text from Jane Austen's Emma.
- 📊 Creates a Pandas DataFrame to store tokens and POS tags.
- 🔢 Calculates and displays the most common tokens and their POS tags.
- 📈 Analyzes the frequency of different POS tags in the text.
- 🔍 Filters and displays the most common nouns and adjectives.
- 👩‍💻 Demonstrates how to extract and analyze POS tag information for text analysis.

### Code Examples

Python

```python
import spacy
import pandas as pd

# Load the Spacy model
nlp = spacy.load("en_core_web_sm")

# Load the text (Jane Austen's Emma)
text = "emma text here" # Replace with actual text

# Create a Spacy document
doc = nlp(text)

# Extract tokens and POS tags into a DataFrame
pos_df = pd.DataFrame(columns=['token', 'pos_tag'])
for token in doc:
    pos_df = pd.concat([pos_df, pd.DataFrame([{'token': token.text, 'pos_tag': token.pos_}])], ignore_index=True)

# Analyze most common tokens and tags
pos_df_counts = pos_df.groupby(['token', 'pos_tag']).size().reset_index(name='counts').sort_values('counts', ascending=False)
print(pos_df_counts.head(10))

# Analyze POS tag frequency
pos_df_pos_counts = pos_df.groupby('pos_tag')['token'].count().sort_values(ascending=False)
print(pos_df_pos_counts.head())

# Filter and display most common nouns
nouns = pos_df_counts[pos_df_counts['pos_tag'] == 'NOUN'].head(10)
print(nouns)

# Filter and display most common adjectives
adjectives = pos_df_counts[pos_df_counts['pos_tag'] == 'ADJ'].head(10)
print(adjectives)
```

# Named Entity Recognition (NER)

### Summary

This lesson covers Named Entity Recognition (NER) using Spacy in Python. It demonstrates how to identify and visualize named entities in text, highlighting the impact of text cleaning on NER results. The lesson emphasizes the importance of considering the timing of text cleaning in relation to NER.

### Highlights

- 📦 Uses Spacy and Displacy libraries for NER and visualization.
- 📄 Processes text from the Google Wikipedia page.
- 🏷️ Identifies and labels named entities like dates, people, and organizations.
- 🎨 Visualizes named entities using Displacy's render function.
- 🧹 Shows how text cleaning (removing punctuation, lowercasing) affects NER results.
- ⏰ Emphasizes the importance of timing text cleaning in relation to NER.
- 🧐 Encourages experimentation with pre-processing to optimize NER results.

### Code Examples

```python
import spacy
from spacy import displacy
import re

# Load the Spacy model
nlp = spacy.load("en_core_web_sm")

# Example text from Google Wikipedia page
google_text = "Google, LLC is an American multinational technology company that focuses on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. It was founded in September 4, 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."

# Create a Spacy document
doc = nlp(google_text)

# Print entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

# Visualize entities using Displacy
displacy.render(doc, style="ent", jupyter=True)

# Clean the text (remove punctuation, lowercase)
google_text_clean = re.sub(r'[^\w\s]', '', google_text).lower()

# Create a Spacy document from the cleaned text
doc_clean = nlp(google_text_clean)

# Print entities and their labels from the cleaned text
for ent in doc_clean.ents:
    print(ent.text, ent.label_)

# Visualize entities from the cleaned text
displacy.render(doc_clean, style="ent", jupyter=True)
```