---
execute:
  eval: true
  echo: true
  output: true
title: "text tokenization"
---

- code examples [nltk](https://www.nltk.org/howto.html)
- code examples [spacy](https://spacy.io/usage/spacy-101)
- download [jupyter notebook](pyws03-2-text-analysis.ipynb)

In [None]:
# run inside google colab
#!git clone https://github.com/cca-cce/osm-cca-nlp.git

## natural language processing

In [None]:
import spacy
from spacy import displacy

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

# Example sentence
sentence = "Apple is looking at buying U.K. startup for $1 billion"

# Process the sentence with spaCy
doc = nlp(sentence)

# Print parts of speech tags
print("Parts of Speech:")
for token in doc:
    print(f"{token.text}: {token.pos_} ({token.tag_})")

# Print named entities
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

# Render the dependency parse tree in Jupyter Notebook
displacy.render(doc, style='dep', jupyter=True)

# Render the named entity recognition visualization in Jupyter Notebook
displacy.render(doc, style='ent', jupyter=True)

**Instructions:**

- **Imports:**
  - `spacy` is imported for natural language processing tasks.
  - `displacy` from `spacy` is imported for visualizations.

- **Load Model:**
  - The English language model `en_core_web_sm` is loaded using `spacy.load()`.

- **Process Sentence:**
  - The example sentence is processed to create a `Doc` object.

- **Parts of Speech (POS) Tags:**
  - Iterating over `doc`, each token's text and POS tags are printed.

- **Named Entities:**
  - Iterating over `doc.ents`, each entity's text and label are printed.

- **Visualizations:**
  - The dependency parse tree and named entity recognition (NER) are rendered using `displacy.render()` with `jupyter=True` to display within a Jupyter notebook.

## spacy sentence processing

### Import Libraries

*In this step, we import the necessary libraries: pandas for data manipulation, spaCy for natural language processing tasks, and os for interacting with the operating system.*

In [None]:
#| eval: true
#| echo: true
#| output: false

import pandas as pd
import spacy
import os

!python -m spacy download en_core_web_sm

---

### Load Data from TSV File

*We load the DataFrame `text_df` from a TSV (Tab-Separated Values) file using pandas' `read_csv` function with the separator set to tab (`\t`). The `input_file_path` variable specifies the path to the input file. Note that two file paths are provided; the second one overwrites the first, so only the last path is used.*

In [None]:
# Load text_df from the TSV file
input_file_path = '/content/osm-cca-nlp/csv/text_data.tsv'
input_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/text_data.tsv'
text_df = pd.read_csv(input_file_path, sep='\t')

*Here, pandas reads the TSV file into a DataFrame, which allows for efficient data manipulation and analysis.*

---

### Load the spaCy Model

*We load the spaCy English language model using `spacy.load`. The model `en_core_web_sm` is a small English model that includes vocabulary, syntax, entities, and word vectors.*

In [None]:
# Load the spaCy model (small English model is used here)
nlp = spacy.load("en_core_web_sm")

*This model provides the necessary tools for tokenization, part-of-speech tagging, named entity recognition, and sentence segmentation, which are essential for processing and analyzing text data.*

---

### Initialize List to Store Sentence Data

*We initialize an empty list `sentence_data` to store information about each sentence extracted from the texts. This list will be populated with dictionaries containing sentence-level data.*

In [None]:
# Initialize an empty list to store sentence data
sentence_data = []

*By using a list, we can dynamically append sentence data as we process each text, which will later be converted into a pandas DataFrame for further analysis.*

---

### Extract Sentences Using spaCy

*In this step, we iterate over each row in the `text_df` DataFrame using `iterrows()`. For each text, we process the 'cleaned_text' column with the spaCy NLP pipeline to obtain a `Doc` object. We then iterate over the sentences in the `Doc` using `doc.sents` and collect the original text ID, sentence number, and the sentence text. Each sentence's data is appended to the `sentence_data` list as a dictionary.*

In [None]:
# Iterate over the cleaned text in the DataFrame
for index, row in text_df.iterrows():
    doc = nlp(row['cleaned_text'])  # Process the cleaned text with spaCy

    # Iterate over the sentences in the document
    for i, sentence in enumerate(doc.sents):
        sentence_data.append({
            'id': row['id'],           # Original text ID
            'sentence_number': i + 1,  # Sentence number (starting from 1)
            'sentence_text': sentence.text.strip()  # Sentence text
        })

*This process leverages spaCy's sentence segmentation capabilities, which use linguistic rules and machine learning models to accurately split text into sentences.*

---

### Create DataFrame with Sentence Data

*We create a new pandas DataFrame `sentence_df` from the `sentence_data` list. This DataFrame contains all the sentences extracted from the texts along with their corresponding IDs and sentence numbers.*

In [None]:
# Create a new DataFrame with the sentence data
sentence_df = pd.DataFrame(sentence_data)

*Using pandas allows us to efficiently organize and manipulate the sentence-level data for analysis or storage.*

---

### Save Sentence Data to TSV File

*We save the `sentence_df` DataFrame to a TSV file using the `to_csv` method, specifying the tab separator (`\t`) and setting `index=False` to exclude the DataFrame's index from the output file. Again, two output file paths are provided, with the second one overwriting the first.*

In [None]:
# Save the sentence_df DataFrame as a TSV file
output_file_path = '/content/osm-cca-nlp/csv/sentence_data.tsv'
output_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/sentence_data.tsv'
sentence_df.to_csv(output_file_path, sep='\t', index=False)

*This step ensures that the extracted sentence data is saved in a structured format, which can be easily shared or used in subsequent analyses.*

---

### Display the Sentence DataFrame

*Finally, we display the `sentence_df` DataFrame by printing it to the console. This allows us to verify the contents and ensure that the sentence extraction was successful.*

In [None]:
# Display the sentence DataFrame
print(sentence_df)

*Viewing the DataFrame helps in quick validation of the data processing steps and provides an immediate look at the results of our sentence extraction.*

---

**Summary:**

- **pandas (`pd`):** Used for data manipulation and storage in DataFrames, providing efficient methods to read from and write to various file formats.
- **spaCy (`spacy`):** Utilized for advanced natural language processing tasks, particularly for sentence segmentation in this code.
- **os module:** Although imported, it isn't actively used in this code snippet. Typically, it would be used for file path manipulations or checking file existence.
- **Data Processing Steps:**
  - Loading text data from a TSV file.
  - Processing text with spaCy to extract sentences.
  - Storing the sentences along with metadata in a DataFrame.
  - Saving the processed data back to a TSV file for future use.

This modular approach breaks down the task into manageable steps, making the code easier to understand and maintain.

## spacy token processing

### Import Libraries

*In this step, we import the necessary libraries: pandas for data manipulation, spaCy for natural language processing tasks, and os for interacting with the operating system.*

In [None]:
import pandas as pd
import spacy
import os

---

### Load Sentence Data from TSV File

*We load the `sentence_df` DataFrame from a TSV (Tab-Separated Values) file using pandas' `read_csv` function with the separator set to tab (`\t`). The `input_file_path` variable specifies the path to the input file. Note that the second assignment of `input_file_path` overwrites the first, so only the last path is used.*

In [None]:
# Load sentence_df from the TSV file
input_file_path = '/content/osm-cca-nlp/csv/sentence_data.tsv'
input_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/sentence_data.tsv'
sentence_df = pd.read_csv(input_file_path, sep='\t')

*By using pandas, we efficiently read the TSV file into a DataFrame for further processing.*

---

### Load the spaCy Model

*We load the spaCy English language model `en_core_web_sm` using `spacy.load()`. This model provides tools for tokenization, part-of-speech tagging, named entity recognition, and more.*

In [None]:
# Load the spaCy model (small English model is used here)
nlp = spacy.load("en_core_web_sm")

*The `nlp` object now contains the language model, which we'll use to process text data.*

---

### Initialize List to Store Token Data

*We initialize an empty list `token_data` to store information about each token extracted from the sentences. This list will accumulate dictionaries containing token-level data.*

In [None]:
# Initialize an empty list to store token data
token_data = []

*This prepares us to collect detailed linguistic information from each sentence.*

---

### Extract Tokens from Sentences Using spaCy

*We iterate over each sentence in `sentence_df` using `iterrows()`. For each sentence, we process it with spaCy to obtain a `Doc` object, which contains the tokens and their linguistic annotations. We then iterate over each token in the `Doc`, extracting attributes such as the token's text, lemma, part-of-speech tag, and named entity type. This information is stored in the `token_data` list as dictionaries.*

In [None]:
# Iterate over the sentences in the sentence_df DataFrame
for index, row in sentence_df.iterrows():
    doc = nlp(row['sentence_text'])  # Process the sentence text with spaCy

    # Iterate over the tokens in the sentence
    for j, token in enumerate(doc):
        token_data.append({
            'id': row['id'],                            # Original text ID
            'sentence_number': row['sentence_number'],  # Sentence number
            'token_number': j + 1,                      # Token number (starting from 1)
            'token_text': token.text,                   # Token text
            'token_lemma': token.lemma_,                # Token lemma
            'token_pos': token.pos_,                    # Token part of speech
            'token_entity': token.ent_type_             # Token entity type (if any)
        })

*This step leverages spaCy's powerful NLP features to extract and annotate tokens, which is essential for detailed text analysis.*

---

### Create a DataFrame with Token Data

*We convert the `token_data` list into a pandas DataFrame called `token_df`. This DataFrame organizes the token-level information in a tabular format, making it easier to analyze and manipulate.*

In [None]:
# Create a new DataFrame with the token data
token_df = pd.DataFrame(token_data)

*Using pandas allows us to handle large amounts of data efficiently and provides tools for data analysis.*

---

### Save Token Data to TSV File

*We save the `token_df` DataFrame to a TSV file using pandas' `to_csv()` method. We set the separator to tab (`\t`) and `index=False` to exclude the DataFrame's index from the output file. Similar to before, the second assignment of `output_file_path` overwrites the first.*

In [None]:
# Save the token_df DataFrame as a TSV file
output_file_path = '/content/osm-cca-nlp/csv/token_data.tsv'
output_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/token_data.tsv'
token_df.to_csv(output_file_path, sep='\t', index=False)

*This saves our token-level data in a structured format that can be shared or used in future analyses.*

---

### Display the Token DataFrame

*Finally, we print the `token_df` DataFrame to display the collected token data. This allows us to verify that the token extraction and annotation processes were successful.*

In [None]:
# Display the token DataFrame
print(token_df)

*Viewing the DataFrame provides immediate feedback on the results of our processing pipeline.*

---

**Summary:**

- **pandas (`pd`):** Used for reading and writing TSV files and handling DataFrames.
- **spaCy (`spacy`):** Utilized for processing text to extract tokens, lemmas, parts of speech, and named entities.
- **Data Processing Steps:**
  - Loading sentence-level data from a TSV file.
  - Processing sentences with spaCy to extract token-level information.
  - Storing and organizing token data in a pandas DataFrame.
  - Saving the token data to a TSV file for future use or analysis.

This modular approach ensures that each part of the code is focused on a specific task, making it easier to understand and maintain.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Load token_df from the TSV file
input_file_path = '/content/osm-cca-nlp/csv/token_data.tsv'
input_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/token_data.tsv'
token_df = pd.read_csv(input_file_path, sep='\t')

# Filter the DataFrame to keep only rows where the part of speech is 'NOUN'
noun_df = token_df[token_df['token_pos'] == 'NOUN']

# Group by the lemma and count the occurrences of each lemma
lemma_counts = noun_df['token_lemma'].value_counts().reset_index()

# Rename the columns for clarity
lemma_counts.columns = ['lemma', 'count']

# Get the 20 most frequent lemmas
top_lemmas = lemma_counts.head(20)

# Plot the 20 most frequent nouns using Seaborn
plt.figure(figsize=(10, 8))
sns.barplot(x='count', y='lemma', data=top_lemmas, palette='viridis')
plt.title('Top 20 Most Frequent Nouns')
plt.xlabel('Count')
plt.ylabel('Lemma')

# Save the figure to a PNG file
output_file_path = '/content/osm-cca-nlp/fig/token_noun.png'
output_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/fig/token_noun.png'
plt.savefig(output_file_path)

# Display the plot
plt.show()

## text vectorization

In [None]:
import spacy

# Load spaCy model.. or en_core_web_sm if you don’t need word embeddings
#nlp = spacy.load("en_core_web_md")
nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    doc = nlp(text)
    tokens = []
    for token in doc:
        # Remove stopwords, punctuation, and non-alphabetic tokens
        if not token.is_stop and not token.is_punct and token.is_alpha:
            tokens.append(token.lemma_.lower())  # Append lemmatized form and lowercase
    return tokens

# Generate news headlines with some words occurring twice in the same headline
headlines = [
    "Apple launches new iPhone iPhone in September",
    "Google announces AI advancements with AI in health sector",
    "Tesla's electric cars revolutionize the electric car industry",
    "Amazon announces new grocery delivery for grocery stores",
    "Netflix announces new series based on new AI-based technology",
    "Microsoft launches cloud services and cloud infrastructure",
    "Facebook unveils privacy controls with enhanced privacy features",
    "Pfizer launches vaccine trials for new vaccine prevention",
    "Nike launches new eco-friendly shoe and shoe design",
    "BMW announces electric car breakthrough in the electric vehicle market"
]

# Preprocess the list of headlines
preprocessed_headlines = [" ".join(preprocess_text(headline)) for headline in headlines]
print(preprocessed_headlines)  # Output preprocessed headlines

In [None]:
from gensim import corpora
from gensim.models import TfidfModel

# Tokenize preprocessed headlines
tokenized_headlines = [headline.split() for headline in preprocessed_headlines]

# Create a dictionary of words
dictionary = corpora.Dictionary(tokenized_headlines)

# Create a Bag of Words corpus
bow_corpus = [dictionary.doc2bow(text) for text in tokenized_headlines]

# Or, alternatively, create a TF-IDF corpus
tfidf = TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

In [None]:
import pandas as pd
from gensim import corpora
from gensim.models import TfidfModel

# Assuming bow_corpus is already defined and dictionary is available
# dictionary = corpora.Dictionary(tokenized_headlines)

# Create a list of terms (vocabulary) from the dictionary
terms = [dictionary[i] for i in range(len(dictionary))]

# Create a document-term matrix for the Bag of Words (BoW) corpus
bow_doc_term_matrix = pd.DataFrame([[dict(doc).get(i, 0) for doc in bow_corpus] for i in range(len(dictionary))], index=terms)

# Display the BoW document-term matrix
#tools.display_dataframe_to_user(name="BoW Document-Term Matrix", dataframe=bow_doc_term_matrix)
print(bow_doc_term_matrix)