---
execute:
  eval: true
  echo: true
  output: true
title: "reading text content"
---

- code examples [nltk](https://www.nltk.org/howto.html)
- code examples [spacy](https://spacy.io/usage/spacy-101)
- download [jupyter notebook](pyws03-1-text-analysis.ipynb)

In [None]:
# run inside google colab
#!git clone https://github.com/cca-cce/osm-cca-nlp.git

## recap reading data files

In [None]:
#| eval: false
#| echo: true

import pandas as pd

# comma separated
df = pd.read_csv('users.csv', sep=',', quotechar='"', header=0)
#df = pd.read_csv('users.csv', sep=',', quotechar='"', header=None)
#df = pd.read_csv('users.csv', sep=',', quotechar="'", header=0)
#df = pd.read_csv('users.csv', sep=',', quotechar="'", header=None)
#df = pd.read_csv('users.csv', sep='\t', quotechar='"', header=0)
#df = pd.read_csv('users.csv', sep='\t', quotechar='"', header=None)
#df = pd.read_csv('users.csv', sep='\t', quotechar="'", header=0)
#df = pd.read_csv('users.csv', sep='\t', quotechar="'", header=None)

# tab separated 
#df = pd.read_csv('users.tsv', sep=',', quotechar='"', header=0)
#df = pd.read_csv('users.tsv', sep=',', quotechar='"', header=None)
#df = pd.read_csv('users.tsv', sep=',', quotechar="'", header=0)
#df = pd.read_csv('users.tsv', sep=',', quotechar="'", header=None)
df = pd.read_csv('users.tsv', sep='\t', quotechar='"', header=0)
#df = pd.read_csv('users.tsv', sep='\t', quotechar='"', header=None)
#df = pd.read_csv('users.tsv', sep='\t', quotechar="'", header=0)
#df = pd.read_csv('users.tsv', sep='\t', quotechar="'", header=None)

# excel
#df = pd.read_excel('users.xlsx', header=0, sheet_name=1)
#df = pd.read_excel('users.xlsx', header=None, sheet_name=1)

df.head()

## nltk and text corpora

### Import Libraries and Download NLTK Data

*In this step, we import the necessary libraries and download the required NLTK data packages. Specifically, we use NLTK's `download` function to ensure the 'gutenberg' corpus and the 'punkt' tokenizer are available for use. The 'punkt' tokenizer is essential for splitting text into sentences and words.*

In [None]:
import nltk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
from IPython.display import display

# Download necessary NLTK data files
nltk.download('gutenberg')
nltk.download('punkt')
nltk.download('punkt_tab')

### Load the Gutenberg Corpus

*Here, we import the Gutenberg corpus from NLTK's corpus module. The Gutenberg corpus is a collection of literary texts that we will analyze. We retrieve the list of file IDs available in the corpus using `gutenberg.fileids()`, which provides us with the filenames of the texts in the corpus.*

In [None]:
from nltk.corpus import gutenberg

# Get list of file IDs from the Gutenberg corpus
file_ids = gutenberg.fileids()

### Analyze Each Text in the Corpus

*In this section, we iterate over each text in the Gutenberg corpus to compute various linguistic statistics. We use NLTK's `raw()` method to get the raw text, `word_tokenize()` to split the text into words, and `sent_tokenize()` to split the text into sentences. These NLTK tokenizers are essential for textual analysis.*

In [None]:
# Initialize a list to store statistics
stats_list = []

# Analyze each text in the corpus
for file_id in file_ids:
    raw_text = gutenberg.raw(file_id)
    words = nltk.word_tokenize(raw_text)
    sentences = nltk.sent_tokenize(raw_text)
    num_words = len(words)
    num_sentences = len(sentences)
    avg_word_length = sum(len(word) for word in words) / num_words
    vocab_size = len(set(words))
    lexical_diversity = vocab_size / num_words
    stats_list.append({
        'Title': file_id,
        'Num_Words': num_words,
        'Num_Sentences': num_sentences,
        'Avg_Word_Length': avg_word_length,
        'Vocab_Size': vocab_size,
        'Lexical_Diversity': lexical_diversity
    })

### Create and Display the DataFrame

*We create a pandas DataFrame from the collected statistics for easier analysis and display it within the notebook using `display()`. This allows us to view the computed statistics in a structured tabular format.*

In [None]:
# Create a DataFrame to hold the statistics
stats_df = pd.DataFrame(stats_list)

# Display the statistics table
display(stats_df)

### Set Up the Output Directory

*Here, we define the output path where we'll save the text files and figures. We use `os.makedirs()` with `exist_ok=True` to create the directory if it doesn't already exist, ensuring that our output files have a designated location.*

In [None]:
# Define the output path for saving text files and figures
output_path = "/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/tmp"

# Create the output directory if it doesn't exist
os.makedirs(output_path, exist_ok=True)

### Generate and Display Plots

*In this step, we create various plots to visualize the text statistics using Seaborn and Matplotlib. We display these plots inline in the notebook using `plt.show()`. The plots include:*

- *A bar plot of the number of words per text.*
- *A bar plot of the average word length per text.*
- *A scatter plot of vocabulary size versus the number of words.*

*We utilize NLTK's tokenization outputs to extract the necessary values for plotting.*

In [None]:
# Set up seaborn style
sns.set(style='whitegrid')

# Bar plot of number of words per text
plt.figure(figsize=(10, 6))
sns.barplot(x='Title', y='Num_Words', data=stats_df)
plt.xticks(rotation=45)
plt.title('Number of Words per Text')
plt.tight_layout()
plt.savefig(os.path.join(output_path, 'num_words_per_text.png'))
plt.show()

# Bar plot of average word length per text
plt.figure(figsize=(10, 6))
sns.barplot(x='Title', y='Avg_Word_Length', data=stats_df)
plt.xticks(rotation=45)
plt.title('Average Word Length per Text')
plt.tight_layout()
plt.savefig(os.path.join(output_path, 'avg_word_length_per_text.png'))
plt.show()

# Scatter plot of vocabulary size vs. number of words
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Num_Words', y='Vocab_Size', data=stats_df, hue='Title')
plt.title('Vocabulary Size vs. Number of Words')
plt.legend(loc='best')
plt.tight_layout()
plt.savefig(os.path.join(output_path, 'vocab_size_vs_num_words.png'))
plt.show()

### Save Texts to Disk

*Finally, we save each text from the Gutenberg corpus as a plain text file to the specified output directory. We use NLTK's `raw()` method again to retrieve the full text of each file and write it to disk using standard file I/O operations.*

In [None]:
# Save each text as a plain text file to the output path
for file_id in file_ids:
    raw_text = gutenberg.raw(file_id)
    output_file_path = os.path.join(output_path, file_id)
    with open(output_file_path, 'w', encoding='utf-8') as f:
        f.write(raw_text)

### Load Saved Texts into an NLTK Corpus

*In this final step, we read the saved texts from the output directory back into an NLTK corpus using `PlaintextCorpusReader`. This allows us to treat the collection of saved texts as a corpus for further analysis. `PlaintextCorpusReader` is an NLTK class designed to read plain text files from a directory and create a corpus object.*

In [None]:
from nltk.corpus import PlaintextCorpusReader

# Define the corpus root directory
corpus_root = output_path

# Define the pattern to match the text files (e.g., all files with .txt extension)
file_pattern = '.*'  # Matches all files
# Matches only text files
file_pattern = r'.*\.txt'  # Matches all files ending with .txt

# Create a PlaintextCorpusReader object
new_corpus = PlaintextCorpusReader(corpus_root, file_pattern)

# Access the file IDs in the new corpus
new_file_ids = new_corpus.fileids()
print("Files in the new corpus:", new_file_ids)

# Example: Read words from a specific file
words_in_file = new_corpus.words(new_file_ids[0])
print("First 20 words in", new_file_ids[0], ":", words_in_file[:20])

*By using `PlaintextCorpusReader`, we can load all the saved text files into a new NLTK corpus. The `fileids()` method lists all the files in the corpus, and methods like `words()`, `sents()`, and `paras()` allow us to access words, sentences, and paragraphs, respectively. This demonstrates NLTK's capability to handle custom corpora built from local text files, enabling further text processing and analysis on the newly created corpus.*

## text to pandas dataframe

- example [sustainability communication](https://www.lunduniversity.lu.se/about-university/university-glance/mission-vision-and-values/sustainability)

### Import Libraries and Define Text Cleaning Function

*In this step, we import the necessary libraries and define a function to clean text data. We use the `os` module for file and directory operations, `re` for regular expressions, `pandas` for data manipulation, and `spacy` for natural language processing tasks.*

In [None]:
#| eval: true
#| echo: true
#| output: false

import os
import re
import pandas as pd
import spacy

!python -m spacy download en_core_web_sm

# Function to clean text by removing non-ASCII characters
def clean_text(text):
    # Remove non-ASCII characters (commented out to preserve UTF-8 text)
    # cleaned_text = re.sub(r'[^\x00-\x7F]+', '', text)
    cleaned_text = text
    return cleaned_text

*The `clean_text` function is intended to remove non-ASCII characters using `re.sub`. However, since we are dealing with UTF-8 encoded text (e.g., Swedish text data), we retain the original text by commenting out the removal line.*

---

### Set Directory Paths and Initialize Data Structures

*Here, we specify the directory paths where the text files are located and initialize data structures for storing the text data. The `directory_path` variable holds the path to the directory containing the text files. We also initialize an empty list `data` to store the text information and a counter `unique_id` for assigning unique identifiers to each text.*

In [None]:
# Directory containing text files
directory_path = '/content/osm-cca-nlp/res'
directory_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/res'

# Initialize an empty list to store the data
data = []

# Initialize a unique ID counter
unique_id = 1

*The `os` module functions will later use `directory_path` to access the files. The `unique_id` will increment for each file, ensuring each text has a unique identifier.*

---

### Read and Clean Text Files

*In this section, we iterate over the text files in the specified directory, read their contents, clean the text using the `clean_text` function, and store the data in the `data` list. The `os.listdir` function lists all files in the directory, and `os.path.join` constructs the full file path.*

In [None]:
# Iterate over the text files in the directory
for filename in os.listdir(directory_path):
    # Consider only plain text files
    if filename.endswith(".txt") or filename.endswith(".md"):
        file_path = os.path.join(directory_path, filename)

        # Read the file content
        with open(file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Clean the text
        cleaned_text = clean_text(text)

        # Append the data as a dictionary with a unique ID
        data.append({
            'id': unique_id,
            'filename': filename,
            'original_text': text,
            'cleaned_text': cleaned_text
        })

        # Increment the unique ID
        unique_id += 1

*We use `open` with `encoding='utf-8'` to read the files, ensuring that UTF-8 characters are handled correctly. The cleaned text and metadata are stored as dictionaries in the `data` list.*

---

### Create and Save DataFrame

*We convert the collected data into a Pandas DataFrame for easier manipulation and analysis. We then save this DataFrame as a TSV (Tab-Separated Values) file using the `to_csv` method with `sep='\t'`. The `index=False` parameter ensures that the DataFrame index is not included in the output file.*

In [None]:
# Create a Pandas DataFrame
text_df = pd.DataFrame(data)

# Save the DataFrame as a TSV file in the 'csv' subdirectory
output_file_path = '/content/osm-cca-nlp/csv/text_data.tsv'
output_file_path = '/home/sol-nhl/rnd/d/quarto/osm-cca-nlp/csv/text_data.tsv'

# Save the DataFrame to a TSV file
text_df.to_csv(output_file_path, sep='\t', index=False)

# Display the DataFrame
print(text_df)

*This step utilizes Pandas' data handling capabilities to structure our text data effectively and save it for future use.*

---

### Load spaCy Model

*We load a spaCy language model to perform natural language processing tasks. The `spacy.load` function loads the specified model into memory. In this case, we use the small English model `en_core_web_sm`.*

In [None]:
# Load the spaCy model (small English model is used here)
nlp = spacy.load("en_core_web_sm")

*The loaded `nlp` object provides access to spaCy's powerful NLP features, including tokenization, part-of-speech tagging, and sentence segmentation.*

---

### Compute Text Statistics

*We calculate word counts, character counts, and sentence counts for each cleaned text in the DataFrame. Pandas' `apply` function applies a lambda function to each row in the `cleaned_text` column. For sentence counting, we use spaCy's sentence segmentation by processing the text with `nlp` and accessing the `.sents` attribute.*

In [None]:
# Perform word count and character count on each cleaned text in the DataFrame
text_df['word_count'] = text_df['cleaned_text'].apply(lambda x: len(x.split()))
text_df['character_count'] = text_df['cleaned_text'].apply(lambda x: len(x))

# Perform sentence count using spaCy
text_df['sentence_count'] = text_df['cleaned_text'].apply(lambda x: len(list(nlp(x).sents)))

*The `len(x.split())` calculates the number of words by splitting the text on whitespace. The character count is obtained with `len(x)`. For sentence count, we process the text with the spaCy model and convert the `sents` generator to a list to count the sentences.*

---

### Display DataFrame with Selected Columns

*Finally, we display the DataFrame, excluding the 'original_text' and 'cleaned_text' columns for brevity. The `columns.difference` function identifies columns to exclude, and we use this to select the remaining columns for display.*

In [None]:
# Select and print all columns except 'original_text' and 'cleaned_text'
columns_to_display = text_df.columns.difference(['original_text', 'cleaned_text'])
print(text_df[columns_to_display])

*This step showcases the metadata and statistical information we've gathered, such as the unique ID, filename, word count, character count, and sentence count, without displaying the potentially lengthy text content.*

---

**Summary:**

- **os module:** Used for interacting with the operating system, listing directory contents, and constructing file paths.
- **re module:** Provides regular expression matching operations for text cleaning (though in this code, the regex is commented out).
- **pandas:** Used for creating and manipulating the DataFrame to store text data and computed statistics.
- **spacy:** Provides advanced NLP capabilities; we load a language model to perform sentence segmentation for counting sentences.
- **apply and lambda functions in pandas:** Used to apply functions to DataFrame columns for calculating word counts, character counts, and sentence counts.

This modular approach allows for easy understanding and maintenance of the code, with each section handling a specific part of the text processing pipeline.
