# Wordwright Tutorial

In today's world, text is omnipresent and serves as more than just a form of communication. From the briefest tweets to in-depth blog posts, from academic papers to business emails, our digital world is full of textual content. The ability to read, analyze, and derive meaning from the written content is crucial.

In this vignette, we demonstrate how `wordwright` can be used to analyze customer reviews similar to those on rental platforms like Airbnb. Such reviews, often submitted post-visit, are useful in helping hosts enhance their accommodations and services. These reviews often reveal aspects not covered in descriptions, like bedding comfort or local tips that make a stay special. With `wordwright`, users can efficiently extract key features, gaining insights that drive informed decision-making and service improvement.

## Imports

In [1]:
from wordwright.preprocessing import load_text
from wordwright.preprocessing import clean_text
from wordwright.language_detection import language_detection
from wordwright.count_sentences import count_sentences
from wordwright.word_frequency import frequent_words
from wordwright.count_keywords import count_keywords

## Getting Started with Text Files
We compose two customer reviews of varying lengths and details and save them as text files.

In [2]:
# Long positive review of a city apartment
review1 = """Our stay at Skyline Plaza was truly exceptional. The apartment offered 
breathtaking views of the city skyline, making every morning a luxurious experience. 
Inside, modern amenities like a fully-equipped kitchen, high-speed internet, 
and a smart TV provided exceptional comfort. The spacious bedrooms, each with 
ultra-comfortable beds, ensured restful nights after exploring the city's 
breathtaking sights. Located at the heart of the city, the apartment's proximity 
to public transport, diverse restaurants, and shops was incredibly convenient. 
Our host was wonderfully accommodating, providing a detailed city guide and 
personal recommendations for dining with views of the skyline. For anyone seeking 
a mix of modern luxury and convenient city living, this apartment is a 
perfect choice. Its luxurious amenities and prime location make it a 
standout choice for an exceptional city experience.
"""

# Short positive review of a cozy cottage
review2 = """Cozy Cottage was a quiet, charming escape from the busy city life. 
The garden was delightful, and the location was quiet yet convenient. 
A perfect weekend getaway! Highly recommend!
"""

# Writing the reviews to text files
filenames = ["longer_review.txt",  
             "shorter_review.txt"]
reviews = [review1, review2]

for filename, review in zip(filenames, reviews):
    with open(filename, "w") as file:
        file.write(review)

## Function 1: Loading Text Data
The `load_text` function reads the content of a text file without modification. 

In [3]:
raw_longer_review = load_text("longer_review.txt")
print(raw_longer_review)

Our stay at Skyline Plaza was truly exceptional. The apartment offered 
breathtaking views of the city skyline, making every morning a luxurious experience. 
Inside, modern amenities like a fully-equipped kitchen, high-speed internet, 
and a smart TV provided exceptional comfort. The spacious bedrooms, each with 
ultra-comfortable beds, ensured restful nights after exploring the city's 
breathtaking sights. Located at the heart of the city, the apartment's proximity 
to public transport, diverse restaurants, and shops was incredibly convenient. 
Our host was wonderfully accommodating, providing a detailed city guide and 
personal recommendations for dining with views of the skyline. For anyone seeking 
a mix of modern luxury and convenient city living, this apartment is a 
perfect choice. Its luxurious amenities and prime location make it a 
standout choice for an exceptional city experience.



In [4]:
raw_shorter_review = load_text("shorter_review.txt")
print(raw_shorter_review)

Cozy Cottage was a quiet, charming escape from the busy city life. 
The garden was delightful, and the location was quiet yet convenient. 
A perfect weekend getaway! Highly recommend!



## Function 2: Cleaning the Text
The `clean_text` function removes punctuation and whitespace and converts all words to lowercase, removing formatting inconsistencies and ensuring text is uniform for analysis (as demonstrated in later sections).

In [5]:
cleaned_shorter_review = clean_text(raw_shorter_review)
print(cleaned_shorter_review)

cozy cottage was a quiet charming escape from the busy city life the garden was delightful and the location was quiet yet convenient a perfect weekend getaway highly recommend


Our `clean_text` function preserves the nuances of the English language. For example, apostrophes that are part of contractions or possessive forms—like in "city's"—is maintained as shown below. By doing so, our analysis respects the linguistic integrity of the text and retains its original meaning.

In [6]:
cleaned_longer_review = clean_text(raw_longer_review)
print(cleaned_longer_review)

our stay at skyline plaza was truly exceptional the apartment offered breathtaking views of the city skyline making every morning a luxurious experience inside modern amenities like a fullyequipped kitchen highspeed internet and a smart tv provided exceptional comfort the spacious bedrooms each with ultracomfortable beds ensured restful nights after exploring the city's breathtaking sights located at the heart of the city the apartment's proximity to public transport diverse restaurants and shops was incredibly convenient our host was wonderfully accommodating providing a detailed city guide and personal recommendations for dining with views of the skyline for anyone seeking a mix of modern luxury and convenient city living this apartment is a perfect choice its luxurious amenities and prime location make it a standout choice for an exceptional city experience


## Function 3: Detecting Language

The `language_detection` function discerns if text is in English. Language detection is a vital feature for global platforms like Airbnb, which receive reviews in multiple languages. Since text analytical tools are typically language-specific, applying an English-focused tool on non-English text can lead to inaccurate results.  

This function takes a string as input, returning "English" for English text, "Not English" for other languages, and "Language detection error" in case of detection issues. Upon detecting English, we can proceed with the subsequent text analysis.

In [7]:
language_detection(raw_shorter_review)

'English'

If the text is identified as non-English, such as Spanish, the function will return "Not English".

In [8]:
review_in_spanish = """
La estancia fue encantadora, con un servicio impecable, 
un ambiente acogedor y una ubicación privilegiada que 
superó todas las expectativas. ¡Una experiencia 
verdaderamente memorable!
"""
language_detection(review_in_spanish)

'Not English'

## Function 4: Counting Sentences

The `count_sentence` function determines the number of sentences in a given text. Total sentence count can serve as a useful measure of customer engagement, with a greater number of sentences typically reflecting deeper interaction. It also helps estimate information density, with higher sentence counts indicating more comprehensive feedback.

Specifically, this function enables users to specify punctuation marks—such as periods, exclamation points, and question marks—that indicate sentence breaks, thus determining the total sentence count.

In [9]:
#seperate each sentence using period and exclamation mark
customer_1 = count_sentences(raw_longer_review, [".", "!"]) 

There are 8 sentence(s), which is splited by ['.', '!'].


In [10]:
#seperate each sentence using period and exclamation mark
customer_2 = count_sentences(raw_shorter_review, [".", "!"]) 

There are 4 sentence(s), which is splited by ['.', '!'].


## Function 5: Analyzing Word Frequency

### Integrated Text Cleaning
The `frequent_words` function has incorporated text cleaning within its process. This means that users do not need to separately clean their text with the `clean_text` function before analysis; `frequent_words` will handle both cleaning and word frequency analysis in one step. Specifically, the `frequent_words` function counts the occurrences of each unique word.

In [11]:
frequent_words(raw_shorter_review)

Counter({'was': 3,
         'the': 3,
         'a': 2,
         'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'from': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'and': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Stopwords Filtering
In text analysis, the frequency of words can reveal a lot about the content. However, not all words carry the same weight in conveying meaning. This is where the `frequent_words` function becomes particularly valuable.

**User-defined Stopwords List**  
Stopwords are commonly used words in any language, like "the," "is," and "in," which may not add significant meaning. The `frequent_words` function, allowing for the removal of user-defined stopwords, focuses on more meaningful words.

In [12]:
user_stopwords = ["the", "a", "is", "was", "and", "yet", "but", 
                  "from", "to", "in" "at", "on"]
frequent_words(raw_shorter_review, stopwords=user_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Leveraging NLTK's Comprehensive Stopwords List**  
For a more thorough cleaning of our text data, we can utilize the Natural Language Toolkit (NLTK)(Loper & Bird 2002), a leading package to work with human language data (see details [here](https://www.nltk.org/)). NLTK provides an extensive list of stopwords commonly used in English.

In [13]:
# Imports NLTK stopwords and creates a list of English stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
nltk_stopwords = stopwords.words('english')

In [14]:
frequent_words(raw_shorter_review, stopwords=nltk_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Extending and Customizing Stopwords List**   
In scenarios where the default NLTK stopwords may not be sufficient, users can extend the NLTK stopwords list with their own set of words.

In [15]:
# Start with the default NLTK stopwords
custom_stopwords = set(nltk_stopwords)

# Define additional stopwords that are irrelevant to the analysis
additional_stopwords = {'weekend', 'location', 'life', 'city', 'yet'}

# Combine the default stopwords with the additional ones
custom_stopwords.update(additional_stopwords)

# Ensure that the custom stopwords list is unique
custom_stopwords = list(set(custom_stopwords))

# Display the most common words excluding the customized stopwords
frequent_words(raw_shorter_review, stopwords=custom_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'garden': 1,
         'delightful': 1,
         'convenient': 1,
         'perfect': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Displaying the Most Frequent Words
The `frequent_words` function returns a `Counter` object (see details [here](https://docs.python.org/3/library/collections.html#collections.Counter)), a specialized dictionary from Python's `collections` module.
One useful method available on a Counter object is `most_common()`(see details of the function [here](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)). This method returns a list of the top N most frequent items and their counts, where N is the number specified by the user. This function is particularly handy when you want to focus on the most relevant words.

In [16]:
top_10_long_pos_review = frequent_words(raw_longer_review, 
                                        stopwords=nltk_stopwords
                                       ).most_common(10)
top_10_long_pos_review

[('city', 5),
 ('skyline', 3),
 ('exceptional', 3),
 ('apartment', 2),
 ('breathtaking', 2),
 ('views', 2),
 ('luxurious', 2),
 ('experience', 2),
 ('modern', 2),
 ('amenities', 2)]

## Function 6: Counting Keywords

The `count_keywords` function searches for specific words in a text, serving as a practical tool for filtering content. For instance, by identifying the presence and frequency of particular keywords, users can select relevant texts for deeper analysis or exclude those that do not meet their criteria. The `count_keywords` function has a built-in text cleaning feature, eliminating the need for users to pre-clean their text. 

In [18]:
count_keywords(raw_longer_review, 
               ['breathtaking', 'exceptional', 'city\'s']
              )

{'breathtaking': 2, 'exceptional': 3, "city's": 1}

**Critical Remark**: by integrating all features in the `clean_text` function, the `count_keywords` function forces lower case and ignores punctuation. For instance, the word "fully-equipped" is treated as "fullyequipped". These rules apply to both the text and the specified keywords. In this instance, "fullyequipped" has a count of 1, despite its absence in the text as a literal word. 

In [19]:
count_keywords(raw_longer_review, ['fullyequipped'])

{'fullyequipped': 1}

In [20]:
count_keywords(raw_longer_review, ['fully-equipped'])

{'fullyequipped': 1}

**Robustness**: The `count_keywords` function is designed with robustness. Even if a user-defined keyword is accidentally repeated, capitalized, or entered with extra punctuation, the function resolves these variations and provides a singular count, which ensures the analysis is more resilient to user inputs.

In [21]:
count_keywords(raw_shorter_review, ['cozy'])

{'cozy': 1}

In [22]:
count_keywords(raw_shorter_review, ['cozy', 'cozy', 'Cozy', 'cozy!'])

{'cozy': 1}

## Reference

Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. arXiv preprint cs/0205028.