## Introduction
The `wordwright` package is a useful tool for textual data analysis, perfect for extracting insights from reviews, feedback, or any written content. In this vignette, we demonstrate how `wordwright` can be used to analyze Airbnb reviews, providing a clear path from raw text to insightful data. Whether you are a data scientist, a student, or just someone curious about text analytics, `wordwright` provides intuitive functions to transform raw text into actionable data.

## Getting Started with Text Files
Before our analysis, we need to compose several customer reviews and save them as text files for processing.

In [28]:
# Long positive review of a city apartment
review1 = "Our stay at Skyline Plaza was truly exceptional. The apartment offered breathtaking views of the city skyline, making every morning a luxurious experience. Inside, modern amenities like a fully-equipped kitchen, high-speed internet, and a smart TV provided exceptional comfort. The spacious bedrooms, each with ultra-comfortable beds, ensured restful nights after exploring the city's breathtaking sights. Located at the heart of the city, the apartment's proximity to public transport, diverse restaurants, and shops was incredibly convenient. Our host was wonderfully accommodating, providing a detailed city guide and personal recommendations for dining with views of the skyline. For anyone seeking a mix of modern luxury and convenient city living, this apartment is a perfect choice. Its luxurious amenities and prime location make it a standout choice for an exceptional city experience."

# Short positive review of a cozy cottage
review2 = "Cozy Cottage was a quiet, charming escape from the busy city life. The garden was delightful, and the location was quiet yet convenient. A perfect weekend getaway! Highly recommend!"

# A medium length negative review 
review3 = "Although the location nestled right in the heart of the city, our stay was marred by disappointment. The first sign of trouble was the linens; they had a strange, unpleasant smell, as if they hadn't been washed. The bathroom, too, has a strange odor that made it feel unclean and unpleasant. Overall, the lack of cleanliness cast a shadow over our visit, turning our stay into an unpleasant experience. The location, unfortunately, couldn't make up for these  disappointing aspects."

# Writing the reviews to text files
filenames = ["long_pos_review.txt", 
             "short_pos_review.txt", 
             "medium_neg_review.txt"]
reviews = [review1, review2, review3]

for filename, review in zip(filenames, reviews):
    with open(filename, "w") as file:
        file.write(review)

Now that we have our text files, let's dive into each function and see how it contributes to our text analysis workflow.

## Loading Text Data
The `load_text` function reads the content of a text file without altering it. For our Airbnb reviews, we have three different text files containing reviews of varying lengths and details. We'll load each one and prepare them for cleaning.

In [43]:
from wordwright.preprocessing import load_text
raw_long_pos_review = load_text("long_pos_review.txt")
print(raw_long_pos_review)

Our stay at Skyline Plaza was truly exceptional. The apartment offered breathtaking views of the city skyline, making every morning a luxurious experience. Inside, modern amenities like a fully-equipped kitchen, high-speed internet, and a smart TV provided exceptional comfort. The spacious bedrooms, each with ultra-comfortable beds, ensured restful nights after exploring the city's breathtaking sights. Located at the heart of the city, the apartment's proximity to public transport, diverse restaurants, and shops was incredibly convenient. Our host was wonderfully accommodating, providing a detailed city guide and personal recommendations for dining with views of the skyline. For anyone seeking a mix of modern luxury and convenient city living, this apartment is a perfect choice. Its luxurious amenities and prime location make it a standout choice for an exceptional city experience.


In [44]:
raw_short_pos_review = load_text("short_pos_review.txt")
print(raw_short_pos_review)

Cozy Cottage was a quiet, charming escape from the busy city life. The garden was delightful, and the location was quiet yet convenient. A perfect weekend getaway! Highly recommend!


In [45]:
raw_medium_neg_review = load_text("medium_neg_review.txt")
print(raw_medium_neg_review)

Although the location nestled right in the heart of the city, our stay was marred by disappointment. The first sign of trouble was the linens; they had a strange, unpleasant smell, as if they hadn't been washed. The bathroom, too, has a strange odor that made it feel unclean and unpleasant. Overall, the lack of cleanliness cast a shadow over our visit, turning our stay into an unpleasant experience. The location, unfortunately, couldn't make up for these  disappointing aspects.


## Cleaning the Text
The `clean_text` function removes all punctuations and whitespaces and makes all words lowercase, which ensures text uniform and easier to analyze, especially for word counting and searching (as shown in later sections). Using this function, we ensure our analysis is not skewed by formatting inconsistencies.

In [46]:
from wordwright.preprocessing import clean_text

In [48]:
cleaned_short_pos_review = clean_text(raw_short_pos_review)
print(cleaned_short_pos_review)

cozy cottage was a quiet charming escape from the busy city life the garden was delightful and the location was quiet yet convenient a perfect weekend getaway highly recommend


Now let's clean the other two pieces of reviews. Note that in the reviews shown below, we also take care to preserve the nuances of the English language. For example, apostrophes that are part of contractions or possessive forms—like in "hadn't" or "city's"—are maintained. By doing so, our analysis respects the linguistic integrity and retains the text's original meaning.

In [49]:
cleaned_long_pos_review = clean_text(raw_long_pos_review)
print(cleaned_long_pos_review)

our stay at skyline plaza was truly exceptional the apartment offered breathtaking views of the city skyline making every morning a luxurious experience inside modern amenities like a fullyequipped kitchen highspeed internet and a smart tv provided exceptional comfort the spacious bedrooms each with ultracomfortable beds ensured restful nights after exploring the city's breathtaking sights located at the heart of the city the apartment's proximity to public transport diverse restaurants and shops was incredibly convenient our host was wonderfully accommodating providing a detailed city guide and personal recommendations for dining with views of the skyline for anyone seeking a mix of modern luxury and convenient city living this apartment is a perfect choice its luxurious amenities and prime location make it a standout choice for an exceptional city experience


In [50]:
cleaned_medium_neg_review = clean_text(raw_medium_neg_review)
print(cleaned_medium_neg_review)

although the location nestled right in the heart of the city our stay was marred by disappointment the first sign of trouble was the linens they had a strange unpleasant smell as if they hadn't been washed the bathroom too has a strange odor that made it feel unclean and unpleasant overall the lack of cleanliness cast a shadow over our visit turning our stay into an unpleasant experience the location unfortunately couldn't make up for these disappointing aspects


## Analyzing Word Frequency

### Integrated Text Cleaning
The `frequent_words` function has incorporated text cleaning within its process. This means that users don't need to separately clean their text with the `clean_text` function before analysis; `frequent_words` will handle both cleaning and word frequency analysis in one step. Specifically, the `frequent_words` count the occurrences of each unique word.

In [52]:
from wordwright.word_frequency import frequent_words
frequent_words(raw_short_pos_review)

Counter({'was': 3,
         'the': 3,
         'a': 2,
         'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'from': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'and': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Stopwords Filtering
In text analysis, the frequency of words can reveal a lot about the content. However, not all words carry the same weight in conveying meaning. This is where the `frequent_words` function of the wordwright package becomes particularly valuable.

**User-defined Stopwords List**  
Stopwords are commonly used words in any language, like "the," "is," and "in," which may not add significant meaning. The `frequent_words` function, allowing for removing user-defined stopwords, can focus on the more meaningful words that could provide insights into the text.

In [62]:
user_stopwords = ["the", "a", "is", "was", "and", "yet", "but", 
                  "from", "to", "in" "at", "on"]
frequent_words(raw_short_pos_review, stopwords=user_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Leveraging NLTK's Comprehensive Stopwords List**  
For a more thorough cleaning of our text data, we can utilize the Natural Language Toolkit (NLTK), a leading package to work with human language data (see details [here](https://www.nltk.org/)). NLTK provides an extensive list of stopwords commonly used in the English language.

In [60]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
nltk_stopwords = stopwords.words('english')

In [61]:
frequent_words(raw_short_pos_review, stopwords=nltk_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Extending and Customizing Stopwords List**   
In scenarios where the default NLTK stopwords may not be sufficient for analysis in depth, users can extend the NLTK stopwords list with their own set of words.

In [67]:
# Start with the default NLTK stopwords
custom_stopwords = set(nltk_stopwords)

# Define additional stopwords that are irrelevant to the analysis
additional_stopwords = {'weekend', 'location', 'weekend', 'life', 'city'}

# Combine the default stopwords with the additional ones
custom_stopwords.update(additional_stopwords)

# Ensure that the custom stopwords list is unique
custom_stopwords = list(set(custom_stopwords))

# Display the most common words excluding the customized stopwords
frequent_words(raw_short_pos_review, stopwords=custom_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'garden': 1,
         'delightful': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Displaying Most Frequent words
The output of the `frequent_words` function is a `Counter` object (see details of `Counter` [here](https://docs.python.org/3/library/collections.html#collections.Counter)), a specialized dictionary from Python's collections module designed for counting hashable objects.
One useful method available on a Counter object is `most_common()`(see details of the function [here](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)). This method allows quickly identifying the most frequently occurring items and their counts as a `list`. You can also specify a number to return the top N most frequent words.

This method is particularly handy when dealing with large texts and you want to focus on the most relevant words, such as identifying key themes in customer reviews or the most mentioned terms.

In [73]:
top_10_long_pos_review = frequent_words(raw_long_pos_review, 
                                        stopwords=nltk_stopwords
                                       ).most_common(10)
top_10_long_pos_review

[('city', 5),
 ('skyline', 3),
 ('exceptional', 3),
 ('apartment', 2),
 ('breathtaking', 2),
 ('views', 2),
 ('luxurious', 2),
 ('experience', 2),
 ('modern', 2),
 ('amenities', 2)]

In [75]:
top_3_short_pos_review = frequent_words(raw_short_pos_review, 
                                        stopwords=nltk_stopwords
                                       ).most_common(3)
top_3_short_pos_review

[('quiet', 2), ('cozy', 1), ('cottage', 1)]

In [77]:
top_5_medium_neg_review = frequent_words(raw_medium_neg_review, 
                                         stopwords=nltk_stopwords
                                        ).most_common(5)
top_5_medium_neg_review

[('unpleasant', 3),
 ('location', 2),
 ('stay', 2),
 ('strange', 2),
 ('although', 1)]