# Wordwright Tutorial

## Motivation
In today's world, text is omnipresent, which is more than just a form of communication. From the briefest tweets to in-depth blog posts, from academic papers to business emails, this digital landscape is full of textual content. In such a text-saturated environment, the ability to read, analyze and derive meaning from written content is essential and meaningful. This is where our comprehensive text analysis package `wordwright` enters the picture. It's not only a text-processing tool, it's a bridge to understanding the world from overwhelmed text. Our package empowers various professionals by meeting their demands through cleaning unrelevant puncatuations, calculating the number of identified keywords and sentences, detecting the paragraph language, and ranking the word frequency. Potential users such as marketers, journalists, educators and healthcare professionals could be beneficial from this package. 

## Introduction
In this vignette, we demonstrate how `wordwright` can be used to analyze customer reviews that are similar to those typically seen on rental platforms like Airbnb. After each visit, users will take the time to submit a review to assist hosts in improving their accommodations and service as a valuable part of the community ecosystem. These reviews can highlight the unique features of a stay that may not be apparent from the description, such as the comfort of the bedding or even local tips that made your stay special. Whether you are a data scientist, a student, or just someone curious about text analytics, `wordwright` provides intuitive functions to transform raw text into actionable data.

## Imports

In [12]:
from wordwright.preprocessing import load_text
from wordwright.preprocessing import clean_text
from wordwright.language_detection import language_detection
from wordwright.count_sentences import count_sentences
from wordwright.word_frequency import frequent_words
from wordwright.count_keywords import count_keywords

## Getting Started with Text Files
Before our analysis, we need to compose several customer reviews and save them as text files for processing.

In [13]:
# Long positive review of a city apartment
review1 = """Our stay at Skyline Plaza was truly exceptional. The apartment offered 
breathtaking views of the city skyline, making every morning a luxurious experience. 
Inside, modern amenities like a fully-equipped kitchen, high-speed internet, 
and a smart TV provided exceptional comfort. The spacious bedrooms, each with 
ultra-comfortable beds, ensured restful nights after exploring the city's 
breathtaking sights. Located at the heart of the city, the apartment's proximity 
to public transport, diverse restaurants, and shops was incredibly convenient. 
Our host was wonderfully accommodating, providing a detailed city guide and 
personal recommendations for dining with views of the skyline. For anyone seeking 
a mix of modern luxury and convenient city living, this apartment is a 
perfect choice. Its luxurious amenities and prime location make it a 
standout choice for an exceptional city experience.
"""

# Short positive review of a cozy cottage
review2 = """Cozy Cottage was a quiet, charming escape from the busy city life. 
The garden was delightful, and the location was quiet yet convenient. 
A perfect weekend getaway! Highly recommend!
"""

# A medium length negative review 
review3 = """Although the location nestled right in the heart of the city, 
our stay was marred by disappointment. The first sign of trouble was the linens; 
they had a strange, unpleasant smell, as if they hadn't been washed. 
The bathroom, too, has a strange odor that made it feel unclean and unpleasant. 
Overall, the lack of cleanliness cast a shadow over our visit, 
turning our stay into an unpleasant experience. The location, unfortunately, 
couldn't make up for these  disappointing aspects.
"""

# Writing the reviews to text files
filenames = ["long_pos_review.txt", 
             "short_pos_review.txt", 
             "medium_neg_review.txt"]
reviews = [review1, review2, review3]

for filename, review in zip(filenames, reviews):
    with open(filename, "w") as file:
        file.write(review)

Now that we have our text files, let's dive into each function and see how it contributes to our text analysis workflow.

## Function 1: Loading Text Data
The `load_text` function reads the content of a text file without altering it. For our Airbnb reviews, we have three different text files containing reviews of varying lengths and details. We'll load each one and prepare them for cleaning.

In [14]:
raw_long_pos_review = load_text("long_pos_review.txt")
print(raw_long_pos_review)

Our stay at Skyline Plaza was truly exceptional. The apartment offered 
breathtaking views of the city skyline, making every morning a luxurious experience. 
Inside, modern amenities like a fully-equipped kitchen, high-speed internet, 
and a smart TV provided exceptional comfort. The spacious bedrooms, each with 
ultra-comfortable beds, ensured restful nights after exploring the city's 
breathtaking sights. Located at the heart of the city, the apartment's proximity 
to public transport, diverse restaurants, and shops was incredibly convenient. 
Our host was wonderfully accommodating, providing a detailed city guide and 
personal recommendations for dining with views of the skyline. For anyone seeking 
a mix of modern luxury and convenient city living, this apartment is a 
perfect choice. Its luxurious amenities and prime location make it a 
standout choice for an exceptional city experience.



In [15]:
raw_short_pos_review = load_text("short_pos_review.txt")
print(raw_short_pos_review)

Cozy Cottage was a quiet, charming escape from the busy city life. 
The garden was delightful, and the location was quiet yet convenient. 
A perfect weekend getaway! Highly recommend!



In [16]:
raw_medium_neg_review = load_text("medium_neg_review.txt")
print(raw_medium_neg_review)

Although the location nestled right in the heart of the city, 
our stay was marred by disappointment. The first sign of trouble was the linens; 
they had a strange, unpleasant smell, as if they hadn't been washed. 
The bathroom, too, has a strange odor that made it feel unclean and unpleasant. 
Overall, the lack of cleanliness cast a shadow over our visit, 
turning our stay into an unpleasant experience. The location, unfortunately, 
couldn't make up for these  disappointing aspects.



## Function 2: Cleaning the Text
The `clean_text` function removes all punctuations and whitespaces and makes all words lowercase, which ensures text uniform and easier to analyze, especially for word counting and searching (as shown in later sections). Using this function, we ensure our analysis is not skewed by formatting inconsistencies.

In [17]:
cleaned_short_pos_review = clean_text(raw_short_pos_review)
print(cleaned_short_pos_review)

cozy cottage was a quiet charming escape from the busy city life the garden was delightful and the location was quiet yet convenient a perfect weekend getaway highly recommend


Now let's clean the other two pieces of reviews. Note that in the reviews shown below, we also take care to preserve the nuances of the English language. For example, apostrophes that are part of contractions or possessive forms—like in "hadn't" or "city's"—are maintained. By doing so, our analysis respects the linguistic integrity and retains the text's original meaning.

In [18]:
cleaned_long_pos_review = clean_text(raw_long_pos_review)
print(cleaned_long_pos_review)

our stay at skyline plaza was truly exceptional the apartment offered breathtaking views of the city skyline making every morning a luxurious experience inside modern amenities like a fullyequipped kitchen highspeed internet and a smart tv provided exceptional comfort the spacious bedrooms each with ultracomfortable beds ensured restful nights after exploring the city's breathtaking sights located at the heart of the city the apartment's proximity to public transport diverse restaurants and shops was incredibly convenient our host was wonderfully accommodating providing a detailed city guide and personal recommendations for dining with views of the skyline for anyone seeking a mix of modern luxury and convenient city living this apartment is a perfect choice its luxurious amenities and prime location make it a standout choice for an exceptional city experience


In [19]:
cleaned_medium_neg_review = clean_text(raw_medium_neg_review)
print(cleaned_medium_neg_review)

although the location nestled right in the heart of the city our stay was marred by disappointment the first sign of trouble was the linens they had a strange unpleasant smell as if they hadn't been washed the bathroom too has a strange odor that made it feel unclean and unpleasant overall the lack of cleanliness cast a shadow over our visit turning our stay into an unpleasant experience the location unfortunately couldn't make up for these disappointing aspects


## Function 3: Detecting Language

The `language_detection` function is designed to identify whether a given text is written in English. Language detection is crucial for platforms like Airbnb that operate globally and handle reviews in many languages. Analytical tools are often language-specific, and running an English language analysis tool on non-English text would yield invalid results. For text analysis purposes, it is important to separate reviews by language, ensuring customer reviews are accessible and useful for the platform's analytical purposes.

This function accepts a string as input and determines language. The function returns "English" if it detects that the text is in English, "Not English" for any other language, and "Language detection error" if it encounters an error during the detection process. When English is detected, we will be ready for the following text analysis.

In [40]:
language_detection(raw_short_pos_review)

'English'

Detect a text not in English, take Spanish as an example, the function will return "Not English".

In [21]:
review_in_spanish = """
La estancia fue encantadora, con un servicio impecable, 
un ambiente acogedor y una ubicación privilegiada que 
superó todas las expectativas. ¡Una experiencia 
verdaderamente memorable!
"""
language_detection(review_in_spanish)

'Not English'

## Function 4: Counting Sentences

We can count the number of sentences in a given text. This function is particularly useful to understand the text structure in readability assessments, summarization tasks, or content analysis to prepare for a more in-depth text analysis. It can accurately identify sentence boundaries, which can vary with periods, exclamation marks, question marks, or other punctuation used to denote the end of a sentence. User could identify specific punctuations which use to seperate each sentence and thus calculate the total number of sentences. 

Counting sentences about customer reviews can provide a quick and quantifiable measure of user satisfaction, facilitate data-driven insights and enhance the analysis and presentation of textual review data.

In [22]:
#seperate each sentence using period and exclamation mark
customer_1 = count_sentences(raw_long_pos_review, [".", "!"]) 

There are 8 sentence(s), which is splited by ['.', '!'].


In [23]:
#seperate each sentence using period and exclamation mark
customer_2 = count_sentences(raw_short_pos_review, [".", "!"]) 

There are 4 sentence(s), which is splited by ['.', '!'].


In [24]:
#seperate each sentence using period
customer_3 = count_sentences(raw_medium_neg_review, ["."]) 

There are 5 sentence(s), which is splited by ['.'].


## Function 5: Analyzing Word Frequency

### Integrated Text Cleaning
The `frequent_words` function has incorporated text cleaning within its process. This means that users don't need to separately clean their text with the `clean_text` function before analysis; `frequent_words` will handle both cleaning and word frequency analysis in one step. Specifically, the `frequent_words` count the occurrences of each unique word.

In [25]:
frequent_words(raw_short_pos_review)

Counter({'was': 3,
         'the': 3,
         'a': 2,
         'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'from': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'and': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Stopwords Filtering
In text analysis, the frequency of words can reveal a lot about the content. However, not all words carry the same weight in conveying meaning. This is where the `frequent_words` function of the wordwright package becomes particularly valuable.

**User-defined Stopwords List**  
Stopwords are commonly used words in any language, like "the," "is," and "in," which may not add significant meaning. The `frequent_words` function, allowing for removing user-defined stopwords, can focus on the more meaningful words that could provide insights into the text.

In [26]:
user_stopwords = ["the", "a", "is", "was", "and", "yet", "but", 
                  "from", "to", "in" "at", "on"]
frequent_words(raw_short_pos_review, stopwords=user_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Leveraging NLTK's Comprehensive Stopwords List**  
For a more thorough cleaning of our text data, we can utilize the Natural Language Toolkit (NLTK), a leading package to work with human language data (see details [here](https://www.nltk.org/)). NLTK provides an extensive list of stopwords commonly used in the English language.

In [27]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)
nltk_stopwords = stopwords.words('english')

In [28]:
frequent_words(raw_short_pos_review, stopwords=nltk_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'city': 1,
         'life': 1,
         'garden': 1,
         'delightful': 1,
         'location': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'weekend': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

**Extending and Customizing Stopwords List**   
In scenarios where the default NLTK stopwords may not be sufficient for analysis in depth, users can extend the NLTK stopwords list with their own set of words.

In [29]:
# Start with the default NLTK stopwords
custom_stopwords = set(nltk_stopwords)

# Define additional stopwords that are irrelevant to the analysis
additional_stopwords = {'weekend', 'location', 'weekend', 'life', 'city'}

# Combine the default stopwords with the additional ones
custom_stopwords.update(additional_stopwords)

# Ensure that the custom stopwords list is unique
custom_stopwords = list(set(custom_stopwords))

# Display the most common words excluding the customized stopwords
frequent_words(raw_short_pos_review, stopwords=custom_stopwords)

Counter({'quiet': 2,
         'cozy': 1,
         'cottage': 1,
         'charming': 1,
         'escape': 1,
         'busy': 1,
         'garden': 1,
         'delightful': 1,
         'yet': 1,
         'convenient': 1,
         'perfect': 1,
         'getaway': 1,
         'highly': 1,
         'recommend': 1})

### Displaying Most Frequent words
The output of the `frequent_words` function is a `Counter` object (see details of `Counter` [here](https://docs.python.org/3/library/collections.html#collections.Counter)), a specialized dictionary from Python's collections module designed for counting hashable objects.
One useful method available on a Counter object is `most_common()`(see details of the function [here](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)). This method allows quickly identifying the most frequently occurring items and their counts as a `list`. You can also specify a number to return the top N most frequent words.

This method is particularly handy when dealing with large texts and you want to focus on the most relevant words, such as identifying key themes in customer reviews or the most mentioned terms.

In [30]:
top_10_long_pos_review = frequent_words(raw_long_pos_review, 
                                        stopwords=nltk_stopwords
                                       ).most_common(10)
top_10_long_pos_review

[('city', 5),
 ('skyline', 3),
 ('exceptional', 3),
 ('apartment', 2),
 ('breathtaking', 2),
 ('views', 2),
 ('luxurious', 2),
 ('experience', 2),
 ('modern', 2),
 ('amenities', 2)]

In [31]:
top_3_short_pos_review = frequent_words(raw_short_pos_review, 
                                        stopwords=nltk_stopwords
                                       ).most_common(3)
top_3_short_pos_review

[('quiet', 2), ('cozy', 1), ('cottage', 1)]

In [32]:
top_5_medium_neg_review = frequent_words(raw_medium_neg_review, 
                                         stopwords=nltk_stopwords
                                        ).most_common(5)
top_5_medium_neg_review

[('unpleasant', 3),
 ('location', 2),
 ('stay', 2),
 ('strange', 2),
 ('although', 1)]

## Function 6 Counting keywords

This is a function that counts keywords in a text. The user gets to specify the keywords to look for and the text.

In [36]:
count_keywords(raw_long_pos_review, ['breathtaking', 'exceptional', 'city\'s'])

{'breathtaking': 2, 'exceptional': 3, "city's": 1}

The function counts these words correctly. Moreover, notice that it forces lower case and ignores punctuation, with the exception of the apostrophes used for sentence contraption. This is because our function first preprocesses the text and the keywords by forcing lower case and removing all punctuation except for the apostrophes when evaluating word counts. This includes cases where a hyphen is used to join two words (ex: fully-equipped is treated as fullyequipped). These rules extend to both the text and the keywords:

In [37]:
count_keywords(raw_long_pos_review, ['fullyequipped'])

{'fullyequipped': 1}

Here we see that the word "fullyequipped" has a count of 1, even though in the text there is no literal word "fullyequipped". This is the result of the preprocessing taken before evaluating word counts, which leads to the removal of the character "-" connecting the words "fully" and "equipped".

Sometimes an user may put in the same word in the keyword list multiple times on accident or by intention. In the output, the count for that word only appears once.

In [39]:
count_keywords(raw_short_pos_review, ['cozy'])

{'cozy': 1}

In [38]:
count_keywords(raw_short_pos_review, ['cozy', 'cozy', 'Cozy', 'cozy!'])

{'cozy': 1}

Notice that punctuation and case do not make a new word. The preprocessing applied to the keywords would map all of them onto the vanilla version of the word.