<p align="center">
<img src="https://github.com/datacamp/python-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br><br>

# **Working with Text Data in Python**

Welcome to this hands-on training where you will beging to learn how to work with text data in Python! In this session you will learn:

- How to explore and visualize your text data.
- How to manipulate and clean text data for further analysis.
- The basics of regex, and how to use it to filter a DataFrame.
- How to use prepare a template that is easily reusable.

## **The Dataset**

The dataset to be used in this webinar is a CSV file named `wine_reviews.csv`, which contains data on wine reviews. In particular, it contains the following columns:

### Columns:

`country`: The country that the wine is from

`description`: The review.

`designation`: The vineyard within the winery.

`points`: The number of points awarded to the wine on a scale from 1-100.

`price`: The cost of the wine.

`province`: The province or state where the wine originated from.

`region`: The wine growing area within the province or state.

`variety`: The type of grapes used to make the wine.

In [None]:
# Load resources
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
from collections import Counter

# Set pandas columns to display at max width
pd.set_option('display.max_colwidth', None)

# Set seaborn aesthetic features to pretty up our plots
sns.set()

## **Import wine data, and look at the first five rows**
Let's first import the data which is stored in the csv `wine_reviews.csv` and inspect it. We will use:
- `read_csv()` to read the csv file as a DataFrame.
- `.head()` to view the first five rows.

In [None]:
# Read the csv and assign to the DataFrame 'wine_df'
wine_df = pd.read_csv('https://github.com/datacamp/working-with-text-data-in-python-live-training/blob/master/data/wine_reviews.csv?raw=true')

# View the first five rows


**Observation:** It looks as though there are a variety of different cases in the `variety` column, which we will need to address later.

**Further inspection of our data**
---
First, now that we have an idea how our data is structured, and a little bit about what it contains, let's dig into the details a bit more. To do so, we will use: 
- `.info()` method on the DataFrame to learn about the data types and missing values.
- `.sort_values()` to reorder the DataFrame by a given series.
- `.unique()` to access the unique values.

In [None]:
# Display key information about the DataFrame


**Observation:** The two numeric columns, points and price, are correctly specified as integers and floats. However, it appears as though there are a lot of missing values for the designation column, which specifies what winery the wine originated from.

In [None]:
# Access the variety column, sort them alphabetically, and select only the unique values


**Observation:** Due to extra spaces and inconsistencies in case, there are many duplicate entries of wine varieties that we will need to address.

## **Cleaning the data**
As observed when exploring data, the `designation` column has missing values. One strategy to follow could be filling missing values with the name of the winery attached with `'- unknown'`. To fill out missing values, we can use the `.fillna()` method which takes in the following argument:
- Here we will pass the `winery` series of our DataFrame, and then concatenate `- unknown`.
- We will also use the `inplace` argument to apply this operation directly to our DataFrame.

But first, let's go over string **concatenation**. In the example below, we join the word `"the"` with our `x` variable containing `"winery"`, and separate the two words by also adding a space (`" "`).

In [None]:
# Create two variables storing the words of interest




# Combine the words with a space


In [None]:
# Fill our designation column with the name of the winery and 'unknown'


# Sample the DataFrame to see the result


In [None]:
# Use .info() again to ensure that we have no more missing values


**Observation:** Currently, there are a number of location attributes scattered across columns. While it is useful to have them separately, let's also make a column that combines this data into a useful `location` column. 

In particular, let's take the first three characters of the `country` name, and then combine them with the `region`. To do so, we will use:
- String concatenation, which we used earlier.
- The `.str.upper()` method, which returns an uppercase version of the string.
- String indexing (which we will go over below):

In [None]:
# Print out the first letter of x (which is storing 'winery')


In [None]:
# Print out the first four letters


In [None]:
# Print from the fourth letter until the end


In [None]:
# Create a location column, and assign to it the region, 
# a hyphen, and the first two characters of the country column in upper case


# Check our data with a random sample


**Observation:** Our `variety` column is a bit of a mess! Let's use a variety of string functions provided by pandas to set the varieties to lowercase, strip leading and trailing spaces, and replace any double spaces with single spaces!

To do this, we will use (in order):
- `str.strip()`: remove leading and trailing spaces.
- `str.lower()`: convert the string to lowercase.
- `str.replace()`: replace a given pattern or string with another.

In [None]:
# Remove leading and trailing spaces from the variety column


# Set the variety names to lower case


# Replace all double spaces with single spaces


In [None]:
# Double check that we have a sensible list of wine varieties


---
## Q&A 1
---

## **Let's build a wordcloud!**
---
Okay, we have some cleaned data, it's time to start exploring these reviews and learning what we can about wines. One common way of getting a visualization of text data is through a word cloud. Here, we will use:
- `STOPWORDS`: a set of common words to eliminate from our wordcloud.
- `.join()`: a method by which we can join together all of the reviews in our dataset so that we have one set of text for the wordcloud.
- `WordCloud()`: a function to generate a wordcloud from a given set of text.
- Some `matplotlib.pyplot` functions to show our wordcloud, turn off the axis, and call the plot.

In [None]:
# Assign the built-in set of STOPWORDS to a variable stopwords, and preview it



**Observation:** As you can see, this set contains the type of words that are common and not particularly informative for our purposes (we want to learn about wine!).

In [None]:
# Join all the reviews by a space and lowercase


# Preview first 2000 characters to see whether reviews have been joined


Initial arguments for our first `WordCloud()`:
- `collocations`: whether we include bigrams of words (e.g. "the wine") or just unigrams (e.g. "wine").
    - In this wordcloud, our stopwords are only built for individual words, rather than bigrams. So let's turn this off.
- `width` and `height`: width and height of the wordcloud canvas
- `background_color`: the color of the background
- `stopwords`: words that will be eliminated from the wordcloud (in this case, the common ones we loaded in earlier)

We then use the `.generate()` method to generate our wordcloud from our `text` variable.

In [None]:
# Initialize wordcloud





# Generate wordcloud from our text variable    


# Render the wordcloud as an image, turn off the axis, and show




1. We can update our set of `stopwords` by calling `.update()`, passing in a list of words that we don't want to appear in the wordcloud.

2. We can update the background color by updating the `background_color` argument.

3. Lastly, we can update the size of the wordcloud by specifying the figure size in `plt.figure()`. Wordcloud is built upon `matplotlib`, so we can adjust figure characteristics by using our `plt` alias for `matplotlib.pyplot`.

In [None]:
# Less interesting words


# Update our stopwords with words that aren't too interesting to us


# Generate our wordcloud, but change the background color




# Display the wordcloud, but with updated size


## **Simple word count**
While a full runthrough of preprocessing for advanced topics such as natural language processing and machine learning are beyond the scope of this live-training, let's just do a quick preview of how one might get a frequency count of words outside of a wordcloud! To do this we will use:
- `word_tokenize()`: a function we imported earlier from `ntlk.tokenize` that will 'tokenize' or split a string into substrings based on a set of criteria. In this case, we will split based on punctuation, thereby collecting words.
- `Counter()`: a type of dictionary from the `collections` module that stores elements as keys and their counts as values.
- `.most_common()`: a method on a counter object that will return the _n_ most common elements.
- `.isalpha()`: Returns True if all characters are alphabetic.

In [None]:
# Tokenize our existing set of text if the word is alphabetic



# Filter our list for words that aren't in our existing set of stopwords


# Preview our tokenized list


In [None]:
# Use counter to create a count of each word in our filtered list final_words


# Use a loop to print the 10 most common tokens
for word in word_count.most_common(10):
  print(word[0] + ": " + str(word[1]) + " mentions")

In [None]:
# Check word count for 'tannin' and 'tannins'


**Observation:** It looks like our count could be more accurate, as there are nearly identical words included!

## **Stemming and Lemmatization**
### **Stemming**
There are a few ways we can go about fixing this. First, there is stemming, which is an algorithmic way of reducing words to their root form. However, in doing so, it runs the risk of producing non-words as a result. Let's take a look by importing `PorterStemmer()` from `nltk.stem`.
- `PorterStemmer()`: a popular stemmer available in the `nltk.stem` package, based on the Porter stemming algorithm.
- `.stem()`: the method to which we pass the token we want to stem.

In [None]:
# Initialize our stemmer


# Generate a list of test words
test_words = ['bike', 'bikes', 'biking']

# View the stemmer in action


In [None]:
# Generate a new list of test words
new_words = ['lease', 'leasing', 'leases']

# View the stemmer in action


**Observation:** It looks like we are already encountering limits of our stemmer! 

### **Lemmatization**
An approach that is slower, but relies upon linguistic rules is the process of _lemmatization_, which attemps to reduce input to its root word, or _lemma_. Unlike the algorithmic approach of stemming, lemmatization uses a corpus to ensure that the root word is an actual word. Let's use `WordNetLemmatizer()` and its corresponding method `.lemmatize()` to try this out!

In [None]:
# Initialize our lemmatizer


# View the stemmer in action


## **A final pass through our word frequencies**
Okay, now that we have experimented a bit, let's do a final run through our words with a lemmatizer and plot the counts.

In [None]:
# Lemmatize our list with our lemmatizer


# Use counter to create a count of each word in our filtered list 


# Check the success of our lemmatization
print("Old count of tannin: " + str(word_count.get('tannin')))
print("New count of tannin: " + str(new_word_count.get('tannin')))
print("New count of tannins: " + str(new_word_count.get('tannins')))

In [None]:
# Convert our most common words to a DataFrame for easy plotting


# View first five rows


In [None]:
# Set figure size
plt.figure(figsize=(12, 8))

# Create a barplot with our values
plt.barh(y=word_freq.Word, width=word_freq.Count)

# Invert the y axis
plt.gca().invert_yaxis()

# Set the title and show
plt.title('Most common words')
plt.show()

## **Searching for specific strings**
Wordclouds and frequency counts are great starting steps, but to really explore our data we will probably want to be able to search it. Let's do some basic searches for mentions of oak in the `description` column of the DataFrame.
- `str.contains()` will return a Boolean whether a given pattern or string is found in the string of the series. It is based off of `re.search`. 
    - For now, we will use the optional parameters `case` and `regex` to ensure that our search is _not_ case sensitive, and to specify that we are not using a regular expression pattern, and instead simply a string.

In [None]:
# Create our Boolean filter


# Filter our DataFrame using our oak_filter and look at the first five rows


**Observation:** Take a look at the first result. Although the review does make reference to being aged in `oak`, it also references the wine coming from `Oakville`. Thus, there may be a risk that our query is grabbing descriptions that contain Oakville, and not oak.

**Introduction to Regular Expressions**
---
We could add a space after `'oak'` to ensure that we don't get Oakville, but what about when the word ends a sentence, or `'oakiness'` and `'oaky'`? Enter regular expressions, which allow us to define patterns to find and extract text.

Regular expressions are strings that make use of normal and special characters to help define a pattern which we can then compare to our text of interest. Here, we will make use of a few special characters to write a pattern for `oak` and related adjectives. But first, let's try out some simple examples by using the digit special character.

`\d`: Matches any digit character (i.e. 0-9)

`{}`: Quantifies the number of matches. 
- `{1,5}` will match between 1 and 5.
- `{2,}` will match at least two.

There are many more special characters we can use to write complex regular expression (or regex) patterns, but let's see what we can do with what we have learned so far. We will make use of the `re` package:
- `re.findall()` will return all matches of our patterns in a given string.
- We also prefix our pattern with `r` to denote it as a raw string, so Python interprets our backslashes correctly.

In [None]:
# Create a very contrived test string
test_string = """
This tremendous 100% varietal wine hails from Oakville 
and was aged over three years in oak. Juicy red-cherry fruit 
and a compelling hint of caramel greet the palate, framed by elegant, 
fine tannins and a subtle 20% minty tone in the background. 
Balanced and rewarding from start to finish, 
it has years ahead of it to develop further nuance. There are absolutely no
bad tannins in this wine. But there is a tasty tannin.
Enjoy 2022â€“2030."""

# Let's find all digits in this review
re.findall(r"\d", test_string)

In [None]:
# Let's find all years represented in the format XXXX in this review
re.findall("\d{4}", test_string)

In [None]:
# Let's find all percentages (with between 1 and 3 digits, followed by a percentage sign)
re.findall("\d{1,3}%", test_string)

### **Let's expand a bit on some other special characters before diving back into our dataset!**

`\w`: Matches any alphanumeric character or underscore (i.e. A-Z, a-z, 0-9, \_)

`+`: Matches between 1 or more of the preceding character.

`\s`: Matches any white space character, such as spaces, tabs, and line breaks.

`?`: Matches 0 or more of the preceding character.

In [None]:
# Find all mentions of tannin (or tannins), as well as the word that precedes it for context
re.findall(r"\w+\stannins?", test_string)

---
## Q&A 2
---

## **Let's learn about oak!**

1. First, we want to ensure that it is the beginning of the string, or preceded by a space, because there may be cases where a word contains the same characters (e.g. `'cloak`'). We do this by using two new techniques: alternation and the carat.
- `^`: matches the position before the first character of a string.
- `|`: acts as an or.
    - Using `()` groups characters together.

```
(\s|^)
```


- `re.search()`: returns a match object if it finds an occurence of the pattern, and `NULL` if it doesn't.

In [None]:
# Test out the pattern
print(re.search(r"(\s|^)oak", "oaky flavors"))
print(re.search(r"(\s|^)oak", "flavored oak"))
print(re.search(r"(\s|^)oak", "cloak"))

2. We then want to search for a capitalized or non-capitalized `'o'`. To do this, we can use a character set `[]`, which matches any character in the square brackets. We then add `'a'` and `'k'` as these characters will be in every variant of oak we want to search for.

```
[Oo]ak
```

In [None]:
# Test out character sets
print(re.search(r"[Oo]ak", "Oak"))
print(re.search(r"[Oo]ak", "oak"))

3. Next, we want to allow for variants of oakiness, Here, we use one large capturing group, and then search for any of the inner groups using `|`. 
- We look for `'iness'` **or** s or y (again using square brackets).
    - We make this set optional with `?`.

```
((iness)|s|y)?
```


4. Finally, we use a character set that allows for a period, a comma, or white space.
- _Note: we also introduce a `/` before the `.` to 'cancel out' the special character and treat it as a period._

```
[/.,\s]
```

Putting the pieces together, we now have a pattern that will capture all references to oak that we expect to find, while also not matching with words like `'Oakville'`. Let's add these together, and then assign them to the variable `oak_pattern`.

In [None]:
# Search for a white space, 
    # followed by any case 'o', 'ak', 
    # and ending in 'iness', 'y', or a period, comma, or whitespace.
oak_pattern = r"(\s|^)[Oo]ak((iness)|y|s)?[/.,\s]"

### **Testing our pattern**
**Note:** It is always a good idea to test that your regular expression is returning the results that you would expect, especially when you are working with longer and more complex patterns. There are many resources online where you can test patterns, such as www.regexr.com


In [None]:
# Use our pattern to search a string that contains Oakville but not explicitly oak
x = re.search(oak_pattern, "This tremendous 100% varietal wine hails from Oakville.")

# If a match is found, let us know!
if x:
    print("Yes, an oak match!")
else:
    print("No, oak isn't here!")

In [None]:
# Use our pattern to search a string that contains a reference to something being 'oaky'
x = re.search(oak_pattern, "This tremendous wine is oaky.")

# If a match is found, let us know!
if x:
    print("Yes, an oak match!")
else:
    print("No, oak isn't here!")

In [None]:
# View the first (and only) match


**Filtering the DataFrame using regex to find interesting patterns**
---
Now that we have a functioning pattern, we can use it to gain insights about the oakiness of wines. First, we use our pattern to filter the DataFrame, again using `str.contains()`, but this time using a regular expression.

We can assign this filtered DataFrame to a new one, titled `oak_wines`, which we can use for further analyses. Let's also call `.head()` to do a sense check on the new DataFrame.

In [None]:
# Filter our DataFrame for descriptions matching our pattern, and assign to `oak_wines`


# View the first five rows of our new DataFrame


### **Performing some comparisons**
Wonderful! Now, let's start to get an idea of the ratios of oakiness between varieties of wines. Let's start by grouping our two DataFrames, `oak_wines` and `wine_df` by the variety and counting the number of references.

To do so, we will use:
- `.groupby()` to group by the `variety`.
- `.count()` to aggregate by the count of different wines.
- `.sort_index()` so that we have an alphabetically sorted index to use for plotting.

In [None]:
# Group our oak DataFrame by variety, count it, and sort the index in descending alphabetical order
oak_grouped = 

# Group our original DataFrame by variety, count it, and sort the index in descending alphabetical order
wine_grouped = 

# Display the two grouped DataFrames



### **Plot our ratios**
Let's start by doing a simple plot of the grouped DataFrames. By overlaying the `oak_grouped` data over the `wine_grouped` data, we can get a rough visualization of the ratios of oaky wines. To do this, we will use `matplotlib.pyplot.barh` to make two horizontal barplots.

In [None]:
# Plot a horizontal bar chart of the count of varieties in 'lightsalmon'
plt.barh(wine_grouped.index, wine_grouped, color='lightsalmon')

# Plot a horizontal bar chart of the count of oaky varieties in 'firebrick'
plt.barh(oak_grouped.index, oak_grouped, color='firebrick')

# Give the plot a title
plt.title("Oak Wine Counts by Variety")

# Show the plot
plt.show()

### **Calculate ratios**
Clearly there are some varieties that are more likely to be described as oaky than others. Let's create a Boolean (True/False) column called `'oaky'` using our pattern, and then use a `groupby()` to calculate the percentage of each variety that is described as oaky.

**Note:** When calculating the `.mean()` of a Boolean column, `pandas` treats True as 1 and False as 0.

In [None]:
# Generate a new Boolean column based on whether the review contains 'oak'


# Group the DataFrame by variety, take the mean, and select the 'oaky' column


# View the resulting percentages


In [None]:
# Create another horizontal bar plot of the frequencies
plt.barh(oak_freq.index, oak_freq)

# Title the plot
plt.title("Wines by level of Oakiness")

# Show the plot
plt.show()

**Finally, let's produce some replicable output!**
---
Let's create some variables from our filtered datasets, and use these to write a replicable expression that adapts based on new data. We can generate some simple summaries using:
- `len()` to find the length of the two DataFrames.
- `.idmax()` to return the index of the row with the highest value (i.e. the wine with the highest oaky percentage).

In [None]:
# Count the number of rows in the two DataFrames
oak_num = 
wine_num = 

# Select the wine with the highest percentage of oakiness
oak_wine = 

### **Using F strings**
Okay, now let's embed our three variables into an f-string. F-strings allow us to insert our variables into strings using curly brackets `{}`. We simply need to add an `f` as a prefix to our string and call it inside a print function.

We now have a simple and dynamic summary of our data!

In [None]:
# Use an f-string to print a written description of our data


In [None]:
# Update our variables and print again
oak_num = 
wine_num = 
oak_wine = 

print(f"There were {oak_num} mentions of 'oak' from amongst {wine_num} reviews. The most oaky wine was {oak_wine}.")

## **What's next?**
Now that you have started your Pythonic-text journey, there are a variety of more advanced topics for you to tackle!

<a href = "https://learn.datacamp.com/courses/regular-expressions-in-python"><img src = "https://assets.datacamp.com/production/course_17118/shields/original/shield_image_course_17118_20200109-1-1b7rdip?1578597449" width=100pt align=left></a><br><br>&nbsp;&nbsp;&nbsp;**Further Experience with Regular Expressions**

<a href = "https://learn.datacamp.com/courses/sentiment-analysis-in-python"><img src = "https://assets.datacamp.com/production/course_16852/shields/original/shield_image_course_16852_20190816-1-1drnml4?1565953559" width=100pt align=left></a><br><br>&nbsp;&nbsp;&nbsp;**Sentiment Analysis**

<a href = "https://learn.datacamp.com/courses/feature-engineering-for-machine-learning-in-python"><img src = "https://assets.datacamp.com/production/course_14336/shields/original/shield_image_course_14336_20190428-1-1s1qt3h?1556485075" width=100pt align=left></a><br><br>&nbsp;&nbsp;&nbsp;**Feature Engineering for Machine Learning**

<a href = "https://learn.datacamp.com/courses/introduction-to-natural-language-processing-in-python"><img src = "https://assets.datacamp.com/production/course_3629/shields/original/shield_image_course_3629_20200424-1-1jg2tak?1587716990" width=100pt align=left></a><br><br>&nbsp;&nbsp;&nbsp;**Natural Language Processing**