# Workbook 1: Preprocessing Text Data
### Summary:
- Reading in text data 
    - txt
    - pdf
    - images
    - word
- Cleaning strings
    - Clean est_price
    - Regular expressions
    - Clean desc_1
- Tokenization
- Removing stop words
- Stemming
- Lemmatization
- Creating a document-term matrix
- Analyzing word counts and sentiment








In Workbook 0b we focused on wrangling quantitative data. In this workbook we'll transition to text data.

The parallel of wrangling for qualitative data is **pre-processing**. This includes parsing, cleaning, tokenizing, removing stop words, stemming, and lemmatizing the data for analysis. 



In [2]:
# import packages
import os 

import re

import pandas as pd
import numpy as np
import statsmodels.api as sm

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/bah17005/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Reading in text data
Workbook 0b covered loading data formatted as a .csv file. This is a common format for any data that lean quantitative. As we begin working working with more qualitative-leaning data, there are a variety of formats they may come in (e.g., .txt, .pdf, .docx). While the example in this workbook will continue using the coffee.csv file from Workbook 0b, in the first section, we will also load some other types of data. 

In [3]:
# change to your file path:
# option 1: if coffee.csv is saved in the same location as this workbook you can use path + file
# option 2: remove parent + "/Data/Coffee" + file and paste a direct path encased in quotes (" " or ' ')
# option 3: diy!
path = os.getcwd() 
parent = os.path.abspath(os.path.join(path, os.pardir)) # this returns the parent folder of the cwd
file_coffee = "/coffee.csv"
full_path = parent + "/Data/Coffee" + file_coffee

In [4]:
# for importing .csv files
coffee = pd.read_csv(full_path)
coffee[0:5]

Unnamed: 0,slug,all_text,rating,roaster,name,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,...,agtron,aroma,acid,body,flavor,aftertaste,with_milk,desc_1,desc_2,desc_3
0,https://www.coffeereview.com/review/wilton-ben...,\n\n\n95\n\n\nJBC Coffee Roasters\nWilton Ben...,95,JBC Coffee Roasters,Wilton Benitez Geisha,0,0,0,0,0,...,59/81,9.0,9.0,9.0,9.0,9.0,,"Richly floral-toned, exceptionally sweet. Dist...",Produced by Wilton Benitez of Macarena Farm en...,"A nuanced, complex experimentally processed Co..."
1,https://www.coffeereview.com/review/colombia-c...,\n\n\n95\n\n\nBird Rock Coffee Roasters\nColo...,95,Bird Rock Coffee Roasters,Colombia Cerro Azul Geisha,0,0,0,0,0,...,62/80,9.0,9.0,9.0,9.0,9.0,,"Richly aromatic, chocolaty, fruit-toned. Dark ...",Produced by Rigoberto Herrera of Granja La Esp...,"A trifecta of fruit, chocolate and flowers, bo..."
2,https://www.coffeereview.com/review/yirgacheff...,\n\n\n94\n\n\nRegent Coffee\nYirgacheffe Meng...,94,Regent Coffee,Yirgacheffe Mengesha Natural,1,0,0,0,0,...,60/77,9.0,9.0,9.0,9.0,8.0,,"High-toned, fruit-driven. Boysenberry, pear, c...",Produced at Mengesha Farm from selections of i...,A fruit medley in a cup — think boysenberry an...
3,https://www.coffeereview.com/review/colombia-t...,\n\n\n93\n\n\nRegent Coffee\nColombia Tolima ...,93,Regent Coffee,Colombia Tolima Finca El Mirador Washed Anaerobic,0,0,0,0,0,...,59/79,9.0,9.0,8.0,9.0,8.0,,"Delicately fruit-toned. Guava, ginger blossom,...",Produced by Victor Gutiérrez of Finca Mirador ...,"An appealing washed anaerobic cup: deep-toned,..."
4,https://www.coffeereview.com/review/panama-gei...,\n\n\n94\n\n\nTheory Coffee Roasters\nPanama ...,94,Theory Coffee Roasters,Panama Geisha Finca Debra Symbiosis,0,0,1,0,0,...,62/80,9.0,9.0,9.0,9.0,8.0,,"Richly fruit-forward, floral-toned. Lychee, te...",Produced by Jamison Savage of Finca Debra enti...,A floral- and fruit-driven anaerobic natural P...


### Importing plain text data

In [5]:
# for importing plain text (.txt) or binary files 
# open the file 1
# read it in as a file 2
# we can read in the .csv as a plain text file 
with open(full_path) as file: # 1
   coffee2 = file.read() # 2

# this is an equivelant code ^
coffee2 = open(full_path).read()

# ^ reads the data in as one large string so coffee[0:1000] gives the first  characters
coffee2[0:1000]


'slug,all_text,rating,roaster,name,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,region_south_america,type_espresso,type_organic,type_fair_trade,type_decaffeinated,type_best_value,type_pod_capsule,type_blend,type_estate,type_peaberry,type_barrel_aged,type_aged,location,origin,roast,est_price,review_date,agtron,aroma,acid,body,flavor,aftertaste,with_milk,desc_1,desc_2,desc_3\nhttps://www.coffeereview.com/review/wilton-benitez-geisha/," \n\n\n95\n\n\nJBC Coffee Roasters\nWilton Benitez Geisha\n\n\n \n\n\n\n\n\nRoaster Location:\nMadison, Wisconsin\n\n\nCoffee Origin:\nPiendamó, Cauca Department, Colombia\n\n\nRoast Level:\nMedium-Light\n\n\nAgtron:\n59/81\n\n\nEst. Price:\n$25.00/8 ounces\n\n\n\n\n\n\nReview Date:\nNovember 2022\n\n\nAroma:\n9\n\n\nAcidity/Structure: 9\n\n\nBody:\n9\t\t\t\t\t\t\n\nFlavor:\n9\n\n\nAftertaste:\n9\n\n\n\n\nBlind Assessment: Richly floral-toned, exceptionally sweet. Distinct narcissus, cocoa nib, myrrh, blackb

### Importing PDFs

In [6]:
# for importing pdf files
# install the package in terminal:
# pip3 install PyPDF2

# import the package
import PyPDF2 as pdf

# new path - pdf article on coffee flavor profiles
pdf_path = parent + "/Data/Coffee" + "/coffee_flavor.pdf"

# open the file, "rb" = read, binary 1
# call the pdf reader for pdf1 2
# initiate a new string object to save the pdf textin 3
# for each page of the pages the reader detected in the pdf 4
# add to coffee_pdf the extracted text from the page + a new line 5
pdf1 = open(pdf_path, "rb") # 1
reader = pdf.PdfReader(pdf1) # 2

coffee_pdf = "" # 3
for page in reader.pages: # 4
    coffee_pdf = coffee_pdf + page.extract_text() + "\n" # 5

# ^ reads the data in as one large string so coffee2[0:1000] gives the first 1000 characters
coffee_pdf[0:1000]

'Review\nComplexity of coffee ﬂavor: A compositional and sensory perspective\nWenny B. Sunarharuma,b, David J. Williamsc, Heather E. Smytha,⁎\naQueensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, PO Box 156 Archer ﬁeld BC, Queensland 4108, Australia\nbDepartment of Food Science and Technology, Faculty of Agricultural Technology, University of Brawijaya, JL. Veteran Malang 65145, Indonesia\ncAgri-Science Queensland, Department of Agriculture, Fisheries and Forestry (DAFF), PO Box 156, Archer ﬁeld BC, Queensland 4108, Australia\nabstract article info\nArticle history:\nReceived 30 November 2013Accepted 23 February 2014Available online 1 March 2014\nKeywords:\nCoffee\nFlavorCoffea arabicaAromaSensoryReviewFor the consumer, ﬂavor is arguably the most important aspect of a good coffee. Coffee ﬂavor is extremely\ncomplex and arises from numerous chemical, biological and physical in ﬂuences of cultivar, coffee cherry maturity,\ngeographical growing l

In [7]:
# for importing pdf files
# install the package in terminal:
# pip3 install tika

# import the package
import tika
tika.initVM()
from tika import parser

# parse the pdf into metadata, content, and status
parsed = parser.from_file(pdf_path)
# view the metadata
parsed["metadata"]
# view the status
parsed["status"]

# save the content into a new object
coffee_pdf2 = parsed["content"]

# ^ reads the data in as one large string so coffee2[0:1000] gives the first 1000 characters
coffee_pdf2[0:1000]

2023-04-05 15:24:02,610 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /var/folders/jb/__4n695j15169dhq5xqgt4l40000gp/T/tika-server.jar.
2023-04-05 15:24:07,483 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /var/folders/jb/__4n695j15169dhq5xqgt4l40000gp/T/tika-server.jar.md5.
2023-04-05 15:24:07,913 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nComplexity of coffee flavor: A compositional and sensory perspective\n\n\nFood Research International 62 (2014) 315–325\n\nContents lists available at ScienceDirect\n\nFood Research International\n\nj ourna l homepage: www.e lsev ie r .com/ locate / foodres\nReview\nComplexity of coffee flavor: A compositional and sensory perspective\nWenny B. Sunarharum a,b, David J. Williams c, Heather E. Smyth a,⁎\na Queensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, PO Box 156 Archerfield BC, Queensland 4108, Australia\nb Department of Food Science and Technology, Faculty of Agricultural Technology, University of Brawijaya, JL. Veteran Malang 65145, Indonesia\nc Agri-Science Queensland, Department of Agriculture, Fisheries and Forestry (DAFF), PO Box 156, Archerfield BC, Queensland 4108, Australia\n⁎ Corresponding author. Tel.: +61 7 3276

Note the many \n at the beginning of the PDF when we use tika. \n is a character code for a new line which is a type of white space. To remove this use .strip() which will remove any leading or trailing white space. 

In [8]:
coffee_pdf2[0:1000].strip()

'Complexity of coffee flavor: A compositional and sensory perspective\n\n\nFood Research International 62 (2014) 315–325\n\nContents lists available at ScienceDirect\n\nFood Research International\n\nj ourna l homepage: www.e lsev ie r .com/ locate / foodres\nReview\nComplexity of coffee flavor: A compositional and sensory perspective\nWenny B. Sunarharum a,b, David J. Williams c, Heather E. Smyth a,⁎\na Queensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, PO Box 156 Archerfield BC, Queensland 4108, Australia\nb Department of Food Science and Technology, Faculty of Agricultural Technology, University of Brawijaya, JL. Veteran Malang 65145, Indonesia\nc Agri-Science Queensland, Department of Agriculture, Fisheries and Forestry (DAFF), PO Box 156, Archerfield BC, Queensland 4108, Australia\n⁎ Corresponding author. Tel.: +61 7 32766035.\nE-mail address: h.smyth@uq.edu.au (H.E. Smyth).\n\nhttp://dx.doi.org/10.'

The tika package can also integrate with Tesseract Optimal Character Recognition (OCR) to extract content from images, older PDFs (which are rendered as images), or even webpages. 

In [9]:
# for importing webpages, images, or older PDFs 
# install tesseract ocr in terminal:
# pip3 install tesseract tesseract-lang
import requests

link = "http://dosenashville.com/menu"

# get the link
response = requests.get(link) 
# instead of parser.from_file use parser.from_buffer
parsed = parser.from_buffer(response.content) 

coffee_menu = parsed["content"]

coffee_menu[0:1000].strip()

'Menu — Dose Coffee\n\n\n\n    \n\n\n    \n  \n\n    \n      \n        \n          \n            \n          \n        \n      \n    \n\n    \n\n    \n\n  \n\n  \n    \n      \n        \n          \n          \n        \n        \n          \n          \n        \n        \n          \n          \n        \n        \n          \n          \n        \n        \n          \n          \n        \n        \n          \n          \n        \n        \n          \n          \n        \n      \n    \n\n    \n      \n  \n    \n      Cart\n\n      \n        \n        \n      \n      \n        \n        \n      \n      \n        \n        \n      \n      \n        \n        \n      \n\n      0\n    \n  \n\n    \n\n    \n      \n      \n        \n          \n        \n      \n    \n\n  \n\n\n  \n    \n      \n        \n          \n  \n    \n      \n        \n          \n            \n              Home\n            \n          \n        \n      \n    \n    \n  \n    \n      \n        \n          

### Importing Word documents

In [10]:
# for importing word files
# install docx in terminal:
# pip3 install python-docx
import docx

# new path - word doc on coffee flavor terminology
doc_path = parent + "/Data/Coffee" + "/coffee_flavor.docx"

# call the document reader to the .docx file 1
# initiate a new string object to save the doc text in 2
# for each paragraph of the paragraphs the reader detected in the doc 3
# add to coffee_doc the extracted text from the paragraph + a new line 4
doc = docx.Document(doc_path) # 1

coffee_doc = "" # 2
for paragraph in doc.paragraphs: # 3
    coffee_doc = coffee_doc + paragraph.text + "\n" # 4

# ^ reads the data in as one large string so coffee2[0:1000] gives the first 1000 characters
coffee_doc[0:1000]


'Coffee Flavor Terminology – A Coffee Taste Dictionary for the Noob\nAugust 28, 2022\xa0by\xa0\nCoffee Tasting Words\nBeing able to name the flavors in coffee isn’t just another method for coffee professionals to display their knowledge. If you want to master coffee brewing, knowing how to taste coffee and having the proper vocabulary to express the flavors you distinguish, is an important instrument.\nIt doesn’t matter if you find the coffee you just tasted appealing or not. Improving your ability to discern a coffee’s unique features, will help you discover more about about your coffee taste. As you progress, you will begin to observe what changes in your brewing method result in a better cup.\nWe said it before, espresso is not the best brewing method to explore coffee flavors, since the heavy body masks many of the more delicate notes in coffee. However, you will still be able to detect hints of the origins, varietals, and processing method of your coffee beans. But let’s dive in a

## Cleaning strings
### Cleaning est_price

When using data from .txt, .pdf, or .docx files, these data are loaded as a single string value and can require a bit of cleaning to produce a matrix or data frame that we can begin to process. While this workbook will focus on cleaning the est_price and desc_1 columns from from coffee.csv, the same principles apply will apply to other types of string. 

Expect some trial and error. Especially when working with an unfamiliar set of data, you do not know what patterns exist in the data. What I find most helpful is printing the data somewhere I can easily reference so I can look for patterns, code based that a pattern, and then check if it worked. (We learned how to do this in Workbook 0b!)

In the PDF, webpage, and Word document we uploaded, there are character codes (e.g., \xa0, \n, \nb) that you will see in the string that you don't see when viewing the document or website. You may encounter different types of encoding depending on where the document is coming from, and how the reader (e.g., tika or PyPDF2) is set to translate them. These encoders include ASCII, Unicode, UTF-8, UTF-16, and HTML. Each encoder has different codes for each character in a document including different types of whitespace, lists, etc. If a document has UTF-8 code but the reader is trying to interpret ASCII code then it will attempt to find the closest equivelent in ASCII and if it can't will return an error (this happened in HW 0 with the scores data). 

You'll see these codes from the document appear as part of the string. Note that tika and PyPDF2 return slightly different strings because of how the interpreters are set to work. There are packages that you can use to remove these codes or you may want to remove these manually since they can serve as markers in your data that you can use to isolate the text you are trying to extract.

Now back to the coffee data frame... it includes a column for the estimated price of each coffee, if we print the first 10 cases we find these prices are set in different units. One thing we may want to do is compare the price across the same units. 

In [11]:
coffee["est_price"][0:10]

0     $25.00/8 ounces
1     $59.00/8 ounces
2    $20.50/12 ounces
3    $20.50/12 ounces
4     $45.00/4 ounces
5    $40.00/200 grams
6    $43.00/200 grams
7    $25.00/12 ounces
8     $20.00/6 ounces
9    $40.00/12 ounces
Name: est_price, dtype: object

In [12]:
# isolate just the column we are working with
price = pd.DataFrame(coffee["est_price"])
price = price.rename(columns = {0:"est_price"})

price["est_price"] = price["est_price"].astype("string")
#print(price["est_price"])

From looking at the column, we can note a few things:
- Some cases have a an abbreviation before the $ indicating the type of currency (e.g., NT, HKD) or anoter indicator of currency (e.g., £)
- Some cases have multiple prices listed, these are seperated by ;
- Some cases list the price per bottle, these cases use a - as a seperator instead of whitespace (e.g., 12-ounce bottle). 
- Some cases have extra information encased in ()

In [13]:
# print the first case
print(price["est_price"][45])
print(price["est_price"][1400])
print(price["est_price"][1402])
print(price["est_price"][1427])
print(price["est_price"][1535])

NT $1,500/250 grams
$18.00/4-12-ounce bottles; 32 ounces/$12; 64-ounces/$20.00
$15.00/20 ounces (2 types)
£50.00/10 capsules
€29.95/1 kilo (35.3 ounces)


When approaching cleaning text data, I like to start my tasks with the largest unit. For example, here we are interested in using a single price for comparison, but some cases have multiple prices, therefore this is the largest unit. This will keep us from performing unnecessary cleaning tasks before we narrow our cleaning procedures. So here's how I'm going to approach cleaning the things I noticed above:

1. Isolate a single price for comparison
2. Remove extra information in ()
3. Isolate currency information before the numbers
4. Isolate the numbers
5. Isolate the unit

We can take a few approaches here. One approach is to split the string at the ; since this seperates one price from another. When we do this, cases with multiple prices are split into a list. Each price therefore becomes a string in a list 

In [14]:
# create a new variable in price with the split prices
price["split_prices"] = price["est_price"].str.split(";")

# this is a case with multiple prices that when split, turns into a list
print(price["split_prices"][1400])

# initiate a new list 1
# for each case in split_prices 2
# if the case is missing, 3
# append a blank value to the list 4
# if the case is not missing, 5
# append just the first string in the list 6
# if there is just one price, this will keep it
# if there's more than one price then it will only retain the first one

est_price_split = [] # 1

for each_price in price["split_prices"]: # 2
    if pd.isna(each_price) is True: # 3
        est_price_split.append(None) # 4
    elif pd.isna(each_price) is not True: # 5
        est_price_split.append(each_price[0]) # 6

# use this list to create a new column in our data frame
price["first_price"] = est_price_split

['$18.00/4-12-ounce bottles', ' 32 ounces/$12', ' 64-ounces/$20.00']


#### Remove extra information in ()
Now we have isolated a single price for comparison. The next step is to remove information contained in the () which is additional information that we do not need. Let's see what these cases look like...

In [15]:
# create a new list to capture cases with parentheses 1
# for each price in the variable where we isolated the first price 2
# if the case is not missing 3
# if there is an open parentheses found in each_price 4
# find() will return the location in the substring and if the substring is not found it will return -1 (so any value greater than 0 means an open paren was detected)
# append cases with open parentheses to the list 5

in_paren = [] # 1

for each_price in price["first_price"]: # 2
    if pd.isna(each_price) is not True: # 3
        if each_price.find("(") >= 0: # 4
            in_paren.append(each_price) # 5

print(in_paren)

['NA (available in store only)', '$3.00/sachet (plus one donated)', '$15.00/20 ounces (2 types)', '$45.95/8 ounces (currently on sale for $36.76)', '$13.99/12 ounces ($79.00/5 pounds)', '$13.99/12 ounces ($79.00/5 pounds)', '$21.00/12 ounces (includes shipping)', '€29.95/1 kilo (35.3 ounces)', '$39.95/8 ounces (packaged as a "duo" with Bourbon Rey Guatemala)', '$39.95/8 ounces (packaged as a "duo" with the Bourbon Rey Jamaica)', '$8.99/8 ounces (226 grams)', '$9.99/7 ounces (198 grams)', '$16.98/45 grams (approx. 9 servings)', '$12.00/25.4-ounce bottle (seasonal)', '$65.99/12 10.5-ounce bottles (shipping within California only)']


We see that each case has different information contained within parentheses. We could use str.replace() to remove each individual case but this wouldn't be efficient or scalable. Instead we can use regular expressions to tackle this more efficiently.

#### Regular expressions
**Regular expressions (regex)** are a sequence of characters or groupings that specifies a pattern in a string. 

Characters
|regex|Character|
|---|---|
|'\'|Escape character|
|.|Any character except a new line|
|^|Start of a string|
|$|End of a string|
|*|0 or more repetitions (ab* will match a, ab, abb, abbb, etc.; multiply b by anything >= 0)|
|+|1 or more repetitions (ab+ will match ab, abb, abbb, etc.; multiply b by anything >= 1)|
|?|0 or 1 repetitions (ab? will match a or ab; multiply b by 0 or 1)
|{}|An exact number of copies (a{6} will match exactly 6 a's)|
|{min, max}|A range of copies (a{3, 6} will match exactly 3-5 a's and a{3, } will match 3+ a's)|
|\d|Decimal digit|
|\D|Not a decimal digit|
|\w|Word character|
|\W|Not a word character|
|\s|Whitespace|
|\S|Not whitespace|
|\b|Word boundary|
|\B|Not a word boundary|

Groupings
|regex|Grouping|
|---|---|
|[]|Set of characters|
|[^]|Characters not in brackets|
||Either or|
|()|Any regex in the parentheses|


Now let's use regex search for any data with parentheses (similar to what we did to create in_paren). Instead of searching just for an open parenthesis, we can search for parentheses with some string patterns in them.

Note that parentheses are a grouping regex. To look for parentheses we need to use \( and \) to tell the re package that we don't want to use () as regex but instead want to literally look for ( and ).

In [16]:
# look for parentheses and not grouping regex use \( and \)
# look for an open parenthesis \(
# after the open parenthesis, look for any regex in ()
# any character except a new line .
# with 0 or more repetitions of any character *
# finally, look for the close parenthesis \)
price["first_price"].str.findall(r"\((.*)\)").value_counts()

[]                                                    2262
[$79.00/5 pounds]                                        2
[available in store only]                                1
[plus one donated]                                       1
[2 types]                                                1
[currently on sale for $36.76]                           1
[includes shipping]                                      1
[35.3 ounces]                                            1
[packaged as a "duo" with Bourbon Rey Guatemala]         1
[packaged as a "duo" with the Bourbon Rey Jamaica]       1
[226 grams]                                              1
[198 grams]                                              1
[approx. 9 servings]                                     1
[seasonal]                                               1
[shipping within California only]                        1
Name: first_price, dtype: int64

In [17]:
# print just one test case we know has parentheses
print(price["first_price"][1402])

$15.00/20 ounces (2 types)


In [18]:
# replace all the things we searched for above with a blank value "" 
price["first_price"] = price["first_price"].str.replace(r"\((.*)\)", "")

  price["first_price"] = price["first_price"].str.replace(r"\((.*)\)", "")


In [19]:
# print the test case to see that it worked
print(price["first_price"][1402])

$15.00/20 ounces 


#### Isolate currency information before the numbers
We can also use regular expressions to look for patterns related to currency.

For example, when we search for the exact substring "NT" in our data, we find 547 cases. Where as when we use regular expressions to search just for occurances of NT appearing as the first part of the string, it only returns 544 cases. This means there are 3 cases where the substring NT is found in the string but not as the first part of the string. 

In [20]:
price["first_price"].str.findall("NT").value_counts()

[]      1730
[NT]     547
Name: first_price, dtype: int64

In [21]:
# ^ = start of string
# search for NT at the start of the string
# followed by any 2 characters ..
price["first_price"].str.findall(r"^NT..").value_counts()

[]        1733
[NT $]     528
[NT$3]       3
[NTD ]       3
[NT$5]       2
[NT$6]       2
[NT$4]       2
[NT$2]       1
[NT$8]       1
[NT$7]       1
[NT 4]       1
Name: first_price, dtype: int64

Here we start to pick up on some of the inconsistencies in how currency was input. This indicates that we'll need to do some cleaning before we can isolate it. 

In [22]:
# look for some values at the start of the string ^
# any character .
# with 0 or more repetitions *
# lastly look for a literal $ and don't use it as a regex i.e., \$
price["first_price"].str.findall(r"^.*\$").value_counts()

[$]        1654
[NT $]      528
[]           34
[CAD $]      20
[NT$]        12
[HKD $]       7
[HK $]        5
[AED $]       3
[NTD $]       3
[$NT$]        2
[USD $]       2
[IDR $]       2
[KRW $]       2
[US $]        1
[AUD $]       1
[THB $]       1
Name: first_price, dtype: int64

First, note that most currencies end in a dollar sign, though not all. From the last chunk and this one, we see NT 4 which indicates the start of a price but no \$ before it. We also see 34 cases where no $ was detected. These could be cases where the entire price is missing or typos just missing \$ like NT 4.

Next, note the typos and inconsistencies. \$NT\$ and NT\$ each of which appear in our value counts as cases different from NT \$. And \$, US \$, and USD $ all refer to US dollars. 

Finally, note the missing cases [ ]. If we were to split using \$, this symbol would not be retained. Since the most of the USD cases do not have anything before the \$, if we split on it these USD cases would appear the same as any missing cases (i.e., they would all be empty). 

Let's correct the typos and inconsistencies in currency. 

In [23]:
# replace cases where the substring $NT$ or NT$ are found at the beginning of the string
price["first_price"] = price["first_price"].str.replace(r"^\$NT\$", "NT $")
price["first_price"] = price["first_price"].str.replace(r"^NT\$", "NT $")
price["first_price"] = price["first_price"].str.replace(r"^NTD \$", "NT $")

# replace cases where the substring US $ or just $ are found at the beginning of the string
price["first_price"] = price["first_price"].str.replace(r"^US \$", "USD $")
price["first_price"] = price["first_price"].str.replace(r"^\$", "USD $")

# replace HK with HKD
price["first_price"] = price["first_price"].str.replace(r"^HK \$", "HKD $")

# replace NA with blank
price["first_price"] = price["first_price"].str.replace(r"^NA", "")


# check that it worked
price["first_price"].str.findall(r"^.*\$").value_counts()

  price["first_price"] = price["first_price"].str.replace(r"^\$NT\$", "NT $")
  price["first_price"] = price["first_price"].str.replace(r"^NT\$", "NT $")
  price["first_price"] = price["first_price"].str.replace(r"^NTD \$", "NT $")
  price["first_price"] = price["first_price"].str.replace(r"^US \$", "USD $")
  price["first_price"] = price["first_price"].str.replace(r"^\$", "USD $")
  price["first_price"] = price["first_price"].str.replace(r"^HK \$", "HKD $")
  price["first_price"] = price["first_price"].str.replace(r"^NA", "")


[USD $]    1657
[NT $]      545
[]           34
[CAD $]      20
[HKD $]      12
[AED $]       3
[IDR $]       2
[KRW $]       2
[AUD $]       1
[THB $]       1
Name: first_price, dtype: int64

Now let's dig into the 34 missing cases...

In [24]:
# the longest currency abbreviation is 5 characters: 3 letters, a space, followed by $ 
# look for cases where there is no $ found in the first 5 [^]
price["first_price"].str.findall(r"^[^\$][^\$][^\$][^\$][^\$]").value_counts()


[]         2244
[¥1280]       4
[£50/1]       3
[RM 12]       2
[¥ 2,4]       2
[¥78/1]       2
[£45/1]       2
[#23.0]       1
[18.00]       1
[9.9 E]       1
[£40.5]       1
[See w]       1
[€29.9]       1
[£50.0]       1
[¥ 1,5]       1
[¥ 2,6]       1
[RM 13]       1
[500 p]       1
[¥1,05]       1
[¥2980]       1
[£25.0]       1
[¥88/1]       1
[¥1680]       1
[£100/]       1
[NT 40]       1
Name: first_price, dtype: int64

We see additional symbols for currency (£, ¥, €), some typos (#), and further inconsistencies. For example, one case lists the price as "See website for more information." Because these cases are not consistent, I want to print them completely, to see if there is any additional information in the case that can help determine how they should be recoded.

In [25]:
# check the case that begins with See w
print(price.loc[price["first_price"].str.contains(r"^See w") == True])

# probably $, but check the unit since might give an indication of the currency
print(price.loc[price["first_price"].str.contains(r"^9.9") == True])
print(price.loc[price["first_price"].str.contains(r"^18") == True])

# check if this is supposed to be $ or £
print(price.loc[price["first_price"].str.contains(r"^#") == True])

# check what currency this is supposed to be - maybe 500 pounds
print(price.loc[price["first_price"].str.contains(r"^500 p") == True])

                             est_price                        split_prices  \
1685  See website for more information  [See website for more information]   

                           first_price  
1685  See website for more information  
                est_price           split_prices          first_price
1764  9.9 Euros/225 grams  [9.9 Euros/225 grams]  9.9 Euros/225 grams
            est_price       split_prices      first_price
1773  18.00/12 ounces  [18.00/12 ounces]  18.00/12 ounces
             est_price        split_prices       first_price
1945  #23.00/12 ounces  [#23.00/12 ounces]  #23.00/12 ounces
               est_price           split_prices          first_price
902  500 pesos/200 grams  [500 pesos/200 grams]  500 pesos/200 grams


Recode the cases.

In [26]:
# change to NA if price isn't listed
# in a real world scenario we could look these up and input them
price["first_price"] = price["first_price"].str.replace("See website for more information", "")

# add $ to typos
price["first_price"] = price["first_price"].str.replace(r"^NT [^\$]", "NT $")
price["first_price"] = price["first_price"].str.replace(r"^RM [^\$]", "NT $")
price["first_price"] = price["first_price"].str.replace(r"^18", "USD $18")
price["first_price"] = price["first_price"].str.replace(r"^#", "USD $")

# change to euro symbol
price["first_price"] = price["first_price"].str.replace(r"^9.9 Euros", "€9.9")

# remove space after Yen
price["first_price"] = price["first_price"].str.replace(r"^¥ ", "¥")

# standardize how currency is reported
price["first_price"] = price["first_price"].str.replace(r"^500 pesos", "MXN $500")
price["first_price"] = price["first_price"].str.replace(r"^¥", "JPY $")
price["first_price"] = price["first_price"].str.replace(r"^£", "GBP $")
price["first_price"] = price["first_price"].str.replace(r"^€", "EUR $")

  price["first_price"] = price["first_price"].str.replace(r"^NT [^\$]", "NT $")
  price["first_price"] = price["first_price"].str.replace(r"^RM [^\$]", "NT $")
  price["first_price"] = price["first_price"].str.replace(r"^18", "USD $18")
  price["first_price"] = price["first_price"].str.replace(r"^#", "USD $")
  price["first_price"] = price["first_price"].str.replace(r"^9.9 Euros", "€9.9")
  price["first_price"] = price["first_price"].str.replace(r"^¥ ", "¥")
  price["first_price"] = price["first_price"].str.replace(r"^500 pesos", "MXN $500")
  price["first_price"] = price["first_price"].str.replace(r"^¥", "JPY $")
  price["first_price"] = price["first_price"].str.replace(r"^£", "GBP $")
  price["first_price"] = price["first_price"].str.replace(r"^€", "EUR $")


In [27]:
# check a case to make sure our replacements worked
price["first_price"][1764]

'EUR $9.9/225 grams'

Now that all the cases have a standardized method for reporting currency, we should not have any cases where there are no \$ detected.

In [28]:
price["first_price"].str.findall(r"^[^\$][^\$][^\$][^\$][^\$]").value_counts()

[]    2277
Name: first_price, dtype: int64

Now that all cases are standardized, we can use the \$ as a value to split on to create a new variable for currency.

In [29]:
# create a new list to store the currency 1
# create a new list to store the price and units 2
# for each_price in first_price 3
# if the case is missing 4
# append a blank string 5 & 6
# if it's not missing 7
# split the price by " $" this creates a list of things before and after the split 8
# append currency... this info is before the split so index [0] for the currency 9
# append the remaining price and unit... this info is after the split 10

currency = [] # 1
price_unit = [] # 2

for each_price in price["first_price"]: # 3
    if pd.isna(each_price) is True: # 4
        currency.append("") # 5
        price_unit.append("") # 6
    elif pd.isna(each_price) is False: # 7
        split_price = each_price.split(" $") # 8
        currency.append(split_price[0]) # 9
        price_unit.append(split_price[len(split_price)-1]) # 10

# save the list as a new variable in our data frame
price["currency"] = currency

#### Split price_unit
With the remaining information in price_unit it is formatted as price/unit. We can use the / to split in the same way we did the $.

In [30]:
# format of price_unit is price/unit
print(price_unit[0])

25.00/8 ounces


In [31]:
# this loop follows the same format as the one above
just_price = []
unit = []

for each_case in price_unit:
    if pd.isna(each_case) is True: 
        currency.append("") 
        price_unit.append("") 
    elif pd.isna(each_case) is False: 
        split_price = each_case.split("/") 
        just_price.append(split_price[0]) 
        unit.append(split_price[len(split_price)-1]) 

price["price"] = just_price

price["per_unit"] = unit

In [32]:
# remove any commas, spaces, or letters
price["price"] = price["price"].str.replace(",", "")
price["price"] = price["price"].str.replace(r"\s", "")
price["price"] = price["price"].str.replace(r"[a-zA-Z]", "")

# cast the price to a float
price["price"] = pd.to_numeric(price["price"])

  price["price"] = price["price"].str.replace(r"\s", "")
  price["price"] = price["price"].str.replace(r"[a-zA-Z]", "")


In [33]:
price["price"].value_counts()

18.00      98
20.00      88
19.00      73
25.00      59
16.00      57
           ..
13.25       1
3600.00     1
65.95       1
498.00      1
18.40       1
Name: price, Length: 351, dtype: int64

#### Isolate the unit
The unit contains a number, space, and a string. Here we can use regex to grab the digits and words seperately. When looking at the unique per-unit values, there are some bottles and cans where the format is not uniform. When there are multiple bottles or cans, the quantity is listed first. So, here we use regex to list all the cases that don't begin with a digit. Some of these cases begin with whitespace, so first we'll get rid of these cases. 

In [34]:
price["per_unit"].value_counts()

12 ounces                1092
8 ounces                  369
4 ounces                  127
227 grams                  86
16 ounces                  86
                         ... 
50 ounces                   1
Four 12-ounce bottles       1
10.5-ounce bottle           1
Four 8.4-ounce cans         1
12 ounces online            1
Name: per_unit, Length: 111, dtype: int64

In [35]:
# for cases that begin with whitespace, remove that whitespace
price["per_unit"] = price["per_unit"].str.replace(r"^\s", "")

# print all unique cases that don't begin with a digit
print(price.loc[price["per_unit"].str.contains(r"^\D") == True]["per_unit"].value_counts())

five 5-gram single-serve packets        4
eight 3.3 gram packets                  2
six 8-ounce cans                        1
four 8-ounce cans                       1
six 12-ounce cans                       1
twelve 6-ounce cans                     1
seven single-serve pouches              1
can                                     1
Four 12-ounce bottles                   1
Four 8.4-ounce cans                     1
sachet                                  1
thirty 1.6-gram single-serve packets    1
six 5-gram packets                      1
six 5-gram single-serve packets         1
eight 5-gram tubes                      1
Name: per_unit, dtype: int64


  price["per_unit"] = price["per_unit"].str.replace(r"^\s", "")


In [36]:
price["per_unit"] = price["per_unit"].str.replace("five 5-gram single-serve packets", "25 grams")
price["per_unit"] = price["per_unit"].str.replace("eight 3.3 gram packets", "26.4 grams")
price["per_unit"] = price["per_unit"].str.replace("six 8-ounce cans", "48 ounces")
price["per_unit"] = price["per_unit"].str.replace("four 8-ounce cans", "32 ounces")
price["per_unit"] = price["per_unit"].str.replace("six 12-ounce cans", "72 ounces")
price["per_unit"] = price["per_unit"].str.replace("twelve 6-ounce cans", "72 ounces")
price["per_unit"] = price["per_unit"].str.replace("seven single-serve pouches", "")
price["per_unit"] = price["per_unit"].str.replace("can", "")
price["per_unit"] = price["per_unit"].str.replace("Four 12-ounce bottles", "48 ounces")
price["per_unit"] = price["per_unit"].str.replace("Four 8.4-ounce cans", "33.6 ounces")
price["per_unit"] = price["per_unit"].str.replace("sachet", "")
price["per_unit"] = price["per_unit"].str.replace("thirty 1.6-gram single-serve packets", "48 grams")
price["per_unit"] = price["per_unit"].str.replace("six 5-gram packets", "30 grams")
price["per_unit"] = price["per_unit"].str.replace("six 5-gram single-serve packets", "30 grams")
price["per_unit"] = price["per_unit"].str.replace("eight 5-gram tubes", "40 grams")
price["per_unit"] = price["per_unit"].str.replace("Four 8.4-ounce s", "33.6 ounces")
price["per_unit"] = price["per_unit"].str.replace("NA", "")

  price["per_unit"] = price["per_unit"].str.replace("eight 3.3 gram packets", "26.4 grams")
  price["per_unit"] = price["per_unit"].str.replace("Four 8.4-ounce cans", "33.6 ounces")
  price["per_unit"] = price["per_unit"].str.replace("thirty 1.6-gram single-serve packets", "48 grams")
  price["per_unit"] = price["per_unit"].str.replace("Four 8.4-ounce s", "33.6 ounces")


In [37]:
# check that the recodes worked
# print all unique cases that don't begin with a digit
print(price.loc[price["per_unit"].str.contains(r"^\D") == True]["per_unit"].value_counts())

     1
Name: per_unit, dtype: int64


In [38]:
# get 0 or more repetitions of digits (\d*) with 0 or 1 decimals or commas [.,]? in between
price["per"] = price["per_unit"].str.extract(r"(\d*[.,]?\d*)")

# cast to numeric
price["per"] = pd.to_numeric(price["per"])

# get anything that is lower or uppercase a-z for 1 or more repetitions
price["unit"] = price["per_unit"].str.extract(r"([a-zA-Z]+)")

In [39]:
print(price["per"].value_counts())

12.0     1110
8.0       380
4.0       128
227.0      88
16.0       86
         ... 
275.0       1
460.0       1
70.0        1
453.0       1
25.4        1
Name: per, Length: 64, dtype: int64


In [40]:
print(price["unit"].value_counts())

ounces      1816
grams        405
ounce         19
capsules      10
g              6
ml             6
gram           4
pounds         3
kilo           1
sticks         1
single         1
Name: unit, dtype: int64


There is not too much additional cleaning to do except making sure all the units that are on the same scale are reported in the same way.

In [41]:
# recodes
# grams
# if we didn't use regex here it would replace the "gram" in "grams" with "grams" giving us "gramss"
price["unit"] = price["unit"].str.replace(r"gram$", "grams")
price["unit"] = price["unit"].str.replace(r"g$", "grams")

# kilograms

price["unit"] = price["unit"].str.replace("kilo", "kilograms")

# ounces
price["unit"] = price["unit"].str.replace(r"ounce$", "ounces")

# milliliters
price["unit"] = price["unit"].str.replace("ml", "milliliters")

# unit unknown
price["unit"] = price["unit"].str.replace("capsules", "")
price["unit"] = price["unit"].str.replace("sticks", "")
price["unit"] = price["unit"].str.replace("single", "")


  price["unit"] = price["unit"].str.replace(r"gram$", "grams")
  price["unit"] = price["unit"].str.replace(r"g$", "grams")
  price["unit"] = price["unit"].str.replace(r"ounce$", "ounces")


Our cases are now consistent! Though, they are not yet on the same scale. 

In [42]:
# value counts
price["unit"].value_counts()

ounces         1835
grams           415
                 12
milliliters       6
pounds            3
kilograms         1
Name: unit, dtype: int64

In [43]:
price["currency"].value_counts()

USD    1659
NT      549
CAD      20
JPY      14
HKD      12
GBP       9
          6
AED       3
EUR       2
IDR       2
KRW       2
          1
MXN       1
AUD       1
THB       1
Name: currency, dtype: int64

To get them all on the same scale, we can create a new column for exchange rate which will help us convert all our prices to the same currency. We can do the same with the units to conver everything to the same unit.

In [44]:
# initate the variable for exchange rate
price["price_multiplier"] = None

# create a dictinary for converting to USD
exchange_rate = {"USD" : 1, 
                 "NT" : 0.033,
                 "CAD" : 0.73,
                 "JPY" : 0.0075,
                 "GBP" : 1.22,
                 "HKD" : 0.13,
                 "AED" : 0.27,
                 "KRW" : 0.00077,
                 "EUR" : 1.07,
                 "IDR" : 0.000065,
                 "MXN" : 0.053,
                 "AUD" : 0.67,
                 "THB" : 0.029}

# recode the variable
price = price.assign(price_multiplier = price["currency"].map(exchange_rate))

In [45]:
# initate the unit multiplier
price["unit_multiplier"] = None

# create a dictionary for converting to ounces
unit_conversion = {"ounces" : 1,
                   "grams" :.035274,
                   "pounds" : 0.00220462,
                   "kilograms" : 0.001,
                   "milliliters" : 0.033814}

# recode the variable
price = price.assign(unit_multiplier = price["unit"].map(unit_conversion))

Finally, we can create a new variable that represents the price in USD per ounce.

In [46]:
# create the new variable
price["usd_per_ounce"] = (price["price"] * price["price_multiplier"]) / (price["per"] * price["unit_multiplier"])

price[20:40]

Unnamed: 0,est_price,split_prices,first_price,currency,price,per_unit,per,unit,price_multiplier,unit_multiplier,usd_per_ounce
20,$24.50/12 ounces,[$24.50/12 ounces],USD $24.50/12 ounces,USD,24.5,12 ounces,12.0,ounces,1.0,1.0,2.041667
21,$23.00/12 ounces,[$23.00/12 ounces],USD $23.00/12 ounces,USD,23.0,12 ounces,12.0,ounces,1.0,1.0,1.916667
22,$18.99/8 ounces,[$18.99/8 ounces],USD $18.99/8 ounces,USD,18.99,8 ounces,8.0,ounces,1.0,1.0,2.37375
23,$22.50/12 ounces,[$22.50/12 ounces],USD $22.50/12 ounces,USD,22.5,12 ounces,12.0,ounces,1.0,1.0,1.875
24,$20.95/12 ounces,[$20.95/12 ounces],USD $20.95/12 ounces,USD,20.95,12 ounces,12.0,ounces,1.0,1.0,1.745833
25,$16.95/12 ounces,[$16.95/12 ounces],USD $16.95/12 ounces,USD,16.95,12 ounces,12.0,ounces,1.0,1.0,1.4125
26,$35.00/200 grams,[$35.00/200 grams],USD $35.00/200 grams,USD,35.0,200 grams,200.0,grams,1.0,0.035274,4.961161
27,$17.95/12 ounces,[$17.95/12 ounces],USD $17.95/12 ounces,USD,17.95,12 ounces,12.0,ounces,1.0,1.0,1.495833
28,$19.25/12 ounces,[$19.25/12 ounces],USD $19.25/12 ounces,USD,19.25,12 ounces,12.0,ounces,1.0,1.0,1.604167
29,NT $375/8 ounces,[NT $375/8 ounces],NT $375/8 ounces,NT,375.0,8 ounces,8.0,ounces,0.033,1.0,1.546875


Now we'll return to the coffee data frame. Save the new usd_per_ounce column to coffee.

In [47]:
coffee["usd_per_ounce"] = price["usd_per_ounce"]

### Cleaning coffee descriptions (desc_1)
There are 3 variables in coffee that contain descriptions of each coffee. We'll use the first description. Let's prepare these data by removing any punctuation and converting all words to lowercase.


In [48]:
coffee[["desc_1", "desc_2", "desc_3"]][0:20]

Unnamed: 0,desc_1,desc_2,desc_3
0,"Richly floral-toned, exceptionally sweet. Dist...",Produced by Wilton Benitez of Macarena Farm en...,"A nuanced, complex experimentally processed Co..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...",Produced by Rigoberto Herrera of Granja La Esp...,"A trifecta of fruit, chocolate and flowers, bo..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...",Produced at Mengesha Farm from selections of i...,A fruit medley in a cup — think boysenberry an...
3,"Delicately fruit-toned. Guava, ginger blossom,...",Produced by Victor Gutiérrez of Finca Mirador ...,"An appealing washed anaerobic cup: deep-toned,..."
4,"Richly fruit-forward, floral-toned. Lychee, te...",Produced by Jamison Savage of Finca Debra enti...,A floral- and fruit-driven anaerobic natural P...
5,"High-toned, richly bittersweet. Pomelo, raspbe...",Produced by Jamison Savage of Finca Debra enti...,"A complex, multi-layered experimentally proces..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...",Produced by Jamison Savage of Finca Debra enti...,"A balanced, richly sweet Panama Geisha cup, pr..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...",Produced by smallholding farmers from trees of...,"An invitingly elegant washed Ethiopia cup, def..."
8,"Richly spice-toned, floral-driven. Bergamot, l...",Produced by small-holding farmers largely from...,A lyrically composed Ethiopia anaerobic cup wi...
9,"High-toned, crisply sweet-tart. Lemongrass, co...",Produced by Gibran Leonardo Cervantes Covarrub...,A particularly fine Mexico Geisha: elegantly s...


In [49]:
# make lower case
def make_lower(text):
    new_text = text.lower()
    return new_text

coffee["lower_desc_1"] = coffee["desc_1"].apply(make_lower)

In [50]:
# replace anything that is not a lower or upper case letter, white space, apostrophe or hyphen with a blank
coffee["clean_desc_1"] = coffee["lower_desc_1"].str.replace(r"([^a-zA-Z\s\-'])", "")

# if we wanted to replace hyphens we could run the following
#coffee["clean_desc_1"] = coffee["clean_desc_1"].str.replace("-", " ")

  coffee["clean_desc_1"] = coffee["lower_desc_1"].str.replace(r"([^a-zA-Z\s\-'])", "")


## Tokenization
**Tokenization** is the process for breaking up text into smaller units. This example uses 1-word tokens though tokens can be made up of any unit smaller than the complete text, e.g., 2, 3, etc. words, phrases, or sentences. 

In this section, we'll clean and prep the coffee description text and then sperarate it into 1-word tokens. 

In [51]:
# create a function to split the string into tokens
def split_string(text):
    tokens = text.split(" ")
    return tokens

# apply this function to the cleaned description
coffee["desc_1_tokens"] = coffee["clean_desc_1"].apply(split_string)

# view the original desc_1 column, cleaned, and split side-by-side for the first 10 cases
coffee[["desc_1", "lower_desc_1", "clean_desc_1", "desc_1_tokens"]][0:10]

Unnamed: 0,desc_1,lower_desc_1,clean_desc_1,desc_1_tokens
0,"Richly floral-toned, exceptionally sweet. Dist...","richly floral-toned, exceptionally sweet. dist...",richly floral-toned exceptionally sweet distin...,"[richly, floral-toned, exceptionally, sweet, d..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...","richly aromatic, chocolaty, fruit-toned. dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[richly, aromatic, chocolaty, fruit-toned, dar..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...","high-toned, fruit-driven. boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-toned, fruit-driven, boysenberry, pear, ..."
3,"Delicately fruit-toned. Guava, ginger blossom,...","delicately fruit-toned. guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delicately, fruit-toned, guava, ginger, bloss..."
4,"Richly fruit-forward, floral-toned. Lychee, te...","richly fruit-forward, floral-toned. lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[richly, fruit-forward, floral-toned, lychee, ..."
5,"High-toned, richly bittersweet. Pomelo, raspbe...","high-toned, richly bittersweet. pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-toned, richly, bittersweet, pomelo, rasp..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...","crisply sweet-tart. apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crisply, sweet-tart, apricot, cocoa, nib, aga..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...","high-toned, juicy-sweet. mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-toned, juicy-sweet, mango, cocoa, nib, m..."
8,"Richly spice-toned, floral-driven. Bergamot, l...","richly spice-toned, floral-driven. bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[richly, spice-toned, floral-driven, bergamot,..."
9,"High-toned, crisply sweet-tart. Lemongrass, co...","high-toned, crisply sweet-tart. lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-toned, crisply, sweet-tart, lemongrass, ..."


Natural Language Toolkit (NLTK) also has a tokenizer which tokenizes based on white space and punctuation. Note that the NLTK tokenizer retains punctuation as a token. 

In [52]:
# create a function to tokenize using nltk's tokenizer
def tokenize(text):
    new_text = nltk.tokenize.word_tokenize(text)
    return new_text

coffee["desc_1_tokens_nltk"] = coffee["lower_desc_1"].apply(tokenize)

# view the original desc_1 column, cleaned, and split side-by-side for the first 10 cases
coffee[["desc_1", "lower_desc_1", "clean_desc_1", "desc_1_tokens", "desc_1_tokens_nltk"]][0:10]

Unnamed: 0,desc_1,lower_desc_1,clean_desc_1,desc_1_tokens,desc_1_tokens_nltk
0,"Richly floral-toned, exceptionally sweet. Dist...","richly floral-toned, exceptionally sweet. dist...",richly floral-toned exceptionally sweet distin...,"[richly, floral-toned, exceptionally, sweet, d...","[richly, floral-toned, ,, exceptionally, sweet..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...","richly aromatic, chocolaty, fruit-toned. dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[richly, aromatic, chocolaty, fruit-toned, dar...","[richly, aromatic, ,, chocolaty, ,, fruit-tone..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...","high-toned, fruit-driven. boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-toned, fruit-driven, boysenberry, pear, ...","[high-toned, ,, fruit-driven, ., boysenberry, ..."
3,"Delicately fruit-toned. Guava, ginger blossom,...","delicately fruit-toned. guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delicately, fruit-toned, guava, ginger, bloss...","[delicately, fruit-toned, ., guava, ,, ginger,..."
4,"Richly fruit-forward, floral-toned. Lychee, te...","richly fruit-forward, floral-toned. lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[richly, fruit-forward, floral-toned, lychee, ...","[richly, fruit-forward, ,, floral-toned, ., ly..."
5,"High-toned, richly bittersweet. Pomelo, raspbe...","high-toned, richly bittersweet. pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-toned, richly, bittersweet, pomelo, rasp...","[high-toned, ,, richly, bittersweet, ., pomelo..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...","crisply sweet-tart. apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crisply, sweet-tart, ., apricot, ,, cocoa, ni..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...","high-toned, juicy-sweet. mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-toned, ,, juicy-sweet, ., mango, ,, coco..."
8,"Richly spice-toned, floral-driven. Bergamot, l...","richly spice-toned, floral-driven. bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[richly, spice-toned, floral-driven, bergamot,...","[richly, spice-toned, ,, floral-driven, ., ber..."
9,"High-toned, crisply sweet-tart. Lemongrass, co...","high-toned, crisply sweet-tart. lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-toned, crisply, sweet-tart, lemongrass, ...","[high-toned, ,, crisply, sweet-tart, ., lemong..."


## Stop words
**Stop words** are common words or phrases that do not contribute much information to the analysis. Thus reducing the size of data set and training time (since we have fewer tokens to train). In this section we'll create our own list of stop words and NLTK's and work through how to remove them from our data.

The size of the list of stop words will depend on the analysis and research questions. For example, perhaps we have a book review where Reader 1 says, "The book was so good" while Reader 2 says, "The book was not good at all." After removing NLTK's stop words, both reviews would be reduced to "book" and "good." This might be okay if the purpose of our analysis.

For example, if the purpose of the analysis was analysing the themes that students talked about when reflecting on a course. From this we could determine that students gave feedback about the course resources. However, if we remove the stop words, we can't determine their sentiment towards these resources.

Another example of why stop words might not be helpful is if we were analysing data from a publishing company where the reviewers were agents and editors who had read a submission. The company wants to use the data as an initial indicator for whether the company should invest in the book. For the purpose of this analysis, removing the complete list of NLTK's stop words would not be useful. 

In [53]:
review1 = ["the", "book", "was", "not", "good", "at", "all"]
review2 = ["the", "book", "was", "so", "good"]

# create a set of nltk's stop words
nltk_stop_words = set(nltk.corpus.stopwords.words("english"))

# create a function to remove stop words that requires 1 input, a list of tokens 1 
# create a new list to save the output to 2
# for each token in the list 3
# if the token isn't in the list of stopwords 4
# append the token to clean_tokens list 5 
# return the list of clean tokens

def remove_stop_words(token_list):  # 1
    clean_tokens = [] # 2
    for token in token_list: # 3
        if token not in nltk_stop_words: # 4 
            clean_tokens.append(token) # 5
    return clean_tokens

print(remove_stop_words(review1))
print(remove_stop_words(review2))

['book', 'good']
['book', 'good']


NLTK's list of stop words includes 179 frequently used words which include articles, prepositions, pronouns, conjunctions, etc. Another package with stop words is spaCy which has an even longer list of stop words (326). 

In [54]:
print(len(nltk_stop_words))
print(nltk_stop_words)

179
{'my', 'your', 'as', 'further', 'now', 'so', 'during', 'ours', 'yours', 'while', 'here', 'hers', 'didn', 'at', 'y', "didn't", "couldn't", 'own', 'herself', 'aren', 'i', 'be', "don't", 'myself', 'have', 'm', 'any', 'd', 'her', 'by', "weren't", 'how', "you'd", 'again', 'or', 'off', 'theirs', 'with', 'too', "you've", 'yourselves', 'needn', 'same', 'shouldn', 'are', 'of', 'you', 'o', 'hasn', 'over', 'up', 'each', 'been', 'a', 'and', 'our', 'most', 'if', 'nor', 've', "haven't", 'whom', 'did', "won't", 'more', 'than', 'such', 'had', 'were', 's', 'until', 'before', "mustn't", 'mustn', 'their', 'that', 'ourselves', "should've", 'don', "aren't", 'them', 'very', 'its', 'against', 'who', 'does', "you'll", 'into', 'being', 'below', 'in', "it's", "needn't", 're', "you're", 'ain', 'to', 'mightn', "shan't", 'is', "doesn't", 'we', 'his', 'then', 'isn', 'was', 'once', 'these', 'on', "wouldn't", 'himself', 'the', 'am', "she's", 'll', "mightn't", 'can', 'itself', 'shan', 'just', 'should', 'doesn', 't

In [55]:
# pip3 install spacy
import spacy
from spacy.lang.en import English

#loading the english language small model of spacy
nlp = English()

# get th
spacy_stop_words = nlp.Defaults.stop_words
print(len(spacy_stop_words))
print(spacy_stop_words)

326
{'as', 'further', 'now', 'so', 'during', 'while', 'former', 'be', '’ll', 'her', 'well', 'almost', '‘s', 'again', 'whole', "'m", 'twenty', 'without', 'however', 'whereby', 'empty', 'twelve', 'been', 'elsewhere', 'many', 'than', 'except', 'such', 'noone', 'onto', 'whereas', 'their', 'even', 'back', 'who', 'thus', 'anywhere', 'last', 'serious', "'s", 'his', 'we', 'somewhere', 'on', 'whereupon', 'am', 'either', 'hundred', 'already', 'just', 'behind', 'may', 'doing', 'name', 'after', 'some', 'amongst', 'do', 'see', 'what', 'when', 'anyone', 'several', 'anything', 'has', 'they', 'whether', "'ve", 'call', 'else', 'sixty', 'whose', 'my', 'your', '’s', 'here', 'yet', 'herself', 'anyhow', 'somehow', 'how', 'or', "'d", 'toward', 'same', 'are', 'of', 'up', '‘m', 'most', 'moreover', 'nor', 'thereby', 'otherwise', 'much', 'via', 'nothing', 'perhaps', 'were', 'rather', 'before', 'since', 'whenever', 'very', 'against', 'does', 'namely', 'being', 'beside', 'something', 'seemed', 'to', 'first', 'wha

We can also customize a list of stop words based on our own knowledge of the data. For example, in this case, the word coffee will not provide much information about the description since all the descriptions are talking about coffee.

In [56]:
# find cases that contain the word coffee
coffee.loc[coffee["desc_1"].str.contains("coffee") == True]["desc_1"]

10      Crisply sweet-tart, richly fruit-toned Costa R...
301     Evaluated as espresso. Extremely sweet, intent...
531     Richly fruit-toned, complex. Dried mango, coco...
819     A ready-to-drink black coffee, tested cold. Dr...
820     A ready-to-drink black coffee, tested cold. Dr...
                              ...                        
2105    A ready-to-drink bottled black coffee tested o...
2106    A ready-to-drink bottled black coffee tested o...
2107    A ready-to-drink bottled black coffee tested o...
2108    A ready-to-drink bottled black coffee tested o...
2266    Evaluated as espresso. Bright, zesty-sweet. Ba...
Name: desc_1, Length: 64, dtype: object

In [57]:
stop_words = {"coffee"}

To combine our list of stop words to the list created by NLTK, use | which gives the union of two sets. 

In [58]:
combined_stop_words = stop_words | nltk_stop_words
print(combined_stop_words)

{'my', 'your', 'as', 'further', 'now', 'so', 'during', 'ours', 'yours', 'while', 'here', 'hers', 'didn', 'at', 'y', "didn't", "couldn't", 'own', 'herself', 'aren', 'i', 'be', "don't", 'myself', 'have', 'm', 'any', 'd', 'her', 'by', "weren't", 'how', "you'd", 'again', 'or', 'off', 'theirs', 'with', 'too', "you've", 'yourselves', 'needn', 'same', 'shouldn', 'are', 'of', 'you', 'o', 'hasn', 'over', 'up', 'each', 'been', 'a', 'and', 'our', 'most', 'if', 'nor', 've', "haven't", 'whom', 'did', "won't", 'more', 'than', 'such', 'had', 'were', 's', 'until', 'before', "mustn't", 'mustn', 'their', 'that', 'ourselves', "should've", 'don', "aren't", 'them', 'very', 'its', 'against', 'who', 'does', "you'll", 'into', 'being', 'below', 'in', "it's", "needn't", 're', "you're", 'ain', 'to', 'mightn', "shan't", 'is', "doesn't", 'we', 'his', 'then', 'isn', 'was', 'once', 'these', 'on', "wouldn't", 'himself', 'the', 'am', "she's", 'll', "mightn't", 'can', 'itself', 'shan', 'just', 'should', 'doesn', 'this'

We can also use this to see how much overlap there is between spaCy's (326) and NLTK's (179) stop words. There are 382 unique words from combining both sets meaning, there's a 56 stop words in NLTK that don't appear in spaCy.

In [59]:
len(spacy_stop_words|nltk_stop_words)

382

Once we have a final list of stop words, we'll want to remove them from our tokens. 

In [60]:
# create a function with that requires 1 input, a list of tokens 1
# initate a new list 2
# for each token in the list 3
# if the token is not in our list of stop words 4
# append the word to the initated list 5
# return the list of non-stop words 6

def remove_stop_words(token_list): # 1
    new_tokens = [] # 2
    for token in token_list: # 3
        if token not in combined_stop_words: # 4
            new_tokens.append(token) # 5
    return new_tokens # 6

# create a new column in our data set of clean tokens with the stop words removed
coffee["desc_1_tokens_no_stop"] = coffee["desc_1_tokens"].apply(remove_stop_words)

# view the original desc_1 column, cleaned, and split side-by-side for the first 10 cases
coffee[["desc_1", "lower_desc_1", "clean_desc_1", "desc_1_tokens", "desc_1_tokens_nltk", "desc_1_tokens_no_stop"]][0:10]

Unnamed: 0,desc_1,lower_desc_1,clean_desc_1,desc_1_tokens,desc_1_tokens_nltk,desc_1_tokens_no_stop
0,"Richly floral-toned, exceptionally sweet. Dist...","richly floral-toned, exceptionally sweet. dist...",richly floral-toned exceptionally sweet distin...,"[richly, floral-toned, exceptionally, sweet, d...","[richly, floral-toned, ,, exceptionally, sweet...","[richly, floral-toned, exceptionally, sweet, d..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...","richly aromatic, chocolaty, fruit-toned. dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[richly, aromatic, chocolaty, fruit-toned, dar...","[richly, aromatic, ,, chocolaty, ,, fruit-tone...","[richly, aromatic, chocolaty, fruit-toned, dar..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...","high-toned, fruit-driven. boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-toned, fruit-driven, boysenberry, pear, ...","[high-toned, ,, fruit-driven, ., boysenberry, ...","[high-toned, fruit-driven, boysenberry, pear, ..."
3,"Delicately fruit-toned. Guava, ginger blossom,...","delicately fruit-toned. guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delicately, fruit-toned, guava, ginger, bloss...","[delicately, fruit-toned, ., guava, ,, ginger,...","[delicately, fruit-toned, guava, ginger, bloss..."
4,"Richly fruit-forward, floral-toned. Lychee, te...","richly fruit-forward, floral-toned. lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[richly, fruit-forward, floral-toned, lychee, ...","[richly, fruit-forward, ,, floral-toned, ., ly...","[richly, fruit-forward, floral-toned, lychee, ..."
5,"High-toned, richly bittersweet. Pomelo, raspbe...","high-toned, richly bittersweet. pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-toned, richly, bittersweet, pomelo, rasp...","[high-toned, ,, richly, bittersweet, ., pomelo...","[high-toned, richly, bittersweet, pomelo, rasp..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...","crisply sweet-tart. apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crisply, sweet-tart, ., apricot, ,, cocoa, ni...","[crisply, sweet-tart, apricot, cocoa, nib, aga..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...","high-toned, juicy-sweet. mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-toned, ,, juicy-sweet, ., mango, ,, coco...","[high-toned, juicy-sweet, mango, cocoa, nib, m..."
8,"Richly spice-toned, floral-driven. Bergamot, l...","richly spice-toned, floral-driven. bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[richly, spice-toned, floral-driven, bergamot,...","[richly, spice-toned, ,, floral-driven, ., ber...","[richly, spice-toned, floral-driven, bergamot,..."
9,"High-toned, crisply sweet-tart. Lemongrass, co...","high-toned, crisply sweet-tart. lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-toned, crisply, sweet-tart, lemongrass, ...","[high-toned, ,, crisply, sweet-tart, ., lemong...","[high-toned, crisply, sweet-tart, lemongrass, ..."


Let's see how many stop words were removed from each row...

In [61]:
# number of stop words removed from each row
stop_words_removed = coffee["desc_1_tokens"].apply(len) - coffee["desc_1_tokens_no_stop"].apply(len)
print(stop_words_removed[0:10])

# average number of stop words removed
print("on avg there were", round(stop_words_removed.mean(), 3), "stop words removed")

0    10
1     6
2     9
3     6
4     4
5    12
6     5
7     6
8     9
9     9
dtype: int64
on avg there were 8.123 stop words removed


## Stemming
**Stemming** is the process of reducing a word to its root form so that a group of related words are captured in the same stem. For example, if we refer back to the course feedback example, "book" and "books" would be captured as separate tokens without stemming. After stemming, the root of both these words is "book." Stemming applies a rule based approach for slicing the prefix and/or suffix from a word. 

However, stemming can result in incorrect grouping of words from over- and under-stemming.

**Over-stemming** is when two words are stemmed to the same root but should be different roots. For example, if "university" and "universe" are both stemmed to "univers."

**Under-stemming** is when two words are stemmed to different roots but they should be the same root. For example, if "fair" and "farily" are separate roots. 

When stemming, the root may not neccesarily have an appropriate meaning, for example "univers," but can still be useful for the purpose of the analysis.


One commonly used stemmer is the Porter stemmer which was written and is mantained by Martin Porter. The Porter stemmer removes commoner morphological and inflexional endings from words in English (Porter, 1980).
- **Morphological:** the structure of words such as stems, root words, prefixes, and suffixes
- **Inflectional:** changes in the form of a word to distinguish tense, person, number, gender, mood, voice, or case.

There is also the Snowball stemmer which is a revised version of the Porter stemmer. And the Lancaster stemmer which is the most agressive of the three stemmers and often leads to over stemming. 

Regardless of the stemmer used, stemming does not consider how a word is being used in the context of the sentence e.g., if it's being used as a noun or a verb.

In [62]:
# create the stemmers from NLTK
porter_stemmer = nltk.stem.porter.PorterStemmer()
snowball_stemmer = nltk.stem.snowball.SnowballStemmer("english")
lancaster_stemmer = nltk.stem.lancaster.LancasterStemmer()

Even though each of the stemmers produce different stems, it is fine so long as they aren't over- or under-stemming. In the case below, the Snowball and Lancaster stemmers perform well while the Porter stem under-stems exceed and exceedingly. 

In [63]:
# note differences in the stems from different stemmers
print("porter stem of exceed:", porter_stemmer.stem("exceed"))
print("porter stem of exceedingly:", porter_stemmer.stem("exceedingly"), "\n")

print("snowball stem of exceed:", snowball_stemmer.stem("exceed"))
print("snowball stem of exceedingly:", snowball_stemmer.stem("exceedingly"), "\n")

print("lancaster stem of exceed:", lancaster_stemmer.stem("exceed"))
print("lancaster stem of exceedingly:", lancaster_stemmer.stem("exceedingly"))

porter stem of exceed: exceed
porter stem of exceedingly: exceedingli 

snowball stem of exceed: exceed
snowball stem of exceedingly: exceed 

lancaster stem of exceed: excess
lancaster stem of exceedingly: excess


The Lancaster stemmer is the most agressive stemmer and can lead to over-stemming. We see this with "mat" and "matter" which should be two different stems like they are in the Porter and Snowball stemmers.

In [64]:
# over-stemming
print("porter stem of mat:", porter_stemmer.stem("mat"))
print("porter stem of matter:", porter_stemmer.stem("matter"), "\n")

print("snowball stem of mat:", snowball_stemmer.stem("mat"))
print("snowball stem of matter:", snowball_stemmer.stem("matter"), "\n")

print("lancaster stem of mat:", lancaster_stemmer.stem("mat"))
print("lancaster stem of matter:", lancaster_stemmer.stem("matter"))

porter stem of mat: mat
porter stem of matter: matter 

snowball stem of mat: mat
snowball stem of matter: matter 

lancaster stem of mat: mat
lancaster stem of matter: mat


Now we'll use the Snowball stemmer on our data.

In [65]:
# create a function with that requires 1 input, a list of tokens 1
# initate a new list 2
# for each token in the list 3
# use the stemmer to create a stemmed word 4
# append the word to the initated list 5
# return the list of stemmed tokens 6

def stem_tokens(token_list): # 1
    stemmed_tokens = [] # 2
    for token in token_list: # 3
        stem = snowball_stemmer.stem(token) # 4
        stemmed_tokens.append(stem) # 5
    return stemmed_tokens # 6

# create a new column in our data set of clean tokens with the stop words removed
coffee["desc_1_token_stems"] = coffee["desc_1_tokens_no_stop"].apply(stem_tokens)

# view the original desc_1 column, cleaned, and split side-by-side for the first 10 cases
coffee[["desc_1", "lower_desc_1", "clean_desc_1", "desc_1_tokens","desc_1_tokens_no_stop", "desc_1_token_stems"]][0:10]

Unnamed: 0,desc_1,lower_desc_1,clean_desc_1,desc_1_tokens,desc_1_tokens_no_stop,desc_1_token_stems
0,"Richly floral-toned, exceptionally sweet. Dist...","richly floral-toned, exceptionally sweet. dist...",richly floral-toned exceptionally sweet distin...,"[richly, floral-toned, exceptionally, sweet, d...","[richly, floral-toned, exceptionally, sweet, d...","[rich, floral-ton, except, sweet, distinct, na..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...","richly aromatic, chocolaty, fruit-toned. dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[richly, aromatic, chocolaty, fruit-toned, dar...","[richly, aromatic, chocolaty, fruit-toned, dar...","[rich, aromat, chocolati, fruit-ton, dark, cho..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...","high-toned, fruit-driven. boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-toned, fruit-driven, boysenberry, pear, ...","[high-toned, fruit-driven, boysenberry, pear, ...","[high-ton, fruit-driven, boysenberri, pear, co..."
3,"Delicately fruit-toned. Guava, ginger blossom,...","delicately fruit-toned. guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delicately, fruit-toned, guava, ginger, bloss...","[delicately, fruit-toned, guava, ginger, bloss...","[delic, fruit-ton, guava, ginger, blossom, coc..."
4,"Richly fruit-forward, floral-toned. Lychee, te...","richly fruit-forward, floral-toned. lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[richly, fruit-forward, floral-toned, lychee, ...","[richly, fruit-forward, floral-toned, lychee, ...","[rich, fruit-forward, floral-ton, lyche, tea, ..."
5,"High-toned, richly bittersweet. Pomelo, raspbe...","high-toned, richly bittersweet. pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-toned, richly, bittersweet, pomelo, rasp...","[high-toned, richly, bittersweet, pomelo, rasp...","[high-ton, rich, bittersweet, pomelo, raspberr..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...","crisply sweet-tart. apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crispli, sweet-tart, apricot, cocoa, nib, aga..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...","high-toned, juicy-sweet. mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-ton, juicy-sweet, mango, cocoa, nib, mag..."
8,"Richly spice-toned, floral-driven. Bergamot, l...","richly spice-toned, floral-driven. bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[richly, spice-toned, floral-driven, bergamot,...","[richly, spice-toned, floral-driven, bergamot,...","[rich, spice-ton, floral-driven, bergamot, lil..."
9,"High-toned, crisply sweet-tart. Lemongrass, co...","high-toned, crisply sweet-tart. lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-toned, crisply, sweet-tart, lemongrass, ...","[high-toned, crisply, sweet-tart, lemongrass, ...","[high-ton, crispli, sweet-tart, lemongrass, co..."


## Lemmatization
**Lemmatization** attempts to find the lemma or dictionary form of a word. This takes into consideration the way it was used in the sentence which typically means it removes inflectional endings only. 

Though more nuanced than stemming, lemmatizers are also more computationally intensive and are more laborious to develop because they require a deeper understanding of the language. 

NLTK's lemmatizer is based off of the WordNet lexical database (https://wordnet.princeton.edu).

In [66]:
# initalize the lemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()

In [67]:
# compare the stemmer and lemmatizer
print("snowball stem of study:", snowball_stemmer.stem("study"))
print("snowball stem of studies:", snowball_stemmer.stem("studies"))
print("snowball stem of studying:", snowball_stemmer.stem("studying"))
print("snowball stem of studied:", snowball_stemmer.stem("studied"), "\n")

print("lemma of study:", lemmatizer.lemmatize("study"))
print("lemma of studies:", lemmatizer.lemmatize("studies"))
print("lemma of studying:", lemmatizer.lemmatize("studying"))
print("lemma of studied:", lemmatizer.lemmatize("studied"))

snowball stem of study: studi
snowball stem of studies: studi
snowball stem of studying: studi
snowball stem of studied: studi 

lemma of study: study
lemma of studies: study
lemma of studying: studying
lemma of studied: studied


In [68]:
# create a function that requires 1 input, a list of tokens 1
# initate a new list 2
# for each token in the list 3
# use the lemmatizer to create lemmas from tokens 4
# append the lemma to the initated list 5
# return the list of lemmatized tokens 6

def lemmatize_tokens(token_list): # 1
    lemmatized_tokens = [] # 2
    for token in token_list: # 3
        lemma = lemmatizer.lemmatize(token) # 4
        lemmatized_tokens.append(lemma) # 5
    return lemmatized_tokens # 6

# create a new column in our data set of clean tokens with the stop words removed
coffee["desc_1_token_lemmas"] = coffee["desc_1_tokens_no_stop"].apply(lemmatize_tokens)

# view the original desc_1 column, cleaned, and split side-by-side for the first 10 cases
coffee[["desc_1", "lower_desc_1", "clean_desc_1", "desc_1_tokens","desc_1_tokens_no_stop", "desc_1_token_stems", "desc_1_token_lemmas"]][0:10]

Unnamed: 0,desc_1,lower_desc_1,clean_desc_1,desc_1_tokens,desc_1_tokens_no_stop,desc_1_token_stems,desc_1_token_lemmas
0,"Richly floral-toned, exceptionally sweet. Dist...","richly floral-toned, exceptionally sweet. dist...",richly floral-toned exceptionally sweet distin...,"[richly, floral-toned, exceptionally, sweet, d...","[richly, floral-toned, exceptionally, sweet, d...","[rich, floral-ton, except, sweet, distinct, na...","[richly, floral-toned, exceptionally, sweet, d..."
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...","richly aromatic, chocolaty, fruit-toned. dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[richly, aromatic, chocolaty, fruit-toned, dar...","[richly, aromatic, chocolaty, fruit-toned, dar...","[rich, aromat, chocolati, fruit-ton, dark, cho...","[richly, aromatic, chocolaty, fruit-toned, dar..."
2,"High-toned, fruit-driven. Boysenberry, pear, c...","high-toned, fruit-driven. boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-toned, fruit-driven, boysenberry, pear, ...","[high-toned, fruit-driven, boysenberry, pear, ...","[high-ton, fruit-driven, boysenberri, pear, co...","[high-toned, fruit-driven, boysenberry, pear, ..."
3,"Delicately fruit-toned. Guava, ginger blossom,...","delicately fruit-toned. guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delicately, fruit-toned, guava, ginger, bloss...","[delicately, fruit-toned, guava, ginger, bloss...","[delic, fruit-ton, guava, ginger, blossom, coc...","[delicately, fruit-toned, guava, ginger, bloss..."
4,"Richly fruit-forward, floral-toned. Lychee, te...","richly fruit-forward, floral-toned. lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[richly, fruit-forward, floral-toned, lychee, ...","[richly, fruit-forward, floral-toned, lychee, ...","[rich, fruit-forward, floral-ton, lyche, tea, ...","[richly, fruit-forward, floral-toned, lychee, ..."
5,"High-toned, richly bittersweet. Pomelo, raspbe...","high-toned, richly bittersweet. pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-toned, richly, bittersweet, pomelo, rasp...","[high-toned, richly, bittersweet, pomelo, rasp...","[high-ton, rich, bittersweet, pomelo, raspberr...","[high-toned, richly, bittersweet, pomelo, rasp..."
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...","crisply sweet-tart. apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crisply, sweet-tart, apricot, cocoa, nib, aga...","[crispli, sweet-tart, apricot, cocoa, nib, aga...","[crisply, sweet-tart, apricot, cocoa, nib, aga..."
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...","high-toned, juicy-sweet. mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-toned, juicy-sweet, mango, cocoa, nib, m...","[high-ton, juicy-sweet, mango, cocoa, nib, mag...","[high-toned, juicy-sweet, mango, cocoa, nib, m..."
8,"Richly spice-toned, floral-driven. Bergamot, l...","richly spice-toned, floral-driven. bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[richly, spice-toned, floral-driven, bergamot,...","[richly, spice-toned, floral-driven, bergamot,...","[rich, spice-ton, floral-driven, bergamot, lil...","[richly, spice-toned, floral-driven, bergamot,..."
9,"High-toned, crisply sweet-tart. Lemongrass, co...","high-toned, crisply sweet-tart. lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-toned, crisply, sweet-tart, lemongrass, ...","[high-toned, crisply, sweet-tart, lemongrass, ...","[high-ton, crispli, sweet-tart, lemongrass, co...","[high-toned, crisply, sweet-tart, lemongrass, ..."


For this analysis, we're not concerned with differentiating parts of speech. Especially, given that these text are meant to be reviews of coffee, the part of speech is not relevant to the flavor profile. So, we'll move forward with the Snowball stems instead of lemmas. 

Now let's save our processed text to a new column. 

In [69]:
# create a function with that requires 1 input, a list of tokens 1
# combine the list of tokens into a single string 2
# return the string 3

def combine_tokens(token_list):
    new_string = " ".join(token_list)
    return new_string

# create a new column in our data set of combined tokens
coffee["desc_1_processed"] = coffee["desc_1_token_stems"].apply(combine_tokens)

# view the columns side-by-side for the first 10 cases
coffee[["desc_1", "clean_desc_1", "desc_1_token_stems", "desc_1_processed"]][0:10]

Unnamed: 0,desc_1,clean_desc_1,desc_1_token_stems,desc_1_processed
0,"Richly floral-toned, exceptionally sweet. Dist...",richly floral-toned exceptionally sweet distin...,"[rich, floral-ton, except, sweet, distinct, na...",rich floral-ton except sweet distinct narcissu...
1,"Richly aromatic, chocolaty, fruit-toned. Dark ...",richly aromatic chocolaty fruit-toned dark cho...,"[rich, aromat, chocolati, fruit-ton, dark, cho...",rich aromat chocolati fruit-ton dark chocol dr...
2,"High-toned, fruit-driven. Boysenberry, pear, c...",high-toned fruit-driven boysenberry pear cocoa...,"[high-ton, fruit-driven, boysenberri, pear, co...",high-ton fruit-driven boysenberri pear cocoa n...
3,"Delicately fruit-toned. Guava, ginger blossom,...",delicately fruit-toned guava ginger blossom co...,"[delic, fruit-ton, guava, ginger, blossom, coc...",delic fruit-ton guava ginger blossom cocoa nib...
4,"Richly fruit-forward, floral-toned. Lychee, te...",richly fruit-forward floral-toned lychee tea r...,"[rich, fruit-forward, floral-ton, lyche, tea, ...",rich fruit-forward floral-ton lyche tea rose d...
5,"High-toned, richly bittersweet. Pomelo, raspbe...",high-toned richly bittersweet pomelo raspberry...,"[high-ton, rich, bittersweet, pomelo, raspberr...",high-ton rich bittersweet pomelo raspberri coc...
6,"Crisply sweet-tart. Apricot, cocoa nib, agave ...",crisply sweet-tart apricot cocoa nib agave syr...,"[crispli, sweet-tart, apricot, cocoa, nib, aga...",crispli sweet-tart apricot cocoa nib agav syru...
7,"High-toned, juicy-sweet. Mango, cocoa nib, mag...",high-toned juicy-sweet mango cocoa nib magnoli...,"[high-ton, juicy-sweet, mango, cocoa, nib, mag...",high-ton juicy-sweet mango cocoa nib magnolia ...
8,"Richly spice-toned, floral-driven. Bergamot, l...",richly spice-toned floral-driven bergamot lila...,"[rich, spice-ton, floral-driven, bergamot, lil...",rich spice-ton floral-driven bergamot lilac co...
9,"High-toned, crisply sweet-tart. Lemongrass, co...",high-toned crisply sweet-tart lemongrass cocoa...,"[high-ton, crispli, sweet-tart, lemongrass, co...",high-ton crispli sweet-tart lemongrass cocoa n...


### Create a document-term matrix
In the **document-term matrix (DTM)**, rows are comprised of documents, columns are comprised of terms (i.e., tokens), and cells become a count of each token in the document. We'll first use the CountVectorizer which converts a column of text into a document-term matrix.

In [70]:
# initaialize the vectorizer
vec = CountVectorizer()

# use the vectorizer to create a dtm called X
X = vec.fit_transform(coffee["desc_1_processed"])
X

<2282x1281 sparse matrix of type '<class 'numpy.int64'>'
	with 64406 stored elements in Compressed Sparse Row format>

In [71]:
# this code extracts the index (i.e., row/document names) from coffee and applies them to our matrix, X
# we also convert this matrix to a DataFrame
df = pd.DataFrame(X.toarray(), columns = vec.get_feature_names_out(), index = coffee.index)
df

Unnamed: 0,access,accompani,acditi,acid,acidi,acidti,acorn,acrid,ad,add,...,yeasti,yellow,yet,yogurt,yogurti,young,yuzu,zest,zesti,zesty
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2277,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2278,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2279,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2280,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Analyzing word counts and sentiment

There are some flavor defects that can be detected in coffee which result from over- or under-extracting, over- or under-roasting, etc. The list below is a collection of key words that are associated with flavor defects. 

In [72]:
# create a list of flavor defect key words
flavor_defects = [
    "rotten",
    "must",
    "mold",
    "potato",
    "bitter",
    "cappy",
    "baggy",
    "oat",
    "grain",
    "grass",
    "hay",
    "ash",
    "carbon",
    "burnt",
    "scorched",
    "sour"]

# use the function we created to stem these key words
flavor_defect_stems = stem_tokens(flavor_defects)

Were any of these key words detected in our document?

In [73]:
# initate a new list 1
# for each word in our list of flavor defects 2
# if the word appears in one of the columns in our dtm 3
# then add the word to our initated list 4

flavor_defects_detected = [] # 1

for word in flavor_defects: # 2
    if word in df.columns: # 3
        flavor_defects_detected.append(word) # 4

flavor_defects_detected

['rotten', 'bitter', 'grain', 'grass', 'sour']

Now let's create a subset of our DTM data frame where the columns are just the flavor defect key words that were detected in our DTM. 

In [74]:
df_flavor_defects = df[flavor_defects_detected]
df_flavor_defects

Unnamed: 0,rotten,bitter,grain,grass,sour
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0
...,...,...,...,...,...
2277,0,0,0,0,0
2278,0,0,0,0,0
2279,0,0,0,0,0
2280,0,0,0,0,0


Create a count of the total number of flavor defect key words that were detected

In [75]:
df_flavor_defects["n_flavor_defects"] = df_flavor_defects[flavor_defects_detected].sum(axis = 1)
df_flavor_defects

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_flavor_defects["n_flavor_defects"] = df_flavor_defects[flavor_defects_detected].sum(axis = 1)


Unnamed: 0,rotten,bitter,grain,grass,sour,n_flavor_defects
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
2277,0,0,0,0,0,0
2278,0,0,0,0,0,0
2279,0,0,0,0,0,0
2280,0,0,0,0,0,0


Lastly, lets add the new variable to our coffee data frame.

In [76]:
coffee["n_flavor_defects"] = df_flavor_defects["n_flavor_defects"]

I reckon that the appearance of these key words may be associated with lower ratings of the coffee. Specifically, the more times these flavor defect key words appear in the description, the lower the score will be.

In [77]:
coffee["n_flavor_defects"].value_counts().sort_index(ascending = True)

0    2263
1      17
2       1
3       1
Name: n_flavor_defects, dtype: int64

In [78]:
print(coffee.groupby(["n_flavor_defects"])["rating"].mean())

n_flavor_defects
0    93.041096
1    88.000000
2    68.000000
3    80.000000
Name: rating, dtype: float64


Interestingly the case with 3 flavor defect key words was rated higher than the case with only 2 key words. Let's see what those words were...

In [79]:
pd.set_option('display.max_colwidth', None)
print(coffee.loc[coffee["n_flavor_defects"] == 2]["desc_1"])
print(coffee.loc[coffee["n_flavor_defects"] == 2]["desc_1_token_stems"])

2040    Evaluated at proportions of 5 grams of instant coffee powder mixed with 8.5 ounces (250 ml) of hot water. Not attractive. Rotten suggestions (composted orange, lily) and acrid wood dominate, along with a metallic note. On the positive side, hints of prune, salty caramel, cardamom. Bittersweet and acrid in structure; lean but smooth in mouthfeel. Continued bitter and acrid in the finish.
Name: desc_1, dtype: object
2040    [evalu, proport, , gram, instant, powder, mix, , ounc, , ml, hot, water, attract, rotten, suggest, compost, orang, lili, acrid, wood, domin, along, metal, note, posit, side, hint, prune, salti, caramel, cardamom, bittersweet, acrid, structur, lean, smooth, mouthfeel, continu, bitter, acrid, finish]
Name: desc_1_token_stems, dtype: object


In [80]:
print(coffee.loc[coffee["n_flavor_defects"] == 3]["desc_1"])
print(coffee.loc[coffee["n_flavor_defects"] == 3]["desc_1_token_stems"])

957    Evaluated at a steeping time of 6 minutes. This beverage is composed of coffee that has essentially been toasted but not roasted. Toasted grain, cocoa nib, hints of raw cashew and limelike citrus in aroma and cup. Sweet, woody/grainy structure with a hint of bitterness but no acidy sensation whatsoever. The mouthfeel is thin and tea-like but silky in texture. Grain and nut fade in the finish, though a woody sweetness lasts.
Name: desc_1, dtype: object
957    [evalu, steep, time, , minut, beverag, compos, essenti, toast, roast, toast, grain, cocoa, nib, hint, raw, cashew, limelik, citrus, aroma, cup, sweet, woodygraini, structur, hint, bitter, acidi, sensat, whatsoev, mouthfeel, thin, tea-lik, silki, textur, grain, nut, fade, finish, though, woodi, sweet, last]
Name: desc_1_token_stems, dtype: object


The inclusion of modifiers (like not) in our list of stop-words impacted this result. The case with 3 flavor defects had one occurance that was modified - "a hint of bitterness." Further, some key words (like rotten) might hold more weight than others. Perhaps a sentiment analysis can better capture if the rating was good or bad. 

In [81]:
sentiment_analyzer = SentimentIntensityAnalyzer()

# apply the sentiment analyzer to all the columns
coffee["polarity_scores"] = coffee["desc_1"].apply(sentiment_analyzer.polarity_scores)
coffee["polarity_scores"]

0       {'neg': 0.028, 'neu': 0.725, 'pos': 0.247, 'compound': 0.8832}
1         {'neg': 0.0, 'neu': 0.799, 'pos': 0.201, 'compound': 0.8316}
2         {'neg': 0.0, 'neu': 0.923, 'pos': 0.077, 'compound': 0.4767}
3         {'neg': 0.0, 'neu': 0.871, 'pos': 0.129, 'compound': 0.5106}
4         {'neg': 0.0, 'neu': 0.761, 'pos': 0.239, 'compound': 0.8176}
                                     ...                              
2277      {'neg': 0.0, 'neu': 0.842, 'pos': 0.158, 'compound': 0.7346}
2278    {'neg': 0.032, 'neu': 0.891, 'pos': 0.077, 'compound': 0.4215}
2279      {'neg': 0.0, 'neu': 0.774, 'pos': 0.226, 'compound': 0.9313}
2280       {'neg': 0.0, 'neu': 0.838, 'pos': 0.162, 'compound': 0.836}
2281    {'neg': 0.029, 'neu': 0.692, 'pos': 0.279, 'compound': 0.9042}
Name: polarity_scores, Length: 2282, dtype: object

The result gives us a dictionary in each row. We can therefore supply a key and it will return a value from the dictionary. For example, if we want the negative score, we would use the "neg" key and it would return the negative value. 

In [82]:
coffee["polarity_scores"][0]["neg"]

0.028

In [83]:
# initate new lists 1 - 2
# for each row in the column polarity scores 3
# save that row as a new variable called dictionary 4
# append the dictionary value for pos to the pos list 5
# append the dictionary value for neg to the pos list 5

pos = [] # 1
neg = [] # 2
compound = [] # 2

for each_row in coffee["polarity_scores"]: # 3
    dictionary = each_row # 4
    pos.append(dictionary["pos"]) # 5
    neg.append(dictionary["neg"]) # 6
    compound.append(dictionary["compound"]) # 6

# use these lists to create new columns in our data frame
coffee["positive"] = pos
coffee["negative"] = neg
coffee["compound"] = compound

coffee[["polarity_scores", "positive", "negative", "compound"]]

Unnamed: 0,polarity_scores,positive,negative,compound
0,"{'neg': 0.028, 'neu': 0.725, 'pos': 0.247, 'compound': 0.8832}",0.247,0.028,0.8832
1,"{'neg': 0.0, 'neu': 0.799, 'pos': 0.201, 'compound': 0.8316}",0.201,0.000,0.8316
2,"{'neg': 0.0, 'neu': 0.923, 'pos': 0.077, 'compound': 0.4767}",0.077,0.000,0.4767
3,"{'neg': 0.0, 'neu': 0.871, 'pos': 0.129, 'compound': 0.5106}",0.129,0.000,0.5106
4,"{'neg': 0.0, 'neu': 0.761, 'pos': 0.239, 'compound': 0.8176}",0.239,0.000,0.8176
...,...,...,...,...
2277,"{'neg': 0.0, 'neu': 0.842, 'pos': 0.158, 'compound': 0.7346}",0.158,0.000,0.7346
2278,"{'neg': 0.032, 'neu': 0.891, 'pos': 0.077, 'compound': 0.4215}",0.077,0.032,0.4215
2279,"{'neg': 0.0, 'neu': 0.774, 'pos': 0.226, 'compound': 0.9313}",0.226,0.000,0.9313
2280,"{'neg': 0.0, 'neu': 0.838, 'pos': 0.162, 'compound': 0.836}",0.162,0.000,0.8360


What's the average positive sentiment score for each rating? What about negative sentiment?

In [84]:
print(coffee.groupby(["rating"])["positive"].mean().sort_index(ascending = False))
print("correlation: ", coffee["rating"].corr(coffee["positive"]))

rating
98    0.261000
97    0.215571
96    0.213833
95    0.207279
94    0.197660
93    0.195565
92    0.190475
91    0.200472
90    0.190642
89    0.167200
88    0.166000
87    0.150000
86    0.130100
85    0.155500
84    0.095250
83    0.082333
80    0.106000
72    0.081000
68    0.063000
67    0.077000
63    0.000000
Name: positive, dtype: float64
correlation:  0.13182100980038508


In [85]:
print(coffee.groupby(["rating"])["negative"].mean().sort_index(ascending = True))
print("correlation: ", coffee["rating"].corr(coffee["negative"]))

rating
63    0.000000
67    0.045000
68    0.118000
72    0.039000
80    0.081000
83    0.017667
84    0.009000
85    0.042000
86    0.017900
87    0.035444
88    0.028714
89    0.018500
90    0.008908
91    0.004433
92    0.005614
93    0.003620
94    0.002708
95    0.002554
96    0.003458
97    0.009357
98    0.000000
Name: negative, dtype: float64
correlation:  -0.24143965565822414


In [86]:
print(coffee.groupby(["rating"])["compound"].mean().sort_index(ascending = True))
print("correlation: ", coffee["rating"].corr(coffee["compound"]))

rating
63    0.000000
67    0.250000
68   -0.392300
72    0.458800
80    0.449700
83    0.507433
84    0.450700
85    0.644600
86    0.541480
87    0.490067
88    0.658543
89    0.665360
90    0.713469
91    0.745744
92    0.714472
93    0.745590
94    0.759546
95    0.763565
96    0.799199
97    0.840950
98    0.907433
Name: compound, dtype: float64
correlation:  0.20268729975090735


It appears that negative sentiment is the polarity score that has the strongest association with the coffee ratings, but let's test it and compare it to the key word frequency. 

In [87]:
# regression
x = coffee[["negative", "n_flavor_defects"]]
y = coffee["rating"]
x = sm.add_constant(x)

model = sm.OLS(y, x, missing = "drop")
res = model.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                 rating   R-squared:                       0.128
Model:                            OLS   Adj. R-squared:                  0.127
Method:                 Least Squares   F-statistic:                     166.8
Date:                Wed, 05 Apr 2023   Prob (F-statistic):           2.49e-68
Time:                        15:24:36   Log-Likelihood:                -4693.0
No. Observations:                2282   AIC:                             9392.
Df Residuals:                    2279   BIC:                             9409.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               93.1329      0.042  

The average rating for a coffee with a negative sentiment score of 0 and no flavor defect key words detected is 93.13. 

Both negative sentiment scores and the number of flavor defect key words detected were significant predictors of coffee rating (p < .001). A 1-unit increase in negative sentiment is associated with a 21.68 drop in coffee rating. While each additional key word detected is associated with a 4.95 drop in coffee rating. COOL!