# Wine Clustering 
### As a part time job during my last semester at University, I worked at the Wine Rack. Most of my job consisted of recommending wines to people and a lot of times this consisted of suggesting a similar wine based on another one they liked. 
### So, I thought I could use ML to make my life easier!

- ### I first scraped the Wine Rack Website, to grab wine names and tasting notes.
- ### I then cleaned the text, to make it easier for clustering.
- ### Finally I created clusters based on similar tasting notes.



**1. Scraping the Wine Rack Website**

In [2]:

# import libraries
import requests
from bs4 import BeautifulSoup
import csv

# need to scrape for both red and white wine seperatley
MASTER_URL = "https://www.winerack.com/"

def getProductLinks(url, file):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    wine = soup.find_all("div", class_="product-tile-grid")
    
    # create the csv writer
    writer = csv.writer(file)
    
    # write a row to the csv file
    row = ["wine", "tastingNotes", "price"]
    writer.writerow(row)
    
    # loop through all product links and get details for each product
    for item in wine:
        wineLink = item.find("a", href=True)
        productURL = MASTER_URL + wineLink['href']
        details = getWineDetails(productURL)
        writer.writerow(details)  
    
    # close the file
    f.close()

# product details
def getWineDetails(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    # wine name
    details = soup.find("div", class_ = "awc-product-detail__details")
    name = details.find("h1").getText()
    # price
    price = soup.find("div", class_="col-6 col-md-12 col-sm-6 awc-product-detail__price")
    #tasting notes
    tasting = soup.find("div", class_="awc-pdp-tasting-notes__main-text")
    tastingNotes = tasting.getText()
    priceFinal = price.find("p").getText()
    return [name, tastingNotes, priceFinal]

# red wine
url = "https://www.winerack.com/products/red/?page=4"
f = open('/Users/bridgetmoynihan/redWines.csv', 'w')
getProductLinks(url, f)

# white wine
url = "https://www.winerack.com/products/white/?page=4"
f = open('/Users/bridgetmoynihan/whiteWines.csv', 'w')
getProductLinks(url, f)



**2. Text Cleaning**

In [3]:
# load necessary libraries
import pandas as pd
import numpy as np
#for text pre-processing
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# bag of words
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.feature_extraction.text import CountVectorizer


#load csv file as dataframe
dfRedWine = pd.read_csv("/Users/bridgetmoynihan/redWines.csv")
dfWhiteWine = pd.read_csv("/Users/bridgetmoynihan/whiteWines.csv")


In [4]:
# convert to lowercase, strip and remove punctuations
def preprocess(text):
    text = text.lower() 
    text=text.strip()  
    text=re.compile('<.*?>').sub('', text) 
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
    text = re.sub('\s+', ' ', text)  
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) 
    text = re.sub(r'\s+',' ',text) 
    return text

 
# remove stopwards
def stopword(string):
    a= [i for i in string.split() if i not in stopwords.words('english')]
    return ' '.join(a)

# Lemmatizatiom
# Initialize the lemmatizer
wl = WordNetLemmatizer()
 
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
# Tokenize the sentence
def lemmatizer(string):
    word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
    a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
    return " ".join(a)

# call all the cleaning functions
def finalpreprocess(string):
    return lemmatizer(stopword(preprocess(string)))

# clean tasting notes column of both red and white wines 
dfRedWine['clean_tastingNotes'] = dfRedWine['tastingNotes'].apply(lambda x: finalpreprocess(x))
dfWhiteWine['clean_tastingNotes'] = dfWhiteWine['tastingNotes'].apply(lambda x: finalpreprocess(x))

In [8]:
# looking at the difference between clean tasting notes and original
dfRedWine.head()

Unnamed: 0,wine,tastingNotes,price,clean_tastingNotes
0,Bodacious Smooth Red,"Bold, rich & juicy. Medium-bodied red with aro...",$46.95,bold rich juicy medium body red aroma blackber...
1,Imperial Fortified Wine,"Medium amber colour; aromas of walnuts, carame...",$10.95,medium amber colour aroma walnut caramel figs ...
2,Jackson-Triggs Grand Reserve Red Meritage VQA,"Fruit forward, with generous notes of juicy re...",$25.95,fruit forward generous note juicy red fruit bl...
3,Sandbanks Sleeping Giant Foch-Baco VQA,This luscious full-bodied wine offers a fruit-...,$17.95,luscious full body wine offer fruit forward pa...
4,Wallaroo Trail - 2 Origins Cabernet Sauvignon,An Australian and Canadian blend with aromas o...,$13.95,australian canadian blend aroma blackberry che...


In [7]:
# looking at the difference between clean tasting notes and original
dfWhiteWine.head()

Unnamed: 0,wine,tastingNotes,price,clean_tastingNotes
0,Audacity Of Thomas G. Bright Orange Wine VQA,Orange wine is a trendy new wine where extende...,$18.95,orange wine trendy new wine extend grape skin ...
1,Caleta - 2 Origins Sauvignon Blanc,A Chilean and Canadian blend with aromas of me...,$10.95,chilean canadian blend aroma melon pear apple ...
2,Inniskillin Vidal VQA,"The Vidal Icewine has intense aromas of mango,...",$7.95,vidal icewine intense aroma mango apricot hone...
3,Inniskillin Vidal Icewine VQA,The Vidal Pearl Icewine has intense aromas of ...,$24.95,vidal pearl icewine intense aroma mango aprico...
4,Jackson Triggs PS Pinot Grigio,A classic! Our Jackson-Triggs Proprietors' Sel...,$49.95,classic jackson triggs proprietor selection pi...


**3. Model Fitting**


### Red Wine Model Building

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

vectorizer = TfidfVectorizer(stop_words='english')

### Red Wine
X = vectorizer.fit_transform(dfRedWine['clean_tastingNotes'])
modelRedWine = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=1)
modelRedWine.fit(X)
order_centroids = modelRedWine.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

# looking at Red Wine cluster feature names
print("RED WINE CLUSTER FEATURE NAMES")
for i in range(4):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(" %s" % terms[ind])
# adding red wine clusters to csv
labels = modelRedWine.labels_
dfRedWine["Cluster"] = labels
dfRedWine.to_csv("/Users/bridgetmoynihan/redWines.csv")


RED WINE CLUSTER FEATURE NAMES
Cluster 0:
 wine
 bask
 cherry
 hint
 cabernet
 blackcurrant
 rounded
 spice
 finish
 smooth
Cluster 1:
 fruit
 vanilla
 palate
 spice
 red
 plum
 dark
 note
 cherry
 hint
Cluster 2:
 canadian
 blend
 blackberry
 smooth
 flavour
 bold
 aroma
 medium
 body
 australian
Cluster 3:
 dry
 light
 colour
 fresh
 medium
 ruby
 crisp
 berry
 wine
 note


#### Now, I am going to pull a red wine from the LCBO website, to see how it is classified
##### I am pulling a California sweeter cabernet sauvignon

In [19]:
textTest1 = "California has a world-class reputation for great cabernet sauvignon, and this rich example shows why. It brims with aromas of toast, nuts, dark fruit and chocolate that lead to flavours of ripe dark fruit and chocolate-covered plums and ends in a smooth finish with a hint of spice. Enjoy with grilled steak or on its own."
cleanTextTest1 = finalpreprocess(textTest1)
X = vectorizer.transform([cleanTextTest1])
predicted = modelRedWine.predict(X)
print(predicted)


[1]


##### I reviewed the other red wines in cluster 1, and from my tasting expereince, the cabernet sauvignon I picked is similar to the other wines.

### White Wine Model Building

In [21]:
### White Wine
X = vectorizer.fit_transform(dfWhiteWine['clean_tastingNotes'])
modelWhiteWine = KMeans(n_clusters=4, init='k-means++', max_iter=100, n_init=1)
modelWhiteWine.fit(X)
order_centroids = modelWhiteWine.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

# looking at White Wine cluster feature names
print("WHITE WINE CLUSTER FEATURE NAMES")
for i in range(4):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(" %s" % terms[ind])
# adding white wine clusters to csv
labels = modelWhiteWine.labels_
dfWhiteWine["Cluster"] = labels
dfWhiteWine.to_csv("/Users/bridgetmoynihan/whiteWines.csv")

WHITE WINE CLUSTER FEATURE NAMES
Cluster 0:
 riesling
 flavour
 honey
 icewine
 peach
 semi
 brown
 hold
 integrated
 candy
Cluster 1:
 note
 finish
 wine
 citrus
 light
 crisp
 fruit
 floral
 body
 tropical
Cluster 2:
 apple
 lemon
 vanilla
 note
 pear
 aromas
 green
 touch
 fresh
 honey
Cluster 3:
 blend
 canadian
 aromas
 fruit
 gooseberry
 pear
 white
 aromatic
 pineapple
 bask


#### Now, I am going to pull a white wine from the LCBO website, to see how it is classified
##### I am pulling a sweeter reisling wine

In [22]:

textTest2="The Riesling for this easy-on-the-wallet 2017 was allowed to hang longer than usual, intensifying the fruit flavour. Riesling holds exceptionally well on the vine, maintaining its bright acidity through such extended ripening. This exceptional natural acidity is also why Riesling is so ageable. Expect this 2017 to reflect peach, apple and pear, with an emerging mineral oil tone. Try it with smoked meat."
cleanTextTest2 = finalpreprocess(textTest2)
X = vectorizer.transform([cleanTextTest2])
predicted = modelWhiteWine.predict(X)
print(predicted)

[0]


##### From my experience, comparing this reisling to other wines in my list, it is in the same cluster as similar wines
