### **Unstructured Data: Text**

<font color="red">File access required:</font> In Colab this notebook requires first uploading files **Wines200.csv** and **Wines10K.csv** using the *Files* feature in the left toolbar. If running the notebook on a local computer, simply ensure these files are in the same workspace as the notebook.

In [None]:
# Set-up
import pandas as pd
import re
import nltk
import string
from nltk.corpus import stopwords
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('tagsets')
from collections import Counter

### Dataset of wine descriptions

*Read wines into dataframe, show sample*

In [None]:
f = open('Wines200.csv')
wines = pd.read_csv(f)
print(len(wines), 'wines')
wines.head(5)

*Show sample descriptions*

In [None]:
text = wines.loc[1].description
print(text, '\n')
text = wines.loc[3].description
print(text)

### Search: string-contains and regular expressions

*Find wines with description containing 'chocolate'*

In [None]:
wines[wines.description.str.contains('chocolate')][['country', 'variety', 'description']]

*Find wines with description containing 'chocolate' and 'fruit'*

In [None]:
wines[wines.description.str.contains('chocolate') & wines.description.str.contains('fruit')]\
[['country', 'variety','description']]

*Find wines with description where 'chocolate' precedes 'fruit', then reverse*

In [None]:
for i in range(len(wines)):
    text = wines.loc[i].description
    s = re.search('chocolate(.*)fruit', text)
    if s:
        print(wines.loc[i].country, wines.loc[i].variety, '-', text, '\n')

*Find wines to be drunk between through 2020 or later*

In [None]:
# Find wines to be drunk through 2020 or later
for i in range(len(wines)):
    text = wines.loc[i].description
    s = re.search('Drink(.*) through 20(2|3).\.', text)
    if s:
        print(wines.loc[i].variety, '-', text[s.start():s.end()])

### Language processing: tokenizing, removing punctuation, parts of speech

*Process one description; first separate into list of tokens*

In [None]:
text = wines.loc[1].description
tokens = nltk.wordpunct_tokenize(text)
print(tokens)

*Remove puncutation*

In [None]:
punct = list(string.punctuation)
# print(punct)
tokens_nopunct = []
for word in tokens:
    if word not in punct:
        tokens_nopunct.append(word)
print(tokens_nopunct)

In [None]:
# more compact code for same thing
punct = list(string.punctuation)
tokens_nopunct = [word for word in tokens if word not in punct]
print(tokens_nopunct)

*Tag with parts of speech*

In [None]:
tagged = nltk.pos_tag(tokens_nopunct)
print(tagged)

*Demystify tags*

In [None]:
done = []
for word in tagged:
    if word[1] not in done:
        done.append(word[1])
        nltk.help.upenn_tagset(word[1])

### Entire corpus as list of words

In [None]:
punct = list(string.punctuation)
allwords = []
for i in range(len(wines)):
    text = wines.loc[i].description
    tokens = nltk.wordpunct_tokenize(text)
    tokens = [word.lower() for word in tokens if word not in punct]
    allwords = allwords + tokens
# print(allwords)

*Most common words in corpus*

In [None]:
counts = Counter(allwords)
# print(counts)
counts.most_common(20)

*Now without stopwords*

In [None]:
stop = stopwords.words('english')
# print(stop)
allwords_nostops = [word for word in allwords if word not in stop]
counts = Counter(allwords_nostops)
counts.most_common(20)

#### N-grams

*N-grams on one description*

In [None]:
# Recreate list of tokens from one description
text = wines.loc[1].description
print(text, '\n')
tokens = nltk.wordpunct_tokenize(text)
punct = list(string.punctuation)
tokens = [word for word in tokens if word not in punct]
# Find 2-grams
bg = nltk.ngrams(tokens,2)
for b in bg: print(b)
# Change to ngrams(tokens,4)

*Most common word triples in corpus*

In [None]:
grams = nltk.ngrams(allwords, 3)
counts = Counter(grams)
counts.most_common(20)
# try longer n-grams
# change to allwords_nostops

*On entire corpus find all pairs of words that follow 'citrus'*

In [None]:
grams = nltk.ngrams(allwords, 3)
for g in grams:
    if g[0] == 'citrus': print(g)
# change to allwords_nostops
# change to pairs of words around citrus (g[1]), back to allwords

In [None]:
# same functionality without n-grams
for i in range(len(allwords)-2):
    if allwords[i] == 'citrus':
        print(allwords[i], allwords[i+1], allwords[i+2])

### <font color="green">**Your Turn**</font>

*Read bigger wine dataset*

In [None]:
f = open('Wines10K.csv')
wines = pd.read_csv(f)
print(len(wines), 'wines')

*REDUNDANCY: Find all wines where the description includes a sequence of words that starts with the wine's winery and ends with the wine's variety. Print the winery, variety, and the winery-to-variety word sequence from the description.*

In [None]:
# Hint: Refer to chocolate-precedes-fruit and 2020 examples earlier
# Hint: Use string concatenation operator '+' to create your re.search string
YOUR CODE HERE

*SEARCH SUGGESTIONS: Loop where user enters two words, system returns suggested additional words. Specifically, return the five triples occuring most often that contain the two entered words (in any position). Use non-stopwords only. Text preparation may take several minutes, so it's in a separate cell from the user loop for convenience in debugging and experimenting.*

In [None]:
# STEP 1: Turn corpus into one list of words without punctuation or stopwords
# Hint: Combine code from earlier examples for all-words, no-punctuation, no-stopwords
# Preparation will take a while, so first print a user-pacification message
print('Preparing the text corpus...')
YOUR CODE HERE
# STEP 2: Create a list of 3-grams from the text corpus
# HINT: The ngrams() function returns a one-time iterator; applying function list()
#   to the result creates a list that can be iterated through many times
YOUR CODE HERE
print('Done')

In [None]:
# STEP 3: Loop asking for words and returning suggestions
# Code provided for loop of user entering words, putting them into a list
while True:
    text = input("Enter two words (or 'quit' to quit): ")
    if text == 'quit': break
    words = text.split()
    if len(words) != 2:
        print('Enter exactly two words')
    else:
        # Replace print statement below with code to find the five most common
        # triples containing both words
        # Hint: Make a list of all the 3-grams containing the two entered words
        # Use the 'Counter' feature as above to find the 3-grams occurring most
        # often in your list - no need for beautiful output
        # If there are no 3-grams containing the two entered words, print 'No suggestions'
        print(words[0], words[1]) # REPLACE THIS STATEMENT WITH YOUR CODE
print('goodbye')