# Chapter 04 Logistic Regression: Assignment



Following files need to be uploaded to the Colab environment first:
- logreg_nouns.pkl (Logistic Regression Model)
- train.csv (training data)
- test.csv (test data)

We first load the logistic regression model we have trained during the class, which uses nouns as features.

In [2]:
import pickle
# Load from file
with open("logreg_nouns.pkl", 'rb') as file:
    lr_nouns = pickle.load(file)
    
# Let's see what are the possible labels to predict (and in which order they are stored)
print(lr_nouns.classes_)

# We can get additional information about all the parameters used with LogReg model
print(lr_nouns.get_params())

['Red' 'Rose' 'White' 'unk']
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


Also, we create the list of nouns (unique, lemmatised) by processing the training data.

In [5]:
import pandas

# Let's load the training data from a csv file
train_set = pandas.read_csv('./train.csv', sep='\t', encoding='utf-8')
test_set = pandas.read_csv('./test.csv', sep='\t', encoding='utf-8')
test_set

Unnamed: 0,File name,URL,Color,Misc1,Misc2,Brand,Year,Price v1,Price v2,Grape type v1,...,Alcohol,Region,Country,Misc4,Score v1,Score v2,Reviewer,Review,Review date,Misc5
0,agroargento-2004-timoleonte-red-sicilia.txt,http://buyingguide.winemag.com/catalog/agroarg...,Red,no,no,AgroArgento 2004 Timoleonte Red (Sicilia),2004,unk,unk,red_blend,...,13.5%,"Sicilia, Sicily & Sardinia",Italy,unk,90,90,M.L.,"Leather, spice, tobacco and tea emerge from th...",3/1/2009,unk
1,ceretto-2000-blange-italian-white-arneis-piedm...,http://buyingguide.winemag.com/catalog/ceretto...,White,no,no,Ceretto 2000 Blangé Arneis (Piedmont),2000,30,$30,arneis_italian_white,...,12%,Piedmont,Italy,Moët Hennessy USA,87,87,D.T.,"So pale that it’s almost colorless, the Blangé...",9/1/2001,unk
2,edna-valley-vineyard-1998-paragon-pinot-noir-c...,http://buyingguide.winemag.com/catalog/edna-va...,Red,no,no,Edna Valley Vineyard 1998 Paragon Pinot Noir (...,1998,19,$19,pinot_noir,...,14.2%,"Edna Valley, Central Coast, California",US,unk,86,86,S.H.,"Comes across on the earthy, herbal side, altho...",10/1/2000,unk
3,arnaldo-caprai-2011-grecante-italian-white-gre...,http://buyingguide.winemag.com/catalog/arnaldo...,White,no,no,Arnaldo Caprai 2011 Grecante Grechetto (Colli ...,2011,20,$20,grechetto_italian_white,...,13.5%,"Colli Martani, Central Italy",Italy,Folio Fine Wine Partners,88,88,M.L.,What a wonderful wine to pair with spaghetti a...,10/1/2012,unk
4,cline-2003-ancient-vines-mourvedre-central-coa...,http://buyingguide.winemag.com/catalog/cline-2...,Red,no,no,Cline 2003 Ancient Vines Mourvèdre (Contra Cos...,2003,18,$18,mourvèdre,...,unk,"Contra Costa County, Central Coast, California",US,unk,84,84,S.H.,"What’s puzzling about this wine is why, given ...",4/1/2005,unk
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,lucien-albrecht-2005-cuvee-romanus-pinot-grigi...,http://buyingguide.winemag.com/catalog/lucien-...,White,no,no,Lucien Albrecht 2005 Cuvée Romanus Pinot Gris ...,2005,19,$19,pinot_gris,...,13%,Alsace,France,Pasternak Wine Imports,90,90,Joe Czerwinski,"This medium-bodied, off-dry Pinot Gris feature...",7/1/2007,unk
996,brander-2005-au-naturel-sauvignon-blanc-centra...,http://buyingguide.winemag.com/catalog/brander...,White,no,no,Brander 2005 Au Naturel Sauvignon Blanc (Santa...,2005,30,$30,sauvignon_blanc,...,13.9%,"Santa Ynez Valley, Central Coast, California",US,unk,93,93,S.H.,Brander’s stainless steel fermented Sauvignon ...,11/15/2006,unk
997,domaine-francois-schmitt-2010-pinot-blanc-alsa...,http://buyingguide.winemag.com/catalog/domaine...,White,no,no,Domaine François Schmitt 2010 Pinot Blanc (Als...,2010,unk,unk,pinot_blanc,...,12.5%,Alsace,France,"Fruit of the Vines, Inc",86,86,Roger Voss,"This ripe, honeyed wine is very rich and smoot...",12/31/2012,unk
998,sunce-vineyard-winery-2007-clone-337-bevill-ma...,http://buyingguide.winemag.com/catalog/sunce-v...,Red,no,no,Suncé Vineyard & Winery 2007 Clone 337 Bevill-...,2007,50,$50,cabernet_sauvignon,...,14%,"Alexander Valley, Sonoma, California",US,unk,86,86,S.H.,"Soft and rather one-dimensional, this Cabernet...",11/1/2011,unk


In [6]:
# Let's extract only these two columns from the data 
train_reviews = train_set['Review'].to_list()
train_colors = train_set['Color'].to_list()

test_reviews = test_set['Review'].to_list()
test_colors = test_set['Color'].to_list()


In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Process a text
train_doc_reviews = nlp.pipe(train_reviews)

list_noun_lemma = []

for review in train_doc_reviews:
    for token in review:
        if token.pos_ == "NOUN":
            # In this case, we will add the lemma of the noun to our list and not the full word
            list_noun_lemma.append(token.lemma_)
            

# We are only interested in the list of unique nouns
list_noun_lemma_unique = list(set(list_noun_lemma))

# Let's print and see how many unique nouns we have
print(len(list_noun_lemma_unique))
print(list_noun_lemma_unique)

4098
['bouillon', 'blackberry', 'earth', 'ash', 'Grapey', 'punchless', 'whiskey', 'abv', 'awkwardness', 'imagination', 'depth', 'feel', 'edge', 'stiffer', 'campore', 'starring', 'avalanche', 'curd', 'pour', 'plot', 'note', 'load', 'presence', 'side', 'buttercream', 'melony', 'underscoring', 'pit', 'goldkap', 'affordability', 'dal', 'envelope', 'grapefuit', 'labeling', 'lifeless', 'roof', 'moleche', 'peach', 'father', 'swagger', 'numbing', 'dustiness', 'suave', 'refreshing', 'spit', 'cheer', 'transplant', 'increase', 'ctirus', 'garland', 'carafe', 'frame', 'ribbon', 'green', 'plantain', 'quality', 'smell', 'stripe', 'pastas', 'd’Alter', 'harmony', 'nuttiness', 'comment', 'boost', 'bois', 'ooze', 'parfait', 'meritage', 'standard', 'coating', 'cooking', 'opulence', 'reverberate', 'traditionalist', 'drama', 'volatility', 'percent', 'garni', 'ambition', 'reserve', 'crispy', 'chest', 'vise', 'soil', 'dunk', 'mouthcoating', 'screwcap', 'hammer', 'briefly', 'brunt', 'sorbet', 'breeze', 'famila

## Assignment 1. Feature weights ##
Can you print the 5 terms (nouns in this case) that have the largest weight (coefficients) for predicting the label "Red"? (Use the logistic regression model that was trained with NOUNS)
- Print largest absolute weights (weights can be positive or negative)


In [None]:
for label, coefs, intercept in zip(lr_nouns.classes_, lr_nouns.coef_, lr_nouns.intercept_):
    if label == "Red":
        print(label)
        noun_coef_list = []
        for t, c in zip(list_noun_lemma_unique, coefs):
            noun_coef_list.append([t, c])
        
        # Use sorted with lambda (absolute value), and reverse (descending order)
        coefs_sorted = sorted(noun_coef_list, key=lambda x:abs(x[1]), reverse=True)
        print(coefs_sorted[:5])
        

#How can we interpret the coefficients and the intercept?


Red
[['rosé', -1.928311270568853], ['dessert', -1.8101951727516834], ['tannin', 1.756719856332193], ['course', -1.6717013023044425], ['grapefruit', -1.6668051079209778]]


## Assignment 2. Feature weights ##
Can you print the 10 terms (nouns) that have the largest weight (coefficients) for each label separately: 'Red' 'Rose' 'White' 'unk'. (Use the logistic regression model that was trained with NOUNS)
- Print largest absolute weights (weights can be positive or negative)

In [None]:
for label, coefs, intercept in zip(lr_nouns.classes_, lr_nouns.coef_, lr_nouns.intercept_):    
    print()
    print(label)
    noun_coef_list = []
    for t, c in zip(list_noun_lemma_unique, coefs):
        noun_coef_list.append([t, c])
    #print("INTERCEPT:", intercept)
    
    coefs_sorted = sorted(noun_coef_list, key=lambda x:abs(x[1]), reverse=True)
    print(coefs_sorted[:5])
    



Red
[['rosé', -1.928311270568853], ['dessert', -1.8101951727516834], ['tannin', 1.756719856332193], ['course', -1.6717013023044425], ['grapefruit', -1.6668051079209778]]

Rose
[['rosé', 2.924601091734564], ['strawberry', 1.6303522768035823], ['blush', 1.5446374554499815], ['watermelon', 1.4723306418242699], ['cherry', 1.2806005823467308]]

White
[['cherry', -2.1119416275229526], ['berry', -2.106642556169743], ['raspberry', -2.040957707044851], ['strawberry', -1.9915161081742416], ['red', -1.71249136746508]]

unk
[['sparkler', 3.3040960232853007], ['dessert', 2.4104827342290323], ['sparkle', 2.248750173330566], ['bubbly', 2.080444573504521], ['botrytis', 1.9128409715710066]]


## DIY3. Adjectives ##
Train a new model that only uses the adjectives in the test set as features (lowercased, unique, lemmatized).
Remove the adjectives "red", "white" and "rose" from the unique list of adjectives (we don't want to predict the labels from the same words).

Save your model! We will evaluate them next week!

Process the test set to get a list of unique adjectives (Note to Pranay: we extracted these terms from the test set during class but it is also ok to extract them from the training set in this exercise)

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

# Process a text
test_doc_reviews = nlp.pipe(test_reviews)

list_adj = []

# Fill the rest of the code
# Feel free to use additional cells not to repeat steps that take long time (such as generating features)

In [14]:
import spacy

nlp = spacy.load('en_core_web_sm')

# Process a text
train_doc_reviews = nlp.pipe(train_reviews)

list_adj = []

for review in train_doc_reviews:
    for token in review:
        if token.pos_ == "ADJ":
            list_adj.append(token.lemma_)

list_adj_unique = list(set(list_adj))

print(len(list_adj_unique))
print(list_adj_unique)

2297
['lock', 'blackberry', 'cambodian', 'appetizing', 'outgoing', 'brilliant', 'physical', 'canadian', 'feel', 'caribbean', 'überb', 'buttercream', 'conventional', 'side', 'heavilyoaked', 'underscoring', 'confected', 'tense', 'lifeless', 'needed', 'undergrowth', 'basque', 'peach', 'desiccated', 'refreshing', 'awry', 'extended', 'daytime', 'notorious', 'lucky', 'telltale', 'green', 'plantain', 'vegetarian', 'standard', 'vertical', 'joyful', 'plus', 'casual', 'efficient', 'vivacious', 'flavorful', 'untamed', 'drinkable', 'reserve', 'invisible', 'least', 'provacative', 'erratic', 'brutal', 'unleashed', 'needs', 'lipsmacking', 'australian', 'bountiful', 'ominous', 'minus', 'patented', 'passito', 'enthusiastic', 'attuned', 'patchy', 'svelte', 'scant', 'german', 'oaky', 'eucalypt', 'mouthfeel', 'marvelous', 'mannered', 'consistent', 'vinous', 'lightweight', 'albana', 'gulpable', 'broad', 'dainty', 'afforable', 'roasted', 'chiseled', 'fecund', 'alcoholic', 'keen', 'local', 'invasive', 'prist

remove the adjectives "red", "white" and "rose" from the list

In [15]:
print(len(list_adj_unique))
print(list_adj_unique)

list_adj_unique.remove('red')
list_adj_unique.remove('white')
list_adj_unique.remove('rose')


print(len(list_adj_unique))
print(list_adj_unique)

2297
['lock', 'blackberry', 'cambodian', 'appetizing', 'outgoing', 'brilliant', 'physical', 'canadian', 'feel', 'caribbean', 'überb', 'buttercream', 'conventional', 'side', 'heavilyoaked', 'underscoring', 'confected', 'tense', 'lifeless', 'needed', 'undergrowth', 'basque', 'peach', 'desiccated', 'refreshing', 'awry', 'extended', 'daytime', 'notorious', 'lucky', 'telltale', 'green', 'plantain', 'vegetarian', 'standard', 'vertical', 'joyful', 'plus', 'casual', 'efficient', 'vivacious', 'flavorful', 'untamed', 'drinkable', 'reserve', 'invisible', 'least', 'provacative', 'erratic', 'brutal', 'unleashed', 'needs', 'lipsmacking', 'australian', 'bountiful', 'ominous', 'minus', 'patented', 'passito', 'enthusiastic', 'attuned', 'patchy', 'svelte', 'scant', 'german', 'oaky', 'eucalypt', 'mouthfeel', 'marvelous', 'mannered', 'consistent', 'vinous', 'lightweight', 'albana', 'gulpable', 'broad', 'dainty', 'afforable', 'roasted', 'chiseled', 'fecund', 'alcoholic', 'keen', 'local', 'invasive', 'prist

create feature vectors

In [16]:
import numpy 

train_features = numpy.zeros((len(train_reviews), len(list_adj_unique)))
test_features = numpy.zeros((len(test_reviews), len(list_adj_unique)))

print(train_features.shape)
print(test_features.shape)

(10000, 2294)
(1000, 2294)


In [17]:
import spacy

nlp = spacy.load('en_core_web_sm')

train_doc_reviews = nlp.pipe(train_reviews)

for review, f in zip(train_doc_reviews, train_features):
    tokens_list = [token.lemma_ for token in review]
    #print(tokens_list)
    for term in list_adj_unique:
        if term in tokens_list:
            term_id = list_adj_unique.index(term)
            f[term_id] = 1

In [18]:
import spacy

nlp = spacy.load('en_core_web_sm')

test_doc_reviews = nlp.pipe(test_reviews)

for review, f in zip(test_doc_reviews, test_features):
    tokens_list = [token.lemma_ for token in review]
    #print(tokens_list)
    for term in list_adj_unique:
        if term in tokens_list:
            term_id = list_adj_unique.index(term)
            f[term_id] = 1

In [19]:
#In sklearn, all machine learning models are implemented as Python classes
from sklearn.linear_model import LogisticRegression

# Make an instance of the Model from LogisticRegression class
# all parameters not specified are set to their defaults
lr_adj = LogisticRegression()

# Train the model on the data, storing the information learned from the dat`a
# Model is learning the relationship between digits (x_train) and labels (y_train)
lr_adj.fit(train_features, train_colors)

# Let's see what are the possible labels to predict (and in which order they are stored)
print(lr_adj.classes_)

# We can get additional information about all the parameters used with LogReg model
print(lr_adj.get_params())

['Red' 'Rose' 'White' 'unk']
{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Define a "predict" function that prints some additional information

In [20]:
def predict(index):
    # Print the review of the index
    print(test_reviews[index])
    # print the features of the index
    print(test_features[index])
    # print all terms
    #print(len(test_features))
    for i in range(0, len(test_features[index])):
        #print(i)
        if test_features[index][i] == 1:
            print(list_adj_unique[i])
    print()
    # print the correct label of the index
    print(test_colors[index])

    print()
    print("Predictions:")
    # print the prediction for the features of this index
    print(lr_adj.predict([test_features[index]]))
    # print the probabilities for each label predictions
    print(lr_adj.predict_proba([test_features[index]]))
    # print all possible labels
    print(lr_adj.classes_)
    print()

In [21]:
predict(0)
predict(10)

Leather, spice, tobacco and tea emerge from the nose of this Sicilian blend of Nero d’Avola, Syrah, Merlot, Cabernet and Petit Verdot. You’ll get aromas of clove, allspice and vanilla behind vibrant blueberry and raspberry.
[0. 0. 0. ... 0. 0. 0.]
clove
vibrant
sicilian
emerge

Red

Predictions:
['Red']
[[0.77069234 0.04934914 0.16585378 0.01410474]]
['Red' 'Rose' 'White' 'unk']

I haven’t been a fan of Santa Ynez Cabs for the simple reason that they’re so seldom ripe. You get this green, herb and mint streak that’s not flattering to Cab’s tannins. This wine is in that vein. 
[0. 0. 0. ... 0. 0. 0.]
green
ripe
simple
’
wine
flattering

Red

Predictions:
['White']
[[0.22309658 0.00216217 0.76830962 0.00643163]]
['Red' 'Rose' 'White' 'unk']



In [22]:
import pickle

# Save to file in the current working directory
pkl_filename = "logreg_adj_Key.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(lr_adj, file)