# Logistic Regression assignment


This is **Code help Notebook** for the Logistic Regression assignment.

The goal of this assignment is to introduce you to the idea of a **Maximum Entropy Language Model**,
that is, a Logistic Regression (LR) model that tries to predict the next word on the basis of features of the word's linguistic context.  We will look at two versions of the model.  One is a simple bigram model formulated as an LR model (the only context feature is the word immediately preceding the word we are predicting,
(which we will call the **target**).  The other is a bigram model augmented with **trigger word features**, words that are known triggers for other words, which may be found arbitrarily far from the target.  The idea of building an LR language model is due to [Rosenfeld (1994)](https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/roni/papers/me-csl-revised.pdf) and we follow his method for finding trigger words.

The code in the section entitled **Preparing the filtered corpus** all needs to run in order for
this notebook to work.

The code in the section entitled  **Finding triggers  using Mutual Information (Rosenfeld 1994)**
does not need to run (it takes a while).  You can circumvent it by using the value assigned to
`triggers` at the end of that section. The correct value  for `triggers`(a set of 150 words) is spelled out at the end of that section.

## Preparing the filtered corpus

To restrict the number of parameters in our model and yet demonstarte the ideas on reealistic data, we are going to build an LR language model on a **filtered** data set.  We will model onl noun co-occurrences and we will limit our model to nouns occurring over 100 times in the Brown corpus, That is, gioven that Brpwn is about 1.2 million words,
we will look at nounds with a relative fdrequency greater than about:

This gives us a vocabulary of between 600 and 700 words.

In [8]:
from sklearn.metrics.cluster import contingency_matrix
from sklearn.metrics import mutual_info_score
import numpy as np

Corpus prep,  Vocab filter (freq thresholkd and noun)

$$
\begin{array}{rcrrr}
k & pos & \# sents & V & N\\
\hline
200 & \text{N} & 28,468  & 258 & 21,709\\
\mathbf{100} &  \text{N}  & \mathbf{19,243} & \mathbf{615} & \mathbf{55, 792} \\
 1 &        & 57,340  & 49,815 & 1,116,192\\
\end{array}
$$

In [9]:
from nltk import FreqDist
from nltk.corpus import brown

def relevant(w, tag, fd=None, k=100, pos_chars=None):
    """
    pos_char is a usually 'N' or 'V' or 'NV' or to select the nominal or verbal pos sets of Brown
    """
    if pos_chars is not None and not tag[0] in pos_chars:
        return False
    if fd is not None:
        return fd[w] > k
    else:
        return True


bw = [w.lower() for (w,t) in brown.tagged_words()]
print(len(bw))
fd = FreqDist(bw)
tagged_sents = brown.tagged_sents()

###############   Filtered vocab #####################################################################

#k=200,pos_chars="N"
#k=100, pos_chars = "N"
k, pos_chars, filtered_vocab = 100, "N", True

if filtered_vocab:
    f_bw = [w.lower() for (w,t) in brown.tagged_words() if relevant(w,t,fd,k=k,pos_chars=pos_chars)]
    f_fd = FreqDist(f_bw)
    #print(len(bw))
    f_sents = [[w.lower() for (w,t) in sent if relevant(w, t,f_fd,k=k,pos_chars=pos_chars)] for sent in tagged_sents]
    f_sents = [s for s in f_sents if len(s)>1]
    f_vocab=set(f_fd.keys())
    f_V = len(f_vocab)
    print("Filtered info: ","Sents", len(f_sents),"Vocab", f_V, "N", sum(len(s) for s in f_sents))

###############   End filtered vocab ###################################################################

sents = [[w.lower() for (w,t) in sent] for sent in tagged_sents]
vocab=set(fd.keys())
sents = [s for s in sents if len(s)>1]
num_events = len(sents)
V = len(vocab)
num_tokens = sum(len(s) for s in sents)
#print(num_events,V)
print("Base info: ","Sents", num_events,"Vocab", V, "N", num_tokens)

1161192
Filtered info:  Sents 19243 Vocab 615 N 55792
Base info:  Sents 57013 Vocab 49815 N 1160865


In [10]:
len(fd)

49815

In [11]:
len(tagged_sents)

57340

In [12]:
fd["time"]

1598

In [13]:
fd["language"]

109

These are the words tp try to predict

In [14]:
class_list = f_fd.most_common(100)
class_set = {w for (w,ct) in class_list}
print(len(class_set))

100


In [15]:
f_sents[:5]

[['evidence', 'place'],
 ['charge', 'manner'],
 ['interest', 'number', 'size', 'city'],
 ['number', 'interest'],
 ['cost', 'administration']]

In [16]:
#wd="work"
#sent = class_sents[wd][0]
#if not wd== sent[0]:
#   print(sent[:sent.index(wd)])


Make a dictionary such that each key is a classword and
the corresponding value is a list of brown "sentences" containing that class

In [17]:
from collections import defaultdict

def make_class_sents(f_sents,class_set):
    class_sents0 = defaultdict(list)
    for sent in f_sents:
        wds = class_set.intersection(sent)
        for wd in wds:
            if not wd == sent[0]:
                class_sents0[wd].append(sent[:sent.index(wd)])
    return {wd:sents for (wd,sents) in class_sents0.items() if len(sents)>100}

#{wd:sents for (wd,sents) in class_sents0.items() if len(sents)>100}
class_sents = make_class_sents(f_sents,class_set) 

In [18]:
#len(class_sents["work"])  276
# len(class_sents) 91  
# so we actually only have 91 lcasses to predict
# total_num_class_sents = sum(1 for sents in class_sents.values() for sent in sents)
# total_num_class_sents 16026
# so we have 16K  histories to split into 

Wehave 91 target words.

In [19]:
len(class_sents)

91

In [20]:
# number of sentences/wds in the filtered corpus

print("sents", sum(1 for sents in class_sents.values() for s in sents))
print("words", sum(1 for sents in class_sents.values() for s in sents for wd in s))

sents 16026
words 28714


For ease of computation, the corpus has been filtered to include only nouns with frequency greater than 100 in the Brown corpus:

Freq threshold for f_vocab, part of speech for f_vocab.

In [21]:
k,pos_chars

(100, 'N')

In [22]:
#for s in sents[:25]:
#    print(s)

## Finding triggers  using Mutual Information (Rosenfeld 1994)

## Mutual information calculation

This code takes a while to run.  You can run it  if you like, or you can just use the value
of `triggers`, the set of words being computed ion this section, which is assigned to the right set at the end of the **Trigger Vocab calculation** section.

The comcept of a **trigger word**.

For MI_ calculation: Each word gets assigned avector representing what sentences it has occurred in,

In [23]:

f_encoder = {wd:i for (i,wd) in enumerate(f_vocab)}


def get_mi_word_vecs (sents,V,encoder):
    vecs = np.zeros((V,len(sents)),dtype=int)
    for (j,sent) in enumerate(sents):
        cts = FreqDist(sent)
        for (wd,ct) in cts.items():
            vecs[encoder[wd],j] = ct
    return vecs
            
f_vecs = get_mi_word_vecs (f_sents,f_V,f_encoder)

Shape is V x len(f_sents).

In [24]:
f_vecs.shape

(615, 19243)

Now we can get the mutual information of two words by taking the mutual information score
of their two word vectors.  So we compute all the pairwird MI scores for the filtered
vocab.

In [9]:
import time

f_enc_pairs = list(f_encoder.items())
#mis0 = np.zeros((V,V))
#mis = Triangular2DArray(mis0)
#vecs[0]

def get_mis (V,enc_pairs,vecs,batch_sz=50):
    mis = np.zeros((V,V))
    vecs_b = (vecs > 0)
    for (idx,(wd,i)) in enumerate(enc_pairs):
        if idx%batch_sz == 0:
            print(f"Processing word {idx} {time.ctime()}")
        for (wd2,j) in enc_pairs[idx:]:
            mis[i,j] = mutual_info_score(vecs_b[i],vecs_b[j])
    return mis

print(time.ctime())
mis = get_mis (f_V,f_enc_pairs,f_vecs)
print(time.ctime())
#word_vecs = vecs.sum(axis=0)
#del vecs

Fri Jan 30 17:48:24 2026
Processing word 0 Fri Jan 30 17:48:24 2026
Processing word 50 Fri Jan 30 17:48:59 2026
Processing word 100 Fri Jan 30 17:49:31 2026
Processing word 150 Fri Jan 30 17:50:00 2026
Processing word 200 Fri Jan 30 17:50:27 2026
Processing word 250 Fri Jan 30 17:50:51 2026
Processing word 300 Fri Jan 30 17:51:11 2026
Processing word 350 Fri Jan 30 17:51:29 2026
Processing word 400 Fri Jan 30 17:51:43 2026
Processing word 450 Fri Jan 30 17:51:55 2026
Processing word 500 Fri Jan 30 17:52:03 2026
Processing word 550 Fri Jan 30 17:52:09 2026
Processing word 600 Fri Jan 30 17:52:11 2026
Fri Jan 30 17:52:11 2026


In [29]:
mis[:,0].max()

0.04332267751869658

In [31]:
mi_max_vals = mis.max(axis=0)
#mi_max_vals

In [481]:
mis[f_encoder["man"]

579

##  End MI calculations

## Trigger vocab calculation

Now for each vocab word find exactly one "trigger", another word strongly associated according to
mutual information:

In [33]:
def print_triggers (trigger_pairs, top_n=None):
    if top_n is not None:
        trigger_pairs = trigger_pairs[:top_n]
    for (wd,trig,score) in trigger_pairs:
        print(f"{wd} {trig} {score:.5f}")

def get_triggers (mis, decoder, threshhold=.0002,verbose=False):
    # Zero out self triggers for now
    for i in range(mis.shape[0]):
        mis[i,i] = 0
    # Find the best trigger of word in V
    trigger_pairs = [(decoder[idx],decoder[mis[idx].argmax()]) for idx in range(mis.shape[0])]
    # as well as the MI scores
    trigger_scores = [(decoder[idx],mis[idx].max()) for idx in range(mis.shape[0])]
    # Apply threshhold
    filtered_trigger_pairs= []
    for (i,(wd, trig)) in enumerate(trigger_pairs):
        score = trigger_scores[i][1]
        if score > threshhold:
            filtered_trigger_pairs.append((wd,trig,score))
    # Sort by score
    filtered_trigger_pairs.sort(key=lambda x: x[2])
    if verbose:
        print_triggers (filtered_trigger_pairs)
    return filtered_trigger_pairs

threshhold,verbose = .0002,False
f_decoder = {i:wd for (wd,i) in f_encoder.items()}
filtered_trigger_pairs = get_triggers (mis, f_decoder,threshhold=threshhold,verbose=verbose)
(wds,triggers,scores) = zip(*filtered_trigger_pairs)
triggers = set(triggers)
print(len(triggers))

150


In [493]:
#(wds00,triggers00,scores00) = zip(*filtered_trigger_pairs)
for (w,t,s) in filtered_trigger_pairs[::-1]:
    print(f"{w:<25} {t:<25} {s:.5f}")

men                       women                     0.00613
girls                     boys                      0.00565
wife                      husband                   0.00388
view                      point                     0.00249
eyes                      face                      0.00235
year                      tax                       0.00211
father                    mother                    0.00195
costs                     cost                      0.00193
stock                     market                    0.00188
aid                       countries                 0.00187
development               research                  0.00183
sales                     tax                       0.00181
room                      door                      0.00170
students                  college                   0.00170
children                  school                    0.00167
college                   school                    0.00163
ball                      game          

An important limitation of **mutual information**:  These words have been
discovered because the occurrence of one of them in a sentence increases ythe likelihood oif its partner occurring in a sentence.  So they're here bnecause they ocurred in the **same** sentence often,
not because they occurred in **similar** sentences often.  We will return to this issue and offer a solution when we consider embeddings models of word meaning.

You can run the code to find the correct value for `triggers`, or you can just use the value for `triggers` set in the next cell:

In [25]:
triggers = {'action',
 'aid',
 'amount',
 'analysis',
 'answer',
 'areas',
 'attention',
 'bed',
 'blood',
 'board',
 'boy',
 'boys',
 'business',
 'car',
 'case',
 'cent',
 'child',
 'college',
 'community',
 'company',
 'corner',
 'cost',
 'countries',
 'couple',
 'court',
 'day',
 'defense',
 'distance',
 'door',
 'effect',
 'equipment',
 'extent',
 'face',
 'fact',
 'factors',
 'faith',
 'floor',
 'forces',
 'form',
 'forms',
 'freedom',
 'friends',
 'front',
 'function',
 'future',
 'game',
 'government',
 'growth',
 'hair',
 'hands',
 'head',
 'heart',
 'history',
 'home',
 'hours',
 'husband',
 'ideas',
 'image',
 'increase',
 'industry',
 'influence',
 'issue',
 'labor',
 'land',
 'language',
 'law',
 'leaders',
 'length',
 'letter',
 'level',
 'life',
 'line',
 'literature',
 'man',
 'market',
 'material',
 'meaning',
 'means',
 'meeting',
 'member',
 'members',
 'method',
 'mind',
 'money',
 'month',
 'months',
 'morning',
 'mother',
 'mouth',
 'nations',
 'number',
 'numbers',
 'others',
 'paper',
 'parts',
 'party',
 'persons',
 'piece',
 'plane',
 'point',
 'policy',
 'pool',
 'population',
 'pressure',
 'principle',
 'problem',
 'production',
 'program',
 'programs',
 'progress',
 'property',
 'range',
 'reaction',
 'religion',
 'research',
 'respect',
 'sales',
 'school',
 'schools',
 'science',
 'season',
 'situation',
 'society',
 'son',
 'sound',
 'spirit',
 'state',
 'statement',
 'street',
 'student',
 'summer',
 'sun',
 'surface',
 'systems',
 'tax',
 'temperature',
 'terms',
 'thing',
 'town',
 'treatment',
 'truth',
 'value',
 'values',
 'war',
 'water',
 'way',
 'ways',
 'women',
 'world',
 'years'}

This should evaluate to 150 to get the results I want you to reproduce:

In [290]:
len(triggers)

150

## Make the Korpus

To make a corpus you will call the function `prepare_korpus` defined and called in the few code cells.

The first thing it will do is convert the filtered corpus into a set of history, predicted_word
pairs.  This is done in the function `make_class_sents`, which takes as its arguments
the filtered corpus and the set of class words (`class_set`) 
which are all frequent words 
selected to guarantee there would be enough examples of each predicted word to make reasonable training possible, even in  this small dataset.  

The function `make_class_sents` returns  a dictionary `class_sents` which has t he following structure:

```python
class_wd |-> histories
```

where each `class_wd` is a word to be predicted and `histories` is a list
of (filtered) histories for which_class_wd is the next word.

Here's an example of the contents:

In [26]:
class_sents = make_class_sents(f_sents,class_set)
# Histories followed by the word "time"
class_sents["time"][:20]

[['law'],
 ['police', 'trial'],
 ['today', 'business'],
 ['administration', 'policy'],
 ['city'],
 ['night', 'study', 'changes'],
 ['group', 'mind'],
 ['sales', 'state', 'tax'],
 ['right'],
 ['scene'],
 ['defense'],
 ['game'],
 ['sun'],
 ['points', 'years', 'school'],
 ['boy'],
 ['efforts'],
 ['party'],
 ['home'],
 ['evening'],
 ['market', 'years']]

If we mush all the histories associated with all the target words, there are 16,026 histories.  That's how many histories we will train on.

In [458]:
sum(len(sents) for sents in class_sents.values())

16026

The `class_sents` dictionary is then used to create `korpus`, the array representation of all 16,026
histories,  and `Y` the corresponding 16,026 words to predict

In [27]:
def prepare_korpus (f_sents, class_set, active_triggers):
    # Do creation of class_sents here because this code destructively modifies the sent lists
    # class sents is a dictionary:  class_wd |-> histories
    # where each eclass_d is a wd to be predicted and hsitories is a list
    # (filetered) histories for which_class_wd is the next word.
    class_sents = make_class_sents(f_sents,class_set)
    #active_triggers = triggers
    #active_triggers = set()
    korpus0 = []
    final_vocab0 = set()
    for cls_wd,sents in class_sents.items():
        for sent in sents:
            sent[-1] = sent[-1] + "_b"
            final_vocab0.add(sent[-1])
            korpus0.append((sent,cls_wd))

    final_vocab = final_vocab0 | active_triggers

    #########   FINAL  PASS: Vectorize; Create korpus and Y   ########################
    final_sample_sz, final_V = (len(korpus0),len(final_vocab))
    korpus = np.zeros((final_sample_sz, final_V))
    #final_dim = final_V + len(active_triggers)
    final_encoder = {wd:i for (i,wd) in enumerate(final_vocab)}

    trigger_ct = 0
    Y = []    #np.array((final_sample_sz,))
    for (i,(sent,cls_wd)) in enumerate(korpus0):
        bigram_wd = sent[-1]
        #print(bigram_wd,final_encoder[bigram_wd])
        korpus[i,final_encoder[bigram_wd]] += 1
        these_triggers = active_triggers.intersection(sent)
        for trig in these_triggers:
            trigger_ct += 1
            korpus[i,final_encoder[trig]] += 1
        Y.append(cls_wd)
    Y=np.array(Y)
    ##########  END FINAL  PASS  #######################
    print(f"Kropus created:: korpus shape: {korpus.shape}  Y shape: {Y.shape} triggers used: {trigger_ct}")
    return korpus, korpus0, Y
    

In [28]:
# Make the corpus.  Must execute this cell.
korpus, korpus0, Y = prepare_korpus (f_sents, class_set, triggers)

Kropus created:: korpus shape: (16026, 508)  Y shape: (16026,) triggers used: 5511


##  Number of features for the base model (with triggers)

The number of features for this model is 508.

The `prepare_corpus` function returns three things:

1. `korpus`:  16,026 histories from the Brown corpus encoded as a 16,026x508 array, where 508 is the number of dimensions in a history vector, the encoded representation of a history.
2. `Y`:  the 16,026 target words for those history vectors. These are the words our language model will try to predict. A target word is always a word that occurred later in the same sentence as the words in its history in the orginal Brown corpus.
2.  `big_korpus0`: a sequence of history, target pairs represented as words.

The histories in `korpus0` differ from the histories in `class_sents` only in that they're in a flat list
and the last words have been modified to have "_b" on them.  This is because one of the best trigger words for any word is itself:  once a content word with unigram probability $p$ occurs in any document the likelihood of its occurring in the rest of the document is higher than $p$. This property is sometimes called **burstiness.**
To allow for the possibility for the same word to occur in
the history both as the last word (the bigram prefix) and earlier (as a trigger), trigger words
and words in final position in a history have different features; `korpus0` was a convenience
in building the training data, but will play no role in training.  It does however
help when we have questions about how a particular histopry gave rise to its feature reprersentation in 
`korpus`.

In [29]:
korpus0[12]

(['food', 'family_b'], 'place')

Here is part of a row from `korpus`.

It  is a sparse matrix, mostly 0s.

In [30]:
korpus[12,:25]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.])

But each row will have at least one non-zero value in there because the bigram word (with a "_b" at
the end) will always be an encoded feature of the history.

In [277]:
korpus[12].sum()

1.0

Consider how the word sequence in `korpus0[12]` is related to the 1D array `korpus[12]`.

In [460]:
korpus0[12]

(['food', 'family_b'], 'place')

There is only one active feature in row 12, because "food" is not a trigger word. Since tt's not in bigram possition
and it's not a trigger, we ignore it.

In [278]:
"food"  in triggers

False

Let's find a more interesting example.  The row with the greatest number of active features:

In [279]:
korpus.sum(axis=1).argmax()

1484

When there is a no trigger word the history for the
word prediction has only one non zero feature, the feature
for the immediately preceding word (the bigram feature).
When there is also a single trigger word, there are two nonzero features
features.  History `1484` has 6 trigger words in additionto the bigram word.

In [280]:
korpus[1484].sum()

7.0

This history and target word (word-to-predict, or class) of sample 1484:

In [281]:
korpus0[1484]

(['case',
  'pressure',
  'principle',
  'use',
  'means',
  'years',
  'state',
  'state',
  'program_b'],
 'man')

In [463]:
ts_1484 = triggers.intersection(korpus0[1484][0])
ts_1484

{'case', 'means', 'pressure', 'principle', 'state', 'years'}

Alas none of these trigger words has much of an association  with the given target word:

In [484]:
#  Attn:  Re-evaluate this cell only if you have computed themi scores in thsi notebook session.
man_vec = mis[f_encoder["man"]]
for t in ts_1484:
    print(f"{t} {man_vec[f_encoder[t]]}")

state 0.0
pressure 0.0
years 0.0
principle 5.23414303311813e-06
means 0.0
case 0.00013002590661195888


In [None]:
sents = [[w.lower() for (w,t) in sent] for sent in tagged_sents]

Here's the original sentence from the corpus.  The target word `man` is the last word.

In [470]:
print(" ".join([w for s in sents for w in s if len(ts_1484.intersection(s))==6]))

in any case , anyone who fails to make significant distinction between primary and secondary applications of economic pressure would in principle already have justified that use of economic boycott as a means which broke out a few years ago or was skillfully organized by white citizens' councils in the entire state of mississippi against every local philco dealer in that state , in protest against a philco-sponsored program over a national tv network on which was presented a drama showing , it seemed , a `` high yellow gal '' smooching with a white man .


The average row sum is approximately 1.34,
which means there are a significant number of trigger words in `korpus`.

In [284]:
Ss = korpus.sum(axis=1)
Ss.mean()

1.3438786971171846

##  Training the Logistic Regression classifier (with triggers)

This takes a little time, see the wall time printouts below from my Mac.  Your mileage may vary.

In [34]:
from sklearn.linear_model import LogisticRegression
import time
    
#     LogisticRegression(penalty='deprecated', 
#                        C=1.0, l1_ratio=0.0, dual=False, 
#                         tol=0.0001, fit_intercept=True, 
#                         intercept_scaling=1, class_weight=None, 
#                         random_state=None, solver='lbfgs', 
#                         max_iter=100, verbose=0, warm_start=False, n_jobs=None)

# we do want solver - 'saga' (sag wil also work also gd for large datasets) 
# and l1_ratio = 0, Also it's a good multiclass algorithm.
lrc = LogisticRegression(solver="saga")
print(time.ctime())
lrc.fit(korpus,Y)
print(time.ctime())

Tue Feb  3 23:07:24 2026
Tue Feb  3 23:08:31 2026


##  Training the Logistic Regression classifier (without triggers: bigrams only model)

In [39]:
big_korpus, big_korpus0, Y = prepare_korpus (f_sents, class_set, set())

Kropus created:: korpus shape: (16026, 358)  Y shape: (16026,) triggers used: 0


####  The number of features for the bigrams model

The number of features for this model is 358.

In [40]:
# Training the no-trigger words (or bigram) model
big_lrc = LogisticRegression(solver="saga")
print(time.ctime())
big_lrc.fit(big_korpus,Y)
print(time.ctime())

Tue Feb  3 23:14:56 2026
Tue Feb  3 23:15:21 2026


##  Perplexity computation

Your task is to compute the perplexity that the trained language model `lrc` assigns to the given corpus (or `korpus`), using  the formula for perplexity given in the ngram slides.  For the questions
on this assignment you will find the `lrc` methods `.predict()`, `.predict_proba` and `.predict_leg_proba`
useful.


In [43]:
def get_perplexity(lrc,korpus,Y):
    probs = lrc.predict_proba(korpus)
    (nsamples,nclasses) = korpus.shape
    row_selection = list(range(nsamples))
    #  Col selection
    # Y is sequence if words.   We need to convert that
    # to a sequence of their indices.
    class_encoder = {w:i for (i,w) in enumerate(lrc.classes_)}
    Y_idxs  = np.array([class_encoder[y] for y in Y])
    # End col selection
    probs_for_target_words = probs[row_selection,Y_idxs]
    return probs_to_perplexity (probs_for_target_words)

def probs_to_perplexity (probs):
    """
    probs is an array of probabilities.  
    
    Let H = 1/N sum log_{2} probs
    [log of the nth root of the product]
    return 2**H
    (map back off the log scale)
    """
    nsamples = probs.shape[0]
    #log_two_probs = np.log(probs)/np.log(2)
    log_two_probs = np.log2(probs)
    return 2**((-log_two_probs.sum())/nsamples)

def perplexity_to_prob (perp):
    """
    perp is a perplexity number (an average branching score)
    
    Convert it to the corresponding probability.
    """
    #return np.exp(-perp * np.log(2))
    return 2**(-perp)
    

Get the perplexity for the trigger word model:

In [36]:
# with triggers: 5.217463600316301
# w/o triggers:  5.464946456044086
# w no info: probs_to_perplexity (np.array([1/91])) = 6.507794640198696 =
get_perplexity(lrc,korpus,Y)

np.float64(37.206003735660275)

In [44]:
import numpy as np
probs_to_perplexity(np.array([1/32]))

np.float64(32.0)

### Perplexity for the bigrams only model

Get the perplexity for the model with no triggers (the bigram model).

In [41]:
get_perplexity(big_lrc,big_korpus,Y)

np.float64(44.168512432090985)

### Perplexity for unigram probability model

Get the perplexity for the model that assigns equal probability to each of the 91 target words:

In [49]:
num_targets = len(set(Y))
num_targets

91

In [50]:
probs_to_perplexity (np.array([1/num_targets]))

np.float64(91.0)

In [51]:
probs_to_perplexity (np.array([1/91]))

np.float64(91.0)

###  Maximum precision word

In [141]:
from sklearn.metrics import precision_score,recall_score

The predicted target words compared to the actual target words using `sklearn.metrics.precision_score`.

In [479]:
predictions = lrc.predict(korpus)
lrc.classes_[precision_score(predictions,Y,average=None).argmax()]

'time'

In [480]:
"time" in predictions

True

In [242]:
lrc.classes_[recall_score(predictions,Y,average=None,labels=lrc.classes_,zero_division=0).argmax()]

'wife'

## 

## Example of using classifiers in scikit learn

Sprinkled liberally with hints for the assignment.

In [342]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

### The data

In [319]:
newsgroups = fetch_20newsgroups()
newsgroups['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [320]:
# We want a multiclass problem, so pick three of the 20 categories
categories = ['alt.atheism', 'sci.space','comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train',
                                     categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories)

In [321]:
print(newsgroups_train.data[0])

From: degroff@netcom.com (21012d)
Subject: Re: Venus Lander for Venus Conditions.
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Lines: 8


  I doubt there are good prospects for  a self armoring system
for venus surface conditions (several hundred degrees, very high
pressure of CO2, possibly sulfuric and nitric acids or oxides
but it is a notion to consider for outer planets rs where you might
pick up ices under less extream upper atmosphere conditions buying
deeper penetration.  A nice creative idea, unlikly but worthy of
thinking about.



In [323]:
newsgroups_train['target_names']

['alt.atheism', 'comp.graphics', 'sci.space']

The classnames we will pass to the classifier in training are integers.
The integers are aligned with the class names in the order in target_names.
Therefore we can set up a simple decoder dictionary that maps from class indices to class names:

In [453]:
first_class = newsgroups_train.target[0]
print(first_class)
decoder = np.array(newsgroups_train.target_names)
decoder[2]

2


'sci.space'

###  Training

Train a logistic regression classifier on this multiclass problem.
Also test it:

In [450]:
# Not usually prize-winning with language data
# vectorizer = CountVectorizer()

##########  Mapping from a sequence of texts to a feature representation of the data
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
# (1657, 29663)
# 1,657 documents. 29663 features.  Why so many features?  That's how many
# distinct vocab items cropped up in these 1,657 documents.  We use words
# as features in our language model as well but there are way fewer
# features because our training data consists of filtered document texts and therefore a filtered vocab.
print(vectors_train.shape)
########## End of feature  mapping #####################################

clf = LogisticRegression(solver="saga")
# targets are [0,1,2]  aligned with newsgroups_train.target_names
clf.fit(vectors_train, newsgroups_train.target)
#  The usual thing we do with trained classifiers
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average="micro")
#  f1 score average = "micro"  0.9407548825982005

(1657, 29663)


0.941016333938294

##  Predicting probabilities

For this assignment we're more interested in having the clasifier produce probabilities:

We classify a fresh example using `predict_proba`.

Notice there are three probabilities.  That's because there are three classes:

In [445]:
space_text = ["Space is the final frontier."]
probs = clf.predict_proba(vectorizer.transform(space_text))
probs

array([[0.0606451 , 0.17867197, 0.76068294]])

In [378]:
probs.sum()

1.0

Class with the highest prob:

In [379]:
probs.argmax()

2

In [362]:
decoder = np.array(newsgroups_train.target_names)
decoder[probs.argmax()]

'sci.space'

Which is the same answer I could have gotten through `predict`:

In [448]:
decoder[clf.predict(vectorizer.transform(space_text))[0]]

'sci.space'

Note  that in our language modeling example you didn't need to call `vectorizer.transform`.
I wrote `prepare_korpus` to do that job, because I wanted some custom "vectorizing".
But the output was still a 2D array with the same number of rows as there were histories
to classify and the same number of columns as there were features to classify with.

In [375]:
# For a sequence of n inputs predict_proba will produce a  nx3 array.  Here n=2
texts = ["Space is the final frontier.", "Rasters are common in computer images."]
probs = clf.predict_proba(vectorizer.transform(texts))
print("Probs shape: ", probs.shape)
cls_names = newsgroups_train.target_names
cls_idxs = probs.argmax(axis=1)
print("Predicted class indices: ",cls_idxs)
decoder = np.array(cls_names)
decoder[cls_idxs]

Probs shape:  (2, 3)
Predicted class indices:  [2 1]


array(['sci.space', 'comp.graphics'], dtype='<U13')

##  Retrieving the predicted probabilities for a sequence of classes

Hint:  This discussion relates to computing the perplexity of the data:

Now suppose in the interest of finding the **hard** examples in the test set,  I want to know the probability 
my trained classifier assigns to the **correct** class for each example.

In [387]:
pred_probs = clf.predict_proba(vectors_test)

In [388]:
pred_probs.shape

(1102, 3)

I need to retrieve a different column index from each row of `pred_probs`, as dictated by the correct classes
for the test data (`newsgroups_test.target`).

In [384]:
newsgroups_test.target

array([2, 1, 1, ..., 0, 1, 2])

This can be done via **fancy indexing** of the probs array. For a 1D array we pass a list containing
a sequence of the indices we want. 

In [432]:
a = np.arange(4,62,3)
print(a)
print(a.shape)
a[[2,8,11,15]]

[ 4  7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61]
(20,)


array([10, 28, 37, 49])

For a 2D array, we pass two sequences, one for the row indices we want, the other for the column indices:

In [433]:
aa = a.reshape((5,4))
print(aa)
# The second element retrieved is aa[2,3]
aa[[1,2,4],[2,3,1]]

[[ 4  7 10 13]
 [16 19 22 25]
 [28 31 34 37]
 [40 43 46 49]
 [52 55 58 61]]


array([22, 37, 55])

Back to our original problem.  We want an array consisting of one element from each row,
the element corresponding to the correct class for that row:

In [454]:
num_rows = len(newsgroups_test.target)
all_rows_idxs = list(range(num_rows))
column_idxs = list(newsgroups_test.target)

prediction_probs_for_correct_classes = pred_probs[all_rows_idxs,column_idxs]
prediction_probs_for_correct_classes

array([0.88644291, 0.91605077, 0.68185055, ..., 0.90199975, 0.68808646,
       0.8170256 ])

To find the lowest prob assigned to a correct class on the test set we first find its index:

In [396]:
example_idx = prediction_probs_for_correct_classes.argmin()
example_idx 

548

Here's the probability assigned to the correct class:

In [455]:
prediction_probs_for_correct_classes[example_idx]

0.1416095974675527

And here is that example:

In [456]:
doc548 = newsgroups_test.data[548]
print(doc548)

From: gkm@wampyr.cc.uow.edu.au (Glen K Moore)
Subject: Fax/email wanted for Louis Friedman/Planetary Society
Organization: University of Wollongong, NSW, Australia.
Lines: 7
NNTP-Posting-Host: wampyr.cc.uow.edu.au
Summary: Want to obtain fax/email address for Planetary Society
Keywords: Planetary Friedman

If available please send to
Glen Moore
Director
Science Centre
Wollongong, Australia
fax: 61 42 213151   email: gkm@cc.uow.edu.au




Confirming the probs:

In [399]:
probs = clf.predict_proba(vectorizer.transform([doc548]))
probs

array([[0.06956561, 0.78882479, 0.1416096 ]])

##  Precision by class

Here `clf` is the classifier trained above. Note the use of `average=None`.  This gets the class
by class precision results:

In [419]:
pred = clf.predict(vectors_test)
metrics.precision_score(newsgroups_test.target, pred, average=None)

array([0.98006645, 0.90510949, 0.95128205])

Note the order of the arguments matters for precision.  Swapping predicted and true labelings changes the scores.

In [412]:
metrics.precision_score(pred,newsgroups_test.target, average=None)

array([0.92476489, 0.9562982 , 0.94162437])

The first order given is correct, as the documentation shows:

In [409]:
print(metrics.f1_score.__doc__)

Compute the F1 score, also known as balanced F-score or F-measure.

    The F1 score can be interpreted as a harmonic mean of the precision and
    recall, where an F1 score reaches its best value at 1 and worst score at 0.
    The relative contribution of precision and recall to the F1 score are
    equal. The formula for the F1 score is::

        F1 = 2 * (precision * recall) / (precision + recall)

    In the multi-class and multi-label case, this is the average of
    the F1 score of each class with weighting depending on the ``average``
    parameter.

    Read more in the :ref:`User Guide <precision_recall_f_measure_metrics>`.

    Parameters
    ----------
    y_true : 1d array-like, or label indicator array / sparse matrix
        Ground truth (correct) target values.

    y_pred : 1d array-like, or label indicator array / sparse matrix
        Estimated targets as returned by a classifier.

    labels : array-like, default=None
        The set of labels to include when ``aver