Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [12]:
NAME = "Aymane Hachcham" 

---

# Setup our working context and load the data

In this assignment, we will work with a database of inaugural speeches of US presidents.

In [13]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, scipy.sparse
import gzip, json
inaugural = json.load(gzip.open("/data/datasets/inaugural.json.gz","rt"))
labels = [t[0] for t in inaugural]
speeches = [t[1] for t in inaugural]

In [14]:
# Print Speeches:
speeches[0]

'Fellow citizens, in the presence of this vast assemblage of my countrymen I am about to supplement and seal by the oath which I shall take the manifestation of the will of a great and free people. In the exercise of their power and right of self-government they have committed to one of their fellow-citizens a supreme and sacred trust, and he here consecrates himself to their service.\n\nThis impressive ceremony adds little to the solemn sense of responsibility with which I contemplate the duty I owe to all the people of the land. Nothing can relieve me from anxiety lest by any act of mine their interests may suffer, and nothing is needed to strengthen my resolution to engage every faculty and effort in the promotion of their welfare.\n\nAmid the din of party strife the people\'s choice was made, but its attendant circumstances have demonstrated anew the strength and safety of a government by the people. In each succeeding year it more clearly appears that our democratic principle need

# Build a Sparse Document-Term Matrix

Build a document-term matrix for the inaugural speeches.

Use sparse data structures, a minimum document frequency of 5, remove english stopwords.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer # Please use this
vocab = None # Your vocabulary
dtm = None # Your sparse document term matrix

input_text = ['Fellow citizens, in the presence of this vast assemblage of my countrymen I am about to supplement and seal by the oath which I shall take the manifestation of the will of a great and free people.']

vectorizer = CountVectorizer(min_df=1, stop_words='english')
# dtm = vectorizer.fit_transform(input_text)
vocab = vectorizer.get_feature_names_out()
    
print(vocab)
print(dtm.toarray())
print("Document term matrix has shape", dtm.shape)

NotFittedError: Vocabulary not fitted or provided

In [16]:
assert dtm.shape[0] == len(speeches), "Wrong number of speeches"
assert dtm.shape[1] > 2000, "You have too few words"
assert dtm.shape[1] < 3000, "You have too many words"
assert "president" in vocab, "You lost the president"
assert not "President" in vocab, "Please lowercase"
assert not "the" in vocab, "You did not remove stopwords"
assert isinstance(dtm, scipy.sparse.csr_matrix), "Generate a sparse matrix to conserve memory!"
assert dtm.sum(axis=0).min() == 5, "Minimum document frequency not OK"

In [5]:
# Additional tests

In [6]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(dtm,index=labels,columns=vocab).head()

Unnamed: 0,000,abandon,abandoned,abiding,abilities,ability,able,aboriginal,abroad,absence,...,written,wrong,year,years,yes,yield,young,zeal,zealous,zealously
1885-Cleveland,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1969-Nixon,0,0,0,0,0,0,0,0,1,0,...,0,0,1,5,0,0,0,0,0,0
1941-Roosevelt,0,0,0,0,0,0,0,0,0,0,...,2,0,1,6,0,0,0,0,0,0
1937-Roosevelt,0,1,0,1,0,1,0,0,0,1,...,0,0,1,4,0,0,0,0,0,0
1965-Johnson,0,1,0,1,0,0,0,0,0,0,...,1,0,0,3,0,0,1,0,0,0


# Most Frequent Words for Each Speech

Compute the most frequent word (except for the stopwords already removed) for each speech.

In [7]:
# Build a dictionary speech label to most frequent word
most_frequent = dict()

from collections import Counter
import re

_sentences = re.compile(r'[.!\n]\s', re.U) 
_words = re.compile(r"[\w']+", re.U)

# Splitting the speech documents into sentences and words:
dmt_frame = pd.DataFrame.sparse.from_spmatrix(dtm,index=labels,columns=vocab)
count_per_doc = dmt_frame.T.sort_values(by='1885-Cleveland', ascending=True)

for label_speech in inaugural:
#     print(label_speech[0])
#     print(count_per_doc[label_speech[0]])
#     print(count_per_doc.index[ count_per_doc[label_speech[0]] == max(count_per_doc[label_speech[0]]) ].tolist())
    vals = count_per_doc.index[ count_per_doc[label_speech[0]] == max(count_per_doc[label_speech[0]]) ]
    most_frequent[label_speech[0]] = vals.tolist()[0]
    

print(most_frequent)

{'1885-Cleveland': 'people', '1969-Nixon': 'people', '1941-Roosevelt': 'nation', '1937-Roosevelt': 'government', '1965-Johnson': 'change', '2001-Bush': 'america', '1881-Garfield': 'government', '1801-Jefferson': 'government', '1985-Reagan': 'government', '1789-Washington': 'government', '1829-Jackson': 'public', '1869-Grant': 'country', '1997-Clinton': 'new', '2013-Obama': 'people', '1893-Cleveland': 'people', '1913-Wilson': 'great', '1949-Truman': 'world', '1861-Lincoln': 'constitution', '2005-Bush': 'freedom', '1953-Eisenhower': 'free', '1805-Jefferson': 'public', '1945-Roosevelt': 'shall', '1817-Monroe': 'government', '1933-Roosevelt': 'national', '1901-McKinley': 'government', '1853-Pierce': 'power', '1977-Carter': 'nation', '1961-Kennedy': 'let', '1921-Harding': 'world', '1821-Monroe': 'great', '1809-Madison': 'nations', '1897-McKinley': 'people', '1833-Jackson': 'government', '1973-Nixon': 'let', '2017-Trump': 'america', '1929-Hoover': 'government', '1793-Washington': 'shall', '1

In [8]:
assert len(most_frequent) == len(labels), "You are missing some speeches"
assert set(most_frequent.keys()) == set(labels), "You are missing some speeches"
assert not "the" in most_frequent.values(), "Stopwords not removed"
assert "america" in most_frequent.values(), "Someone talked about america"

# TF-IDF

From the document-term matrix, compute the TF-IDF matrix. Implement the standard version of TF-IDF (`ltc`).

Be careful with 0 values, ensure that your matrix remains *sparse*. Do *not* rely on Wikipedia, it has errors.

Perform the transformation in three steps, named `tf`, `idf`, `tfidf`. First implement term frequency.

In [17]:
import numpy as np
def tf(dtm):
    """Compute the "l" step of standard TF-IDF"""
#     # HINT: use dtm.astype(np.float32) to get a *sparse floating point copy* of the dtm matrix.
    
    tf_matrix = dtm.astype(np.float32)
    tf_matrix.data = 1 + np.log(tf_matrix.data)
    
    return tf_matrix
    
    
print("Old sum:", dtm.sum(), "new sum:", tf(dtm).sum(), "(must be less and float)")

Old sum: 44859 new sum: 33943.453 (must be less and float)


In [64]:
# Inspect your matrix
pd.DataFrame.sparse.from_spmatrix(tf(dtm),index=labels,columns=vocab).head()

Unnamed: 0,000,abandon,abandoned,abiding,abilities,ability,able,aboriginal,abroad,absence,...,written,wrong,year,years,yes,yield,young,zeal,zealous,zealously
1885-Cleveland,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,2.609438,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.693147,0.0,1.0,2.791759,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,2.386294,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,2.098612,0.0,0.0,1.0,0.0,0.0,0.0


In [65]:
matrix = tf(dtm).toarray()
matrix[matrix > 0]

array([1.       , 1.       , 1.       , ..., 1.6931472, 1.       ,
       1.       ], dtype=float32)

In [66]:
# Automatic tests
_tf = tf(dtm)
assert _tf.sum() < dtm.sum(), "Weight sum has not decreased."
assert (_tf > 0).sum() == (dtm > 0).sum(), "Number of zeros must not change."
assert (_tf[_tf > 0].min()) == 1, "Scaling incorrect."
assert _tf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert isinstance(_tf, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
# It is not allowed to only restore sparsity if it was lost.
from unittest.mock import patch
with patch('scipy.sparse.csr.csr_matrix') as mock_csr: tf(dtm)
mock_csr.assert_not_called()
del _tf

Implement the `idf` function.

In [18]:
def idf(dtm):
    """ Compute the "t" step inverse document frequency """
    idf_matrix = np.log(dtm.shape[0] / (dtm.getnnz(0) + 1))
    return idf_matrix

In [20]:
b=(np.ones((dtm.shape[0],)) @ dtm)
print(np.log(dtm.shape[0] / b))
print(idf(dtm).shape)

[1.42138568 1.86321843 2.4510051  ... 1.57553636 2.4510051  2.26868354]
(2158,)


In [68]:
# Automatic tests
_idf = idf(dtm)
assert len(_idf.shape) == 1, "Result must be one-dimensional."
assert _idf.flatten().shape[0] == dtm.shape[1], "The IDF dimension is not okay."
assert _idf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert not isinstance(_idf, scipy.sparse.csr_matrix), "IDF must not be sparse."
assert _idf.min() > 0, "IDF must not be zero here."
assert _idf.max() < 3, "Too large idf values."
from unittest.mock import patch
with patch('scipy.sparse.csr_matrix.todense') as mock_todense, patch('scipy.sparse.csr_matrix.toarray') as mock_toarray:
    type(idf(dtm))
mock_todense.assert_not_called()
mock_toarray.assert_not_called()
del _idf

Now implement the full `tfidf` function, using above implementations of `df` and `idf`.

Hint: you may find `scipy.sparse.spdiags` useful to keep the computations *sparse*.

You are **not allowed** to use sklearns `TfidfVectorizer`!

In [21]:
def tfidf(dtm):
    """Finish the computation of standard TF-IDF with the c step"""
    _tf, _idf = tf(dtm), idf(dtm) # Must use above functions.
    # YOUR CODE HERE
    
    tf_idf = np.zeros(_tf.shape)
    for row_index in range(0,_tf.shape[0]):
        for col_index in range(0,_tf.shape[1]):
            tf_idf[row_index, col_index] = _tf[row_index, col_index] * _idf[row_index]
    
    tf_idf = scipy.sparse.csr_matrix(tf_idf)
    _norm = 1 / scipy.sparse.linalg.norm(tf_idf, ord=2, axis=1)
    
    for row_index in range(0,_tf.shape[0]):
        tf_idf[row_index] = tf_idf[row_index] * _norm[row_index]
    
    
    return tf_idf

In [70]:
# Inspect your matrix
pd.DataFrame.sparse.from_spmatrix(tfidf(dtm),index=labels,columns=vocab).head()

Unnamed: 0,000,abandon,abandoned,abiding,abilities,ability,able,aboriginal,abroad,absence,...,written,wrong,year,years,yes,yield,young,zeal,zealous,zealously
1885-Cleveland,0.0,0.037491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.037491,0.0,0.0,0.0,0.0,0.037491,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034816,0.0,...,0.0,0.0,0.034816,0.090851,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.074865,0.0,0.044217,0.123442,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.036023,0.0,0.036023,0.0,0.036023,0.0,0.0,0.0,0.036023,...,0.0,0.0,0.036023,0.085963,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.043708,0.0,0.043708,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043708,0.0,0.0,0.091725,0.0,0.0,0.043708,0.0,0.0,0.0


In [71]:
_tfidf = tfidf(dtm)
assert _tfidf.sum() < dtm.sum(), "Weight sum has not decreased."
assert (_tfidf > 0).sum() == (dtm > 0).sum(), "Number of zeros must not change."
assert abs(_tfidf.power(2).sum() - _tfidf.shape[0]) < 1e-5, "Vectors are not 'c'."
assert _tfidf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."

In [72]:
from unittest.mock import patch
with patch('scipy.sparse.csr_matrix.todense') as mock_todense, patch('scipy.sparse.csr_matrix.toarray') as mock_toarray:
    assert isinstance(tfidf(dtm), scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
mock_todense.assert_not_called()
mock_toarray.assert_not_called()

# Compare to sklearn

Now you are allowed to use `TfidfVectorizer`!

Use sklearns `TfidfVectorizer` (make sure to choose parameters appropriately). Compare the results.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvect = TfidfVectorizer(stop_words='english',min_df=5) # set appropriate parameters!
sktfidf = None # Store the TF-IDF result obtained via sklearn
skvocab = None # The vocabulary
# YOUR CODE HERE
sktfidf = tvect.fit_transform(speeches)
skvocab = tvect.get_feature_names_out()

#raise NotImplementedError()

In [41]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(sktfidf,index=labels,columns=skvocab).head()

Unnamed: 0,000,abandon,abandoned,abiding,abilities,ability,able,aboriginal,abroad,absence,...,written,wrong,year,years,yes,yield,young,zeal,zealous,zealously
1885-Cleveland,0.0,0.040356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.026141,0.0,0.0,0.0,0.0,0.038771,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024069,0.0,...,0.0,0.0,0.021913,0.070461,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.085117,0.0,0.030844,0.11901,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.037874,0.0,0.03174,0.0,0.026946,0.0,0.0,0.0,0.041508,...,0.0,0.0,0.024533,0.063107,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.046402,0.0,0.038888,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041473,0.0,0.0,0.057988,0.0,0.0,0.034736,0.0,0.0,0.0


In [42]:
assert all(skvocab == vocab), "You did not use use same parameters as above."
assert sktfidf.shape == dtm.shape, "Matrix shapes do not agree."
assert (sktfidf > 0).sum() == (dtm > 0).sum(), "Sparsity must not change."
assert abs(sktfidf.power(2).sum() - sktfidf.shape[0]) < 1e-7, "Vectors are not 'c'."
assert isinstance(sktfidf, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
assert sktfidf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."

# Understand the difference

By visual inspection of the two matrixes, you will notice that they do *not* agree.

Check the [bug reports of scikit-learn](https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aissue+tf-idf+is%3Aopen) for related bug reports, and check the scikit-learn documentation *carefully* to figure out the difference.

Is it better or worse? We don't know. But scikit-learn does not implement the standard approach!

But: we can easily "hack" sklearn to produce the desired result.

Hint: Use `fit`, adjust the vectorizer, and `tranform` separately.

In [43]:
# Work around this issue in scikit-learn
tvect2 = TfidfVectorizer(stop_words='english',min_df=5, smooth_idf=False, sublinear_tf=True, use_idf=False) # set appropriate parameters!
sktfidf2 = None # Store the TF-IDF result obtained via sklearn
skvocab2 = None # The vocabulary
# Use fit(), adjust as necessary, transform() to get the desired result!
# YOUR CODE HERE
sktfidf2 = tvect2.fit(speeches)
sktfidf2 = sktfidf2.transform(speeches)
skvocab2 = tvect.get_feature_names_out()

#raise NotImplementedError()

In [44]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(sktfidf2,index=labels,columns=skvocab2).head()

Unnamed: 0,000,abandon,abandoned,abiding,abilities,ability,able,aboriginal,abroad,absence,...,written,wrong,year,years,yes,yield,young,zeal,zealous,zealously
1885-Cleveland,0.0,0.037491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.037491,0.0,0.0,0.0,0.0,0.037491,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034816,0.0,...,0.0,0.0,0.034816,0.090851,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.074865,0.0,0.044217,0.123442,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.036023,0.0,0.036023,0.0,0.036023,0.0,0.0,0.0,0.036023,...,0.0,0.0,0.036023,0.085963,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.043708,0.0,0.043708,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043708,0.0,0.0,0.091725,0.0,0.0,0.043708,0.0,0.0,0.0


In [45]:
assert all(skvocab2 == vocab), "You did not use use same parameters as above."
assert sktfidf2.shape == dtm.shape, "Matrix shapes do not agree."
assert (sktfidf2 > 0).sum() == (dtm > 0).sum(), "Sparsity must not change."
assert abs(sktfidf2.power(2).sum() - sktfidf2.shape[0]) < 1e-7, "Vectors are not 'c'."
assert isinstance(sktfidf2, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
assert sktfidf2.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert abs((sktfidf2 - sktfidf).sum()) > 1, "Results are not different."

In [46]:
assert np.abs(sktfidf2 - tfidf(dtm)).sum() < 1e-3, "Results are not similar."

# Compute the Cosine Similarity Matrix

Compute the cosine similarity matrix of the speeches above.

You are not allowed to use sklearn for this.

In [47]:
X = tfidf(dtm) # use your own tfidf results
sim = None # Compute cosine similarities

# YOUR CODE HERE
sim = X.dot(X.transpose())
del X # free memory again.
print("Matrix of shape %d x %d" % sim.shape)

Matrix of shape 58 x 58


In [48]:
assert sim.shape[0] == dtm.shape[0] and sim.shape[1] == dtm.shape[0], "Matrix size incorrect"
assert sim.max() < 1+1e07, "Invalid values"
assert sim.min() > -1e07, "Invalid values"
assert np.abs(sim.diagonal().mean() - 1) < 1e-8, "Diagonal is not valid."

## Find the two most similar speeches

Given the similarity matrix, find the two most similar (different) speeches.

In [49]:
most_similar = (None, None, None) # Store a pair of document *labels* and their similarity
# YOUR CODE HERE
most_similar = (0,0,0)
for i in range(sim.shape[0]):
    for j in range(sim.shape[0]):
        if i != j and sim[i,j] >= most_similar[2]:
            most_similar = (labels[j],labels[i],sim[i,j])


print("%s\t%s\t%g" % most_similar)

1817-Monroe	1821-Monroe	0.625993


In [50]:
assert isinstance(most_similar[0], str), "Not a label"
assert isinstance(most_similar[1], str), "Not a label"
assert isinstance(most_similar[2], float) or isinstance(most_similar[2], np.floating), "Not a similarity"
assert most_similar[0] != most_similar[1]
assert most_similar[2] > 0, "There is definitely something similar."
assert most_similar[2] < 1, "There were no duplicate inaugural speeches yet."

In [None]:
# Hidden tests