Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [None]:
NAME = "AVISHA BHIRYANI"

---

# Setup our working context and load the data

In this assignment, we will work with a database of inaugural speeches of US presidents.

In [2]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, scipy.sparse
import gzip, json
inaugural = json.load(gzip.open("/data/datasets/inaugural.json.gz","rt"))
labels = [t[0] for t in inaugural]
speeches = [t[1] for t in inaugural]

# Build a Sparse Document-Term Matrix

Build a document-term matrix for the inaugural speeches.

Use sparse data structures, a minimum document frequency of 5, remove english stopwords.

In [154]:
from sklearn.feature_extraction.text import CountVectorizer # Please use this
vocab = None # Your vocabulary
dtm = None # Your sparse document term matrix
# YOUR CODE HERE
coun_vect = CountVectorizer(min_df=5, stop_words='english')
dtm = coun_vect.fit_transform(speeches)
vocab = coun_vect.vocabulary_
print("Document term matrix has shape", dtm.shape)

Document term matrix has shape (58, 2158)


In [4]:
assert dtm.shape[0] == len(speeches), "Wrong number of speeches"
assert dtm.shape[1] > 2000, "You have too few words"
assert dtm.shape[1] < 3000, "You have too many words"
assert "president" in vocab, "You lost the president"
assert not "President" in vocab, "Please lowercase"
assert not "the" in vocab, "You did not remove stopwords"
assert isinstance(dtm, scipy.sparse.csr_matrix), "Generate a sparse matrix to conserve memory!"
assert dtm.sum(axis=0).min() == 5, "Minimum document frequency not OK"

In [None]:
# Additional tests

In [5]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(dtm,index=labels,columns=vocab).head()

Unnamed: 0,fellow,citizens,presence,vast,countrymen,oath,shall,great,free,people,...,prominent,beneficial,weakened,glorious,feature,judicial,watching,influences,admission,supposed
1885-Cleveland,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1969-Nixon,0,0,0,0,0,0,0,0,1,0,...,0,0,1,5,0,0,0,0,0,0
1941-Roosevelt,0,0,0,0,0,0,0,0,0,0,...,2,0,1,6,0,0,0,0,0,0
1937-Roosevelt,0,1,0,1,0,1,0,0,0,1,...,0,0,1,4,0,0,0,0,0,0
1965-Johnson,0,1,0,1,0,0,0,0,0,0,...,1,0,0,3,0,0,1,0,0,0


In [6]:
doc_term_matrix = pd.DataFrame.sparse.from_spmatrix(dtm,index=labels,columns=vocab)

# Most Frequent Words for Each Speech

Compute the most frequent word (except for the stopwords already removed) for each speech.

In [7]:
# Build a dictionary speech label to most frequent word
most_frequent = dict()
# YOUR CODE HERE

for index, row in doc_term_matrix.iterrows():
    most_frequent[index] = row.sort_values(ascending=False).index[0]

for sp, w in sorted(most_frequent.items()): print(sp, w, sep="\t")

1789-Washington	blessed
1793-Washington	meantime
1797-Adams	decisions
1801-Jefferson	blessed
1805-Jefferson	pledges
1809-Madison	pledges
1813-Madison	profound
1817-Monroe	blessed
1821-Monroe	standards
1825-Adams	contributed
1829-Jackson	pledges
1833-Jackson	blessed
1837-VanBuren	decisions
1841-Harrison	presidential
1845-Polk	blessed
1849-Taylor	meantime
1853-Pierce	presidential
1857-Buchanan	afforded
1861-Lincoln	years
1865-Lincoln	profound
1869-Grant	modern
1873-Grant	modern
1877-Hayes	modern
1881-Garfield	blessed
1885-Cleveland	decisions
1889-Harrison	decisions
1893-Cleveland	decisions
1897-McKinley	decisions
1901-McKinley	blessed
1905-Roosevelt	bitter
1909-Taft	blessed
1913-Wilson	standards
1917-Wilson	meantime
1921-Harding	slightest
1925-Coolidge	modern
1929-Hoover	blessed
1933-Roosevelt	supervision
1937-Roosevelt	blessed
1941-Roosevelt	assert
1945-Roosevelt	meantime
1949-Truman	slightest
1953-Eisenhower	recognize
1957-Eisenhower	slightest
1961-Kennedy	facing
1965-Johnson	assert
19

In [8]:
assert len(most_frequent) == len(labels), "You are missing some speeches"
assert set(most_frequent.keys()) == set(labels), "You are missing some speeches"
assert not "the" in most_frequent.values(), "Stopwords not removed"
assert "america" in most_frequent.values(), "Someone talked about america"

AssertionError: Someone talked about america

# TF-IDF

From the document-term matrix, compute the TF-IDF matrix. Implement the standard version of TF-IDF (`ltc`).

Be careful with 0 values, ensure that your matrix remains *sparse*. Do *not* rely on Wikipedia, it has errors.

Perform the transformation in three steps, named `tf`, `idf`, `tfidf`. First implement term frequency.

In [56]:
def tf(dtm):
    """Compute the "l" step of standard TF-IDF"""
    # HINT: use dtm.astype(np.float32) to get a *sparse floating point copy* of the dtm matrix.
    dtm_float = dtm.astype(np.float32)
    #dtm_float_sum = dtm_float.sum(axis = 1)
    dtm_float_coo = scipy.sparse.coo_matrix(dtm_float)
    dtm_float_formean = scipy.sparse.coo_matrix(dtm_float)
    dtm_float_coo.data = np.log(dtm_float_coo.data)
    dtm_float_formean.data = dtm_float_coo.data
    dtm_float_coo.data = dtm_float_coo.data + 1
    
    dtm_float_formean_sparse =  scipy.sparse.csr_matrix(dtm_float_formean.todense())
    dtm_float_mean = dtm_float_formean_sparse / dtm_float_formean_sparse.sum(axis=1)
    
    dtm_float_log_f = scipy.sparse.csr_matrix(dtm_float_coo.todense())
    dtm_final = (dtm_float_log_f) / (1 + dtm_float_mean)
    return scipy.sparse.csr_matrix(dtm_final)

print("Old sum:", dtm.sum(), "new sum:", tf(dtm).sum(), "(must be less and float)")

Old sum: 44859 new sum: 33817.035 (must be less and float)


In [57]:
# Inspect your matrix
pd.DataFrame.sparse.from_spmatrix(tf(dtm),index=labels,columns=vocab).head()

Unnamed: 0,fellow,citizens,presence,vast,countrymen,oath,shall,great,free,people,...,prominent,beneficial,weakened,glorious,feature,judicial,watching,influences,admission,supposed
1885-Cleveland,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,2.578279,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.677022,0.0,1.0,2.724052,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,2.359126,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,2.069033,0.0,0.0,1.0,0.0,0.0,0.0


In [58]:
# Automatic tests
_tf = tf(dtm)
assert _tf.sum() < dtm.sum(), "Weight sum has not decreased."
assert (_tf > 0).sum() == (dtm > 0).sum(), "Number of zeros must not change."
assert (_tf[_tf > 0].min()) == 1, "Scaling incorrect."
assert _tf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert isinstance(_tf, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
# It is not allowed to only restore sparsity if it was lost.
from unittest.mock import patch
with patch('scipy.sparse.csr.csr_matrix') as mock_csr: tf(dtm)
mock_csr.assert_not_called()
del _tf

Implement the `idf` function.

In [98]:
import math
def idf(dtm):
    """ Compute the "t" step inverse document frequency """
    # YOUR CODE HERE
    dtm_float = dtm.astype(np.float32)
    dtm_float = scipy.sparse.csr_matrix(dtm_float)
    idfDict = {}
    i = 0
    for word in vocab:
        idfDict[word] = dtm_float.getcol(i).count_nonzero()
        i = i+1
    N = len(labels)
    arr = []
    #idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        arr.append(math.log10(N / (float(val))))
    numpy_arr = np.array(arr)   
    return numpy_arr

In [97]:
idf(dtm)

(2158,)


array([0.98527674, 0.91832995, 1.06445799, ..., 0.86033801, 1.06445799,
       0.98527674])

In [99]:
b=(np.ones((dtm.shape[0],)) @ dtm)
print(np.log(dtm.shape[0] / b))
print(idf(dtm))

[1.42138568 1.86321843 2.4510051  ... 1.57553636 2.4510051  2.26868354]
[0.98527674 0.91832995 1.06445799 ... 0.86033801 1.06445799 0.98527674]


In [100]:
# Automatic tests
_idf = idf(dtm)
assert len(_idf.shape) == 1, "Result must be one-dimensional."
assert _idf.flatten().shape[0] == dtm.shape[1], "The IDF dimension is not okay."
assert _idf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert not isinstance(_idf, scipy.sparse.csr_matrix), "IDF must not be sparse."
assert _idf.min() > 0, "IDF must not be zero here."
assert _idf.max() < 3, "Too large idf values."
from unittest.mock import patch
with patch('scipy.sparse.csr_matrix.todense') as mock_todense, patch('scipy.sparse.csr_matrix.toarray') as mock_toarray:
    type(idf(dtm))
mock_todense.assert_not_called()
mock_toarray.assert_not_called()
del _idf

Now implement the full `tfidf` function, using above implementations of `df` and `idf`.

Hint: you may find `scipy.sparse.spdiags` useful to keep the computations *sparse*.

You are **not allowed** to use sklearns `TfidfVectorizer`!

In [134]:
def tfidf(dtm):
    """Finish the computation of standard TF-IDF with the c step"""
    _tf, _idf = tf(dtm), idf(dtm) # Must use above functions.
    # YOUR CODE HERE
    df = pd.DataFrame.sparse.from_spmatrix(_tf,index=labels,columns=vocab)
    i = 0
    for column in df.columns:
        df[column] = df[column] * _idf[i]
        df[column] = df[column].astype(np.float32)
        i = i+1
    csr_mat = scipy.sparse.csr_matrix(df.values)
    return csr_mat

In [135]:
tfidf(dtm)

<58x2158 sparse matrix of type '<class 'numpy.float32'>'
	with 25164 stored elements in Compressed Sparse Row format>

In [136]:
# Inspect your matrix
pd.DataFrame.sparse.from_spmatrix(tfidf(dtm),index=labels,columns=vocab).head()

Unnamed: 0,fellow,citizens,presence,vast,countrymen,oath,shall,great,free,people,...,prominent,beneficial,weakened,glorious,feature,judicial,watching,influences,admission,supposed
1885-Cleveland,0.0,0.91833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.421005,0.0,0.0,0.0,0.0,0.860338,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.508155,0.0,...,0.0,0.0,0.421005,0.284166,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.280286,0.0,0.421005,0.300233,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.91833,0.0,0.684247,0.0,0.508155,0.0,0.0,0.0,1.064458,...,0.0,0.0,0.421005,0.260012,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.91833,0.0,0.684247,0.0,0.0,0.0,0.0,0.0,0.0,...,0.763428,0.0,0.0,0.228039,0.0,0.0,0.559308,0.0,0.0,0.0


In [137]:
_tfidf = tfidf(dtm)
assert _tfidf.sum() < dtm.sum(), "Weight sum has not decreased."
assert (_tfidf > 0).sum() == (dtm > 0).sum(), "Number of zeros must not change."
assert abs(_tfidf.power(2).sum() - _tfidf.shape[0]) < 1e-5, "Vectors are not 'c'."
assert _tfidf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."

AssertionError: Vectors are not 'c'.

In [138]:
from unittest.mock import patch
with patch('scipy.sparse.csr_matrix.todense') as mock_todense, patch('scipy.sparse.csr_matrix.toarray') as mock_toarray:
    assert isinstance(tfidf(dtm), scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
mock_todense.assert_not_called()
mock_toarray.assert_not_called()

ValueError: Column length mismatch: 2158 vs. 0

# Compare to sklearn

Now you are allowed to use `TfidfVectorizer`!

Use sklearns `TfidfVectorizer` (make sure to choose parameters appropriately). Compare the results.

In [156]:
from sklearn.feature_extraction.text import TfidfVectorizer
# set appropriate parameters!
sktfidf = None # Store the TF-IDF result obtained via sklearn
skvocab = None # The vocabulary
# YOUR CODE HERE
tvect=TfidfVectorizer(stop_words='english', min_df=5)
sktfidf=tvect.fit_transform(speeches)
skvocab = tvect.vocabulary_

In [157]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(sktfidf,index=labels,columns=skvocab).head()

Unnamed: 0,fellow,citizens,presence,vast,countrymen,oath,shall,great,free,people,...,prominent,beneficial,weakened,glorious,feature,judicial,watching,influences,admission,supposed
1885-Cleveland,0.0,0.040356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.026141,0.0,0.0,0.0,0.0,0.038771,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024069,0.0,...,0.0,0.0,0.021913,0.070461,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.085117,0.0,0.030844,0.11901,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.037874,0.0,0.03174,0.0,0.026946,0.0,0.0,0.0,0.041508,...,0.0,0.0,0.024533,0.063107,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.046402,0.0,0.038888,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041473,0.0,0.0,0.057988,0.0,0.0,0.034736,0.0,0.0,0.0


In [169]:
skvocab == vocab

True

In [168]:
assert all(skvocab == vocab), "You did not use use same parameters as above."
assert sktfidf.shape == dtm.shape, "Matrix shapes do not agree."
assert (sktfidf > 0).sum() == (dtm > 0).sum(), "Sparsity must not change."
assert abs(sktfidf.power(2).sum() - sktfidf.shape[0]) < 1e-7, "Vectors are not 'c'."
assert isinstance(sktfidf, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
assert sktfidf.dtype in [np.float32, np.float64, np.float16], "Not using floating point."

TypeError: 'bool' object is not iterable

# Understand the difference

By visual inspection of the two matrixes, you will notice that they do *not* agree.

Check the [bug reports of scikit-learn](https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aissue+tf-idf+is%3Aopen) for related bug reports, and check the scikit-learn documentation *carefully* to figure out the difference.

Is it better or worse? We don't know. But scikit-learn does not implement the standard approach!

But: we can easily "hack" sklearn to produce the desired result.

Hint: Use `fit`, adjust the vectorizer, and `tranform` separately.

In [171]:
# Work around this issue in scikit-learn
# set appropriate parameters!
sktfidf2 = None # Store the TF-IDF result obtained via sklearn
skvocab2 = None # The vocabulary
# Use fit(), adjust as necessary, transform() to get the desired result!
# YOUR CODE HERE
tvect2=TfidfVectorizer(stop_words='english', min_df=5)
sktfidf1=tvect.fit(speeches)
sktfidf2 = sktfidf1.transform(speeches)
skvocab = sktfidf1.vocabulary_

In [172]:
# Pretty display the data with pandas:
pd.DataFrame.sparse.from_spmatrix(sktfidf2,index=labels,columns=skvocab2).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2148,2149,2150,2151,2152,2153,2154,2155,2156,2157
1885-Cleveland,0.0,0.040356,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.026141,0.0,0.0,0.0,0.0,0.038771,0.0,0.0
1969-Nixon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024069,0.0,...,0.0,0.0,0.021913,0.070461,0.0,0.0,0.0,0.0,0.0,0.0
1941-Roosevelt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.085117,0.0,0.030844,0.11901,0.0,0.0,0.0,0.0,0.0,0.0
1937-Roosevelt,0.0,0.037874,0.0,0.03174,0.0,0.026946,0.0,0.0,0.0,0.041508,...,0.0,0.0,0.024533,0.063107,0.0,0.0,0.0,0.0,0.0,0.0
1965-Johnson,0.0,0.046402,0.0,0.038888,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041473,0.0,0.0,0.057988,0.0,0.0,0.034736,0.0,0.0,0.0


In [173]:
assert all(skvocab2 == vocab), "You did not use use same parameters as above."
assert sktfidf2.shape == dtm.shape, "Matrix shapes do not agree."
assert (sktfidf2 > 0).sum() == (dtm > 0).sum(), "Sparsity must not change."
assert abs(sktfidf2.power(2).sum() - sktfidf2.shape[0]) < 1e-7, "Vectors are not 'c'."
assert isinstance(sktfidf2, scipy.sparse.csr_matrix), "Not a sparse matrix anymore!"
assert sktfidf2.dtype in [np.float32, np.float64, np.float16], "Not using floating point."
assert abs((sktfidf2 - sktfidf).sum()) > 1, "Results are not different."

TypeError: 'bool' object is not iterable

In [175]:
assert np.abs(sktfidf2 - tfidf(dtm)).sum() < 1e-3, "Results are not similar."

AssertionError: Results are not similar.

# Compute the Cosine Similarity Matrix

Compute the cosine similarity matrix of the speeches above.

You are not allowed to use sklearn for this.

In [None]:
X = tfidf(dtm) # use your own tfidf results
sim = None # Compute cosine similarities
# YOUR CODE HERE
raise NotImplementedError()
del X # free memory again.
print("Matrix of shape %d x %d" % sim.shape)

In [None]:
assert sim.shape[0] == dtm.shape[0] and sim.shape[1] == dtm.shape[0], "Matrix size incorrect"
assert sim.max() < 1+1e07, "Invalid values"
assert sim.min() > -1e07, "Invalid values"
assert np.abs(sim.diagonal().mean() - 1) < 1e-8, "Diagonal is not valid."

## Find the two most similar speeches

Given the similarity matrix, find the two most similar (different) speeches.

In [None]:
most_similar = (None, None, None) # Store a pair of document *labels* and their similarity
# YOUR CODE HERE
raise NotImplementedError()
print("%s\t%s\t%g" % most_similar)

In [None]:
assert isinstance(most_similar[0], str), "Not a label"
assert isinstance(most_similar[1], str), "Not a label"
assert isinstance(most_similar[2], float) or isinstance(most_similar[2], np.floating), "Not a similarity"
assert most_similar[0] != most_similar[1]
assert most_similar[2] > 0, "There is definitely something similar."
assert most_similar[2] < 1, "There were no duplicate inaugural speeches yet."

In [None]:
# Hidden tests