# Non-negative matrix factorization (NMF) for Documents

Dimention reduction technique

models are interpretable and easy to explain (unline PCA)

all samples MUST be non-negatives

## Imports

In [None]:
import pandas as pd

In [2]:
# To preprocess `wikipedia-vectors.csv` into the format in which we'll used it in the exercises, 
# we have to take its transpose:

from scipy.sparse import csr_matrix

filePath = "../datasets/Wikipedia articles/wikipedia-vectors.csv"
df = pd.read_csv(filePath, index_col=0)
articles = csr_matrix(df.transpose())
titles = list(df.columns)

# The reason for taking this transpose is that without it, there would be 13,000 columns 
# (corresponding to the 13,000 words in the file), which is a lot of columns for a CSV to have.

In [11]:
df.head(5)

Unnamed: 0,0,1
HTTP 404,0.035435,0.039641
Alexa Internet,0.064925,0.052661
Internet Explorer,0.050883,0.038062
HTTP cookie,0.033883,0.03717
Google Search,0.037928,0.034032


In [8]:
df.shape

(60, 2)

In [9]:
articles.shape

(60, 13125)

### NMF applied to Wikipedia articles
Applying NMF, using the tf-idf word-frequency array of Wikipedia articles, given as a csr matrix articles. Here, fit the model and transform the articles. 

In [6]:
# Import NMF
from sklearn.decomposition import NMF
# Create an NMF instance: model
model = NMF(n_components=6)
# Fit the model to articles
model.fit(articles)
# Transform the articles: nmf_features
nmf_features = model.transform(articles)
# Print the NMF features
print(nmf_features.round(2))


[[0.   0.   0.   0.   0.   0.44]
 [0.   0.   0.   0.   0.   0.56]
 [0.   0.   0.   0.   0.   0.4 ]
 [0.   0.   0.   0.   0.   0.38]
 [0.   0.   0.   0.   0.   0.48]
 [0.01 0.01 0.01 0.03 0.   0.33]
 [0.   0.   0.02 0.   0.01 0.36]
 [0.   0.   0.   0.   0.   0.49]
 [0.02 0.01 0.   0.02 0.03 0.48]
 [0.01 0.03 0.03 0.07 0.02 0.34]
 [0.   0.   0.53 0.   0.03 0.  ]
 [0.   0.   0.35 0.   0.   0.  ]
 [0.01 0.01 0.31 0.06 0.01 0.02]
 [0.   0.01 0.34 0.01 0.   0.  ]
 [0.   0.   0.43 0.   0.04 0.  ]
 [0.   0.   0.48 0.   0.   0.  ]
 [0.01 0.02 0.37 0.03 0.   0.01]
 [0.   0.   0.48 0.   0.   0.  ]
 [0.   0.01 0.55 0.   0.   0.  ]
 [0.   0.   0.46 0.   0.   0.  ]
 [0.   0.01 0.02 0.51 0.06 0.01]
 [0.   0.   0.   0.51 0.   0.  ]
 [0.   0.01 0.   0.42 0.   0.  ]
 [0.   0.   0.   0.43 0.   0.  ]
 [0.   0.   0.   0.49 0.   0.  ]
 [0.1  0.09 0.   0.38 0.   0.01]
 [0.   0.   0.   0.57 0.   0.01]
 [0.01 0.01 0.   0.47 0.   0.01]
 [0.   0.   0.   0.57 0.   0.  ]
 [0.   0.   0.   0.52 0.01 0.01]
 [0.   0.4

### NMF features of the Wikipedia articles
Now we will explore the NMF features we created above. 

When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. Why? NMF components represent topics (for instance, acting!).

In [13]:
# Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)
# Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway',])
# Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington',])

0    0.003815
1    0.000000
2    0.000000
3    0.571887
4    0.000000
5    0.000000
Name: Anne Hathaway, dtype: float64
0    0.000000
1    0.005575
2    0.000000
3    0.419579
4    0.000000
5    0.000000
Name: Denzel Washington, dtype: float64


In [15]:
def read_txt_to_array(filepath):
    """
    Reads a text file and returns its contents as an array of strings,
    where each element is a line from the file.

    Args:
        filepath: The path to the text file.

    Returns:
        A list of strings, where each string is a line from the file.
        Returns an empty list if the file does not exist or if there's an error.
    """
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            lines = file.read().splitlines()
        return lines
    except FileNotFoundError:
        print(f"Error: File not found at {filepath}")
        return []
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

# Example usage with the provided file path:
filepath = "../datasets/Wikipedia articles/wikipedia-vocabulary-utf8.txt"
words = read_txt_to_array(filepath)

# Print the array (optional - for demonstration)
# print(word_array)

# Print the number of words in the array
print(f"Number of words in the array: {len(words)}")

# Print the first 10 words in the array (optional - for demonstration)
print(f"First 10 words: {words[:10]}")


Number of words in the array: 13125
First 10 words: ['aaron', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abbas', 'abbey', 'abbreviated', 'abbreviation', 'abc']


### NMF learns topics of documents
When NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. 

Previously, we saw that the 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. Now we are identifying the topic of the corresponding NMF component.

Here, we can recognize the topic that the articles about Anne Hathaway and Denzel Washington have in common!

In [16]:
# Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=words)

# Print the shape of the DataFrame
print(components_df.shape)

# Select row 3: component
component = components_df.iloc[3,]

# Print result of nlargest
print(component.nlargest())


(6, 13125)
film       0.632082
award      0.254825
starred    0.246927
role       0.212867
actress    0.187646
Name: 3, dtype: float64
