# Non-negative Matrix Factorization (NMF) for Recommendation Systems
Using NMF on articles in order to create a recommendation system, which suggest another article of similarity
<br>

### Mathematic Formulas
Non-negative Matrix Factorization (NMF)
<br>
<img src="/Users/alexandergursky/Local_Repository/Python_Repo/ML_AI/Study_Projects/Unsupervised Learning in Python/Discovering interpretable features/Screen Shot 2023-03-17 at 4.53.56 PM.png" width="500" height="200">
<br>
<br>
Normalization
<br>
<img src="/Users/alexandergursky/Local_Repository/Python_Repo/ML_AI/Study_Projects/Unsupervised Learning in Python/Discovering interpretable features/Screen Shot 2023-03-17 at 5.37.30 PM.png" width="500" height="120">
<br>
<br>
Cosine Similarity
<br>
<img src="/Users/alexandergursky/Local_Repository/Python_Repo/ML_AI/Study_Projects/Unsupervised Learning in Python/Discovering interpretable features/Screen Shot 2023-03-17 at 5.25.34 PM.png" width="500" height="100">

### Additional Information
One of the common issues with working with text-based data is the context or phrasing of the document, this can be demostrated below. Using the cosine distance formula, we can over come this problem, by mapping the articles to their relative angular degree we can interpret similarity between observations.
<br>
<img src="/Users/alexandergursky/Local_Repository/Python_Repo/ML_AI/Study_Projects/Unsupervised Learning in Python/Discovering interpretable features/Screen Shot 2023-03-17 at 4.27.56 PM.png" width="500" height="200">


In [1]:
# pip3 install pandas
# pip3 install scikit-learn
# pip3 install scipy
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix


# Loading df of the articles
main_df = pd.read_csv('/Users/alexandergursky/Local_Repository/Datasets/Dataset_Package/Wikipedia articles/wikipedia-vectors.csv', index_col=0)

# Getting all of the words from the articles
words_df = pd.read_csv('/Users/alexandergursky/Local_Repository/Datasets/Dataset_Package/Wikipedia articles/wikipedia-vocabulary-utf8.txt',header=None)

# csr_matrix is a data type that remembers only the non-zero entries, this saves space. Used for working with NLP
articles = csr_matrix(main_df.transpose())
titles = list(main_df.columns)  # extracting the titles of articles from the df

# Extracting first column, the values, then turning them into a list.
words_ls = words_df.iloc[:,0].values.tolist()

# Creating an NMF instance
model = NMF(n_components=6)

# Fitting the model to articles (our data)
model.fit(articles)

# Transform the data (articles) to work with our model
nmf_features = model.transform(articles)



In [2]:
# Normalizing the NMF features: norm_features
# Normalizing is when you scale data between 0 and 1
norm_features = normalize(nmf_features)

# Creating a holding DataFrame to store our normalized data with the index of the titles.
# This is similar to if we just loaded a traditional df from the start, however dealing with the
# type of data we are dealing with, we had to use a csr_matrix, then normalize the data, then stitch it back together.
# Starting df had columns as articles, observations as words. This was transformed by creating 6 NMF components to group words into
# during the transformation.
df = pd.DataFrame(norm_features, index= titles)

# Selecting the observation(article/document) corresponding to 'Cristiano Ronaldo'. Testing out our model.
article = df.loc['Cristiano Ronaldo']

# Compute the dot products. This is looking for similarities that this article has to every other article/document
similarities = df.dot(article)  # takes our normalized (and processed) df, then multiplies our selected article by all 
                                # observations in the df, because it is normalized from 0 to 1, the max anything could 
                                # ever be is 1. This is how we find the % similarity!

# Display those with the largest cosine similarity
print(similarities.nlargest()) # If no int is passed to the method then it returns 5, similar to .head()

Cristiano Ronaldo                1.000000
Franck Ribéry                    0.999972
Radamel Falcao                   0.999942
Zlatan Ibrahimović               0.999942
France national football team    0.999923
dtype: float64
