### Limitation(s) of sklearn’s non-negative matrix factorization library

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import NMF
from sklearn.metrics import mean_squared_error

### Non-Negative Matrix Factorization (NMF) 

#### Step 1: Load and view the Data

In [4]:
MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

In [5]:
MV_movies.head(1)

Unnamed: 0,mID,title,year,Doc,Com,Hor,Adv,Wes,Dra,Ani,...,Chi,Cri,Thr,Sci,Mys,Rom,Fil,Fan,Act,Mus
0,1,Toy Story,1995,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


In [6]:
MV_users.head(1)

Unnamed: 0,uID,gender,age,accupation,zip
0,1,F,1,10,48067


In [7]:
train.head()

Unnamed: 0,uID,mID,rating
0,744,1210,5
1,3040,1584,4
2,1451,1293,5
3,5455,3176,2
4,2507,3074,5


#### Step 2: Preprocess the Data

Create a user-item matrix for the training data and fill missing values with zeros for the purpose of SVD.

In [8]:
# Create the user-item matrix for the training data
user_item_matrix = train.pivot(index='uID', columns='mID', values='rating')

# Fill missing values with zeros for the purpose of SVD
user_item_matrix.fillna(0, inplace=True)

In [9]:
user_item_matrix.head()

mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### Step 3: Apply NMF

Use NMF from sklearn to factorize the user-item matrix.

In [21]:
# Perform NMF
nmf = NMF(n_components=20, init='random', random_state=42)
W = nmf.fit_transform(user_item_matrix)
H = nmf.components_



In [23]:
H

array([[0.        , 0.        , 0.00302483, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.1657167 , 0.02167047, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.05395629, 0.        , ..., 0.02225336, 0.        ,
        0.        ],
       ...,
       [0.02426404, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.31531265, ..., 0.35171306, 0.22207527,
        2.83829515],
       [0.67379271, 0.04242959, 0.        , ..., 0.        , 0.        ,
        0.14617458]])

#### Step 4: Predict Missing Ratings

Reconstruct the user-item matrix and predict the ratings for the test set.

In [24]:
# Reconstruct the user-item matrix
reconstructed_matrix = np.dot(W, H)

# Convert reconstructed matrix back to DataFrame to match user-item format
reconstructed_df = pd.DataFrame(reconstructed_matrix, index=user_item_matrix.index, columns=user_item_matrix.columns)

In [25]:
reconstructed_df

mID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
uID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.709725,0.549405,0.014191,0.023687,0.015364,0.000337,0.014362,0.095774,0.000384,0.124354,...,0.011539,0.000558,0.016267,0.007902,0.004085,0.131152,0.045178,0.008632,0.005079,0.096387
2,1.594666,0.335222,0.099675,0.035014,0.059429,0.760426,0.183936,0.007312,0.129325,0.985202,...,0.000031,0.000922,0.000203,0.029749,0.021880,0.011091,0.000196,0.020045,0.000000,0.045648
3,0.840486,0.146931,0.014632,0.000931,0.000000,0.174155,0.017593,0.012350,0.059059,0.399550,...,0.001667,0.000000,0.001940,0.011613,0.000941,0.019272,0.000002,0.000000,0.000000,0.000000
4,0.000000,0.000066,0.003995,0.000000,0.000000,0.074023,0.007745,0.000000,0.000000,0.143437,...,0.000754,0.000000,0.000595,0.000009,0.007156,0.000007,0.000004,0.000006,0.000000,0.000000
5,0.658992,0.038826,0.013240,0.073820,0.000000,1.568153,0.002662,0.006715,0.000000,0.100182,...,0.112859,0.000053,0.000319,0.001209,0.035264,0.130977,0.584095,0.040987,0.079055,0.330120
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,2.519037,0.861391,0.153206,0.207802,0.101368,1.170719,0.278494,0.061067,0.032209,0.514240,...,0.208253,0.005127,0.007948,0.046612,0.112903,0.281347,0.997915,0.176131,0.171307,0.750589
6037,1.249398,0.130613,0.008357,0.007953,0.019008,0.053457,0.008318,0.000000,0.000000,0.063120,...,0.009964,0.005552,0.000157,0.004432,0.048756,0.017855,0.001590,0.045320,0.000000,0.015039
6038,0.648476,0.019328,0.004957,0.003400,0.003069,0.000283,0.017999,0.003242,0.000000,0.027164,...,0.002878,0.000123,0.000613,0.000128,0.000532,0.016019,0.007326,0.001163,0.001411,0.005892
6039,1.323802,0.282639,0.121611,0.047811,0.065399,0.000000,0.447001,0.049251,0.000000,0.100186,...,0.013212,0.010183,0.007971,0.000000,0.044403,0.055133,0.000122,0.039914,0.000000,0.000000


#### Step 5: Calculate RMSE

Calculate the Root Mean Squared Error to evaluate the model's performance.

In [26]:

# Get the predictions for the test set
test_predictions = []
test_actuals = []

for index, row in test.iterrows():
    user = row['uID']
    item = row['mID']
    actual_rating = row['rating']
    
    if user in reconstructed_df.index and item in reconstructed_df.columns:
        predicted_rating = reconstructed_df.loc[user, item]
        test_predictions.append(predicted_rating)
        test_actuals.append(actual_rating)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test_actuals, test_predictions))
print(f'RMSE: {rmse}')

RMSE: 2.853698227091948


### Discussion on Results
#### Results Comparison

The RMSE for Non-Negative Matrix Factorization (NMF) is 2.853698227091948, which is significantly higher than the baseline models and collaborative filtering methods used in Module 3. Specifically:

* Baseline Model (predicting everything to 3): RMSE = 1.2585510334053043
* Collaborative Filtering: RMSE = 1.0327092415275554

#### Analysis of Why NMF Performed Poorly

- **Non-negativity Constraint:** NMF enforces non-negativity on the factors, which limits the model's ability to capture complex patterns in the data. In contrast, collaborative filtering methods do not have this constraint and can model negative interactions effectively.
- **Scalability:** NMF can be computationally intensive and might not scale well with larger datasets. If the training process was limited by computational resources, this could have led to suboptimal factorization.
- **Initialization Sensitivity:** The performance of NMF is highly dependent on the initial values chosen for the factors. Poor initialization can lead to convergence to local minima, resulting in suboptimal solutions.
- **Model Complexity:** Simple baseline methods and collaborative filtering can sometimes outperform more complex methods like NMF, especially if the data does not have strong latent factors that NMF can exploit.
- **Sparsity of Data:** The user-item matrix is usually sparse in recommendation systems. NMF may struggle to learn accurate latent factors from sparse data compared to methods like collaborative filtering that are better suited to handle sparsity by directly leveraging user-user or item-item similarities.

#### Suggestions for Improvement

- **Hybrid Methods:** Combine NMF with collaborative filtering to leverage the strengths of both approaches. For example, use NMF to learn latent factors and collaborative filtering to refine predictions based on similarities.
- **Regularization:** Add regularization terms to the NMF loss function to prevent overfitting and improve generalization. Regularization helps control the complexity of the model and can lead to better performance.
- **Better Initialization**: Use advanced initialization techniques, such as non-negative double singular value decomposition (NNDSVD), which can provide a better starting point for the NMF algorithm and improve convergence.
- **Parameter Tuning:** Experiment with different numbers of components (n_components) and other hyperparameters to find the optimal configuration for the dataset.
- **Incorporate Additional Data:** Use auxiliary information (e.g., user demographics, item metadata) to enhance the factorization process. This can help NMF learn more accurate latent factors by providing more context.
- **Alternating Least Squares (ALS):** Consider using ALS-based matrix factorization, which can sometimes perform better than NMF for recommendation tasks. ALS alternates between fixing one factor matrix and solving for the other, which can lead to better performance on sparse data.

#### Conclusion

The higher RMSE for NMF compared to baseline and collaborative filtering methods highlights the limitations of NMF in handling the specific characteristics of the dataset used in Module 3. By addressing these limitations through hybrid methods, regularization, better initialization, and parameter tuning, the performance of NMF can be improved, potentially leading to more accurate predictions and better overall recommendations.