# The Issues with Fitting Then Transforming for NMF Recomendation Systems
In scikit-learn, the fit() method is used to train a machine learning model on a given dataset, while the transform() method is used to transform the dataset based on the model that was trained using the fit() method.

On the other hand, the fit_transform() method is used to both train the model and transform the dataset in a single step.

While fit() and transform() can be used separately, fit_transform() is specifically designed to optimize the training and transformation processes for certain models, and may perform optimizations that are not possible when using the two methods separately.

However, not all models in scikit-learn support fit_transform(), and in some cases, using fit_transform() may not be appropriate. For example, when dealing with large datasets, it may be more efficient to use fit() to train the model and then apply transform() to multiple batches of data.

Therefore, it is important to read the documentation for each model and method to understand when to use each one appropriately, and avoid using them interchangeably.

In [1]:
# Something is wrong with this program in comparison to the NMF_Music_Recomendation_System.ipynb
# Problems: miss-matched names, the recomendations do not apper to be correct nor do they mirror the other script.
# Potential Causes: Diffrences between .fit_transform() vs .fit() .transform() , or data was not appended correctly.
# Ex: Foo Fighters - Dr.Dre should NOT have an 0.86 score in contrast to The Killers at only 0.27

# This method simultaneously performs fit and transform operations on the input data and converts the data points. 
# Using fit and transform separately when we need them both decreases the efficiency of the model. 
# Instead, fit_transform() is used to get both works done.

# Imports
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.preprocessing import MaxAbsScaler, Normalizer
from sklearn.pipeline import make_pipeline

artist_df = pd.read_csv('/Users/alexandergursky/Local_Repository/Datasets/Dataset_Package/Musical artists/artists.csv', header=None)
samples_df = pd.read_csv('/Users/alexandergursky/Local_Repository/Datasets/Dataset_Package/Musical artists/scrobbler-small-sample.csv')

In [2]:
# Mapping 
artist_df['artist_key'] = artist_df.index   # Getting the index and creating a new column

merged_df = pd.merge(
    artist_df,
    samples_df,
    left_on= 'artist_key',
    right_on= 'artist_offset'
)
# Dropping columns that I dont need 
merged_df = merged_df.drop(columns=['artist_key','artist_offset'])

In [3]:
# Renaming index
merged_df = merged_df.rename(columns={0: 'artist_names'})

# Creating a sparse matrix
sparse_df = merged_df.pivot_table(
    index= 'artist_names',
    columns= 'user_offset',
    values= 'playcount',
    fill_value= 0
)


In [4]:
# Creating the model

scaler = MaxAbsScaler()     # To get the values on the same level
nmf = NMF(n_components= 20) # The NMF model, 20 genres
norm = Normalizer()         # Getting the data in a 0 to 1 scale for percentage in relation for later

pipeline = make_pipeline(scaler, nmf, norm)

In [5]:
################### The issue happened here

# Fitting and transforming the model to the data
pipeline.fit(sparse_df)

piped_data = pipeline.transform(sparse_df)


In [8]:

# Append the name of the observations as a new df
artist_names_list = artist_df[0].values.tolist()

final_data = pd.DataFrame(
    piped_data,
    index= artist_names_list
)
# Select observation
selected = final_data.loc['Foo Fighters']

# Dot Product
recomendation = final_data.dot(selected)

# Print Recomendation
print(recomendation.nlargest())

Foo Fighters                   1.000000
Nick Cave and the Bad Seeds    0.868229
Dr. Dre                        0.826468
The Flaming Lips               0.823686
The White Stripes              0.817639
dtype: float64
