<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    Project: Movie Recommendation System
</h2>

<h3 align="center">
    Name: Mohd Atif Khan
</h3>


# **Singular Value Decomposition**

---

# Crash Course on Singular Value Decomposition (SVD)

## Introduction

Singular Value Decomposition, or SVD, is a mathematical technique used in many fields such as signal processing, statistics, and machine learning, particularly in the context of recommendation systems. It's a method for decomposing a matrix into three other matrices that reveal its underlying structure.

## Basic Concepts

### Matrices
- **Matrix**: A rectangular array of numbers.
- **Dimension of a Matrix**: Given in the form of rows × columns.

### Decomposition
- **Decomposition**: Breaking down a complex matrix into simpler, understandable parts.

## What is SVD?

```
SVD breaks down any given matrix A into three separate matrices named U, Σ and V*
ie. A = UΣV*
```
Where the components are:
```
- A: Original matrix.
- U: Left singular vectors (orthogonal matrix).
- Σ: Diagonal matrix of singular values (non-negative).
- V*: Right singular vectors (conjugate transpose of V , an orthogonal matrix).
```




## Where do we use SVDs?

### Applications in Recommendation Systems

In recommendation systems, SVD is used to predict unknown preferences by decomposing a large matrix of user-item interactions into factors representing latent features. It helps in capturing the underlying patterns in the data.

### Process

1. **Matrix Creation**: Start with a matrix where rows represent users, columns represent items, and entries represent user ratings.
2. **Apply SVD**: Decompose this matrix using SVD.
3. **Latent Features**: The decomposition reveals latent features that explain observed ratings.
4. **Prediction**: Use the decomposed matrices to predict missing ratings.

### Advantages of an SVD
- Effective at uncovering latent features in the data.
- Reduces dimensionality, making computations more manageable.

### Limitations of an SVD
- Assumes linear relationships in data.
- Sensitive to missing data and outliers.

#### Through this project, we will learn how to build a movie recommendation system using an SVD


#### Dataset being used : **Movielens 100k dataset**

- This specific dataset, often referred to as "ml-100k," contains 100,000 ratings from 943 users on 1,682 movies. The data was collected through the MovieLens website during the seven-month period from September 19th, 1997 to April 22nd, 1998.

- **Data Structure**: The dataset includes user ratings that range from 1 to 5. Additionally, it provides demographic information about the users (age, gender, occupation, etc.) and details about the movies (titles, genres).

- **Usage**: It's a standard dataset used for implementing and testing recommender systems. Its size is manageable, making it a popular choice for educational purposes and for initial experimentation with recommendation algorithms.

- **Significance**: The diversity in the dataset, both in terms of users and movie genres, provides a rich ground for analyzing different recommendation strategies, testing algorithms like SVD, and understanding user preferences and behavioral patterns.

This dataset is an excellent starting point for anyone looking to delve into the world of recommender systems and practice with real-world data.


Now, we will write some code to understand and explore the dataset

In [1]:
!pip install surprise
!pip install numpy==1.26.4

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl.metadata (327 bytes)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp312-cp312-linux_x86_64.whl size=2610393 sha256=fa6ee0c07623502f08b7bbbd3ece710fda19bf512b28575113c4351701972690
  Stored in directory: /root/.cache/pip/wheels/75/fa/bc/739bc2cb1fbaab6061854e6cfbb81a0ae52c92a502a7fa454b
Successfully built scikit-surprise
Installi

In [1]:
#Importing necessary modules for this project
import pandas as pd
from surprise import Dataset
from surprise.model_selection import train_test_split

In [2]:
#Installing the dataset from pandas, run this only once, you can comment out this part of the code after.
!pip install pandas scikit-surprise



### How do the predictions work?

1. **Model Training**:
   - The SVD algorithm is first trained on a portion of the dataset, which includes user ratings for various movies.
   - During training, the model learns to associate certain patterns and characteristics of users and movies with specific rating behaviors.

2. **Latent Features Extraction**:
   - SVD decomposes the rating matrix into matrices representing latent features of users and movies.
   - These latent features capture underlying aspects that affect rating behavior but are not explicitly available in the data (like user preferences or movie characteristics).

3. **Making Predictions**:
   - Once the model is trained, it can predict ratings for user-movie pairs where the actual rating is unknown.
   - The prediction is essentially a dot product of the latent features of the user and the movie. It represents the estimated preference of the user for that particular movie based on the learned patterns.

4. **Example of a Prediction**:
   - Suppose we want to predict how user `U` would rate movie `M`.
   - The model uses the latent features it has learned for user `U` and movie `M` to compute a predicted rating.
   - This prediction is a numerical value, typically on the same scale as the original ratings (e.g., 1 to 5).

5. **Application**:
   - These predictions are used to recommend movies to users.
   - For example, the system can recommend movies that have the highest predicted ratings for a particular user.

6. **Handling New Users or Movies (Cold Start Problem)**:
   - One challenge is predicting ratings for new users or movies that have little to no rating history. This is known as the cold start problem.
   - Solutions might involve using content-based approaches or hybrid models that don't rely solely on historical rating data.

In [6]:
data = Dataset.load_builtin('ml-100k')
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rating", "timestamp"])

print(df.head())

  user item  rating  timestamp
0  196  242     3.0  881250949
1  186  302     3.0  891717742
2   22  377     1.0  878887116
3  244   51     2.0  880606923
4  166  346     1.0  886397596


We see the following columns:

* **User ID**: A unique identifier for the user who provided the rating.

* **Item ID (Movie ID)**: A unique identifier for the movie that was rated.

* **Rating:** The rating given to the movie by the user. In the MovieLens 100k dataset, these ratings are typically on a scale of 1 to 5.

* **Timestamp:** The time at which the rating was provided. The timestamp is usually in Unix time format, which counts seconds since the Unix epoch (January 1, 1970).



In [8]:
#TODO - Describe the statistics of this dataset.
# Hint: Use the describe() function
df.describe()

Unnamed: 0,rating
count,100000.0
mean,3.52986
std,1.125674
min,1.0
25%,3.0
50%,4.0
75%,4.0
max,5.0



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



Now, we will do some data preprocessing.

This will include:
*   Checking for missing values
*   Converting timestamps to a readable format
*   Splitting the data into testing and training subsets



In [9]:
print(df.isnull().sum())

user         0
item         0
rating       0
timestamp    0
dtype: int64


We see that there are no missing values.

In [10]:
# Convert timestamp to a readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
print(df.head())

  df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')


  user item  rating           timestamp
0  196  242     3.0 1997-12-04 15:55:49
1  186  302     3.0 1998-04-04 19:22:22
2   22  377     1.0 1997-11-07 07:18:36
3  244   51     2.0 1997-11-27 05:02:03
4  166  346     1.0 1998-02-02 05:33:16


In [11]:
# Split the data into a training set and a test set
trainset, testset = train_test_split(data, test_size=0.20)

# Display the number of users and items in the training set
print(f"Number of users: {trainset.n_users}")
print(f"Number of items: {trainset.n_items}")

# Display the first few elements of the test set
print(testset[:5])

Number of users: 943
Number of items: 1656
[('339', '94', 2.0), ('321', '614', 3.0), ('456', '380', 3.0), ('933', '175', 4.0), ('758', '303', 4.0)]


### Hyperparameter Tuning in SVD
Hyperparameter tuning is a critical step in optimizing the performance of an SVD model. The goal is to find the best combination of parameters that results in the most accurate predictions or lowest error rates.

#### Hyperparameters we will be tuning in this project

1. **`n_factors`**:
   - Represents the number of latent factors (or features) to extract from the dataset.
   - The values `[50, 100, 150]` are chosen to test the model's performance with a varying number of factors. A higher number of factors can capture more complex patterns but may lead to overfitting and increased computation time.

2. **`n_epochs`**:
   - Refers to the number of iterations over the entire dataset during training.
   - The values `[20, 30]` provide a range to evaluate whether more iterations improve model performance or lead to overtraining.

3. **`lr_all`** (Learning Rate):
   - Determines the step size at each iteration while moving toward a minimum of the loss function.
   - The values `[0.005, 0.010]` are chosen to test how fast the model learns. A smaller learning rate may lead to more precise convergence but requires more epochs.

4. **`reg_all`** (Regularization Term):
   - Helps prevent overfitting by penalizing larger model parameters.
   - The values `[0.02, 0.1]` offer a range to assess the impact of regularization on model performance. Higher regularization can reduce overfitting but may lead to underfitting.

In [12]:
# Define a grid of SVD hyperparameters explained above for tuning
param_grid = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.010],
    'reg_all': [0.02, 0.1]
}


Now, we will train the model with the following parameters:

1. **`SVD`**:
   - This is the recommendation algorithm being tuned. SVD is a popular algorithm used in recommendation systems, particularly for matrix factorization.

2. **`param_grid`**:
   - This is a dictionary where keys are hyperparameter names, and values are lists of parameter settings to try as values. It defines the grid of parameters that will be tested.
   - Example: If `param_grid` is `{'n_factors': [50, 100], 'lr_all': [0.005, 0.01]}`, GridSearchCV will evaluate the SVD algorithm for all combinations of `n_factors` and `lr_all` from these lists.

3. **`measures=['RMSE', 'MAE']`**:
   - These are the performance metrics used to evaluate the algorithm.
   - `RMSE` stands for Root Mean Square Error, and `MAE` stands for Mean Absolute Error. Both are common metrics for evaluating the accuracy of prediction algorithms, with lower values indicating better performance.

4. **`cv=3`**:
   - This specifies the number of folds for cross-validation.
   - In this context, `cv=3` means that a 3-fold cross-validation will be used. The dataset will be split into three parts: in each iteration, two parts will be used for training, and one part will be used for testing. This process repeats three times, each time with a different part used for testing.

In [13]:
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise import SVD, Dataset, Reader, accuracy

# Perform grid search with cross-validation to find the best hyperparameters for our model
gs = GridSearchCV(SVD, param_grid, measures=['RMSE', 'MAE'], cv=3)
gs.fit(data)

In [17]:
# Best score and parameters
print(f"Best RMSE: {gs.best_score['rmse']}")
print(f"Best parameters: {gs.best_params['rmse']}")

Best RMSE: 0.9207675147239538
Best parameters: {'n_factors': 100, 'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.1}


In [18]:
# TODO - Use the best model. Use best_estimator function on gs
algo = gs.best_estimator['rmse']

In [19]:

# TODO - Train and test split. Make sure test_size is 0.25
trainset, testset = train_test_split(data, test_size=0.25)

# TODO - Fit the trainset to train the model
algo.fit(trainset)

# TODO - Make predictions on the testset
predictions = algo.test(testset)

# TODO - Calculate and print RMSE on the predictions made
accuracy.rmse(predictions)


RMSE: 0.9155


0.9155236978867809

In [20]:
#Predict rating for a user and item
user_id = '196'  # replace with a specific user ID
item_id = '302'  # replace with a specific item (movie) ID
predicted_rating = algo.predict(user_id, item_id)
print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating.est}")

Predicted rating for user 196 and item 302: 4.12849488887813


In [21]:
# To inspect the predictions in detail, let's print the first 10 predictions made by the model
for idx, prediction in enumerate(predictions[:10]):
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {prediction.est}')


Prediction 0: User 643 and item 716 has true rating 3.0, and the predicted rating is 3.2285984078126706
Prediction 1: User 249 and item 248 has true rating 5.0, and the predicted rating is 4.173906097923602
Prediction 2: User 488 and item 176 has true rating 4.0, and the predicted rating is 3.6959666929934003
Prediction 3: User 10 and item 488 has true rating 5.0, and the predicted rating is 4.4765849434840606
Prediction 4: User 655 and item 293 has true rating 4.0, and the predicted rating is 3.2198532790829266
Prediction 5: User 320 and item 1 has true rating 3.0, and the predicted rating is 4.250431024517389
Prediction 6: User 455 and item 293 has true rating 4.0, and the predicted rating is 3.4822138720590403
Prediction 7: User 549 and item 225 has true rating 3.0, and the predicted rating is 3.332980234064579
Prediction 8: User 629 and item 699 has true rating 3.0, and the predicted rating is 4.131851607398521
Prediction 9: User 234 and item 152 has true rating 4.0, and the predic

####Rounding Numbers
Rounding values is a technique used to simplify numbers, but its appropriateness depends on the context:

**When to Round**
1. **Simplification**: For estimations.
2. **Reporting**: When exact figures aren't necessary (e.g., in everyday language).
3. **Data Analysis**: To focus on significant trends by ignoring minor variations.
4. **Financial Transactions**: Rounding to the smallest currency unit.
5. **Display Purposes**: For clarity in graphs or tables.

**When NOT to Round**
1. **Intermediate Calculations**: Early rounding can lead to significant final errors.
2. **Legal/Regulatory Documents**: Require exact figures.
3. **Scientific/Engineering Work**: Precision is crucial.
4. **Critical Calculations**: In health, safety, or finance, precision is essential.

To summarize,
- Rounding depends on the purpose and context of the calculation.
- It is useful for simplification and clarity but should be avoided when precision is critical.
- We must be aware of potential cumulative errors in sequential calculations.


Let us round the values of the predictions so that it falls within the rating categories of [1.0, 2.0, 3.0, 4.0, 5.0]

In [23]:
#TODO - Round the prediction.est variable being printed. Use python's default rounding function to achieve this

for idx, prediction in enumerate(predictions[:10]):
    # Using round() instead of math.ceil
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {round(prediction.est)}')


Prediction 0: User 643 and item 716 has true rating 3.0, and the predicted rating is 3
Prediction 1: User 249 and item 248 has true rating 5.0, and the predicted rating is 4
Prediction 2: User 488 and item 176 has true rating 4.0, and the predicted rating is 4
Prediction 3: User 10 and item 488 has true rating 5.0, and the predicted rating is 4
Prediction 4: User 655 and item 293 has true rating 4.0, and the predicted rating is 3
Prediction 5: User 320 and item 1 has true rating 3.0, and the predicted rating is 4
Prediction 6: User 455 and item 293 has true rating 4.0, and the predicted rating is 3
Prediction 7: User 549 and item 225 has true rating 3.0, and the predicted rating is 3
Prediction 8: User 629 and item 699 has true rating 3.0, and the predicted rating is 4
Prediction 9: User 234 and item 152 has true rating 4.0, and the predicted rating is 3


<h3 align = 'center' >
Thank you for completing the project!
</h3>

Please submit all materials to the NSDC HQ team at nsdc@nebigdatahub.org in order to receive a virtual certificate of completion. Do reach out to us if you have any questions or concerns. We are here to help you learn and grow.
