<div style="font-family: 'Cinzel', serif; background-color: #A6CEE3; padding: 30px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title -->
  <div style="font-size: 36px; font-weight: bold; color: #ffffff; text-align: center; letter-spacing: 1px;">
    🎓 Coursera Course Recommender System
  </div>
</div>

<h2 style="font-family: 'Cinzel', serif;color: #5C7D8C;">🎯 Purpose</h2>
<p>This notebook demonstrates how to build a Coursera Course Recommendation System using machine learning techniques. It recommends relevant courses based on course details and ratings, leveraging text data like course names, descriptions, skills, and ratings.</p>

<hr>

<h2 style="font-family: 'Cinzel', serif;color: #5C7D8C;">🔗 Useful Links</h2>
<ul>
    <li><a href="https://github.com/bushraqurban/LearnStream/" target="_blank">GitHub Repository</a></li>
    <li><a href="https://mechanical-oralia-bushra-e3bf072d.koyeb.app" target="_blank">Live App</a></li>
</ul>

<hr>

<h2 style="font-family: 'Cinzel', serif;color: #5C7D8C;">🗂️ Dataset Overview</h2>
<ul>
    <li>The dataset is from the <a href="https://www.kaggle.com/khusheekapoor/coursera-courses-dataset-2021" target="_blank">Coursera Courses Dataset</a> and contains over 3,000 courses.</li>
    <li>We use course details such as the name, description, and skills to build the recommendation model by calculating semantic similarities. Additionally, we incorporate course ratings to adjust the recommendations, prioritizing highly-rated, relevant courses.</li>
</ul>

<hr>

<h2 style="font-family: 'Cinzel', serif;color: #5C7D8C;">🎓 Walkthrough of This Notebook</h2>
<p>This notebook is a step-by-step guide to building a recommendation system. Here's the high-level process:</p>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">1. Import Essential Libraries</h3>
<p>We start by importing libraries like `pandas`, `scikit-learn`, and `nltk` to process the data and perform the machine learning tasks.</p>
<pre><code>import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
import pickle
import nltk
import re
from nltk.corpus import wordnet</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">2. Load the Dataset</h3>
<p>We load the dataset containing 3,524 Coursera courses into a `pandas` DataFrame and inspect it for structure, missing values, and duplicates.</p>
<pre><code>data = pd.read_csv("/kaggle/input/coursera-courses-dataset-2021/Coursera.csv", encoding='utf-8')
data.head()</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">3. Data Preprocessing</h3>
<p>We clean the course name, description, and skills columns by removing unwanted characters and converting text to lowercase.</p>
<pre><code>def clean_for_tags(text):
    text = re.sub(r'��+', '', text) 
    text = re.sub(r'[^\x00-\x7F]+', '', text) 
    text = re.sub(r'[^a-zA-Z\s]', '', text) 
    text = text.lower()  
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  
    return text</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">4. Text Vectorization with TF-IDF</h3>
<p>We convert text data into numerical vectors using the TF-IDF vectorizer, which helps machine learning models understand the importance of each term.</p>
<pre><code>vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(training_data['tags'])</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">5. Dimensionality Reduction with SVD</h3>
<p>To make the data more manageable, we reduce the dimensions of the TF-IDF matrix using Singular Value Decomposition (SVD), enhancing performance.</p>
<pre><code>svd = TruncatedSVD(n_components=100, random_state=42)
tfidf_matrix = svd.fit_transform(tfidf_matrix)</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">6. Cosine Similarity</h3>
<p>We calculate the cosine similarity between courses based on their TF-IDF vectors. This helps us identify courses that are most similar to each other.</p>
<pre><code>similarity_matrix = cosine_similarity(tfidf_matrix)</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">7. Get Recommendations</h3>
<p>We create a function that returns the top N courses based on their similarity to a selected course. Course ratings are also taken into account to improve recommendations.</p>
<pre><code>def get_recommendations(course_name, data, similarity_matrix, top_n=3, rating_weight=0.05):</code></pre>

<h3 style="font-family: 'Cinzel', serif;color: #5C7D8C;">8. Save the Model</h3>
<p>Finally, we save the similarity matrix model using `pickle` for future use, so we don’t have to retrain the model each time.</p>
<pre><code>pickle.dump(similarity_matrix, open('similarity_matrix.pkl', 'wb'))</code></pre>


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Importing Essential Tools
  </div>
</div>



In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
import pickle
import nltk
import re
from nltk.corpus import wordnet

# Download wordnet once (if needed)
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')


print('Dependencies Imported')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Dependencies Imported


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Load Dataset
  </div>
</div>



In [2]:
data = pd.read_csv("/kaggle/input/coursera-courses-dataset-2021/Coursera.csv", encoding='utf-8')

In [3]:
data.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Basic Data Inspection
  </div>
</div>



In [4]:
data.shape

(3522, 7)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [6]:
data.isnull().sum()

Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64

In [7]:
data.nunique()

Course Name           3416
University             184
Difficulty Level         5
Course Rating           31
Course URL            3424
Course Description    3397
Skills                3424
dtype: int64

In [8]:
data.duplicated().sum()

98

In [9]:
# Remove duplicates based on specific columns
data = data.drop_duplicates(subset=['Course Name', 'University', 'Difficulty Level', 'Course Rating',
       'Course URL', 'Course Description'])
data.shape

(3424, 7)

<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Text Preprocessing on Training Data
  </div>
</div>



In [10]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('wordnet2022')

! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet2022 to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/wordnet2022.zip.


In [11]:
lemmatizer = WordNetLemmatizer()

# Function for text cleaning (removing special characters, stopwords, and lemmatization)
def clean_for_tags(text):
    text = re.sub(r'��+', '', text)  # This removes "��" or any repeated "��" characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Removes non-ASCII characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove anything that is not a letter or space
    text = text.lower()  # Convert text to lowercase
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  # Lemmatization
    return text

training_data = data.copy()

# Apply clean_for_tags on columns to be used in tags column
training_data['Course Name'] = training_data['Course Name'].apply(clean_for_tags)
training_data['Course Description'] = training_data['Course Description'].apply(clean_for_tags)
training_data['Skills'] = training_data['Skills'].apply(clean_for_tags)

# Combine 'Course Name', 'Course Description', and 'Skills' into 'tags'
data['tags'] = training_data['Course Name'] + ' ' + training_data['Course Description'] + ' ' + training_data['Skills']

training_data = data[['Course Name', 'tags']]

In [12]:
training_data.head()

Unnamed: 0,Course Name,tags
0,Write A Feature Length Screenplay For Film Or ...,write a feature length screenplay for film or ...
1,Business Strategy: Business Model Canvas Analy...,business strategy business model canvas analys...
2,Silicon Thin Film Solar Cells,silicon thin film solar cell this course consi...
3,Finance for Managers,finance for manager when it come to number the...
4,Retrieve Data using Single-Table SQL Queries,retrieve data using singletable sql query in t...


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Text Vectorization (TF-IDF)
  </div>
</div>



In [13]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(training_data['tags'])
print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (3424, 5000)


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Apply SVD on TF-IDF
  </div>
</div>



In [14]:
n_components = 100 # Reduce to 100 dimensions
svd = TruncatedSVD(n_components=n_components, random_state=42)
tfidf_matrix = svd.fit_transform(tfidf_matrix)

print("Reduced TF-IDF matrix shape:", tfidf_matrix.shape)


Reduced TF-IDF matrix shape: (3424, 100)


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Cosine Similarity and Recommendations
  </div>
</div>



In [15]:
similarity_matrix = cosine_similarity(tfidf_matrix)
print(similarity_matrix[0][1])

0.02311163293204528


<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Functions for Recommendation
  </div>
</div>



In [16]:
def normalize_rating(rating_str):
    """
    Normalize the course rating to a 0-1 scale.
    """
    try:
        return (float(rating_str) - 0) / (5 - 0)  # Normalize to 0-1
    except ValueError:
        return 0  

In [17]:
def get_recommendations(course_name, data, similarity_matrix, top_n=3, rating_weight=0.05):
    """
    Get top N course recommendations based on similarity to the given course name.
    """
    course_name = data[data['Course Name'] == course_name]  # Filter data for selected course
    course_idx = course_name.index[0]  # Get the index of the selected course
    similarity_scores = list(enumerate(similarity_matrix[course_idx]))  # Get similarity scores for all courses
    
    recommendations = []
    for idx, similarity_score in sorted(similarity_scores, key=lambda x: x[1], reverse=True)[:top_n]:
        course_data = data.iloc[idx]  # Get course data for the current recommendation
        normalized_rating = normalize_rating(course_data.get('Course Rating', '0'))  # Normalize rating

        # Prepare recommendation dictionary with relevant course information
        recommendations.append({
            "course_name": course_data['Course Name'],
            "course_url": course_data.get('Course URL', ''),
            "rating": course_data['Course Rating'],
            "institution": course_data.get('University', 'Unknown'),
            "difficulty_level": course_data.get('Difficulty Level', 'Unknown'),
            "similarity": similarity_score,
            "final_score": similarity_score * (1 - rating_weight) + normalized_rating * rating_weight 
        })

    return sorted(recommendations, key=lambda x: x['final_score'], reverse=True)

In [18]:
get_recommendations('Finance for Managers', data, similarity_matrix)

[{'course_name': 'Finance for Managers',
  'course_url': 'https://www.coursera.org/learn/operational-finance',
  'rating': '4.8',
  'institution': 'IESE Business School',
  'difficulty_level': 'Intermediate',
  'similarity': 1.0000000000000004,
  'final_score': 0.9980000000000004},
 {'course_name': 'Finance for Non-Financial Professionals',
  'course_url': 'https://www.coursera.org/learn/finance-for-non-finance-managers',
  'rating': '4.5',
  'institution': 'University of California, Irvine',
  'difficulty_level': 'Conversant',
  'similarity': 0.833306939311862,
  'final_score': 0.8366415923462689},
 {'course_name': 'Finance for Non-Financial Managers',
  'course_url': 'https://www.coursera.org/learn/finance-for-non-financial-managers',
  'rating': '4.2',
  'institution': 'Emory University',
  'difficulty_level': 'Beginner',
  'similarity': 0.8315951404609806,
  'final_score': 0.8320153834379316}]

<div style="background-color: #A6CEE3; padding: 10px 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);">
  <!-- Centered Title with Larger Vertical Bar -->
  <div style="font-family: 'Cinzel', serif; font-size: 30px; font-weight: bold; color: #ffffff; text-align: left; letter-spacing: 1px;">
    <span style="font-size: 50px; margin-right: 15px;">|</span> Save the Model
  </div>
</div>



In [19]:
pickle.dump(similarity_matrix, open('similarity_matrix.pkl', 'wb'))

<div style="background-color: #D16C6C; padding: 20px; border-radius: 10px; text-align: center; font-family: Arial, sans-serif; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> <h2 style="font-family: 'Cinzel', serif; font-size: 24px; font-weight: bold; color: #ffffff; text-align: center;">Thank You for Your Attention! 😊</h2> <p style="font-family: 'Cinzel', serif; font-size: 18px; color: #ffffff; margin-bottom: 20px; text-align: center;">Please give a 👍 if you liked it!</p> </div> 