# Course Recommendation System for CourseEra dataset

###### Content-based similarity filtering based on the course tags which the users either watch or search is being used.

###### The dataset used is of the Coursera Courses Dataset which contains over 3,000 courses!

# Importing Necessary Libraries

In [81]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
import nltk #for stemming process
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

print('Dependencies Imported')

Dependencies Imported


In [82]:
def import_dataset():
    data = pd.read_csv("E:/Coursera.csv")
    return data

data = import_dataset()
data

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...
...,...,...,...,...,...,...,...
3517,"Capstone: Retrieving, Processing, and Visualiz...",University of Michigan,Beginner,4.6,https://www.coursera.org/learn/python-data-vis...,"In the capstone, students will build a series ...",Databases syntax analysis web Data Visuali...
3518,Patrick Henry: Forgotten Founder,University of Virginia,Intermediate,4.9,https://www.coursera.org/learn/henry,"�Give me liberty, or give me death:� Rememberi...",retirement Causality career history of the ...
3519,Business intelligence and data analytics: Gene...,Macquarie University,Advanced,4.6,https://www.coursera.org/learn/business-intell...,�Megatrends� heavily influence today�s organis...,analytics tableau software Business Intellig...
3520,Rigid Body Dynamics,Korea Advanced Institute of Science and Techno...,Beginner,4.6,https://www.coursera.org/learn/rigid-body-dyna...,"This course teaches dynamics, one of the basic...",Angular Mechanical Design fluid mechanics F...


# Basic Data Analysis

In [83]:
data.shape #3522 courses and 7 columns with different attributes


(3522, 7)

In [84]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [85]:
data.isnull().sum() #no value is missing


Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64

In [86]:
data['Difficulty Level'].value_counts()


Difficulty Level
Beginner          1444
Advanced          1005
Intermediate       837
Conversant         186
Not Calibrated      50
Name: count, dtype: int64

In [87]:
data['Course Rating'].value_counts()


Course Rating
4.7               740
4.6               623
4.8               598
4.5               389
4.4               242
4.9               180
4.3               165
4.2               121
5                  90
4.1                85
Not Calibrated     82
4                  51
3.8                24
3.9                20
3.6                18
3.7                18
3.5                17
3.4                13
3                  12
3.2                 9
3.3                 6
2.9                 6
2.6                 2
2.8                 2
2.4                 2
1                   2
2                   1
3.1                 1
2.5                 1
1.9                 1
2.3                 1
Name: count, dtype: int64

In [88]:
data['University'].value_counts()


University
Coursera Project Network                      562
University of Illinois at Urbana-Champaign    138
Johns Hopkins University                      110
University of Michigan                        101
University of Colorado Boulder                101
                                             ... 
GitLab                                          1
Yeshiva University                              1
University of Glasgow                           1
Laureate Education                              1
The World Bank Group                            1
Name: count, Length: 184, dtype: int64

In [89]:
data['Course Name']


0       Write A Feature Length Screenplay For Film Or ...
1       Business Strategy: Business Model Canvas Analy...
2                           Silicon Thin Film Solar Cells
3                                    Finance for Managers
4            Retrieve Data using Single-Table SQL Queries
                              ...                        
3517    Capstone: Retrieving, Processing, and Visualiz...
3518                     Patrick Henry: Forgotten Founder
3519    Business intelligence and data analytics: Gene...
3520                                  Rigid Body Dynamics
3521    Architecting with Google Kubernetes Engine: Pr...
Name: Course Name, Length: 3522, dtype: object

# Required Columns for System

#### Important columns to be used in recommendation system :¶
- Course Name : Names of the courses
- Course Description : Similar courses may have similar course description
- Skills : Users may want to see courses based on same skills
- Difficulty Level : Similar courses as per difficulty level
#### Columns not used for the recommendation system :
- Course Ratings : Numerical Column; Ratings can sometimes become a biased factor and distribution is not even
- University : Same university might offer multiple courses in different domains which the user might not want to see
- Course URL : No significance in the recommendation system

In [90]:
def new_data(data):
    data = data[['Course Name','Difficulty Level','Course Description','Skills']]
    return data
data = new_data(data)
data

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Beginner,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Beginner,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,Advanced,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,Intermediate,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Beginner,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...
...,...,...,...,...
3517,"Capstone: Retrieving, Processing, and Visualiz...",Beginner,"In the capstone, students will build a series ...",Databases syntax analysis web Data Visuali...
3518,Patrick Henry: Forgotten Founder,Intermediate,"�Give me liberty, or give me death:� Rememberi...",retirement Causality career history of the ...
3519,Business intelligence and data analytics: Gene...,Advanced,�Megatrends� heavily influence today�s organis...,analytics tableau software Business Intellig...
3520,Rigid Body Dynamics,Beginner,"This course teaches dynamics, one of the basic...",Angular Mechanical Design fluid mechanics F...


# Data Pre-Processing

###### An important part of the process is to pre-process the data into usable format for the recommendation system

In [91]:
def space_by_commas(data):
    # Removing spaces between the words (Lambda funtions can be used as well)
    data['Course Name'] = data['Course Name'].str.replace(' ',',')
    data['Course Name'] = data['Course Name'].str.replace(',,',',')
    data['Course Name'] = data['Course Name'].str.replace(':','')
    data['Course Description'] = data['Course Description'].str.replace(' ',',')
    data['Course Description'] = data['Course Description'].str.replace(',,',',')
    data['Course Description'] = data['Course Description'].str.replace('_','')
    data['Course Description'] = data['Course Description'].str.replace(':','')
    data['Course Description'] = data['Course Description'].str.replace('(','')
    data['Course Description'] = data['Course Description'].str.replace(')','')

    #removing paranthesis from skills columns 
    data['Skills'] = data['Skills'].str.replace('(','')
    data['Skills'] = data['Skills'].str.replace(')','')
    
    return data

data = space_by_commas(data)
data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Course Name'] = data['Course Name'].str.replace(' ',',')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Course Name'] = data['Course Name'].str.replace(',,',',')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Course Name'] = data['Course Name'].str.replace(':','')
A value is tryin

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills
0,"Write,A,Feature,Length,Screenplay,For,Film,Or,...",Beginner,"Write,a,Full,Length,Feature,Film,Script,In,thi...",Drama Comedy peering screenwriting film D...
1,"Business,Strategy,Business,Model,Canvas,Analys...",Beginner,"By,the,end,of,this,guided,project,you,will,be,...",Finance business plan persona user experienc...
2,"Silicon,Thin,Film,Solar,Cells",Advanced,"This,course,consists,of,a,general,presentation...",chemistry physics Solar Energy film lambda...
3,"Finance,for,Managers",Intermediate,"When,it,comes,to,numbers,there,is,always,more,...",accounts receivable dupont analysis analysis...
4,"Retrieve,Data,using,Single-Table,SQL,Queries",Beginner,"In,this,course,you�ll,learn,how,to,effectively...",Data Analysis select sql database management...
...,...,...,...,...
3517,"Capstone,Retrieving,Processing,and,Visualizing...",Beginner,"In,the,capstone,students,will,build,a,series,o...",Databases syntax analysis web Data Visuali...
3518,"Patrick,Henry,Forgotten,Founder",Intermediate,"�Give,me,liberty,or,give,me,death�,Remembering...",retirement Causality career history of the ...
3519,"Business,intelligence,and,data,analytics,Gener...",Advanced,"�Megatrends�,heavily,influence,today�s,organis...",analytics tableau software Business Intellig...
3520,"Rigid,Body,Dynamics",Beginner,"This,course,teaches,dynamics,one,of,the,basic,...",Angular Mechanical Design fluid mechanics F...


# Tags Columns

###### The tags column is the combination of the following columns : Course Name + Difficulty Level + Course Description + Skills

# Combining Required Columns for System

In [92]:
def combination_columns(data):
    data['tags'] = data['Course Name'] + data['Difficulty Level'] + data['Course Description'] + data['Skills']
    return data

data = combination_columns(data)
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['tags'] = data['Course Name'] + data['Difficulty Level'] + data['Course Description'] + data['Skills']


Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills,tags
0,"Write,A,Feature,Length,Screenplay,For,Film,Or,...",Beginner,"Write,a,Full,Length,Feature,Film,Script,In,thi...",Drama Comedy peering screenwriting film D...,"Write,A,Feature,Length,Screenplay,For,Film,Or,..."
1,"Business,Strategy,Business,Model,Canvas,Analys...",Beginner,"By,the,end,of,this,guided,project,you,will,be,...",Finance business plan persona user experienc...,"Business,Strategy,Business,Model,Canvas,Analys..."
2,"Silicon,Thin,Film,Solar,Cells",Advanced,"This,course,consists,of,a,general,presentation...",chemistry physics Solar Energy film lambda...,"Silicon,Thin,Film,Solar,CellsAdvancedThis,cour..."
3,"Finance,for,Managers",Intermediate,"When,it,comes,to,numbers,there,is,always,more,...",accounts receivable dupont analysis analysis...,"Finance,for,ManagersIntermediateWhen,it,comes,..."
4,"Retrieve,Data,using,Single-Table,SQL,Queries",Beginner,"In,this,course,you�ll,learn,how,to,effectively...",Data Analysis select sql database management...,"Retrieve,Data,using,Single-Table,SQL,QueriesBe..."
...,...,...,...,...,...
3517,"Capstone,Retrieving,Processing,and,Visualizing...",Beginner,"In,the,capstone,students,will,build,a,series,o...",Databases syntax analysis web Data Visuali...,"Capstone,Retrieving,Processing,and,Visualizing..."
3518,"Patrick,Henry,Forgotten,Founder",Intermediate,"�Give,me,liberty,or,give,me,death�,Remembering...",retirement Causality career history of the ...,"Patrick,Henry,Forgotten,FounderIntermediate�Gi..."
3519,"Business,intelligence,and,data,analytics,Gener...",Advanced,"�Megatrends�,heavily,influence,today�s,organis...",analytics tableau software Business Intellig...,"Business,intelligence,and,data,analytics,Gener..."
3520,"Rigid,Body,Dynamics",Beginner,"This,course,teaches,dynamics,one,of,the,basic,...",Angular Mechanical Design fluid mechanics F...,"Rigid,Body,DynamicsBeginnerThis,course,teaches..."


# Dataframe to be used

In [93]:
def use_df(data):
    new_df = data[['Course Name','tags']]
    new_df['tags'] = data['tags'].str.replace(',',' ')
    new_df['Course Name'] = data['Course Name'].str.replace(',',' ')
    new_df.rename(columns = {'Course Name':'course_name'}, inplace = True)
    new_df['tags'] = new_df['tags'].apply(lambda x:x.lower()) #lower casing the tags column
    return new_df

new_df = use_df(data)
new_df



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = data['tags'].str.replace(',',' ')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['Course Name'] = data['Course Name'].str.replace(',',' ')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df.rename(columns = {'Course Name':'course_name'}, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try usin

Unnamed: 0,course_name,tags
0,Write A Feature Length Screenplay For Film Or ...,write a feature length screenplay for film or ...
1,Business Strategy Business Model Canvas Analys...,business strategy business model canvas analys...
2,Silicon Thin Film Solar Cells,silicon thin film solar cellsadvancedthis cour...
3,Finance for Managers,finance for managersintermediatewhen it comes ...
4,Retrieve Data using Single-Table SQL Queries,retrieve data using single-table sql queriesbe...
...,...,...
3517,Capstone Retrieving Processing and Visualizing...,capstone retrieving processing and visualizing...
3518,Patrick Henry Forgotten Founder,patrick henry forgotten founderintermediate�gi...
3519,Business intelligence and data analytics Gener...,business intelligence and data analytics gener...
3520,Rigid Body Dynamics,rigid body dynamicsbeginnerthis course teaches...


# Text Vectorization

In [94]:
def text_vectorization(new_df):
    cv = CountVectorizer(max_features=5000,stop_words='english')
    vectors = cv.fit_transform(new_df['tags']).toarray()
    return vectors

vectors = text_vectorization(new_df)
vectors

    

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# Stemming Process

In [95]:
def stem(text):
    ps = PorterStemmer()
    y=[]
    
    for i in text.split():
        y.append(ps.stem(i))
    
    return " ".join(y)

new_df['tags'] = new_df['tags'].apply(stem)
new_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stem)


Unnamed: 0,course_name,tags
0,Write A Feature Length Screenplay For Film Or ...,write a featur length screenplay for film or t...
1,Business Strategy Business Model Canvas Analys...,busi strategi busi model canva analysi with mi...
2,Silicon Thin Film Solar Cells,silicon thin film solar cellsadvancedthi cours...
3,Finance for Managers,financ for managersintermediatewhen it come to...
4,Retrieve Data using Single-Table SQL Queries,retriev data use single-t sql queriesbeginneri...
...,...,...
3517,Capstone Retrieving Processing and Visualizing...,capston retriev process and visual data with p...
3518,Patrick Henry Forgotten Founder,patrick henri forgotten founderintermediate�g ...
3519,Business intelligence and data analytics Gener...,busi intellig and data analyt gener insightsad...
3520,Rigid Body Dynamics,rigid bodi dynamicsbeginnerthi cours teach dyn...


In [96]:
similarity = cosine_similarity(vectors)
similarity

array([[1.        , 0.03750979, 0.07877378, ..., 0.09463622, 0.06753905,
        0.10266713],
       [0.03750979, 1.        , 0.01220169, ..., 0.2976846 , 0.00502151,
        0.04697402],
       [0.07877378, 0.01220169, 1.        , ..., 0.01989156, 0.08612246,
        0.03117049],
       ...,
       [0.09463622, 0.2976846 , 0.01989156, ..., 1.        , 0.00682185,
        0.03722562],
       [0.06753905, 0.00502151, 0.08612246, ..., 0.00682185, 1.        ,
        0.01973535],
       [0.10266713, 0.04697402, 0.03117049, ..., 0.03722562, 0.01973535,
        1.        ]])

In [100]:
def recommend(course, new_df):
    course_index = new_df[new_df['course_name'] == course].index[0]
    distances = similarity[course_index]
    course_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:7]
    
    for i in course_list:
        return (new_df.iloc[i[0]].course_name)

In [103]:
recommend('Business Strategy Business Model Canvas Analysis with Miro', new_df) 

'Product Development Customer Persona Development with Miro'