# systeme de recommandation pour une plateforme de cours en ligne par SEZINE Arielle

# type de systeme: content based/ basé sur le contenu

##### Un système de recommandation (SR) basé sur le contenu (ou content-based filtering) est un type de système de recommandation qui suggère des éléments à un utilisateur en se basant sur les caractéristiques des éléments eux-mêmes.

![image_illustrative](recomandation.png)

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt')



[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

#### chargement des données

In [259]:
data=pd.read_csv("coursera.csv")

### analyse exploratoire

In [139]:
data.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [140]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [141]:
data["Course Rating"].value_counts()

Course Rating
4.7               740
4.6               623
4.8               598
4.5               389
4.4               242
4.9               180
4.3               165
4.2               121
5                  90
4.1                85
Not Calibrated     82
4                  51
3.8                24
3.9                20
3.6                18
3.7                18
3.5                17
3.4                13
3                  12
3.2                 9
3.3                 6
2.9                 6
2.6                 2
2.8                 2
2.4                 2
1                   2
2                   1
3.1                 1
2.5                 1
1.9                 1
2.3                 1
Name: count, dtype: int64

In [6]:
data["Difficulty Level"].value_counts()

Difficulty Level
Beginner          1444
Advanced          1005
Intermediate       837
Conversant         186
Not Calibrated      50
Name: count, dtype: int64

#### compte tenu de l'usage que nous ferons de notre systeme de recommandation, nous devons supprimer les colonnes: 
- university parceque une université peut dispenser une infinité de cours n'ayant rien a voir les uns avec les autres
- Course Rating : dans un SR basé sur le contenu la notation n'est d'aucune utilité
- Course URL : n'apporte pas d'information importante

### Preprocessing : apres la suppression des colonnes surcitées nous allons effectuer les operations suivante
- creer une nouvelle colonne appelée tag qui contiendra les potentiels mots clés d'une recherche: elle contiendra a la fois le nom du cours, le niveau de difficulté et competences pour un cours donnée
- supprimer les stopwords dans la colonne tag afin de diminuer les informations inutiles
- la lemmatisation pour reduire des mots  à sa forme de base 
- la vectorisation pour mettre nos texte sous formes comprehensible par la machine

#### Suppression des colonnes inutiles

In [260]:
data.drop(columns=['University','Course Rating','Course URL'], inplace=True)

In [261]:
data.columns

Index(['Course Name', 'Difficulty Level', 'Course Description', 'Skills'], dtype='object')

###### nous allons definir ensuite une colonne qui sera utiliser pour calculer les similarités

In [263]:
#remplacement des ponctuations par des espaces
def formating_col(col_name):
    elt=[]
    for i in range(data.shape[0]):
        a=re.sub(r'[^\w\s]', ' ',data[col_name][i])
        a=a.lower()
        elt.append(a)
    
    return elt

In [264]:
data['Course Description']=formating_col(col_name= 'Course Description')
data['Skills']=formating_col(col_name= 'Skills')
data['Course Name']=formating_col(col_name= 'Course Name')
data['Difficulty Level']=formating_col(col_name= 'Difficulty Level')

In [265]:
data

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills
0,write a feature length screenplay for film or ...,beginner,write a full length feature film script in th...,drama comedy peering screenwriting film d...
1,business strategy business model canvas analy...,beginner,by the end of this guided project you will be...,finance business plan persona user experien...
2,silicon thin film solar cells,advanced,this course consists of a general presentation...,chemistry physics solar energy film lambda...
3,finance for managers,intermediate,when it comes to numbers there is always more...,accounts receivable dupont analysis analysis...
4,retrieve data using single table sql queries,beginner,in this course you ll learn how to effectively...,data analysis select sql database manageme...
...,...,...,...,...
3517,capstone retrieving processing and visualiz...,beginner,in the capstone students will build a series ...,databases syntax analysis web data visuali...
3518,patrick henry forgotten founder,intermediate,give me liberty or give me death rememberi...,retirement causality career history of the ...
3519,business intelligence and data analytics gene...,advanced,megatrends heavily influence today s organis...,analytics tableau software business intellig...
3520,rigid body dynamics,beginner,this course teaches dynamics one of the basic...,angular mechanical design fluid mechanics f...


In [266]:
# concatenation des lignes entres elles pour creer une colone tag
data["tags"]=data['Course Name']+data['Skills']+data['Course Description']

In [267]:
data

Unnamed: 0,Course Name,Difficulty Level,Course Description,Skills,tags
0,write a feature length screenplay for film or ...,beginner,write a full length feature film script in th...,drama comedy peering screenwriting film d...,write a feature length screenplay for film or ...
1,business strategy business model canvas analy...,beginner,by the end of this guided project you will be...,finance business plan persona user experien...,business strategy business model canvas analy...
2,silicon thin film solar cells,advanced,this course consists of a general presentation...,chemistry physics solar energy film lambda...,silicon thin film solar cellschemistry physic...
3,finance for managers,intermediate,when it comes to numbers there is always more...,accounts receivable dupont analysis analysis...,finance for managersaccounts receivable dupon...
4,retrieve data using single table sql queries,beginner,in this course you ll learn how to effectively...,data analysis select sql database manageme...,retrieve data using single table sql queriesda...
...,...,...,...,...,...
3517,capstone retrieving processing and visualiz...,beginner,in the capstone students will build a series ...,databases syntax analysis web data visuali...,capstone retrieving processing and visualiz...
3518,patrick henry forgotten founder,intermediate,give me liberty or give me death rememberi...,retirement causality career history of the ...,patrick henry forgotten founderretirement ca...
3519,business intelligence and data analytics gene...,advanced,megatrends heavily influence today s organis...,analytics tableau software business intellig...,business intelligence and data analytics gene...
3520,rigid body dynamics,beginner,this course teaches dynamics one of the basic...,angular mechanical design fluid mechanics f...,rigid body dynamicsangular mechanical design ...


#### Suppression des stopwords dans la description des données: 
- Les stopwords sont des mots courants dans une langue qui sont souvent filtrés et ignorés dans les tâches de traitement du langage naturel (NLP) et d'analyse de texte. Ils ne portent généralement pas beaucoup d'information significative par eux-mêmes et peuvent donc être considérés comme non pertinents pour certaines analyses, telles que la classification de texte ou l'extraction d'informations.

- En anglais, les stopwords courants incluent des mots tels que :"the","is","in","and","of","to" ; la liste est non exhaustive



In [153]:
tags=[]
for i in range(data.shape[0]):
    elt=[word for word in data['tags'][i].split(" ") if word not in stopwords.words('english')]
    elt=" ".join(elt)
    tags.append(elt)   
tags

['write feature length screenplay film televisiondrama  comedy  peering  screenwriting  film  document review  dialogue  creative writing  writing  unix shells arts humanities music artwrite full length feature film script  course  write complete  feature length screenplay film television  serious drama romantic comedy anything  learn break creative process components  discover structured process allows produce polished pitch ready script end course  completing project increase confidence ideas abilities  feel prepared pitch first script get started next  course designed tap creativity based  active learning   actual learning takes place within activities    writing  learn   link trailer course  view trailer  please copy paste link browser  https   vimeo com 382067900 b78b800dc0  learner review   love approach professor wheeler takes towards course  point  easy follow  informative  would definitely recommend anyone interested taking screenplay writing course   course curriculum simple 

#### lemmatisation


La lemmatisation est une technique de traitement du langage naturel (NLP) qui consiste à transformer les mots d'un texte en leur forme canonique ou de base, appelée lemme. Contrairement à la racine (ou stemming), qui découpe simplement la fin des mots pour trouver leur tronc commun, la lemmatisation utilise des connaissances sur la morphologie des mots et le contexte grammatical pour ramener chaque mot à son lemme.

Exemple de lemmatisation:
"playing" → "play",
"better" → "good",
"went" → "go",
"cats" → "cat",

In [154]:

"""stemmed_words=[]
stemmer = PorterStemmer()
for i in range(data.shape[0]):
    elt=[stemmer.stem(word) for word in tags]
    elt=" ".join(elt)
    stemmed_words.append(elt)   
stemmed_words"""


KeyboardInterrupt: 

In [268]:
data['tags']=tags

#### creation du dataframe a utiliser

In [269]:
usefull_df=data[["Course Name", 'tags']]
usefull_df.columns=['CourseName','Tags']

In [270]:
usefull_df

Unnamed: 0,CourseName,Tags
0,write a feature length screenplay for film or ...,write feature length screenplay film televisio...
1,business strategy business model canvas analy...,business strategy business model canvas analy...
2,silicon thin film solar cells,silicon thin film solar cellschemistry physic...
3,finance for managers,finance managersaccounts receivable dupont an...
4,retrieve data using single table sql queries,retrieve data using single table sql queriesda...
...,...,...
3517,capstone retrieving processing and visualiz...,capstone retrieving processing visualizing ...
3518,patrick henry forgotten founder,patrick henry forgotten founderretirement ca...
3519,business intelligence and data analytics gene...,business intelligence data analytics generate...
3520,rigid body dynamics,rigid body dynamicsangular mechanical design ...


#### vectorisation du texte dans la colonne tags : ici nous puvons utiliser differentes methodes de vectorisation ; on essayera CountVectorizer et TfidfVectorizer

In [271]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['tags']).toarray()

cv=CountVectorizer(max_features=3522,stop_words='english')
cv=cv.fit_transform(data['tags']).toarray()

similarity_tfidf = cosine_similarity(X)
similarity_cv = cosine_similarity(cv)

In [272]:
similarity_tfidf.shape

(3522, 3522)

In [273]:
similarity_cv.shape

(3522, 3522)

### ici nous definissons la fonction qui sera utilsée pour la recommendation 

#### cas du CountVectorizer

In [274]:
def recommandation_cv(title):
    course = usefull_df[usefull_df['CourseName'] == title.lower()]
    if len(course)==0:
        print("ce cours n'existe pas")
    else:
        course_index=course.index[0]
        distances = similarity_cv[course_index]
        course_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:11]

        for i in course_list:
            print(usefull_df.iloc[i[0]].CourseName)


### cas du TfidfVectorizer

In [275]:
def recommandation_tfidf(title):
    course = usefull_df[usefull_df['CourseName'] == title.lower()]
    if len(course)==0:
        print("ce cours n'existe pas")
    else:
        course_index=course.index[0]
        distances = similarity_tfidf[course_index]
        course_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:11]

        for i in course_list:
            print(usefull_df.iloc[i[0]].CourseName)


In [276]:
recommandation_tfidf("silicon thin film solar cells")

physics of silicon solar cells
introduction to solar cells
organic solar cells   theory and practice
solar energy systems overview
solar energy basics
photovoltaic solar energy
solar energy and electrical system design
the science of the solar system
solar energy codes  permitting and zoning
origins   formation of the universe  solar system  earth and life


In [277]:
recommandation_cv("silicon thin film solar cells")

physics of silicon solar cells
introduction to solar cells
organic solar cells   theory and practice
solar energy systems overview
solar energy basics
photovoltaic solar energy
solar energy and electrical system design
the science of the solar system
solar energy codes  permitting and zoning
origins   formation of the universe  solar system  earth and life


## on remarque que pratiquement les memes livres sont recommandés: pour la suite on conservera le countvectorizer

# Deployement: dans cette section nous avons sauvegarder certains fichier necessaire pour le deployement de notre model a savoir
- un fichier json contenant des index et nom  de cours
- un fichier npy contenant les similarités calculées sur notre jeu de données
- un fichier pkl contenant notre vocabulaire appris lors de la vectorisation

### sauvegarde du fichier json contenant des index et nom de cours

In [278]:
import json

keys=[i for i in range(len(data))]
values=[usefull_df['CourseName'][i] for i in range(len(data))]
index_course_name = dict(zip(keys, values))

filename='index_course_name.json'
# Sauvegarde du dictionnaire dans un fichier JSON
with open(filename, 'w') as file:
    json.dump(index_course_name, file)
    
print(f"Le dictionnaire a été sauvegardé dans le fichier {filename}.")


Le dictionnaire a été sauvegardé dans le fichier index_course_name.json.


### sauvegarde du fichier pkl contenant notre vocabulaire appris lors de la vectorisation des titres de cours

In [295]:
vectorizer=CountVectorizer(max_features=3522,stop_words='english')
cv_fited=vectorizer.fit(usefull_df['CourseName'])#(data['tags'])#.toarray()
import pickle
with open('vectorizer_fited.pkl','wb') as file:
    pickle.dump(cv_fited,file)

print("fichier sauvegardé.")

fichier sauvegardé.


### sauvegarde pkl contenant les descriptions de differents cours vectorizer

In [310]:
description_transformed=cv_fited.transform(usefull_df['Tags']).toarray()

import pickle
with open('description_fit_transformed.pkl','wb') as file:
    pickle.dump(description_transformed,file)

print("fichier sauvegardé.")

fichier sauvegardé.


### par la suite nous  chargerons  ces fichiers afin de les utiliser dans la fonction de recommandation utilisé lors du deploiement

In [306]:
with open("vectorizer_fited.pkl", 'rb') as file:
    loaded_vectorizer = pickle.load(file)

with open("description_fit_transformed.pkl", 'rb') as file:
    loaded_description = pickle.load(file)
    
with open("index_course_name.json", 'r') as file:
    loaded_dictionary = json.load(file)
    

def recommandation_deploy(title):
    title=title.split(" ")
    title=loaded_vectorizer.transform(title)
    similarities = cosine_similarity(title,loaded_description)
    indices_similaires = np.argsort(similarities[0])[::-1]
    indices_similaires=indices_similaires[:10]
    indices_similaires=[str(i) for i in indices_similaires]
    # Obtenir les cours des livres recommandés
    cours_recommandes = [loaded_dictionary[i] for i in indices_similaires]
    #livres_recommandes = [loaded_dictionary[str(2605)] ]
    print("Livres recommandés:", cours_recommandes)

In [307]:
recommandation_deploy("machine learning en python")

Livres recommandés: ['introduction to applied machine learning', 'how google does machine learning', 'optimizing machine learning performance', 'machine learning with h2o flow', 'machine translation', 'machine learning for data analysis', 'developing ai applications on azure', 'machine learning foundations  a case study approach', 'the power of machine learning  boost business  accumulate clicks  fight fraud  and deny deadbeats', 'machine learning for all']


In [308]:
recommandation_deploy("formation django pour debutant")

Livres recommandés: ['oceanography  a key to better understand our world', 'russian company law  formation of legal entities ', 'analyzing the universe', 'russian for beginners 2               a1', 'the music of the rolling stones  1962 1974', 'russian for beginners 3                a1', 'origins   formation of the universe  solar system  earth and life', 'bacteria and chronic infections', 'taxation of business entities i  corporations', 'getting started in cryo em']


In [309]:
recommandation_deploy("machine learning")

Livres recommandés: ['introduction to applied machine learning', 'how google does machine learning', 'optimizing machine learning performance', 'machine learning with h2o flow', 'machine translation', 'machine learning for data analysis', 'developing ai applications on azure', 'machine learning foundations  a case study approach', 'the power of machine learning  boost business  accumulate clicks  fight fraud  and deny deadbeats', 'machine learning for all']
