In [1]:
with open('text', 'r') as txt:
    contenu = txt.read()
    

In [2]:
contenu

'**Morocco and Marrakech: A Tapestry of Tradition and Modernity** Morocco, located at the crossroads of Europe and Africa, is a country drenched in history, mystery, and cultural richness. A testament to the ancient civilizations that once flourished here, this North African kingdom boasts a unique blend of Arab, Berber, and European influences. At the heart of Morocco\'s rich tapestry lies Marrakech, one of its four imperial cities and a vibrant epicenter of tradition and modernity. **Geographical Significance** Morocco is bordered by the Atlantic Ocean to the west, the Mediterranean Sea to the north, Algeria to the east and southeast, and the vast Sahara desert to the south. Its strategic location has historically made it a sought-after territory and a melting pot of cultures, religions, and trade routes. **Marrakech: The Red City** Marrakech, often referred to as "The Red City" due to its distinctive red-hued buildings, stands against the backdrop of the snow-capped Atlas Mountains.

# Text Preprocessing

### Lower casing

In [3]:
text_nettoyé=contenu.lower()

### Removal of Punctuations

In [11]:
import string

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

text_nettoyé = remove_punctuation(text_nettoyé)




In [12]:
text_nettoyé

'morocco marrakech tapestry tradition modernity morocco located crossroads europe africa country drenched history mystery cultural richness testament ancient civilizations flourished north african kingdom boasts unique blend arab berber european influences heart moroccos rich tapestry lies marrakech one four imperial cities vibrant epicenter tradition modernity geographical significance morocco bordered atlantic ocean west mediterranean sea north algeria east southeast vast sahara desert south strategic location historically made soughtafter territory melting pot cultures religions trade routes marrakech red city marrakech often referred red city due distinctive redhued buildings stands backdrop snowcapped atlas mountains established 11th century remained crucial political economic cultural center morocco journey medina marrakechs old town medina unesco world heritage site labyrinthine maze narrow alleys bustling souks historical landmarks djemaa elfna square lies heart medina comes al

### Removal of Stopwords

In [13]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
text_nettoyé = remove_stopwords(text_nettoyé)


### Removal of Frequent words

In [15]:
from collections import Counter
cnt = Counter()
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

In [16]:
text_nettoyé = remove_freqwords(text_nettoyé)


### Removal of Rare words

In [17]:
n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

In [18]:
text_nettoyé = remove_rarewords(text_nettoyé)


### Stemming

In [19]:
from nltk.stem.porter import PorterStemmer


stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

In [20]:
text_nettoyé = stem_words(text_nettoyé)


### Lemmatization

In [21]:

import spacy
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")




Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 187.9 kB/s eta 0:01:08
     --------------------------------------- 0.0/12.8 MB 187.9 kB/s eta 0:01:08
     --------------------------------------- 0.0/12.8 MB 196.9 kB/s eta 0:01:05
     --------------------------------------- 0.1/12.8 MB 280.5 kB/s eta 0:00:46
      -------------------------------------- 0.2/12.8 MB 618.3 kB/s eta 0:00:21
      -------------------------------------- 0.3/12.8 MB 896.4 kB/s eta 0:00:14
     - -------------------------------------- 0.4/12.8 MB 1.1 MB/s eta 0:00:11
     - -------------------------------------- 0.5

In [22]:
def lemmatize_words(text):
    doc = nlp(text)
    lemmatized_words = [token.lemma_ for token in doc]
    return " ".join(lemmatized_words)


In [23]:
text_nettoyé = lemmatize_words(text_nettoyé)


## after Text Preprocessing

In [24]:
text_nettoyé

'morocco marrakech tapestri tradit modern morocco locat crossroad europ africa countri drench histori mysteri cultur rich testament ancient civil flourish north african kingdom boast uniqu blend arab berber european influenc heart morocco rich tapestri lie marrakech one four imperi citi vibrant epicent tradit modern geograph signific morocco border atlant ocean west mediterranean sea north algeria east southeast vast sahara desert south strateg locat histor make soughtaft territori melt pot cultur religion trade rout marrakech red citi marrakech often refer red citi due distinct redhu build stand backdrop snowcap atla mountain establish 11th centuri remain crucial polit econom cultur center morocco journey medina marrakech old town medina unesco world heritag site labyrinthin maze narrow alley bustl souk histor landmark djemaa elfna squar lie heart medina come aliv everi even storytel musician snake charmer food stall offer tantal moroccan delicaci palac garden citi also home grand pal

## entraîner le modèle word2vec sur le texte

In [25]:
from gensim.models import Word2Vec


In [26]:
# notre texte prétraité
preprocessed_text =[text_nettoyé]
# Tokenize le texte prétraité
tokens = [sentence.split() for sentence in preprocessed_text]
tokens

[['morocco',
  'marrakech',
  'tapestri',
  'tradit',
  'modern',
  'morocco',
  'locat',
  'crossroad',
  'europ',
  'africa',
  'countri',
  'drench',
  'histori',
  'mysteri',
  'cultur',
  'rich',
  'testament',
  'ancient',
  'civil',
  'flourish',
  'north',
  'african',
  'kingdom',
  'boast',
  'uniqu',
  'blend',
  'arab',
  'berber',
  'european',
  'influenc',
  'heart',
  'morocco',
  'rich',
  'tapestri',
  'lie',
  'marrakech',
  'one',
  'four',
  'imperi',
  'citi',
  'vibrant',
  'epicent',
  'tradit',
  'modern',
  'geograph',
  'signific',
  'morocco',
  'border',
  'atlant',
  'ocean',
  'west',
  'mediterranean',
  'sea',
  'north',
  'algeria',
  'east',
  'southeast',
  'vast',
  'sahara',
  'desert',
  'south',
  'strateg',
  'locat',
  'histor',
  'make',
  'soughtaft',
  'territori',
  'melt',
  'pot',
  'cultur',
  'religion',
  'trade',
  'rout',
  'marrakech',
  'red',
  'citi',
  'marrakech',
  'often',
  'refer',
  'red',
  'citi',
  'due',
  'distinct',


In [27]:
model = Word2Vec(tokens, min_count=1, window=5)


#### question 1

In [28]:
#1- Extraire la représentation vectorielle d&#39;un mot
word_vector = model.wv['morocco']
print("Représentation vectorielle de 'morocco': \n", word_vector)

Représentation vectorielle de 'morocco': 
 [-8.6333444e-03  3.6826942e-03  5.2006105e-03  5.7534552e-03
  7.4752048e-03 -6.2365630e-03  1.1175403e-03  6.1292686e-03
 -2.8545356e-03 -6.1877957e-03 -4.1457990e-04 -8.4325811e-03
 -5.6114108e-03  7.1458267e-03  3.3391276e-03  7.2316355e-03
  6.7997817e-03  7.4843410e-03 -3.8207218e-03 -6.3890975e-04
  2.3316382e-03 -4.5185378e-03  8.4018698e-03 -9.8875742e-03
  6.7425105e-03  2.9271226e-03 -4.9321675e-03  4.3827300e-03
 -1.7721495e-03  6.7178253e-03  9.9755768e-03 -4.3610604e-03
 -5.4817187e-04 -5.7032402e-03  3.8299377e-03  2.8479185e-03
  6.9189966e-03  6.0744123e-03  9.5273936e-03  9.2221992e-03
  7.8754714e-03 -7.0615639e-03 -9.1903750e-03 -3.6655625e-04
 -3.0622464e-03  7.8719836e-03  5.9276647e-03 -1.5690869e-03
  1.5417188e-03  1.8251807e-03  7.8276349e-03 -9.5335264e-03
 -1.6942284e-04  3.4624189e-03 -9.1766327e-04  8.3999168e-03
  9.0271374e-03  6.5124766e-03 -7.3698047e-04  7.7566472e-03
 -8.5463030e-03  3.2090212e-03 -4.6343957e

#### question 2

In [29]:
# 2. Calculer la similarité entre deux mots
similarity = model.wv.similarity('morocco', 'marrakech')
print("Similarité entre 'morocco' et 'marrakech': ", similarity)

Similarité entre 'morocco' et 'marrakech':  -0.0072038155


In [35]:
#Calculer la similarité entre deux mots
similarity = model.wv.similarity('cuisin', 'marrakech')
print("Similarité entre 'cuisin' et 'marrakech': ", similarity)

Similarité entre 'morocco' et 'marrakech':  -0.048494823


#### question 3

In [31]:
# 3. Extraire les mots contextuels pour un mot central donné
context_words = model.wv.similar_by_word('marrakech', topn=5)
print("Mots contextuels pour 'marrakech': ", context_words)

Mots contextuels pour 'marrakech':  [('uniqu', 0.292203426361084), ('permeat', 0.27848777174949646), ('maze', 0.2343234270811081), ('lie', 0.22422970831394196), ('enthral', 0.22263091802597046)]


In [34]:
#Extraire les mots contextuels pour un mot central donné
context_words = model.wv.similar_by_word('morocco', topn=5)
print("Mots contextuels pour 'morocco': ", context_words)

Mots contextuels pour 'morocco':  [('experi', 0.21625766158103943), ('mint', 0.18959291279315948), ('cuisin', 0.188522070646286), ('mountain', 0.18417641520500183), ('border', 0.18394117057323456)]
