<a href="https://colab.research.google.com/github/edgardpitta/jopper/blob/main/Document_similarity_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity

You will compare the similarity between your career materials (pitch, CV, LinkedIn profile, etc) and the job advertisement of your choice by calculating their **Similarity Index**.

Document similarity is on a scale of 0 to 1, with 0 being completely different and 1 being an exact match. Each sentence has a 1 when compared to itself - they are totally the same!

This script only works in English.

# Instructions


1.   In the '*Runtime/Environnement d'execution*' menu item, click '*Run all*'. This can take some time as it prepares your environment for the script to run.
2.   When prompted, paste the job posting and your candidate material you want to compare with into the text fields and the script will run automatically.
1.   Analyze the Similarity Index between the two documents. The closer to 1, the better.
2.   Make adjustements to your candidate material and rerun the script from item 3, by clicking in '*Runtime/Environnement d'execution*' menu item '*Run from here/Courir Après*'

**Print the page as a PDF and upload it to K2. This is a graded activity and you will not be graded if the activity is not loaded into K2.**

Click here to download these instructions.



## 1. Prepare the environment. Run this only once.

In [None]:
#@title
!pip install nltk
!pip install gensim
!pip install string
!pip install sklearn

In [None]:
#@title
import nltk
import gensim
import pandas as pd
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

## 2. Upload the **Job Posting** you want to compare with your candidate materials.

Paste your job posting in the text box below and click 'Enter'.
Paste only the relevant part of the posting.

In [None]:
#@title
posting = input()

In [None]:
#@title
lemmatizer = WordNetLemmatizer()

### Job Posting''s Key-words (lemmas)

In [None]:
#@title
posting_clean=[] 
new_sentence = []
 
 #Clean the corpus: eliminate stopwords, punctuations, symbols, ...
stop_words = set(stopwords.words('english')+ list(string.punctuation)+list("\n") )
words = word_tokenize(posting)
words = [word.lower() for word in words if word.isalpha()]
for word in words:
  if word not in stop_words:    
            
#Reduce the tokens to their roots
    s =lemmatizer.lemmatize(word)
    new_sentence.append(s.lower())      
posting_clean=" ".join(new_sentence)
posting_clean

## 3. Upload your **Candidate Material** (pitch, CV, LinkedIn Profile)

Paste your candidate material in the text box below and click 'Enter'

If you want to make adjustments to your candidate materials to increase the Similatiy Index, just re-run all the cells from here by clicking 'Ctrl+F10' or in the menu item '*Runtime/Environnement d'exécution*', choose '*Run Selection/Courir Après*'.

In [None]:
#@title
pitch = input()

### Candidate Material's Key-words (lemmas)

In [None]:
#@title
pitch_clean=[]
new_sentence2 = []

#Clean the corpus: eliminate stopwords, punctuations, symbols, ...
stop_words = set(stopwords.words('english')+ list(string.punctuation)+list("\n") )
words = word_tokenize(pitch)
words = [word.lower() for word in words if word.isalpha()]
for word in words:
  if word not in stop_words:    
            
#Reduce the tokens to their roots
    s =lemmatizer.lemmatize(word)
    new_sentence2.append(s.lower())      
pitch_clean=" ".join(new_sentence2)
pitch_clean

## 4. Calculate the **Similarity Index**

Document similarity is on a scale of 0 to 1, with 0 being completely different and 1 being an exact match. The higher the score, the better, because that means the keywords that the employer searches for appear in your candidate materials.

In [None]:
#@title
sentences = [
    pitch_clean,
    posting_clean,
]

In [None]:
#@title
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer(binary=True)
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
#counts

In [None]:
#@title
from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarities using the word counts
similarities = cosine_similarity(matrix)

# Make a fancy colored dataframe about it
pd.DataFrame(similarities,
             index=sentences,
             columns=sentences) \
            .style \
            .background_gradient(axis=None)



## How the script works

To judge similarity between these sentences, we're going to use a TfidfVectorizer from scikit-learn. Less common words are stressed, more common words are more important, and words in long sentences mean less than words in short sentences. 

We'll be measuring similarity via cosine similarity, a standard measure of similarity in natural language processing. It's similar to how we might look at a graph with points at (0,0) and (2,3) and measure the distance between them - just a bit more complicated.

## Natural Language Processing - NLP Vocabulary

**Tokenization** is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or punctuation.

The **lemma** is the form of the word found in dictionaries, sometimes called the base form. Introducing lemmas makes it possible to treat different word forms of the word as the same word.

**Stopwords** are those words that do not provide any useful information to decide in which category a text should be classified. This may be either because they don't have any meaning (prepositions, conjunctions, etc.) or because they are too frequent in the classification context.