# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4B - Social Impact

---

### <b> Notebook 5: Using our chosen model to predict MBTI for Linkedin dataset

Structure of this notebook </b>

* Part 1: Importing the .pkl files for chosen model & vectorizer
* Part 2: Generating predictions/ MBTI Labels for LinkedIn dataset
* Part 3: Generating Recommender Systems: Profile-Based Recommendations

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import joblib
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer

### Part 1: Importing .pkl files for chosen models & vectorizer

In [2]:
# Import in the saved model pkl
model_sn = joblib.load("./pkl_files/model_sn.pkl")
model_tf = joblib.load("./pkl_files/model_tf.pkl")
model_jp = joblib.load("./pkl_files/model_jp.pkl")
tvec_sn = joblib.load("./pkl_files/tvec_SN.pkl")
tvec_tf = joblib.load("./pkl_files/tvec_TF.pkl")
tvec_jp = joblib.load("./pkl_files/tvec_JP.pkl")

# Read csv for test data
df_linkedin = pd.read_csv("./cleaned_data/linkedin_cleaned.csv")

# See rows and columns; and details of first 5 rows
df_linkedin.head()

Unnamed: 0,job_title,post,processed_posts
0,tea,collaborating with colleagues to develop inter...,collaborating with colleagues to develop inter...
1,tea,i'm committed to adapting teaching styles to a...,i'm committed to adapting teaching styles to a...
2,tea,my goal is to prepare students for success in ...,my goal is to prepare students for success in ...
3,tea,i am dedicated to preparing students for succe...,i am dedicated to preparing students for succe...
4,tea,living life by default versus living life by d...,living life by default versus living life by d...


In [3]:
# Define stemmer function which will tokenize and stem the corpus
def stemmer(row):
    '''Applies `PorterStemmer()` to each token in a document.
    '''
    # INstantiate stemmer
    stem = PorterStemmer()

    # Extracts post from each row
    document = row["processed_posts"]

    # Tokenize sentence using RegexpTokenizer
    re_tokenizer = RegexpTokenizer(pattern = r"(?u)\b(?:\w\w+|i|I)(?:[\'\'\′\ʼ](?:s|t|m|re|ve|d|ll))?\b")
    list_of_tokens = re_tokenizer.tokenize(document)

    # Applies stemmer to the list of tokens
    stemmed_document = ""
    for token in list_of_tokens:
        stemmed_document += f"{stem.stem(token)} "
    
    return stemmed_document

### Part 2: Generating Predictions/ MBTI Labels for LinkedIn Dataset

##### (a) Generating Vectorised Corpus

In [4]:
# Define X variable
X = df_linkedin[["processed_posts"]]

In [5]:
# Stem our corpus
X["processed_posts"] = X.apply(stemmer, axis = 1)

# Transform linkedin data on each of the tvec
X_tvec_sn = pd.DataFrame(tvec_sn.transform(X["processed_posts"]).todense(),
                            columns = tvec_sn.get_feature_names_out())

X_tvec_tf = pd.DataFrame(tvec_tf.transform(X["processed_posts"]).todense(),
                            columns = tvec_tf.get_feature_names_out())

X_tvec_jp = pd.DataFrame(tvec_jp.transform(X["processed_posts"]).todense(),
                            columns = tvec_jp.get_feature_names_out())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["processed_posts"] = X.apply(stemmer, axis = 1)


##### (b) Generating Probability Predictions for Individual Traits Using Trained Model

In [6]:
# Create function to determine probability predictions
def generate_prob(model, X_tvec):
    
    # Obtain predictions
    y_prediction = model.predict_proba(X_tvec)

    # Extracting out one probability value to represent the trait
    prob_list = [sublist[0] for sublist in y_prediction]

    return prob_list

In [7]:
# Generating the list of probabilities of each trait

prob_sn = generate_prob(model_sn, X_tvec_sn)
prob_tf = generate_prob(model_tf, X_tvec_tf)
prob_jp = generate_prob(model_jp, X_tvec_jp)

In [8]:
# Creating database on LinkedIn job titles and their 3 traits
job_titles = list(df_linkedin["job_title"])
linkedin_mbti = pd.DataFrame({
    "s_n": prob_sn,
    "t_f": prob_tf,
    "j_p": prob_jp,
})
linkedin_mbti.set_index(pd.Series(job_titles, name = "job_titles"), inplace=True)

In [9]:
linkedin_mbti

Unnamed: 0_level_0,s_n,t_f,j_p
job_titles,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tea,0.341091,0.285007,0.498420
tea,0.256834,0.191431,0.548492
tea,0.287066,0.289800,0.523861
tea,0.246035,0.218084,0.491402
tea,0.479142,0.414365,0.489610
...,...,...,...
se,0.310018,0.227274,0.500285
se,0.454863,0.379361,0.504664
se,0.267909,0.234147,0.489678
se,0.596118,0.484924,0.517964


### Part 3: Generating Recommender Systems: Profile-Based Recommendations

(a) Generating MBTI from Job Seeker's Inputs
- When a job seekers input their 'About' or LinkedIn posts, or write any text, their MBTI scores (probabilities for the 3 traits) will be generated.

(b) Finding Job Titles/ Profiles with Similar MBTI as Job Seeker
- With the MBTI score, the recommender system will look for the top 20 LinkedIn profiles/ job titles that has the closest MBTI score to the seeker. Euclidean distances between the seeker and each title's MBTI will be calculated, as a measure of similarity of MBTI scores between them. Each profile (with job title) will have their respective distance scores, ranked in ascending order.

(c) Recommend Top 2 Job Titles
- To determine the top 2 suitable job titles, the distance of each job title will be summed up, divided by the count of each job title to normalise the distances according to job titles. The lowest 2 distances and corresponding job titles will be recommended.

#### (a) Generating MBTI from Job Seeker's Inputs

In [10]:
# Define stemmer function which will tokenize and stem seeker text input
def stemmer_seeker(document):
    '''Applies `PorterStemmer()` to each token in a document.
    '''
    # Instantiate stemmer
    stem = PorterStemmer()

    # Tokenize sentence using RegexpTokenizer
    re_tokenizer = RegexpTokenizer(pattern = r"(?u)\b(?:\w\w+|i|I)(?:[\'\'\′\ʼ](?:s|t|m|re|ve|d|ll))?\b")
    list_of_tokens = re_tokenizer.tokenize(document)

    # Applies stemmer to the list of tokens
    stemmed_document = ""
    for token in list_of_tokens:
        stemmed_document += f"{stem.stem(token)} "
    
    return stemmed_document

In [11]:
# Example of input text by job seeker 
document = "I'm deeply passionate about data and using data to deliver results and achieve desired outcomes. I believe data is at the heart of what we all do, no matter what you do. I have 9+ years of professional experience - in shaping data and product strategy, driving innovation, and leading cross-functional teams to develop cutting-edge data-driven products in the advertising technology and ecommerce domain."

In [12]:
# Stem and transform text input
seeker_tvec_sn = pd.DataFrame(tvec_sn.transform([stemmer_seeker(document)]).todense(),
                            columns = tvec_sn.get_feature_names_out())

seeker_tvec_tf = pd.DataFrame(tvec_tf.transform([stemmer_seeker(document)]).todense(),
                            columns = tvec_tf.get_feature_names_out())

seeker_tvec_jp = pd.DataFrame(tvec_jp.transform([stemmer_seeker(document)]).todense(),
                            columns = tvec_jp.get_feature_names_out())


In [13]:
# Generate the seeker's mbti probability scores
seeker_prob_sn = generate_prob(model_sn, seeker_tvec_sn)
seeker_prob_tf = generate_prob(model_tf, seeker_tvec_tf)
seeker_prob_jp = generate_prob(model_jp, seeker_tvec_jp)

In [14]:
# Storing seeker's MBTI score into a series
trait_types = linkedin_mbti.columns
seeker = pd.Series(data=np.zeros(len(trait_types)), index=trait_types)

seeker["s_n"] = float(seeker_prob_sn[0])
seeker["t_f"] = float(seeker_prob_tf[0])
seeker["j_p"] = float(seeker_prob_jp[0])
seeker

s_n    0.468883
t_f    0.286113
j_p    0.495216
dtype: float64

##### (b) Finding Job Titles/ Profiles with Similar MBTI as Job Seeker & (c) Recommend Top 2 Job Titles

In [15]:
# create a function to determine top 2 matching mbti job titles

def rec_top2_jobs(seeker_mbti, linkedin_mbti_dataset = linkedin_mbti):
    
    # obtaining euclidean distance between job titles' mbti and seeker's mbti
    recommendations = [np.linalg.norm(linkedin_mbti_dataset.values[i] - seeker_mbti.values) for i in range(len(linkedin_mbti_dataset.values))]

    # putting scores in a dataframe
    recommendations_df = pd.DataFrame({
    "job_titles": linkedin_mbti_dataset.index,
    "scores": recommendations,
    })

    # Obtain top 20 similar mbti linkedin users
    linkedin_top20 = recommendations_df.sort_values(by = "scores", ascending = True).head(20).reset_index(drop = True)

    # Summming up number of each linkedin title present in the top 20
    linkedin_top20_count = pd.Series(linkedin_top20.groupby("job_titles")["scores"].count())

    # Summming up distances for each linkedin title present in the top 20
    linkedin_top20 = pd.Series(linkedin_top20.groupby("job_titles")["scores"].sum())

    # Obtaining average distance of each job title
    linkedin_top20_aggregate = linkedin_top20 / linkedin_top20_count

    # Extracting out top 2 'lowest'scoring job titles
    top2_job_titles = list(linkedin_top20_aggregate.sort_values(ascending = True).index[0:2])

    return top2_job_titles

In [16]:
# matching seeker's mbti to determine top 2 job titles
rec_top2_jobs(seeker)

['cp', 'acc']