Skip to content

An NLP-power LinkedIn coPilot that will augment the site's experience. It will classify your network as you always wished!

License

Notifications You must be signed in to change notification settings

adamd1985/AugmentedLinkedInFun

Repository files navigation

Make Linkedin Fun - with some NLP

Canva generated image of bigbird using a computer

Using NLP constructs with scikit-learn, we will be making our linkedin experience better by transforming a profile description on this social-network, into a Seasame Street character!

We have asked chatGPT to create synthetic LinkedIn profiles of specific professions: HR, crypto entusiasts, business and financial gurus and software engineers.

Example of the CSVs generated by openAI are:

  • Lawrence Dickerson, IT professional with skills in microsoft excel microsoft word microsoft outlook microsoft powerpoint IT business partner and so on - Labelled as the hard working Grover.
  • Ana Tolley,web3 visionary,successful cryptocurrency trader with a track record of identifying profitable investments excels at technical analysis and risk management - Labelled as the Oscar the Grouch searching trash for lost digital assets.
  • Mathew Bird,money mastermind,unlock the power of your finances and take control of your wealth with the guidance of the money mastermind - Labelled as the great Big Bird and their puppet leadership.
  • Oleta Reece,the wealth creation newsletter,people manager hr recruitment manager business coach trainer - Labelled as the hustling Count.
  • Gary Otis,responsible for handling the entire recruiting cycle, such as sourcing screening contacting, scheduling interview releasing offer letter and placing the qualified resource sourcing - Labelled as the zany (and Achillean?) duo Ernie and Bert.

Looks interesting? Then read on.

15th May 2023 - A follow up article is available here

Extend your Browser

Our point of augmentation will be a chrome extension.

We want the extension to scrape any feed we see, pull in the information on the profiles found, send these to a classifier and map the output to a character

Building the Extension to Scrape and Show

Either using ChatGPT to get the chrome extension boiler plate, or pull one from the developer quickstart google offers in it's site (see references).

Whatever the method, the extension will be made up of:

  • manifest.json
  • background.js
  • content.js

The manifest tells chrome what rights and functionalities the extension has, in our case:

{
    "manifest_version": 3,
    "name": "AugmentedLinkedInFun",
    "version": "1.0",
    "description": "Replaces LinkedIn user names with 'NPC'.",
    "permissions": [
        "tabs",
        "scripting",
        "activeTab",
        "notifications",
        "storage"
    ],
    "host_permissions": [
        "http://www.linkedin.com/*",
        "https://www.linkedin.com/*"
    ],
    "optional_host_permissions": [
        "https://*/*",
        "http://*/*"
    ],
    "content_scripts": [
        {
            "matches": [
                "*://www.linkedin.com/*"
            ],
            "js": [
                "scripts/jquery-3.6.4.slim.min.js",
                "scripts/content.js"
            ]
        }
    ],
    "background": {
        "service_worker": "scripts/background.js",
        "type": "module"
    },
  • Permissions tells the browser we will be injecting a script in the active tab, in this case contest.js.
  • Host_permissions signals where to run.
  • Content_script is all the scripts and assets we will be using and where.
  • background defines what background script will be having access to the majority of chrome APIs. We also include the slim version of the jQuery library for it's element selection and manipulation functionality.

Injecting the script

Within the content.js script, we want to capture all links that direct us to a profile using the function below:

/**
 * scrape all profile links.
 * @return array of links.
 */
getAllLinks() {
    const re = new RegExp("^(http|https)://", "i");;
    var links = [];

    $('a[href*="/in/"]').each((index, element) => {
        let link = $(element).attr('href');
        if (/^(http|https).*/i.test(link) === false) {
        // Linkedin may use relative links, we need to convert to absolutes.
        link = `https://www.linkedin.com${link}`
        }
        
        links.push(link);
    })

    return links;
}

The function will an array of all links that lead to a profile. Such link is identified with the link regex: ^((http|https). * )?/in/.*..

We will scrape every link in our feed that matches that regex.

We then call the function to pull in all profile content:

/**
   * Given a collection of profile links, we collect relevant information.
   * @param {*} links Absolute links to profiles. If link is not for a profile (we use xpath), it will be ignored.
   * @returns Array of Profile objects, made up of
   *            {
   *                user
   *                titles
   *                link
   *            }
   */
  async getProfilesDetailsFromLinks(links, cachedProfiles) {

    /**
     * Scraping internal function.
     */
    function _scrape(profile){
      let lnName = $("main h1", profile).text()
      let profileObj = null;
      if (lnName !== null && lnName.length > 0) {
        lnName = lnName.trim();
        console.debug(`found name: ${lnName}`);
      }

      let titles = $("main div.text-body-medium", profile).text()
      if (titles !== null && titles.length > 0) {
        titles = titles.trim();
      }

      if ((lnName !== null && lnName !== "") && (titles !== null && titles !== "")) {
        profileObj = {
          user: lnName,
          titles: titles,
          link: ''
        }
      }
      return profileObj;
    }
    
    let profiles = []
    let calls = []
    let MAX_ITERS = 5
    for (const link of links) {
        let profile = null;
        // Avoid rate limit and allow dynamic content to load
        // Randomly wait for up to 2sec.
        await new Promise(r => setTimeout(r, Math.floor(Math.random() * 2000)));

        let _link = link;
        let call = chrome.runtime.sendMessage({ link: link }).then(response => {
            let profile = response?.profile;
            let profileObj = null;
            if (profile) {
            profileObj = _scrape(profile);
            if (profileObj){
                profileObj.link = _link;
            }
            }
            return Promise.resolve(profileObj);
        });
        calls.push(call)
        if (MAX_ITERS <= 0){
            break; // Just to make things faster.
        }
        MAX_ITERS -= 1;
    
    }
   
    profiles = profiles.concat(await Promise.all(calls));
    
    return profiles;
  }

In the main function body, we will use jQuery to scrape relevant profile info.

The selector $("main h1", profile) will be used to capture the user's name, while $("main div.text-body-medium", profile) will be used to capture the user's title.

Note that scraping a profile from a link is not that straight forward, as linkedIn generates dynamic content, and we will lose scope if we browser away.

Therefore we need to call chrome.runtime.sendMessage to send a command, which will be received by our global background.js script (that has the most API access), and it will open a new tab to pull in all of the profile's dynamic content.

A Background script to Orchestrate Profiles

As we already revealed, LinkedIn generates dynamic content on the fly, which means when the page loads - not all info is available.

So we will open the profile in a new tab, pull in the content, and close the tab while sending back the HTML content to the caller.

This is done in the global background.js script:

chrome.runtime.onMessage.addListener(function (message, sender, sendResponse) {
    console.log(`Message from ${sender}: ${JSON.stringify(message)}`)
    if (message?.link) {
        chrome.tabs
            .create({ url: message.link, active: false }) // Create a inactive TAB
            .then(tab => {
                // Inject a script to pull in all DOM
                let tabID = tab.id
                chrome.scripting.executeScript({
                    func: getProfile,
                    target : {
                    abId: tabID
                    },
                    injectImmediately: false
                })
                .then(injectionResults => {
                    // Send the DOM back to the calling content.js
                    let profile = injectionResults[0]?.result;
                    return sendResponse({ profile: profile});
                })
                .then(() => chrome.tabs.remove(tabID)) // Remove the tab
                .catch(error => console.error(error.message))
            })
    } 

    return true;
});

The chrome.runtime.onMessage.addListener will register a function which will pick the message sent from our content.js.

It will proceed to open a tab, and inject the script below using chrome.scripting.executeScript:

async function getProfile() {
    // Open a profile, and wait for all content to be loaded.
    const el = document.querySelector(".pvs-list");
    if (el) {
        console.log("Has activity!")
    }
    else {
        // Delay the tab close.
        await new Promise(r => setTimeout(r, 400));
    }

    return document.body.innerHTML;
}

Note the timeout, there will be a lot of these in this extension.

The timeout waits for the page to load with the data we need, as this dynamic content in particular has no callback function that can signal us that the whole content was loaded.

Once all HTML is available, we get the DOM and close the tab.

An important note on timeouts and scraping: It plays nice with LinkedIn, because if it detects abusive automation, it will ban us. If you want to automate, apply to a LinkedIn Developer Account.

With the profile's HTML in our hands, we will return it to the content.js script from the previous section.

What's the Character

We have the profiles' data scraped into a structure, now let's associate them to a Seasame Street character, using the function below:

/**
   * Augment the linkedin experience by adding info or visual cues.
   * All will happen async as they  call our classification server.
   * @param {*} profiles 
   */
  async augmentLinkedInExperience(profiles) {
      async function _augmentText(element,profile,data){
        if ($(element).data( "scanned" )){
          return;
        }

        let search = `^${profile.user}$`;
        let re = new RegExp(search, "g");

        $(element).data( "scanned", true );
        let text = $(element).text().trim();
        if (text.match(re)){
          if (data['proba'] && data['proba'] >= 0){
            $(element).text(`${text} [${data['proba']}% as ${data['label']}]`);
          }
        }
      }

      async function _augmentImage(element,profile,data){
        // TODO: Inefficient, will load all images for each profile.
        // Use tokens to discern if this image is related to the profile.
        let name = $(element).attr('alt')
        if (!name){
          return;
        }
        name = name.trim().toLocaleLowerCase();
        if (!name.includes("photo") && !name.includes("profile") && !name.includes(profile.user.toLocaleLowerCase())){
          return;
        }
        const tokens = profile.user.toLocaleLowerCase().split(" ");
        for (const tok of tokens){
          if (name.includes(tok)){
            const imgUrl = await chrome.runtime.getURL(`assets/${data['label']}.png`)
            $(element).attr('href', imgUrl);
            $(element).attr('src', imgUrl);
            break;
          }
        }
      }

      let promises = [];
      profiles.forEach(function( profile) {
        if (!profile)
          return

        const data = {
          'descriptions': (profile.posts?.join(' ') ?? ' ') + profile.titles,
        };

        promises.push(
          fetch('http://127.0.0.1:800/profile', {
            method: "POST", 
            headers: {
              "Content-Type": "application/json",
            },
            body: JSON.stringify(data), 
          })
          .then(response=>response.json())
          .then((data) => {
            $(`div.visually-hidden`).each(async (index, element) => {
              try{
                const wrapper = $(element).parent().parent().parent(); 
                if (!wrapper){
                  return;
                }
                wrapper.empty();

                const url = await chrome.runtime.getURL(`assets/${data['label']}.png`)

                const img = $(`<img src="${url}" alt="${profile.user}'s photo" id="ember23" class="EntityPhoto-circle-3 evi-image ember-view" href="${url}">`);
                wrapper.append(img);
              }
              catch (e){
                console.error(e);
              }
            })
            $(`img`).each(async (index, element) => {
              _augmentImage(element,profile,data)
            })
            $(`h1.text-heading-xlarge:contains("${profile.user}")`).each((index, element) => {
              _augmentText(element,profile,data);
            })
            $(`div:contains("${profile.user}")`).each((index, element) => {
              _augmentText(element,profile,data);
            })
            $(`span:contains("${profile.user}")`).each((index, element) => {
              _augmentText(element,profile,data);
            })
            $(`a:contains("${profile.user}")`).each((index, element) => {
              _augmentText(element,profile,data);
            })            
            
          }).catch ((error) => {
            console.log('Error: ', error);
          })
        );
    }); 
  }

We call our flask server, hosting the trained model, with fetch('http://127.0.0.1:800/profile'.

Once the profile has been classified, we capture elements that link to the users using the selector $('h1.text-heading-xlarge:contains("${profile.user}")'') or $('div:contains("${profile.user}")'), and change their content to reflect the character we want.

Training our Model - A Sesame Street Lesson

Let's load all the labeled data, and clean it:

DATA = "./data/anonLinkedInProfiles.csv"
data = pd.concat([chunk for chunk in tqdm(pd.read_csv(DATA, chunksize=1000), desc=f'Loadin {DATA}')])
print(f'Shape: {data.shape}, does it have NAs:\n{data.isna().any()}')

data = data.dropna()
data = data.drop(data[(data['descriptions'] == '') | (data['titles'] == '')].index)

print(f'Post fill NAs:\n{data.isna().any()}')
data['class'] = data['class'].apply(lambda x: x.lower())

# For this exercise, keep it small.
data = data.sample(800)
data = data.reset_index() # Reset index, since we will do operations on it!
print(f'Resampled Shape: {data.shape}')

data.head()

Samples

We then build a tokenizer that removes stop words, whitespace, symbols, and reduces the words to their lemmas:

import string
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from nltk.corpus import stopwords
from sklearn.base import TransformerMixin
from nltk.tokenize import sent_tokenize
from nltk.tokenize import ToktokTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet 
from nltk import pos_tag

nltk.download('all')

NGRAMS = (2,2) # BGrams only
STOP_WORDS = stopwords.words('english')
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”", "|", "#"]
COMMON_WORDS = [] # to be populated later in our analysis
toktok = ToktokTokenizer()
wnl = WordNetLemmatizer()

def _get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

# Creating our tokenizer function. Can also use a TFIDF
def custom_tokenizer(sentence):
    # Let's use some speed here.
    tokens = [toktok.tokenize(sent) for sent in sent_tokenize(sentence)]
    tokens = [wnl.lemmatize(word, _get_wordnet_pos(word)) for word in tokens[0]]
    tokens = [word.lower().strip() for word in tokens]
    tokens = [tok for tok in tokens if (tok not in STOP_WORDS and tok not in SYMBOLS and tok not in COMMON_WORDS)]

    return tokens

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}


def clean_text(text):
    if (type(text) == str):
        text = text.strip().replace("\n", " ").replace("\r", " ")
        text = text.lower()
    else:
        text = "NA"
    return text

bow_vector = CountVectorizer(
    tokenizer=custom_tokenizer, ngram_range=NGRAMS)

WE do the standard training and test data split:

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# Combine features for NLP.
X = data['titles'].astype(str) +  ' ' + data['descriptions'].astype(str)
ylabels = le.fit_transform(data['class'])

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=1 - train_ratio)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))

And we should be ready to go.

But before, let's do some analysis to get a sense of what descriptions our model will be injesting, being the ngrams that make up our selected profiles:

import seaborn as sns
from sklearn.feature_selection import chi2

def get_top_n_dependant_ngrams(corpus, corpus_labels, ngram=1, n=3):
    # use a private vectorizer.
    _vect = CountVectorizer(tokenizer=custom_tokenizer,
                            ngram_range=(ngram, ngram))
    vect = _vect.fit(tqdm(corpus, "fn:fit"))
    bow_vect = vect.transform(tqdm(corpus, "fn:transform"))
    features = bow_vect.toarray()

    labels = np.unique(corpus_labels)
    ngrams_dict = {}
    for label in tqdm(labels, "fn:labels"):
        corpus_label_filtered = corpus_labels == label
        features_chi2 = chi2(features, corpus_label_filtered)
        feature_names = np.array(_vect.get_feature_names_out())

        feature_rev_indices = np.argsort(features_chi2[0])[::-1]
        feature_rev_indices = feature_rev_indices[:n]
        ngrams = [(feature_names[idx], features_chi2[0][idx]) for idx in feature_rev_indices]
        ngrams_dict[label] = ngrams

    # while we are at it, let's return top N counts also
    sum_words = bow_vect.sum(axis=0)
    bottom_words_counts = [(word, sum_words[0, idx])
                  for word, idx in tqdm(_vect.vocabulary_.items())]
    top_words_counts = sorted(
        bottom_words_counts, key=lambda x: x[1], reverse=True)
    top_words_counts = top_words_counts[:n]
    bottom_words_counts= bottom_words_counts[:n]
        
    return {'labels_freq': ngrams_dict,
            'top_corpus_freq': top_words_counts,
            'bottom_corpus_freq': bottom_words_counts}


TOP_N_WORDS = 10

common_bigrams_label_dict = get_top_n_dependant_ngrams(X, ylabels, ngram=1, n=TOP_N_WORDS)

fig, axes = plt.subplots(2, 3, figsize=(26, 12), sharey=False)
fig.suptitle('NGrams per Class')
fig.subplots_adjust(hspace=0.25, wspace=0.50)

x_plot = 0
y_plot = 0
labels = np.sort(np.unique(ylabels), axis=None)
for idx, label in tqdm(enumerate(labels), "Plot labels"):
    common_ngrams_df = pd.DataFrame(
        common_bigrams_label_dict['labels_freq'][label], columns=['ngram', 'chi2'])
    x1, y1 = common_ngrams_df['chi2'], common_ngrams_df['ngram']

    # Reverse it from the ordinal label we transformed it.
    axes[y_plot][x_plot].set_title(
        f'{le.inverse_transform([label])} ngram dependence', fontsize=6)
    axes[y_plot][x_plot].set_yticklabels(y1, rotation=0)
    sns.barplot(ax=axes[y_plot][x_plot], x=x1, y=y1)
    # Go to next plot.
    if idx > 0 and idx % 2 == 0:
        x_plot = 0
        y_plot += 1
    else:
        x_plot += 1

plt.show()

NGRAM Frequencies

From the chart above, we can see some common ngrams for each label. We should filter out the most overused and rarest ones:

print(common_bigrams_label_dict['top_corpus_freq'])
print(common_bigrams_label_dict['bottom_corpus_freq'])

common_label_freq = [word for label in labels for word, count in common_bigrams_label_dict['labels_freq'][label]]
print(f'Highest frequency of ngrames in labels: {common_label_freq}')

COMMON_WORDS = np.append([word for word,count in common_bigrams_label_dict['top_corpus_freq'] if word not in common_label_freq], 
                         [word for word,count in common_bigrams_label_dict['bottom_corpus_freq'] if word not in common_label_freq ])
COMMON_WORDS

plt.figure(figsize=(4,2))
sns.countplot(x=y_train)
plt.show

NGRAM count plots

In our count plot chart, we notice that our training data is unbalanced, so let's downsample to the smallest frequency:

# Either use class weights
from sklearn.utils.class_weight import compute_class_weight

keys = np.unique(y_train)
values = compute_class_weight(class_weight='balanced', classes=keys, y=y_train)

class_weights = dict(zip(keys, values))
print(f'Use these wieghts: {class_weights}')

# Or undersmaple.

min_size = np.array([len(data[data['class'] == 's']), len(data[data['class'] == 'o']), len(data[data['class'] == 'c']), len(data[data['class'] == 'f']), len(data[data['class'] == 'w'])]).min()
print(f'Least sampled class of size {min_size}')

data4 = data[data['class'] == 's'].sample(n=min_size, random_state=101)
data3 = data[data['class'] == 'o'].sample(n=min_size, random_state=101)
data2 = data[data['class']=='c'].sample(n=min_size, random_state=101)
data1 = data[data['class']=='f'].sample(n=min_size, random_state=101)
data0 = data[data['class']=='w'].sample(n=min_size, random_state=101)

data_under = pd.concat([data0,data1,data2,data3,data4],axis=0)

print(f'Undersampled shapes: {data0.shape}, {data1.shape}, {data2.shape}, {data3.shape}, {data4.shape}')

Finally, we train the model:

from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV

text_clf = Pipeline([
        ("cleaner", predictors()),
        ('vect', bow_vector),
        ('tfidf', TfidfTransformer()),
        ('clf', LinearSVC()),
    ],
    verbose=False) # Add verbose to see progress, note that we run x2 for each param combination.
parameters = {
    'vect__ngram_range': [(1, 2)],
    'tfidf__use_idf': [True],
    'tfidf__sublinear_tf': [True],
    'clf__penalty': ['l2'],
    'clf__loss':  ['squared_hinge'],
    'clf__C': [1],
    'clf__class_weight': ['balanced']
}
model_clf = GridSearchCV(text_clf,
                        param_grid=parameters,
                        refit=True,
                        cv=2,
                        error_score='raise')
model = model_clf.fit(X_train, y_train)

# see: model.cv_results_ for more reuslts
print(f'The best estimator: {model.best_estimator_}\n')
print(f'The best score: {model.best_score_}\n')
print(f'The best parameters: {model.best_params_}\n')

model = model.best_estimator_
model = CalibratedClassifierCV(model).fit(X_val, y_val)

predicted = model.predict(X_test)

Previously we used crossvalidators and an ensemble to find the best classifiers and hyperparameters, the code above is the result of that.

Let's check our model's accuracy:

# Model Accuracy
print("F1:", metrics.f1_score(y_test, predicted, average='weighted'))
print("Accuracy:", metrics.accuracy_score(y_test, predicted))
print("Precision:", metrics.precision_score(
    y_test, predicted, average='weighted'))
print("Recall:", metrics.recall_score(y_test, predicted, average='weighted'))

#Ploting the confusion matrix
plt.figure(figsize=(2, 2))
cm = metrics.confusion_matrix(y_test, predicted)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
                                      display_labels=model.classes_)

disp.plot()

Scores

Above 90% is more than perfect for us, the count will be proud!

Scores

Save our model for the Flask server:

from joblib import dump, load
import sys
print(sys.executable)
print(sys.version)
print(sys.version_info)

pickled_le = dump(le, './models/labelencoder.joblib')
validate_pickled_le = load('./models/labelencoder.joblib')
pickled_model = dump(model, './models/model.joblib')
validate_pickled_model = load('./models/model.joblib')

xx_test = ["IT Consultant at Sesame Street, lord of Java Code, who likes to learn new stuff and tries some machine learning in my free engineering time."]

yy_result = validate_pickled_model.predict(xx_test)
yy_result_label = validate_pickled_le.inverse_transform(yy_result)
yy_result_proba = validate_pickled_model.predict_proba(xx_test)

print(f'Predicted: {yy_result_label} at confidece {yy_result_proba[0][yy_result]}\n \
    for features: {validate_pickled_le.inverse_transform(validate_pickled_model.classes_)}\n \
    and their probability: {yy_result_proba}')

Saved model

Serving Models with some Flask

We will use a flask server to host the model and serve it through a REST API.

First we import or recreate all relevant interfaces for the pickled model (note that is the same code as the one used in our notebook):

NGRAMS = (2,2) # BGrams only
STOP_WORDS = stopwords.words('english')
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”", "|", "#"]
COMMON_WORDS = [] # to be populated later in our analysis
toktok = ToktokTokenizer()
wnl = WordNetLemmatizer()

app = Flask(__name__)
CORS(app)

# These functions will be referred to by the unpickled object.
def _get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

# Creating our tokenizer function. Can also use a TFIDF
def custom_tokenizer(sentence):
    # Let's use some speed here.
    tokens = [toktok.tokenize(sent) for sent in sent_tokenize(sentence)]
    tokens = [wnl.lemmatize(word, _get_wordnet_pos(word)) for word in tokens[0]]
    tokens = [word.lower().strip() for word in tokens]
    tokens = [tok for tok in tokens if (tok not in STOP_WORDS and tok not in SYMBOLS and tok not in COMMON_WORDS)]

    return tokens
def clean_text(text):
    if (type(text) == str):
        text = text.strip().replace("\n", " ").replace("\r", " ")
        text = text.lower()
    else:
        text = "NA"
    return text

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
    
CORS_HEADERS = {
    "Access-Control-Allow-Origin": "*",
    "Access-Control-Allow-Methods": "*",
    "Access-Control-Allow-Headers": "*",
    "Access-Control-Max-Age": "3600",
}

Then prepare our prediction API:

def predict_profile(profile_dict):
    try:
        prediction = MODEL.predict([profile_dict["descriptions"]])
        label = ENCODER.inverse_transform(prediction)
        pp = MODEL.predict_proba([profile_dict["descriptions"]])

        position = ''
        if label[0] == "o":
            position = "Engineering"
        elif label[0] == "c":
            position =  "CFA"
        elif label[0] == "s":
            position =  "HR"
        elif label[0] == "f":
            position =  "Product Management"
        elif label[0] == "w":
            position =  "Managing Director"
        else:
            position = "Analyst"

        proba = round(pp[0][prediction][0]*100, 2)
        return {
            "label": position,
            "proba": proba
        }
    except Exception as e:
       app.logger.error(f'We got this error: {e}')
       return None

and serve it on a REST endpoint:

@app.route("/profile", methods=["POST"])
def profile():
    prop = request.get_json()

    if MODEL is None:
        raise RuntimeError("RE MODEL cannot be None!")
    if (
        hasattr(request, "headers")
        and "content-type" in request.headers
        and request.headers["content-type"] != "application/json"
    ):
        ct = request.headers["content-type"]
        return (
            json.dumps({"error": f"Unknown content type: {ct}!"}),
            400,
            CORS_HEADERS,
        )

    if prop is None:
        return (json.dumps({"error": "No features passed!"}), 400, CORS_HEADERS)

    titles = {
        "descriptions": prop["descriptions"] if "descriptions" in prop else -1,
    }
    prediction = predict_profile(titles)
    return (json.dumps(prediction), 200, CORS_HEADERS) if (prediction != None) else (
        json.dumps({"error": f"Unknown error in prediction!"}),
        503,
        CORS_HEADERS,
    )

Call the flask server, and try it out with a curl or postman:

if __name__ == "__main__":
    app.logger.info("Running from the command line")
    app.run(host="0.0.0.0", port=800)
    mock_profile()

Install and Have fun

We are almost there!

To run the extension, go to your chrome browser and type chrome://extensions. From here click on Load Unpacked and select the folder chromeExtension:

Scores

Go to your linkedIn, open browser tools console, and see the extension traverser your feed and pickup information.

It will replace the names of people in your network with their classifier as Sesame Street characters, showing that an extension can create an augmented experience:

Augment LI

Such fun!

Conclusion

We learned how to create an extension and how to use NLP to make our linkedin more fun!

The extension scrapes linked in, but up to a maximum of 5 profiles. Linkedin will ban your account if you abuse the site, so be a good citizen when you do scraping.

References

Github

Article here is also available on Github

Kaggle notebook available here

Media

All media used (in the form of code or images) are either solely owned by me, acquired through licensing, or part of the Public Domain and granted use through Creative Commons License.

All PNGs used are from https://www.pngegg.com/

Sesame Street is copywrited and all characters are owned by the company.

CC Licensing and Use

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Made with πŸ’— by Adam

About

An NLP-power LinkedIn coPilot that will augment the site's experience. It will classify your network as you always wished!

Topics

Resources

License

Stars

Watchers

Forks