# Linkedin Parser

### Made by Carlo Occhiena https://github.com/carloocchiena



This parser is a side project I made for fun in about 1 hours on March 2022. 

The scope is to translate the post from the Italian legend of serveless application Michele Sciabarrà. 

His Linkedin posts are often very technical, so I tried to cook them up a lil' bit.

The parser clean all the text, lemmatize the words, and search for each one of it on an Italian Dictionary.

For all the words that are not on the dictionary, it search on Google.

The tool does not clean any typo, link, or else. It's just a quick pj I made for fun.

Please note that I am friend of Michele and this tool aim to be a tribute to him.


In [None]:
!pip install italian_dictionary
!pip install -U spacy
!python -m spacy download it_core_news_sm

# restart runtime before proceeding (Runtime--> Restart Runtime)

In [None]:
from bs4 import BeautifulSoup as BS4 
import lxml
import requests

import spacy
import italian_dictionary
from googlesearch import search

# import the smallest Spacy dataset for Italian language 
# see https://spacy.io/models/it 

nlp = spacy.load('it_core_news_sm')

# setting the main variables 
clean_text = []
word_list = []
unknown = []
post_lemma = []
count = 0

# insert the url of the post. 
# you have to retrieve it manually from linkedin atm
# even if this could be easily automated
url = "https://www.linkedin.com/feed/update/urn:li:activity:6909443962395049984/"

# preparing the soup
post = requests.get(url, "html parser").content
soup = BS4(post, "lxml")
text = soup.select('.share-update-card__update-text')

# let's do some cleaning
for content in text:
    clean_text.append(content.text)

for sentence in clean_text:
  word_list.append(sentence.split())

# convert list to string to parse to spacy lemmatizer
word_string = "".join([str(item) for item in word_list])

# clean escape characthers
clean_word_string = word_string.replace("\'","").replace("[","").replace(",","")

# lemmatize the words with SpaCy
doc = nlp(clean_word_string)
post_lemma = (" ".join(token.lemma_ for token in doc)).split(" ")

In [63]:
# look up each word on the Italian Dict. 
# this may take a while
for word in post_lemma:
  try:
    italian_dictionary.get_definition(word.lower(), all_data=False,  limit=1)
  except Exception:
    unknown.append(word.lower())

In [98]:
# look up for unknown words on google
# and retrieve the first link for the first 15 words
count = 0
for word in unknown:
  if count < 15:
    count += 1
    search_result_list = list(search(word, tld="co.in", num=1, stop=1, pause=1))
    print(f'The best link to find info about {word} is: {search_result_list}')
  else:
    print("Max number of search for the day!")
    break

The best link to find info about issues is: ['https://www.youtube.com/watch?v=7dqMyh4ILIg']
The best link to find info about pyrhon is: ['https://www.python.org/']
The best link to find info about piccolino is: ['https://www.piccolinobaby.com/']
The best link to find info about bit.ly/nuvambra1 is: []
The best link to find info about autoassegnata is: ['https://www.techdico.com/translation/italian-english/attivit%C3%A0+autoassegnata.html']
The best link to find info about youtube is: ['https://www.youtube.com/']
The best link to find info about ontributor is: ['https://www.lawinsider.com/dictionary/c-ontributor']
The best link to find info about reverse is: ['https://www.merriam-webster.com/dictionary/reverse']
The best link to find info about contributori is: ['https://en.wiktionary.org/wiki/contributori']
The best link to find info about tirar is: ['https://www.spanishdict.com/translate/tirar']
The best link to find info about quotidiane is: ['https://en.wiktionary.org/wiki/quotidian