# LinkedIn Profile Scrapping Using BeautifulSoup & Selenium
# By Eya Kaabachi 

We're going to cover a very popular topic about using Python language, automating web page crowling and extraction, which is Web Scrapping. Before we begin, here are some important rules to follow and understand:


- Be respectful and ask for permission to scrape, don't bombard a website with requests or your IP address could get blocked!
- Understand that websites and their code change regularly, which means that your code that works perfectly one day, could be totally inoperative the next.
- Almost all interesting Web Scraping projects are unique and customized, you will have to make an effort to generalize the skills developed here.

Are you ready ? Ok let's start !

We should set a plan for our web scrapping method! We have 4 setps to go through to get the desired result:

- **TASK 1** : Access the linkedin login page
- **TASK 2** : Search for the profile we want
- **TASK 3** : Navigate through next pages
- **TASK 4** : Scrape the data of each profile and write the data to a .csv file

We're gonna need to install multiple libraries and modules :
- Selenium : to interact with web browser
- Chromedriver : for selenium to interact with
- Beautifulsoup4: to extract the content of a website
- Time: to delay an action by an amount of time
- CSV: to work with csv file

### Task 1:
* **Task 1.1 :**
- Open chrome browser
- Access linkedin site
* **Task 1.2 :**
- key in the login credentials
- login

(PS : you need to create a .txt file and put your email in the first line and your password in the second)

In [1]:
import numpy as np
import pandas as pd

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import csv
from time import sleep

In [3]:
#step1 : login to linkedin

#open chrome and login linkedin login site

driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'https://www.linkedin.com/login'
driver.get(url)
print('- finish initializing a driver')
sleep(2)

# import username and password
id = open('id.txt')
line = id.readlines()
username = line[0]
password = line[1]

#key in unsername
email_field = driver.find_element_by_id ('username')
email_field.send_keys(username)
print('-finish keying in email')

sleep(3)

#key in password
password_field = driver.find_element_by_name('session_password')
password_field.send_keys(password)
print('-finish keying in password')

sleep(2)

#click login button
login_field = driver.find_element_by_xpath('//*[@id="organic-div"]/form/div[3]/button')
login_field.click()
print('-finish logging in')
sleep(3)



Current google-chrome version is 98.0.4758
Get LATEST chromedriver version for 98.0.4758 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/98.0.4758.102/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\Eya Kaabachi\.wdm\drivers\chromedriver\win32\98.0.4758.102]


- finish initializing a driver
-finish keying in email
-finish keying in password
-finish logging in


### Task 2:
- Locate the search bar
- input search keyword
- search

In [4]:
#step 2 : search for profile we want to crawl

#locate the search bar element
search_field = driver.find_element_by_xpath('//*[@id="global-nav-typeahead"]/input')
#input the search query to the search bar
search_query = input('What profile do you want to scrape?')
search_field.send_keys(search_query)
#search 
search_field.send_keys(Keys.RETURN)
print('-finish searching...')

What profile do you want to scrape?ESPRIT
-finish searching...


Also we have to locate the "see all people results" button and click it.

In [5]:
# we have to open "see all people results"
sleep(2)
login_field = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[2]/div[2]')
login_field.click()

#sleep(2)
#login_field = driver.find_element_by_xpath('//*[@id="search-reusables__filters-bar"]/div/div/button')
#login_field.click()

#sleep(2)
#driver.execute_script('window.scrollTo(0, document.scrollHeight)')
#login_field = driver.find_element_by_xpath('//*[@id="ember889"]/ul/li[6]/fieldset/div/ul/li[1]')
#login_field.click()

#sleep(2)
#driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
#sleep(2)
#login_field = driver.find_element_by_xpath('//*[@id="ember571"]')
#login_field.click()


### Task 3:
* **Task 3.1 :**
- Put all the urls of one page in a list
* **Task 3.2 :**
- Crawl all the pages and put the urls in one list

In [6]:
#step 3 : scrape the urls of the profiles
def GetURL():
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, "html.parser")
    profiles = soup.find_all('a', {'class':"app-aware-link",'target':''} )
    all_profile_url = []
    for profile in profiles:
        profile_id = profile.get('href')
        profile_url = profile_id.rsplit('?',1)[0]
        if profile_url not in all_profile_url:
            all_profile_url.append(profile_url)
    return all_profile_url
print(GetURL())

['https://www.linkedin.com/search/results/people/', 'https://www.linkedin.com/in/ncib-lotfi-68443963', 'https://www.linkedin.com/in/amri-omar', 'https://www.linkedin.com/in/guesmi-mohamed-208309202']


In [9]:
    input_page = int(input('How many pages you want to scrape: '))
    urls_all_page = []
    for page in range(input_page):
            url_one_page = GetURL()
            sleep(2)
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
            sleep(2)
            next_button = driver.find_element_by_class_name('artdeco-pagination__button--next')
            next_button.click()
            urls_all_page =urls_all_page + url_one_page
            sleep(2)


How many pages you want to scrape: 1


### Task 4:
- extract data about one profile
- go through all the list of urls
- put the data into a csv file

In [None]:
#step 4 : scrape the data of 1 linkedin profile, and write the data to a .csv file
with open('output.csv','w', newline = '') as file_output:
    headers = ['Name','Job Title', 'Location', 'URL']
    writer = csv.DictWriter(file_output, delimiter =',', lineterminator='\n', fieldnames=headers)
    writer.writeheader()
    for linkedin_URL in urls_all_page:
        driver.get(linkedin_URL)
        sleep(2)
        page_source = driver.page_source
        soup = BeautifulSoup(page_source, "html.parser")
        name_div = soup.find('div',{'class':'mt2 relative'})
        name_loc = name_div.find_all('div')
        name = name_loc[0].find('h1').get_text().strip()
        print('- profile name is : ',name)
        loc = name_loc[5].find('span')
        location=loc.get_text().strip()
        print('- profile location is : ',location)
        profile_title = name_loc[2].get_text().strip()
        print('- profile title is : ', profile_title)
        print('\n')
        writer.writerow({headers[0]:name, headers[1]:location, headers[2]:profile_title, headers[3]:linkedin_URL})

And there you have it! you can customise the data you want to extract as you prefer. You are welcome :)