<a href="https://www.kaggle.com/code/eugenetanake/basic-web-scraping-with-python-express?scriptVersionId=97368066" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Hi all! This notebook is my compiled and tidied version of basic web scraping with python using Beautiful Soup.

The following is my experimental subject

https://www.linkedin.com/learning/search?trk=homepage-basic_intent-module-learning&sortBy=RELEVANCE&entityType=COURSE

In [1]:
# importing the libraries
import requests # to allow http requests
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from bs4 import BeautifulSoup
import os

In [2]:
# begin with the url of the list of courses
url = 'https://www.linkedin.com/learning/search?trk=homepage-basic_intent-module-learning&sortBy=RELEVANCE&entityType=COURSE'
req = requests.get(url).text
soup = BeautifulSoup(req)
courses = soup.find_all('li',{'class':'results-list__item'})

In [3]:
# empty arrays for each course seen on that list
urlList = []
durationList = []
nameList = []
byList = []
viewCountList = []
releaseDateList = []

# loop through the courses to extract information
for course in courses:
    url = course.find(href=True)
    urlList.append(url['href'])
    name = course.find('h3',{'class':'base-search-card__title'}).text.strip()
    nameList.append(name)
    by = course.find('h4',{'class':'base-search-card__subtitle'}).text.strip()
    byList.append(by)
    duration = course.find('div',{'class':'search-entity-media__duration'}).text.strip()
    durationList.append(duration)
    metadataItem = course.find_all('span')
    if "Released" in metadataItem[0].text:
        viewCountList.append(0)
        releaseDateList.append(metadataItem[0].text)
    else:
        viewCountList.append(metadataItem[0].text)
        releaseDateList.append(metadataItem[1].text)

In [4]:
# form our dataframe from the obtained list
df = pd.DataFrame(list(zip(nameList,urlList,durationList,byList,viewCountList,releaseDateList)),
                 columns = ['course name','url','duration','by','viewer count','release date'])
df

Unnamed: 0,course name,url,duration,by,viewer count,release date
0,Microsoft Teams Essential Training,https://www.linkedin.com/learning/microsoft-te...,2h 53m,By: Nick Brazzi,0,"Released Jun 1, 2022"
1,Introduction to RedisGraph,https://www.linkedin.com/learning/introduction...,1h 13m,By: Ayaka Shinozaki,0,"Released Jun 1, 2022"
2,Outlook Quick Tips,https://www.linkedin.com/learning/outlook-quic...,27m,By: Garrick Chow,0,"Released Jun 1, 2022"
3,Blockchain Programming in JavaScript,https://www.linkedin.com/learning/blockchain-p...,1h 55m,By: Mohammad Azam,0,"Released Jun 1, 2022"
4,Azure Dapr for .NET Developers Part 1,https://www.linkedin.com/learning/azure-dapr-f...,1h 47m,By: Rodrigo Díaz Concha,0,"Released Jun 1, 2022"
5,Learning Azure Kubernetes Service (AKS),https://www.linkedin.com/learning/learning-azu...,1h 18m,By: Richard Hooper,0,"Released Jun 2, 2022"
6,Mapping to Learn with Figma,https://www.linkedin.com/learning/mapping-to-l...,31m,By: Drew Bridewell,0,"Released Jun 2, 2022"
7,Git Workflows,https://www.linkedin.com/learning/git-workflow...,1h 4m,By: Kevin Bowersox,0,"Released Jun 2, 2022"
8,Foundations of Decentralized Finance (DeFi),https://www.linkedin.com/learning/foundations-...,59m,By: Kedric Van de Carr,0,"Released May 31, 2022"
9,"Working with Staffing Agencies, Recruiters, He...",https://www.linkedin.com/learning/working-with...,37m,By: Chris Taylor,0,"Released May 31, 2022"


In [5]:
# empty arrays for items we could find from each course's url
likesList = []
skillLevelList = []
ratingList = []
ratingMaxList = []

# loop through each url to obtain information found on each webpage. If information not found, append None
for i in df['url']:
    req = requests.get(i).text
    soup = BeautifulSoup(req)
    temp = soup.find_all('span',{'class':'top-card__headline-row-item'})
    likes = None
    skill = None
    for i in temp:
        if "Liked" in i.text:
            likes = i.text
        if "Skill" in i.text:
            skill = i.text
    likesList.append(likes)
    skillLevelList.append(skill)
    rating = soup.find('span',{'class':'ratings-summary__rating-average'})
    if rating is not None:
        ratingList.append(rating.text)
    else:
        ratingList.append(None)
    ratingMax = soup.find('span',{'class':'ratings-summary__rating-max'})
    if rating is not None:
        ratingMaxList.append(ratingMax.text)
    else:
        ratingMaxList.append(None)

In [6]:
# form dataframe from obtained list
tempdf = pd.DataFrame(list(zip(likesList,skillLevelList,ratingList,ratingMaxList)),
                     columns = ['likes','skill level','rating','rating max'])

# combine our two dataframe
df = pd.concat([df,tempdf],axis = 1)
df

Unnamed: 0,course name,url,duration,by,viewer count,release date,likes,skill level,rating,rating max
0,Microsoft Teams Essential Training,https://www.linkedin.com/learning/microsoft-te...,2h 53m,By: Nick Brazzi,0,"Released Jun 1, 2022",Liked by 2 users,Skill level: Beginner + Intermediate,,
1,Introduction to RedisGraph,https://www.linkedin.com/learning/introduction...,1h 13m,By: Ayaka Shinozaki,0,"Released Jun 1, 2022",Liked by 1 user,Skill level: Beginner,,
2,Outlook Quick Tips,https://www.linkedin.com/learning/outlook-quic...,27m,By: Garrick Chow,0,"Released Jun 1, 2022",Liked by 1 user,Skill level: General,,
3,Blockchain Programming in JavaScript,https://www.linkedin.com/learning/blockchain-p...,1h 55m,By: Mohammad Azam,0,"Released Jun 1, 2022",,Skill level: Intermediate,,
4,Azure Dapr for .NET Developers Part 1,https://www.linkedin.com/learning/azure-dapr-f...,1h 47m,By: Rodrigo Díaz Concha,0,"Released Jun 1, 2022",,Skill level: Intermediate,,
5,Learning Azure Kubernetes Service (AKS),https://www.linkedin.com/learning/learning-azu...,1h 18m,By: Richard Hooper,0,"Released Jun 2, 2022",Liked by 4 users,Skill level: Intermediate,,
6,Mapping to Learn with Figma,https://www.linkedin.com/learning/mapping-to-l...,31m,By: Drew Bridewell,0,"Released Jun 2, 2022",,Skill level: Intermediate,,
7,Git Workflows,https://www.linkedin.com/learning/git-workflow...,1h 4m,By: Kevin Bowersox,0,"Released Jun 2, 2022",,Skill level: Intermediate,,
8,Foundations of Decentralized Finance (DeFi),https://www.linkedin.com/learning/foundations-...,59m,By: Kedric Van de Carr,0,"Released May 31, 2022",Liked by 10 users,Skill level: General,,
9,"Working with Staffing Agencies, Recruiters, He...",https://www.linkedin.com/learning/working-with...,37m,By: Chris Taylor,0,"Released May 31, 2022",Liked by 1 user,Skill level: General,,


In [7]:
# save dataframe to csv file.
os.chdir(r'/kaggle/working')

df.to_csv(r'webScrapexp.csv', index = False)

This is the end of the page. Thanks for viewing!

Link to the two more detailed notebook including my thought processes:

https://www.kaggle.com/code/eugenetanake/basic-web-scraping-with-python-pt-1

https://www.kaggle.com/code/eugenetanake/basic-web-scraping-with-python-pt-2