# Part 1 Web Scraping

Web Scraping is an art where one has to study the website and work according to the dynamics of that particular website.

Most common tools used for web scraping in python are demonstrated below.

1. requests https://requests.readthedocs.io/en/latest/
2. beautiful soup https://beautiful-soup-4.readthedocs.io/en/latest/
3. Selenium https://selenium-python.readthedocs.io/
4. Scrapy https://docs.scrapy.org/en/latest/

We will be working on the first three and the fourth one can be explored in the homeworks.

We will be scraping 4 websites today:

1. GeeksforGeeks
2. MarketWatch
3. CNBC
4. Hoopshype

There are different techniques to be used when scraping a dynamic website vs a static website which will be discussed in the coming sections

Some websites have their APIs open and those can be used to directly fetch the data without the need of scraping the HTML or XML pages.

In [None]:
# installing the libraries
!pip install requests
!pip install bs4
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.4.3-py3-none-any.whl (985 kB)
[K     |████████████████████████████████| 985 kB 5.1 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 51.0 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting urllib3[socks]~=1.26
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 62.2 MB/s 
[?25hCollecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Co

In [None]:
# importing the libraries
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import json
from google.colab import drive
import sys



In [None]:
drive.mount("/content/drive/")

Mounted at /content/drive/


## GeeksforGeeks

In [None]:
# getting the first URL
# open the URL in parallel in other tab to check the information we are extracting
url = "https://www.geeksforgeeks.org/python-programming-language/"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
# creating a soup object from the returned html page
sp = soup(res.text, "lxml")
sp

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8"/><meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/><meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/><link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/><meta content="#308D46" name="theme-color"/><meta content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png" name="image" property="og:image"/><meta content="image/png" property="og:image:type"/><meta content="200" property="og:i

In [None]:
# printing it in readable format
print(sp.prettify())

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/>
  <link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/>
  <meta content="#308D46" name="theme-color"/>
  <meta content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png" name="image" property="og:image"/>
  <meta content="image/png" property="og:image:type"/>
  <meta co

In [None]:
# parsing title element of the page
print(sp.title)
print(sp.title.name)
print(sp.title.string)
print(sp.title.parent.name)

<title>Python Programming Language - GeeksforGeeks</title>
title
Python Programming Language - GeeksforGeeks
head


In [None]:
# extracting the title of the article
print(sp.find("h1", {"class" : "entry-title"}).text)

Python Programming Language


In [None]:
# extracting the date of the article
print(sp.find("div", {"class" : "meta"}).text)

Last Updated :
16 Jun, 2022


In [None]:
# extracting the content of the article
# it extracts everyhting together, in the next sections we can see how to iteratively extract information paragraph by paragraph
print(sp.find("div", {"class" : "page_content"}).text)

Python is a high-level, general-purpose and a very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.This specially designed Python tutorial will help you learn Python Programming Language in most efficient way, with the topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.Below are some facts about Python Programming Language:Python is currently the most widely used multi-purpose, high-level programming language.Python allows programming in Object-Oriented and Procedural paradigms.Python programs generally are smaller than other programming languages like Java. Programmers have to type relatively less and indentation requirement of the language, makes

In [None]:
# extracting the links found in the bottom of the article for further reading
for tag in sp.find("div", {"class":"Basics"}).findAll("a", href = True): print(tag.text, "\n", tag["href"], "\n\n")

Python language introduction 
 https://www.geeksforgeeks.org/python-language-introduction/ 


Python 3 basics 
 https://www.geeksforgeeks.org/python-3-basics/ 


Python The new generation language 
 https://www.geeksforgeeks.org/python-the-new-generation-language/ 


Important difference between python 2.x and python 3.x with example 
 https://www.geeksforgeeks.org/important-differences-between-python-2-x-and-python-3-x-with-examples/ 


Keywords in Python | Set 1 
 https://www.geeksforgeeks.org/keywords-python-set-1/ 


Set 2 
 https://www.geeksforgeeks.org/keywords-python-set-2/ 


Namespaces and Scope in Python 
 https://www.geeksforgeeks.org/namespaces-and-scope-in-python/ 


Statement, Indentation and Comment in Python 
 https://www.geeksforgeeks.org/statement-indentation-and-comment-in-python/ 


Structuring Python Programs 
 https://www.geeksforgeeks.org/structuring-python-programs/ 


How to check if a string is a valid keyword in Python? 
 https://www.geeksforgeeks.org/check-s

In [None]:
# extracting information paragraph by paragraph
for tag in sp.find("div", {"class":"page_content"}).findAll("p"):print(tag.text)

Python is a high-level, general-purpose and a very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.
This specially designed Python tutorial will help you learn Python Programming Language in most efficient way, with the topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.

Below are some facts about Python Programming Language:

Recent Articles on Python !Python Programming ExamplesPython Output & Multiple Choice Questions 
Basics, Input/Output, Data Types, Variables, Operators, Control Flow, Functions, Object Oriented Concepts, Exception Handling, Python Collections, Django Framework, Data Analysis, Numpy, Pandas, Machine Learning with Python, Pyth

*Almost* all the geeksforgeeks articles have the same format and hence all of them can be scraped using the same code, this repeatability is useful while doing web scraping as a block of code can help one get a lot of information in a structured way

## MakketWatch


In [None]:
# getting the second url
# again open the url in parallel to track the extracted information
url = "https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963?mod=home-page"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
# creating a soup object from the returned html page
sp = soup(res.text, "html.parser")
sp

<!DOCTYPE html>

<html data-env="prod" data-site="marketwatch" lang="en-US">
<head>
<title>The next financial crisis may already be brewing, but not where many expect - MarketWatch</title>
<link href="https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963" rel="canonical"/>
<meta content="The next financial crisis may already be brewing --- but not where investors might expect" property="og:title"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-167x167.png" rel="apple-touch-icon" sizes="167x167"/>
<link href="https://mw4.wsj.net/mw5/cont

In [None]:
# getting the title
print(sp.find("h1", {"class" : "article__headline"}).text)


  The next financial crisis may already be brewing — but not where investors might expect



In [None]:
# getting the time
print(sp.find("time", {"class" : "timestamp--pub"}).text)


  First Published: Sept. 14, 2022 at 11:56 a.m. ET



Other information can be extracted using similar methods as used in geeksforgeeks, this can be done in homework

Again same as geeksforgeeks, all articles of marketwatch are similar in structure and hence the same code can be used to scrap through all the articles of this website

Now working with a dynamic website that has its API open.

Open the CNBC website and search for any topic. If we search for SPORTS the URL looks like this: https://www.cnbc.com/search/?query=SPORTS&qsearchterm=SPORTS and if we search for POLITICS the URL looks like this: https://www.cnbc.com/search/?query=POLITICS&qsearchterm=POLITICS

Here we can observe a pattern and we can predict what the url would look like if we search something else, this information can be used for reusability and repseatability of code.

Now we can see that there are more than 50,000 results for the topic politics but only 10 are loaded in the beginning. Once we scroll down, next 10 are loaded and so on. To load the next results when the user scrolls down, an API is hit and link to that API is found from the network section of the inspect element: https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

Here playing around with the endindex and batchsize parameter we can get various results.

Now the response of this API call would be a JSON and hence the need for parsing a web page is gone when there is an open API

## CNBC

In [None]:
# getting the third url
url = "https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=11&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
res.text

'{ "metadata" : { "q" : "politics", "totalresults" : 51673, "pagesize" : 11, "totalpage" : 4698, "pagerequested" : 2, "corrections" : [], "stems" : ["politics"], "suggestions" : ["politics"], "facetsuggestions" : [{ "facet" : "tags:show", "suggestions" : ["Markets and Politics Digital Original Video"] }, { "facet" : "tags:topic", "suggestions" : ["Politics"] }], "related" : [], "resultgenerationtime" : "26.0242 ms" }, "results" : [{ "_id" : 1163003, "description" : "Palantir CEO Alex Karp speaks with CNBC\'s Andrew Ross Sorkin at Aspen Ideas Festival about the evolution of the data company, how Palantir works with the U.S. military and intelligence agencies, and the balance between privacy and survellience.", "cn:lastPubDate" : "2022-06-29T18:55:12+0000", "dateModified" : "2022-06-29T18:55:12+0000", "cn:dateline" : "", "cn:branding" : "cnbc", "section" : "Aspen Ideas Festival", "cn:type" : "cnbcvideo", "author" : "", "cn:source" : [], "cn:subtype" : "clips", "duration" : "2928", "summa

In [None]:
# getting the description of each news article
for description in json.loads(res.text)["results"]: print(description["description"], "\n")

Palantir CEO Alex Karp speaks with CNBC's Andrew Ross Sorkin at Aspen Ideas Festival about the evolution of the data company, how Palantir works with the U.S. military and intelligence agencies, and the balance between privacy and survellience. 

Frank Slootman, Snowflake CEO, joins 'TechCheck' to discuss how Slootman likes to shape Snowflake's workplace culture, how the company's recent announcements help Snowflake's addressable market and more. 

When Finland and Sweden announced their interest in joining NATO, the two Nordic states were expected to be swiftly accepted as members of the defense alliance. But joining NATO requires consensus approval from all existing members, and Turkey – one of the group's most strategically important and mi 

Saudi Arabia's energy minister said Tuesday that OPEC+ will keep politics out of its decision-making in favor of the "common good" of stabilizing energy prices.Governments and international organizations around the world have imposed punitive s

## Hoopshype

Finally working with Selenium

In [None]:
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)

In [None]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://hoopshype.com/salaries/players/')

In [None]:
# getting players name list
players = driver.find_elements("xpath", '//td[@class="name"]')

In [None]:
players_list = []
for p in range(len(players)): players_list.append(players[p].text)
players_list

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


Similarly other information such as players' salaries can also be easily extracted and can be done as homework

# Part 2: PPT Scraping

In [None]:
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install python-pptx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-pptx
  Downloading python-pptx-0.6.21.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 4.9 MB/s 
Collecting XlsxWriter>=0.5.7
  Downloading XlsxWriter-3.0.3-py3-none-any.whl (149 kB)
[K     |████████████████████████████████| 149 kB 52.9 MB/s 
[?25hBuilding wheels for collected packages: python-pptx
  Building wheel for python-pptx (setup.py) ... [?25l[?25hdone
  Created wheel for python-pptx: filename=python_pptx-0.6.21-py3-none-any.whl size=470951 sha256=1860d6eb93e2c80d9a09dd87c61ae0c87acbd114d86df2b05200a51331604bf5
  Stored in directory: /root/.cache/pip/wheels/a7/ab/f4/52560d0d4bd4055e9261c6df6e51c7b56c2b23cca3dee811a3
Successfully built python-pptx
Installing collected packages: XlsxWriter, python-pptx
Successfully installed XlsxWriter-3.0.3 python-pptx-0.6.21


Extracting text from PPT using the python-pptx library, it is typically used for generating ppts from databases but we can exploit some of its features here to extract text from ppts, this is a very basic example and it can be explored further as per the need. documentation to the libary: https://python-pptx.readthedocs.io/en/latest/

In [None]:
# importing library
from pptx import Presentation

In [None]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/Your big idea.pptx")
counter_slide = 1
for slide in prs.slides:
    print("slide:", counter_slide, "\n")
    counter_content = 1
    for shape in slide.shapes:
        try:
            print("content:", counter_content, shape.text, "\n")
            counter_content += 1
        except: continue
    print("\n\n")
    counter_slide += 1

slide: 1 

content: 1 Making Presentations That Stick 

content: 2 A guide by Chip Heath & Dan Heath 




slide: 2 

content: 1 Selling your idea 

content: 2 Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea. 




slide: 3 

content: 1 1. Intro 

content: 2 Choose one approach to grab the audience’s attention right from the start: unexpected, emotional, or simple.
UnexpectedHighlight what’s new, unusual, or surprising.
EmotionalGive people a reason to care.
SimpleProvide a simple unifying message for what is to come 




slide: 4 

content: 1 How many languages do you need to know to communicate with the rest of the world? 




slide: 5 

content: 1 Just one! Your own.
(With a little help from your smart phone) 




slide: 6 

content: 1 The Google Translate app can repeat anything you say in up to NINETY LANGUAGES from G

Similarly other components of the PPT can be extracted after following the documentation as per need

# Part 3: PDF Scraping


Using the library PyPDF2: https://pypi.org/project/PyPDF2/
This library can only extract text from PDFs, for tables and images other methods are required.
Extracting text from PDFs is much difficult compared to web and ppt as there is no inherent structure where just calling the right elements will give us everything, infact pdfs can be seen as an image and hence whatever extraction we do is by using some kind of optical character recognition.

In [None]:
# installing the library
!pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading PyPDF2-2.11.0-py3-none-any.whl (220 kB)
[K     |████████████████████████████████| 220 kB 4.8 MB/s 
Installing collected packages: PyPDF2
Successfully installed PyPDF2-2.11.0


In [None]:
# importing the library
import PyPDF2

In [None]:
# reading the file
pdfFileObj = open('/content/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf', 'rb')


In [None]:
# passing the file to PyPDF
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [None]:
# getting the number of pages
print(pdfReader.numPages)

21


In [None]:
# getting the first page
pageObj = pdfReader.getPage(1)

In [None]:
# extracting text from the first page
print(pageObj.extractText())

K. Mishev et al.: Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers
decisions. The sentiments expressed in news and tweets inu-
ence stock prices and brand reputation, hence, constant mea-
surement and tracking of these sentiments is becoming one of
the most important activities for investors. Studies have used
sentiment analysis based on nancial news to forecast stock
prices [6][8], foreign exchange and global nancial market
trends [9], [10] as well as to predict corporate earnings [11].
Given that the nancial sector uses its own jargon, it is
not suitable to apply generic sentiment analysis in nance
because many of the words differ from their general meaning.
For example, ``liability'' is generally a negative word, but
in the nancial domain it has a neutral meaning. The term
``share'' usually has a positive meaning, but in the nancial
domain, share represents a nancial asset or a stock, which
is a neutral word. Furthermore, ``bull'' is neutral in gen

# Homework

1. As discussed in the demo above, using the example of geeksforgeeks, extract the information from marketwatch articles apart from the title and date that is already demonstrated.

2. As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

3. Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

4. Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

5. Pick a website that has tabular data (can be one of the two selected above) and try to scrap it using the tools studied during the demo.

(The datasets you will be collecting for the projects would be by text extraction so make sure to extract usable structured information)

6. Explore further the python-pptx library and check how to differentiate between texts coming from different components such as title, subtitle and paragraphs.

7. Extract table from a PPT using the same library.

8. Research and find some more libraries to extract text from PDFs and show basic implementation of any one of them.


### Answer 1

As discussed in the demo above, using the example of geeksforgeeks, extract the information from marketwatch articles apart from the title and date that is already demonstrated.


In [None]:
url = "https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963?mod=home-page"

res = requests.get(url)
#print(res.status_code)

sp = soup(res.text, "html.parser")
#print(sp)

In [None]:
#print(sp.title)
#print(sp.title.name)
print(sp.title.string)
print(sp.find("time", {"class" : "timestamp timestamp--update"}).text)
print(sp.find("time", {"class" : "timestamp timestamp--pub"}).text)

The next financial crisis may already be brewing, but not where many expect - MarketWatch

    Last Updated: Sept. 17, 2022 at 10:33 a.m. ET
  

  First Published: Sept. 14, 2022 at 11:56 a.m. ET



### Answer 2

As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

In [None]:
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)

In [None]:
driver.get('https://hoopshype.com/salaries/players/')
players = driver.find_elements("xpath", '//td[@class="name"]')

In [None]:
'''
players_list = []
for p in range(len(players)): players_list.append(players[p].text)
players_list
'''

In [None]:
salaries = driver.find_elements("xpath", '//td[@class="hh-salaries-sorted"]')

salaries_list = []
for p in range(len(salaries)): salaries_list.append(salaries[p].text)
salaries_list

['2022/23',
 '$48,070,014',
 '$47,345,760',
 '$47,063,478',
 '$44,474,988',
 '$44,119,845',
 '$43,279,250',
 '$42,492,568',
 '$42,492,492',
 '$42,492,492',
 '$42,492,492',
 '$40,600,080',
 '$38,172,414',
 '$37,984,276',
 '$37,980,720',
 '$37,653,300',
 '$37,633,050',
 '$37,096,500',
 '$37,096,500',
 '$37,096,500',
 '$36,934,550',
 '$36,596,549',
 '$35,448,672',
 '$35,448,672',
 '$33,833,400',
 '$33,833,400',
 '$33,833,400',
 '$33,665,040',
 '$33,616,770',
 '$33,616,770',
 '$33,333,333',
 '$33,047,803',
 '$33,000,000',
 '$31,650,600',
 '$31,650,600',
 '$31,377,750',
 '$30,913,750',
 '$30,913,750',
 '$30,913,750',
 '$30,913,750',
 '$30,351,780',
 '$30,351,780',
 '$30,351,780',
 '$30,075,000',
 '$28,946,605',
 '$28,942,830',
 '$28,741,071',
 '$28,400,000',
 '$28,333,334',
 '$27,733,332',
 '$27,300,000',
 '$26,500,000',
 '$25,806,468',
 '$23,760,000',
 '$23,500,000',
 '$22,680,000',
 '$22,600,000',
 '$22,321,429',
 '$22,000,000',
 '$21,486,316',
 '$21,250,000',
 '$21,177,750',
 '$20,955,00

### Answer 3

Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

In [None]:
#Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.
url = "https://www.esa.int/kids/en/learn/Our_Universe/Story_of_the_Universe/The_Universe"
res = requests.get(url)
print(res.status_code)
sp = soup(res.text, "html.parser")

print(sp.find("h1", {"class" : "entry-title"}).text)

for tag in sp.find("article", {"class":"article"}).findAll("p"):
  print(tag.text)

for tag in sp.find("ul", {"class":"rel_menu"}).findAll("a", href = True):
  print(tag.text, "\n", tag["href"], "\n\n")

200
The Universe
Access the image


The Universe is everything we can touch, feel, sense, measure or detect. It includes living things, planets, stars, galaxies, dust clouds, light, and even time. Before the birth of the Universe, time, space and matter did not exist.


The Universe contains billions of galaxies, each containing millions or billions of stars. The space between the stars and galaxies is largely empty. However, even places far from stars and planets contain scattered particles of dust or a few hydrogen atoms per cubic centimeter. Space is also filled with radiation (e.g. light and heat), magnetic fields and high energy particles (e.g. cosmic rays).


The Universe is incredibly huge. It would take a modern jet fighter more than a million years to reach the nearest star to the Sun. Travelling at the speed of light (300,000 km per second), it would take 100,000 years to cross our Milky Way galaxy alone.


No one knows the exact size of the Universe, because we cannot see th

In [None]:
#Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.
url = "https://www.w3schools.com/html/html_tables.asp"
res = requests.get(url)
print(res.status_code)
sp = soup(res.text, "html.parser")

print(sp.find("div", {"class" : "w3-code notranslate htmlHigh"}).text)

200

<table> 
<tr>    <th>Company</th>
    <th>Contact</th>     <th>Country</th>
 
</tr> 
<tr>    <td>Alfreds Futterkiste</td>
    <td>Maria 
  Anders</td>     <td>Germany</td>
 
</tr>  <tr>    <td>Centro 
  comercial Moctezuma</td>
    <td>Francisco 
  Chang</td>     <td>Mexico</td>
  </tr></table>



### Answer 4

Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

In [None]:
driver_ans4 = webdriver.Chrome('chromedriver',options=chrome_options)
driver_ans4.get('https://www.esa.int/kids/en/learn/Our_Universe/The_Sun/Our_nearest_star')

In [None]:
from selenium.webdriver.common.by import By
'''
API: https://www.selenium.dev/documentation/webdriver/elements/finders/
'''

p_elements = driver_ans4.find_elements(By.TAG_NAME,"p")

for e in p_elements:
    print(e.text)

SOHO snaps the Sun's heat
Access the image
The Sun is our nearest star. The Sun provides us with light and heat. It also gives out dangerous ultraviolet light which causes sunburn and may cause cancer. Without the Sun there would be no daylight, and our planet would simply be a dark, frozen world, with no oceans of liquid water and no life.
This huge ball of superhot gas is 1.4 million kilometres across, equal to 109 Earths set side by side. With a mass of 2 million-trillion-trillion-trillion kilograms, it weighs as much as 330 000 Earths. About 1 300 000 Earths would fit inside the Sun!
From Earth, the Sun looks like it moves across the sky in the daytime and appears to disappear at night. This is because the Earth is spinning towards the east. The Earth spins about its axis, an imaginary line that runs through the middle of the Earth between the North and South poles. This means that to us here on the spinning Earth, the Sun appears to rise in the east in the morning, and climb highe

### Answer 5

Pick a website that has tabular data (can be one of the two selected above) and try to scrap it using the tools studied during the demo.

(The datasets you will be collecting for the projects would be by text extraction so make sure to extract usable structured information)

In [None]:
url = "https://www.w3schools.com/html/html_tables.asp"
res = requests.get(url)
print(res.status_code)
sp = soup(res.text, "html.parser")
print(sp.find("table", {"class" : "ws-table-all"}).text)

200


Company
Contact
Country


Alfreds Futterkiste
Maria Anders
Germany


Centro comercial Moctezuma
Francisco Chang
Mexico


Ernst Handel
Roland Mendel
Austria


Island Trading
Helen Bennett
UK


Laughing Bacchus Winecellars
Yoshi Tannamuri
Canada


Magazzini Alimentari Riuniti
Giovanni Rovelli
Italy




In [None]:
table = sp.find("table", {"class" : "ws-table-all"})
df = pd.DataFrame(columns=['Company', 'Contact', 'Country'])

for row in table.find_all('tr'):
    # Find all data for each column
    columns = row.find_all('td')

    if(columns != []):
        company = columns[0].text.strip()
        contact = columns[1].text.strip()
        country = columns[2].text.strip()

        df = df.append({'Company': company,  'Contact': contact, 'Country': country}, ignore_index=True)

df.head()

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada


### Answer 6



Explore further the python-pptx library and check how to differentiate between texts coming from different components such as title, subtitle and paragraphs.



In [None]:
from google.colab import drive
drive.mount('/content/drive')
from pptx import Presentation

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/Your big idea.pptx")

n_slide = 1

for slide in prs.slides:
  print("# slide:", n_slide, "\n")
  if (slide.shapes.title != None) :
    print("title:", slide.shapes.title.text, "\n")

  for shape in slide.shapes:
    if not shape.has_text_frame:
        continue
    for paragraph in shape.text_frame.paragraphs:
      n_para = 1
      for run in paragraph.runs:
        print("paragraph:", n_para, run.text)
        n_para +=1

  print("\n\n")
  n_slide += 1

# slide: 1 

title: Making Presentations That Stick 

paragraph: 1 Making Presentations That Stick
paragraph: 1 A guide by Chip Heath & Dan Heath



# slide: 2 

paragraph: 1 Selling your idea
paragraph: 1 Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea.



# slide: 3 

paragraph: 1 1. Intro
paragraph: 1 Choose one approach
paragraph: 2  to grab the audience’s attention right from the start: 
paragraph: 3 unexpected, emotional, or simple.
paragraph: 1 Unexpected
paragraph: 2 Highlight what’s new, unusual, or
paragraph: 3  
paragraph: 4 surprising.
paragraph: 1 Emotional
paragraph: 2 Give people a reason to care.
paragraph: 1 Simple
paragraph: 2 Provide a simple unifying message for
paragraph: 3  
paragraph: 4 what
paragraph: 5  
paragraph: 6 is to come



# slide: 4 

title: How many languages do you need to know to communic

### Answer 7

Extract table from a PPT using the same library.

In [None]:
from re import S
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/TablePPT.pptx")
slide = prs.slides[0]
table = slide.shapes[1].table # maybe 0..n

for r in table.rows:
  s = " "
  for c in r.cells:
    s += c.text_frame.text + " , "
  print(s)

  , education , name , birth , 
 Yoonjung Choi , MS , SJSU , 88.09.06 , 
 Myungjong Kim , Ph.D , KAIST , 85.12.10 , 


In [None]:
df = pd.DataFrame(columns=['Name', 'Education', 'SchoolName', 'Birth'])

prs = Presentation("/content/TablePPT.pptx")
slide = prs.slides[0]
table = slide.shapes[1].table # maybe 0..n

first = 0;

for row in table.rows:
  if (first==0) :
    first = 1
    continue
  name = (row.cells[0].text_frame.text)
  edu = (row.cells[1].text_frame.text)
  school = (row.cells[2].text_frame.text)
  birth = (row.cells[3].text_frame.text)
  df = df.append({'Name': name,  'Education': edu, 'SchoolName': school, 'Birth':birth}, ignore_index=True)

df.head()

Unnamed: 0,Name,Education,SchoolName,Birth
0,Yoonjung Choi,MS,SJSU,88.09.06
1,Myungjong Kim,Ph.D,KAIST,85.12.10


### Answer 8

Research and find some more libraries to extract text from PDFs and show basic implementation of any one of them.


In [None]:
!pip install pdfplumber -q


[K     |████████████████████████████████| 40 kB 4.7 MB/s 
[K     |████████████████████████████████| 3.1 MB 70.5 MB/s 
[K     |████████████████████████████████| 142 kB 56.5 MB/s 
[K     |████████████████████████████████| 5.6 MB 44.7 MB/s 
[K     |████████████████████████████████| 4.0 MB 56.6 MB/s 
[?25h

In [None]:
import pdfplumber
pdf = pdfplumber.open('/content/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf')
page = pdf.pages[0]
text = page.extract_text()
display(text)

'ReceivedJune13,2020,acceptedJuly1,2020,dateofpublicationJuly16,2020,dateofcurrentversionJuly29,2020.\nDigitalObjectIdentifier10.1109/ACCESS.2020.3009626\nEvaluation of Sentiment Analysis in Finance:\nFrom Lexicons to Transformers\nKOSTADINMISHEV 1,ANAGJORGJEVIKJ 1,IRENAVODENSKA2,LUBOMIRT.CHITKUSHEV2,\nANDDIMITARTRAJANOV 1,(Member,IEEE)\n1FacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,1000Skopje,NorthMacedonia\n2FinancialInformaticsLab,MetropolitanCollege,BostonUniversity,Boston,MA02215,USA\nCorrespondingauthor:KostadinMishev(kostadin.mishev@ﬁnki.ukim.mk)\nThisworkwassupportedinpartbytheFacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,Skopje.\nABSTRACT Financial and economic news is continuously monitored by ﬁnancial market participants.\nAccordingtotheefﬁcientmarkethypothesis,allpastinformationisreﬂectedinstockpricesandnewinfor-\nmationisinstantaneouslyabsorbedindeterminingfuturestockprices.Hence,promptextractionofpositive\nor negative senti

In [None]:
print(text)

ReceivedJune13,2020,acceptedJuly1,2020,dateofpublicationJuly16,2020,dateofcurrentversionJuly29,2020.
DigitalObjectIdentifier10.1109/ACCESS.2020.3009626
Evaluation of Sentiment Analysis in Finance:
From Lexicons to Transformers
KOSTADINMISHEV 1,ANAGJORGJEVIKJ 1,IRENAVODENSKA2,LUBOMIRT.CHITKUSHEV2,
ANDDIMITARTRAJANOV 1,(Member,IEEE)
1FacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,1000Skopje,NorthMacedonia
2FinancialInformaticsLab,MetropolitanCollege,BostonUniversity,Boston,MA02215,USA
Correspondingauthor:KostadinMishev(kostadin.mishev@ﬁnki.ukim.mk)
ThisworkwassupportedinpartbytheFacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,Skopje.
ABSTRACT Financial and economic news is continuously monitored by ﬁnancial market participants.
Accordingtotheefﬁcientmarkethypothesis,allpastinformationisreﬂectedinstockpricesandnewinfor-
mationisinstantaneouslyabsorbedindeterminingfuturestockprices.Hence,promptextractionofpositive
or negative sentiments from new