# Part 1 Web Scraping

Web Scraping is an art where one has to study the website and work according to the dynamics of that particular website.

Most common tools used for web scraping in python are demonstrated below.

1. requests https://requests.readthedocs.io/en/latest/
2. beautiful soup https://beautiful-soup-4.readthedocs.io/en/latest/
3. Selenium https://selenium-python.readthedocs.io/
4. Scrapy https://docs.scrapy.org/en/latest/

We will be working on the first three and the fourth one can be explored in the homeworks.

We will be scraping 4 websites today:

1. GeeksforGeeks
2. MarketWatch
3. CNBC
4. Hoopshype

There are different techniques to be used when scraping a dynamic website vs a static website which will be discussed in the coming sections

Some websites have their APIs open and those can be used to directly fetch the data without the need of scraping the HTML or XML pages.

In [1]:
# installing the libraries
!pip install requests
!pip install bs4
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25l[?25hdone
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1256 sha256=67634ddbe377d563904b4f3e38eb1a557e4071b147bce1655ed2e7db865ced41
  Stored in directory: /root/.cache/pip/wheels/25/42/45/b773edc52acb16cd2db4cf1a0b47117e2f69bb4eb300ed0e70
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Collecting selenium
  Downloading selenium-4.12.0-py3-none-any.whl (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.22.2-py3-none-any.whl (400 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.2/400.2 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from s

In [2]:
# importing the libraries
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import json
from google.colab import drive
import sys

In [3]:
drive.mount("/content/drive/")

Mounted at /content/drive/


In [4]:
# getting the first URL
# open the URL in parallel in other tab to check the information we are extracting
url = "https://www.geeksforgeeks.org/python-programming-language/"

In [5]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [6]:
# creating a soup object from the returned html page
sp = soup(res.text, "lxml")
sp

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8"/><meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/><meta content="width=device-width,initial-scale=1,minimum-scale=.5,maximum-scale=3" name="viewport"/><link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/><link href="https://fonts.googleapis.com" rel="preconnect"/><link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><meta content="#308D46" name="theme-color"/><meta content="https://media.geeksforgeeks.org/wp-content/cdn-

In [None]:
# printing it in readable format
print(sp.prettify())

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/>
  <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/>
  <link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/>
  <meta content="#308D46" name="theme-color"/>
  <meta content="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_200x200-min.png" name="image" property="og:image"/>
  <meta content="image/png" property="og:image:type"/>
  <meta co

In [None]:
# parsing title element of the page
print(sp.title)
print(sp.title.name)
print(sp.title.string)
print(sp.title.parent.name)

<title>Python Programming Language - GeeksforGeeks</title>
title
Python Programming Language - GeeksforGeeks
head


In [7]:
# extracting the title of the article
print(sp.find("h1", {"class" : "entry-title"}).text)

Python Tutorial


In [8]:
# extracting the date of the article
print(sp.find("div", {"class" : "meta"}).text)

Last Updated :
20 Sep, 2023


In [9]:
# extracting the content of the article
# it extracts everyhting together, in the next sections we can see how to iteratively extract information paragraph by paragraph
print(sp.find("div", {"class" : "page_content"}).text)

This Python Tutorial is very well suited for Beginners, and also for experienced programmers with other programming languages like C++ and Java. This specially designed Python tutorial will help you learn Python Programming Language in the most efficient way, with topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.What is Python?Python is a high-level, general-purpose, and very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.Python language is being used by almost all tech-giant companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.The biggest strength of Python is huge collection of standard library which can be used for the following:Machine LearningGUI Applications (like Kivy, Tkinter, PyQt etc. )Web frameworks like Django (used by YouTube, Instagram, Dropbox)Image proces

In [10]:
# extracting the links found in the bottom of the article for further reading
for tag in sp.find("div", {"class":"Basics"}).findAll("a", href = True): print(tag.text, "\n", tag["href"], "\n\n")

Python language introduction 
 https://www.geeksforgeeks.org/python-language-introduction/ 


Python 3 basics 
 https://www.geeksforgeeks.org/python-3-basics/ 


Python The new generation language 
 https://www.geeksforgeeks.org/python-the-new-generation-language/ 


Important difference between python 2.x and python 3.x with example 
 https://www.geeksforgeeks.org/important-differences-between-python-2-x-and-python-3-x-with-examples/ 


Keywords in Python | Set 1 
 https://www.geeksforgeeks.org/keywords-python-set-1/ 


Set 2 
 https://www.geeksforgeeks.org/keywords-python-set-2/ 


Namespaces and Scope in Python 
 https://www.geeksforgeeks.org/namespaces-and-scope-in-python/ 


Statement, Indentation and Comment in Python 
 https://www.geeksforgeeks.org/statement-indentation-and-comment-in-python/ 


Structuring Python Programs 
 https://www.geeksforgeeks.org/structuring-python-programs/ 


How to check if a string is a valid keyword in Python? 
 https://www.geeksforgeeks.org/check-s

In [None]:
# extracting information paragraph by paragraph
for tag in sp.find("div", {"class":"page_content"}).findAll("p"): print(tag.text)

Python is a high-level, general-purpose and a very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.

This specially designed Python tutorial will help you learn Python Programming Language in most efficient way, with the topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.

Below are some facts about Python Programming Language:
Recent Articles on Python !Python Programming ExamplesPython Output & Multiple Choice Questions 
Basics, Input/Output, Data Types, Variables, Operators, Control Flow, Functions, Object Oriented Concepts, Exception Handling, Python Collections, Django Framework, Data Analysis, Numpy, Pandas, Machine Learning with Python, Pyth

Almost all the geeksforgeeks articles have the same format and hence all of them can be scraped using the same code, this repeatability is useful while doing web scraping as a block of code can help one get a lot of information in a structured way

In [None]:
# getting the second url
# again open the url in parallel to track the extracted information
url = "https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963?mod=home-page"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
# creating a soup object from the returned html page
sp = soup(res.text, "html.parser")
sp

<!DOCTYPE html>

<html data-env="prod" data-site="marketwatch" lang="en-US">
<head>
<title>The next financial crisis may already be brewing, but not where many expect - MarketWatch</title>
<link href="https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963" rel="canonical"/>
<meta content="The next financial crisis may already be brewing --- but not where investors might expect" property="og:title"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-167x167.png" rel="apple-touch-icon" sizes="167x167"/>
<link href="https://mw4.wsj.net/mw5/cont

In [None]:
# getting the title
print(sp.find("h1", {"class" : "article__headline"}).text)


  The next financial crisis may already be brewing — but not where investors might expect



In [None]:
# getting the time
print(sp.find("time", {"class" : "timestamp--pub"}).text)


  First Published: Sept. 14, 2022 at 11:56 a.m. ET



Other information can be extracted using similar methods as used in geeksforgeeks, this can be done in homework

Again same as geeksforgeeks, all articles of marketwatch are similar in structure and hence the same code can be used to scrap through all the articles of this website

Now working with a dynamic website that has its API open.

Open the CNBC website and search for any topic. If we search for SPORTS the URL looks like this: https://www.cnbc.com/search/?query=SPORTS&qsearchterm=SPORTS and if we search for POLITICS the URL looks like this: https://www.cnbc.com/search/?query=POLITICS&qsearchterm=POLITICS

Here we can observe a pattern and we can predict what the url would look like if we search something else, this information can be used for reusability and repseatability of code.

Now we can see that there are more than 50,000 results for the topic politics but only 10 are loaded in the beginning. Once we scroll down, next 10 are loaded and so on. To load the next results when the user scrolls down, an API is hit and link to that API is found from the network section of the inspect element: https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

Here playing around with the endindex and batchsize parameter we can get various results.

Now the response of this API call would be a JSON and hence the need for parsing a web page is gone when there is an open API

In [None]:
# getting the third url
url = "https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
res.text

'{ "metadata" : { "q" : "politics", "totalresults" : 53379, "pagesize" : 10, "totalpage" : 5338, "pagerequested" : 2, "corrections" : [], "stems" : ["politics"], "suggestions" : ["politics"], "facetsuggestions" : [{ "facet" : "tags:show", "suggestions" : ["Markets and Politics Digital Original Video"] }, { "facet" : "tags:topic", "suggestions" : ["Politics"] }], "related" : [], "resultgenerationtime" : "46.8806 ms" }, "results" : [{ "description" : "Shareholder activism making inroads into ETF space remains a contentious topic for companies. Proponents of environmental, social and governance (ESG) products say investors are pushing corporations to pay more attention to broader social issues. Others, such as Strive Asset Management, say companie", "cn:lastPubDate" : "2022-10-06T10:49:15+0000", "dateModified" : "2022-10-06T10:49:15+0000", "cn:dateline" : "", "cn:branding" : "cnbc", "section" : "ETF Edge", "cn:type" : "cnbcnewsstory", "author" : "Kevin Schmidt", "cn:source" : [], "cn:subt

In [None]:
# getting the description of each news article
for description in json.loads(res.text)["results"]: print(description["description"], "\n")

Shareholder activism making inroads into ETF space remains a contentious topic for companies. Proponents of environmental, social and governance (ESG) products say investors are pushing corporations to pay more attention to broader social issues. Others, such as Strive Asset Management, say companie 

Iraq's powerful Shi'ite Muslim cleric Moqtada al-Sadr said on Monday he was quitting politics and closing his institutions in response to an intractable political deadlock, sparking protests by his followers and raising fears of more instability.Sadr's supporters, who have been staging a weeks-long  

Investment advisers say it's not wise to try to time the market, but it does make sense to periodically adjust your portfolio. So with the midterm elections now a week away but the outcome still not in focus, does it make sense to make those adjustments now? Probably not, say most financial advisors 

CNBC's Hadley Gamble discusses ties between the U.S. and Saudi Arabia after the kingdom rea

Finally working with Selenium

In [11]:
!sudo add-apt-repository ppa:saiarcot895/chromium-beta
!sudo apt remove chromium-browser
!sudo snap remove chromium
!sudo apt install chromium-browser -qq

!pip3 install selenium --quiet
!apt-get update
!apt install chromium-chromedriver -qq
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu/ jammy main'
Description:
This PPA contains the latest Chromium Beta builds, with hardware video decoding enabled (hidden behind a flag), and support for Widevine (needed for viewing many DRM-protected videos) enabled.

== Hardware Video Decoding ==

To enable hardware video decoding, start Chromium with the --enable-features=VaapiVideoDecoder argument. To make this persistent, create a file at /etc/chromium-browser/customizations/92-vaapi-hardware-decoding with the following contents:

CHROMIUM_FLAGS="${CHROMIUM_FLAGS} --enable-features=VaapiVideoDecoder"

See also https://wiki.archlinux.org/title/Chromium#Hardware_video_acceleration for more information on VAAPI video decoding support.

=== Widevine Support ===

The packages in this PPA have support for Widevine inside Chromium enabled. However, you still need to copy some files from 

In [12]:
!pip install --upgrade selenium



In [13]:
!pip install selenium
!apt-get update
!apt-get install -y chromium-chromedriver


Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest versi

In [14]:
import sys
from selenium.webdriver.chrome.service import Service as ChromeService
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)

In [15]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://hoopshype.com/salaries/players/')

In [15]:
# getting players name list
players = driver.find_elements("xpath", '//td[@class="name"]')

In [16]:
players_list = []
for p in range(len(players)): players_list.append(players[p].text)
players_list

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


Similarly other information such as players' salaries can also be easily extracted and can be done as homework

# Part 2: PPT Scraping

In [None]:
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
!pip install python-pptx

Collecting python-pptx
  Downloading python_pptx-0.6.22-py3-none-any.whl (471 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/471.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.5/471.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting XlsxWriter>=0.5.7 (from python-pptx)
  Downloading XlsxWriter-3.1.4-py3-none-any.whl (153 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.5/153.5 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: XlsxWriter, python-pptx
Successfully installed XlsxWriter-3.1.4 python-pptx-0.6.22


Extracting text from PPT using the python-pptx library, it is typically used for generating ppts from databases but we can exploit some of its features here to extract text from ppts, this is a very basic example and it can be explored further as per the need. documentation to the libary: https://python-pptx.readthedocs.io/en/latest/

In [17]:
# importing library
from pptx import Presentation

In [18]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/drive/My Drive/Your big idea.pptx")
counter_slide = 1
for slide in prs.slides:
    print("slide:", counter_slide, "\n")
    counter_content = 1
    for shape in slide.shapes:
        try:
            print("content:", counter_content, shape.text, "\n")
            counter_content += 1
        except: continue
    print("\n\n")
    counter_slide += 1

slide: 1 

content: 1 Making Presentations That Stick 

content: 2 A guide by Chip Heath & Dan Heath 




slide: 2 

content: 1 Selling your idea 

content: 2 Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea. 




slide: 3 

content: 1 1. Intro 

content: 2 Choose one approach to grab the audience’s attention right from the start: unexpected, emotional, or simple.
UnexpectedHighlight what’s new, unusual, or surprising.
EmotionalGive people a reason to care.
SimpleProvide a simple unifying message for what is to come 




slide: 4 

content: 1 How many languages do you need to know to communicate with the rest of the world? 




slide: 5 

content: 1 Just one! Your own.
(With a little help from your smart phone) 




slide: 6 

content: 1 The Google Translate app can repeat anything you say in up to NINETY LANGUAGES from G

Similarly other components of the PPT can be extracted after following the documentation as per need

# Part 3: PDF Scraping


Using the library PyPDF2: https://pypi.org/project/PyPDF2/
This library can only extract text from PDFs, for tables and images other methods are required.
Extracting text from PDFs is much difficult compared to web and ppt as there is no inherent structure where just calling the right elements will give us everything, infact pdfs can be seen as an image and hence whatever extraction we do is by using some kind of optical character recognition.

In [19]:
# installing the library
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [20]:
# importing the library
import PyPDF2

In [21]:
# reading the file
pdfFileObj = open('/content/drive/My Drive/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf', 'rb')


In [22]:
# passing the file to PyPDF
pdfReader = PyPDF2.PdfReader(pdfFileObj)

In [23]:
# getting the number of pages
print(len(pdfReader.pages))

21


In [None]:
# getting the first page
pageObj = pdfReader.pages[1]

In [None]:
# extracting text from the first page
print(pageObj.extract_text())

K. Mishev et al.: Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers
decisions. The sentiments expressed in news and tweets inu-
ence stock prices and brand reputation, hence, constant mea-
surement and tracking of these sentiments is becoming one of
the most important activities for investors. Studies have used
sentiment analysis based on nancial news to forecast stock
prices [6][8], foreign exchange and global nancial market
trends [9], [10] as well as to predict corporate earnings [11].
Given that the nancial sector uses its own jargon, it is
not suitable to apply generic sentiment analysis in nance
because many of the words differ from their general meaning.
For example, ``liability'' is generally a negative word, but
in the nancial domain it has a neutral meaning. The term
``share'' usually has a positive meaning, but in the nancial
domain, share represents a nancial asset or a stock, which
is a neutral word. Furthermore, ``bull'' is neutral in gen

# Homework

1. As discussed in the demo above, using the example of geeksforgeeks, extract the information from marketwatch articles apart from the title and date that is already demonstrated.

2. As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

3. Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

4. Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

5. Pick a website that has tabular data (can be one of the two selected above) and try to scrap it using the tools studied during the demo.

(The datasets you will be collecting for the projects would be by text extraction so make sure to extract usable structured information)

6. Explore further the python-pptx library and check how to differentiate between texts coming from different components such as title, subtitle and paragraphs.

7. Extract table from a PPT using the same library.

8. Research and find some more libraries to extract text from PDFs and show basic implementation of any one of them.
