# Single Page Scraping

## To scrape all players who won medal for India in Olympics

1. To parse a single page website and store the content in a text file
2. Here I have chosen to scrape all the players who won medal for india in olympics from website:https://olympics.com/en/news/india-olympics-medals
3. All the players' details from the website are extracted and stored in players.txt file.


In [1]:
import requests
from bs4 import BeautifulSoup

In [8]:
import texttable as tt
from fake_useragent import UserAgent

user_agent = UserAgent()
url = 'https://olympics.com/en/news/india-olympics-medals'
page = requests.get(url, headers={'user-agent':user_agent.chrome})
soup = BeautifulSoup(page.text, 'lxml')

data = []
data_iterator = iter(soup.find_all('td'))

while True:
  try:
    athelete = next(data_iterator).text
    medal = next(data_iterator).text
    sport = next(data_iterator).text
    place = next(data_iterator).text

    data.append((athelete, medal, sport, place))

  except StopIteration:
    break

table = tt.Texttable()
table.add_rows([(None, None, None, None)] + data)  # Add an empty row at the beginning for the headers
table.set_cols_align(('c', 'c','c','c'))  # 'l' denotes left, 'c' denotes center, and 'r' denotes right
table.header(('athelete', 'medal', 'sport', 'place'))

print(table.draw())

+-------------------------+--------+------------------------+------------------+
|        athelete         | medal  |         sport          |      place       |
|    Norman Pritchard     | Silver |       Men's 200m       |    Paris 1900    |
+-------------------------+--------+------------------------+------------------+
|    Norman Pritchard     | Silver |   Men's 200m hurdles   |    Paris 1900    |
+-------------------------+--------+------------------------+------------------+
|   Indian hockey team    |  Gold  |      Men's hockey      |  Amsterdam 1928  |
+-------------------------+--------+------------------------+------------------+
|   Indian hockey team    |  Gold  |      Men's hockey      | Los Angeles 1932 |
+-------------------------+--------+------------------------+------------------+
|   Indian hockey team    |  Gold  |      Men's hockey      |   Berlin 1936    |
+-------------------------+--------+------------------------+------------------+
|   Indian hockey team    | 

In [12]:
with open("players.txt", "w") as file:
  for t in data:
      file.write(f" {t[0]}  {t[1]} {t[2]} {t[3]}  '\n'")

# Multi page Web Scraping

1. Here I used this website: https://subslikescript.com/ where all movies and tv shows names along with their plot summaries are present.
2. I have extracted plot summaries of a particular tv series called Planet-Earth-II and stored it in plots.txt file

In [None]:
pip install fake_useragent

Collecting fake_useragent
  Downloading fake_useragent-1.2.1-py3-none-any.whl (14 kB)
Installing collected packages: fake_useragent
Successfully installed fake_useragent-1.2.1


In [53]:
from fake_useragent import UserAgent
import re

user_agent = UserAgent()
main_url = 'https://subslikescript.com/series/Planet_Earth_II-5491994'
page = requests.get(main_url,headers={'user-agent':user_agent.chrome})
soup = BeautifulSoup(page.content,'lxml')

base_url = 'https://subslikescript.com/'

print('Printing all relative urls')
links = [link['href'] for link in soup.find_all('a', href=True)]
for i in links:
  print(i, end='\n')

print('Printing all absolute urls')
links = [link['href'] for link in soup.find_all('a', href=True)]
all_links = []

for i in links:
  print(base_url + i)
  all_links.append(base_url+i)

Printing all relative urls
/
/movies
/series
/
/series
/series/Planet_Earth_II-5491994/season-1/episode-1-Islands
/series/Planet_Earth_II-5491994/season-1/episode-2-Mountains
/series/Planet_Earth_II-5491994/season-1/episode-3-Jungles
/series/Planet_Earth_II-5491994/season-1/episode-4-Deserts
/series/Planet_Earth_II-5491994/season-1/episode-5-Grasslands
/series/Planet_Earth_II-5491994/season-1/episode-6-Cities
/dmca
Printing all absolute urls
https://subslikescript.com//
https://subslikescript.com//movies
https://subslikescript.com//series
https://subslikescript.com//
https://subslikescript.com//series
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-1-Islands
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-2-Mountains
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-3-Jungles
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-4-Deserts
https://subslikescript.com//series/Planet_Ear

In [54]:
pattern = r'season-1/episode-'
filtered_links = list(filter(lambda link: re.search(pattern, link), all_links))
for link in filtered_links:
    print(link)

https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-1-Islands
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-2-Mountains
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-3-Jungles
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-4-Deserts
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-5-Grasslands
https://subslikescript.com//series/Planet_Earth_II-5491994/season-1/episode-6-Cities


In [55]:
plots = []
for link in all_links:
  final_page = requests.get(link)
  final_soup = BeautifulSoup(final_page.content, 'lxml')
  plot = final_soup.find('p', attrs={'class':'plot'})
  if plot:
    plots.append(plot.text.strip())

In [56]:
with open("plots.txt", "w") as file:
  for p in plots:
      file.write(p + "\n")