# Web Scraping Assignment

In all the following questions, you have to use BeautifulSoup to scrape different websites and collect data as per
the requirement of the question.


In [6]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

1) Write a python program to display IMDB’s Top rated 100 Indian movies’ data
https://www.imdb.com/list/ls056092300/ (i.e. name, rating, year ofrelease) and make data frame.

In [12]:
def get_100_movies(url):
    page=requests.get(url)
    soup=BeautifulSoup(page.content, 'html.parser')
    titles=[]
    years=[]
    ratings=[]
    for i in soup.find_all('h3', class_='lister-item-header'):
        titles.append(i.find('a').text.strip())
    for i in soup.find_all('span', class_='lister-item-year'):
        year_text=i.text.strip()
        year = re.search(r'\d{4}', year_text).group()
        years.append(year)
    ratings=[]
    for i in soup.find_all('div', class_='ipl-rating-star small'):
        rating = i.find('span', class_="ipl-rating-star__rating").text.strip()
        ratings.append(rating)
    df=pd.DataFrame({'Name':titles,'Rating':ratings,'Year of Release':years})
    return df

imdb_url = 'https://www.imdb.com/list/ls056092300/'
get_100_movies(imdb_url)

Unnamed: 0,Name,Rating,Year of Release
0,Ship of Theseus,8,2012
1,Iruvar,8.4,1997
2,Kaagaz Ke Phool,7.8,1959
3,Lagaan: Once Upon a Time in India,8.1,2001
4,Pather Panchali,8.2,1955
...,...,...,...
95,Apur Sansar,8.4,1959
96,Kanchivaram,8.2,2008
97,Monsoon Wedding,7.3,2001
98,Black,8.1,2005


2) Write a python program to scrape product name, price and discounts from
https://www.meesho.com/bags-ladies/pl/3jo?page=1


In [9]:
def meesho_products(url):
    page=requests.get(url)
    soup=BeautifulSoup(page.content)
    names=[]
    prices=[]
    discounts=[]
    for i in soup.find_all('div', class_='product-title'):
        names.append(i.find('a').text.strip())
    return names

url="https://peachmode.com/search?q=bags"
meesho_products(url)

['${ item.title }$']

3) Write a python program to scrape cricket rankings from icc-cricket.com. You have to scrape:

a) Top 10 ODI teams in men’s cricket along with the records for matches, points and rating.


In [8]:
def odi_teams(url):
    page=requests.get(url)
    soup=BeautifulSoup(page.content, 'html.parser')
    teams=[]
    matches=[]
    points=[]
    ratings=[]
    for i in soup.find_all('span',class_='si-fname si-text'):
        teams.append(i.text)
    for i in soup.find_all('div', class_='si-table-data si-matches'):
        matches.append(i.find('span').text.strip())

    return matches
url="https://www.icc-cricket.com/rankings/team-rankings/mens/odi"
odi_teams(url)

[]

Please visit https://www.cnbc.com/world/?region=world and scrap-

a) headings

b) date

c) News link

In [7]:
def scrape_cnbc_news(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    headings = []
    dates = []
    news_links = []
    for i in soup.find_all('div', class_="RiverHeadline-headline RiverHeadline-hasThumbnail"):
        headings.append(i.find('a').text.strip())
    for i in soup.find_all('div', class_="RiverHeadline-headline RiverHeadline-hasThumbnail"):
        news_links.append((i.find('a'))['href'])
    df=pd.DataFrame({'Headline':headings, 'News link':news_links})
    return df


url = "https://www.cnbc.com/world/?region=world"
scrape_cnbc_news(url)


Unnamed: 0,Headline,News link
0,Ship sunk by Houthis threatens Red Sea environ...,https://www.cnbc.com/2024/03/03/ship-sunk-by-h...
1,,/pro/
2,Short positions in China stocks shrink after r...,https://www.cnbc.com/2024/03/03/short-position...
3,Nasdaq surges more than 1% to take out 2021 re...,https://www.cnbc.com/2024/02/29/stock-market-t...
4,Trump wins caucuses in Missouri and Idaho and ...,https://www.cnbc.com/2024/03/02/trump-wins-the...
5,India approves chip plants with over $15 billi...,https://www.cnbc.com/2024/03/01/india-boosts-c...
6,,/pro/
7,Ukraine's losses on the battlefield could make...,https://www.cnbc.com/2024/03/01/ukraines-losse...
8,India’s Byju’s lost more than $20 billion in v...,https://www.cnbc.com/2024/03/01/the-rise-and-f...
9,Iran vote turnout hits historic low amid disco...,https://www.cnbc.com/2024/03/03/iran-vote-turn...


Please visit https://www.keaipublishing.com/en/journals/artificial-intelligence-in-agriculture/most-downloaded-articles/ and scrap-

a) Paper title

b) date

c) Author

In [14]:
def scrape_downloaded_articles(url):
    page=requests.get(url)
    soup=BeautifulSoup(page.content, 'html.parser')
    paper_titles=[]
    dates=[]
    authors=[]
    for i in soup.find_all('h2',class_='h5 article-title'):
        paper_titles.append(i.find('a').text.strip())
    for i in soup.find_all('p', class_='article-date'):
        dates.append(i.text.strip())
    for i in soup.find_all('p', class_='article-authors'):
        authors.append(i.text.strip())
    df=pd.DataFrame({'Paper title':paper_titles,'Date':dates,'Author':authors})
    return df

url = "https://www.keaipublishing.com/en/journals/artificial-intelligence-in-agriculture/most-downloaded-articles/"
scrape_downloaded_articles(url)


Unnamed: 0,Paper title,Date,Author
0,Implementation of artificial intelligence in a...,2020,Tanha Talaviya | Dhara Shah | Nivedita Patel...
1,Review of agricultural IoT technology,2022,Jinyuan Xu | Baoxing Gu | Guangzhao Tian
2,A comprehensive review on automation in agricu...,June 2019,Kirtan Jha | Aalap Doshi | Poojan Patel | M...
3,Automation and digitization of agriculture usi...,2021,A. Subeesh | C.R. Mehta
4,Applications of electronic nose (e-nose) and e...,2020,Juzhong Tan | Jie Xu
5,Fruit ripeness classification: A survey,March 2023,Matteo Rizzo | Matteo Marcuzzo | Alessandro ...
6,A review of imaging techniques for plant disea...,2020,Vijai Singh | Namita Sharma | Shikha Singh
7,Deep learning based computer vision approaches...,2022,V.G. Dhanya | A. Subeesh | N.L. Kushwaha | ...
8,Comparison of CNN-based deep learning architec...,September 2023,Md Taimur Ahad | Yan Li | Bo Song | Touhid ...
9,Transfer Learning for Multi-Crop Leaf Disease ...,2022,Ananda S. Paymode | Vandana B. Malode


Write a python program to scrape house details from mentioned URL. It should include house title, location,
area, EMI and price from https://www.nobroker.in/ .Enter three localities which are Indira Nagar, Jayanagar,
Rajaji Nagar.

In [16]:
def nobrokerurl(url):
    page=requests.get(url)
    soup=BeautifulSoup(page.content,'html.parser')
    houses=[]
    locations=[]
    emis=[]
    for i in soup.find_all('div',class_='text-16 text-my-booking-color whitespace-nowrap overflow-hidden overflow-ellipsis'):
        houses.append(i.text.strip())
    
    return houses
url='https://www.nobroker.in/'
nobrokerurl(url)

[]