# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [4]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

# your code here
import requests
from bs4 import BeautifulSoup

# Extract the page content and create a BeautifulSoup object
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Find the html elements that contain the developer names
developers = soup.find_all('h1', {'class': 'h3 lh-condensed'})

# Create a list to store the cleaned names
names = []

# Iterate through the elements and clean the names
for developer in developers:
    name = developer.text.strip().replace('\n', '')
    names.append(name)

# Print the list of names
print(names)

['Emil Ernerfeldt', 'Manu MA', 'atomiks', 'Azure SDK Bot', 'Marten Seemann', 'Hadley Wickham', 'Charlie Marsh', 'Alessandro Ros', 'Earle F. Philhower, III', 'Christy Jacob', 'George Hotz', 'Marc Philipp', 'Folkert de Vries', 'Leandro von Werra', 'Adrienne Walker', 'Christopher Berner', 'josegonzalez', "Kael'thas", 'Serhii Kulykov', 'Haifeng Jin', 'boojack', 'Ryan Bates', 'Hans-Kristian Arntzen', 'David Sherret', 'Romain Beaumont']


#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [6]:
url = 'https://github.com/trending/python?since=daily'

# your code here

# Extract the page content and create a BeautifulSoup object
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Find the html elements that contain the repository names
repositories = soup.find_all('h1', {'class': 'h3 lh-condensed'})

# Create a list to store the cleaned names
names = []

# Iterate through the elements and clean the names
for repository in repositories:
    name = repository.text.strip().replace('\n', '')
    names.append(name)

# Print the list of names
print(names)

['AzeemIdrisi /      PhoneSploit-Pro', 'LAION-AI /      Open-Assistant', 'hwchase17 /      chat-langchain', 'bregman-arie /      devops-exercises', 'mindsdb /      mindsdb', 'AUTOMATIC1111 /      stable-diffusion-webui', 'Z4nzu /      hackingtool', 'geohot /      tinygrad', 'jackfrued /      Python-100-Days', 'botallen /      repository.botallen', 'ReFirmLabs /      binwalk', '521xueweihan /      HelloGitHub', 'microsoft /      restler-fuzzer', 'donnemartin /      system-design-primer', 'approximatelabs /      sketch', 'mandiant /      capa', 'vnpy /      vnpy', 'iperov /      DeepFaceLab', 'microsoft /      BioGPT', 'rawandahmad698 /      PyChatGPT', 'facebookresearch /      CutLER', 'HackSoftware /      Django-Styleguide', 'Evil0ctal /      Douyin_TikTok_Download_API', 'microsoft /      fluentui-emoji', 'rougier /      numpy-100']


#### Display all the image links from Walt Disney wikipedia page.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [7]:
import requests
from bs4 import BeautifulSoup

# Extract the page content and create a BeautifulSoup object
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Find the image tags
images = soup.find_all('img')

# Extract the src attribute of each image tag to get the image link
image_links = [image['src'] for image in images]

print(image_links)

['', '', '', '', 'https://avatars.githubusercontent.com/u/112647789?s=40&v=4', 'https://avatars.githubusercontent.com/u/398211?s=40&v=4', 'https://avatars.githubusercontent.com/u/858040?s=40&v=4', 'https://avatars.githubusercontent.com/u/9976399?s=40&v=4', 'https://avatars.githubusercontent.com/u/24505302?s=40&v=4', 'https://avatars.githubusercontent.com/u/2178292?s=40&v=4', 'https://avatars.githubusercontent.com/u/11986836?s=40&v=4', 'https://avatars.githubusercontent.com/u/37985796?s=40&v=4', 'https://avatars.githubusercontent.com/u/26654315?s=40&v=4', 'https://avatars.githubusercontent.com/u/10349437?s=40&v=4', 'https://avatars.githubusercontent.com/u/23431146?s=40&v=4', 'https://avatars.githubusercontent.com/u/59370927?s=40&v=4', 'https://avatars.githubusercontent.com/u/4068133?s=40&v=4', 'https://avatars.githubusercontent.com/u/23587658?s=40&v=4', 'https://avatars.githubusercontent.com/u/7192539?s=40&v=4', 'https://avatars.githubusercontent.com/u/8502631?s=40&v=4', 'https://avatar

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [None]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [8]:
import requests
from bs4 import BeautifulSoup

# Request the Wikipedia page
response = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the links on the page
links = [link.get('href') for link in soup.find_all('a')]

# Print the list of links
print(links)

['#start-of-content', 'https://github.com/', '/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftrending%2Fpython&source=header', '/features/actions', '/features/packages', '/features/security', '/features/codespaces', '/features/copilot', '/features/code-review', '/features/issues', '/features/discussions', '/features', 'https://docs.github.com', 'https://skills.github.com/', 'https://github.blog', '/enterprise', '/team', '/enterprise/startups', 'https://education.github.com', '/solutions/ci-cd/', 'https://resources.github.com/devops/', 'https://resources.github.com/devops/fundamentals/devsecops/', '/customer-stories', 'https://resources.github.com/', '/sponsors', '/readme', '/topics', '/trending', '/collections', '/pricing', '', '', '', '', '/login?return_to=https%3A%2F%2Fgithub.com%2Ftrending%2Fpython%3Fsince%3Ddaily', '/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftrending%2Fpython&source=header', '/explore', '/topics', '/trending', '/collections', '/eve

#### Find the number of titles that have changed in the United States Code since its last release point.

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [None]:
# your code here

#### Find a Python list with the top ten FBI's Most Wanted names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [9]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.fbi.gov/wanted/topten'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

top_ten = soup.find_all('h3', {'class': 'title'})
most_wanted = [name.text.strip() for name in top_ten]

print(most_wanted[:10])

['ARNOLDO JIMENEZ', 'OMAR ALEXANDER CARDENAS', 'ALEXIS FLORES', 'YULAN ADONAY ARCHAGA CARIAS', 'BHADRESHKUMAR CHETANBHAI PATEL', 'ALEJANDRO ROSALES CASTILLO', 'RUJA IGNATOVA', 'JOSE RODOLFO VILLARREAL-HERNANDEZ', 'MICHAEL JAMES PRATT', 'RAFAEL CARO-QUINTERO']


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [11]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [12]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Extract the page content and create a BeautifulSoup object
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Find the html elements that contain the earthquake information
earthquakes = soup.find_all('tr', class_='odd') + soup.find_all('tr', class_='even')

# Extract the earthquake information and store it in a list of dictionaries
earthquakes_list = []
for earthquake in earthquakes:
    earthquake_data = earthquake.find_all('td')
    date = earthquake_data[0].text
    time = earthquake_data[1].text
    latitude = earthquake_data[2].text
    longitude = earthquake_data[3].text
    region = earthquake_data[4].text
    earthquakes_list.append({
        'Date': date,
        'Time': time,
        'Latitude': latitude,
        'Longitude': longitude,
        'Region': region
    })

# Create a pandas dataframe from the list of dictionaries
df = pd.DataFrame(earthquakes_list)

# Display the first 20 rows of the dataframe
print(df.head(20))

Empty DataFrame
Columns: []
Index: []


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [13]:
handle = input("Enter the Twitter handle (e.g. @handle): ")

# Base URL for twitter profile
url = f"https://twitter.com/{handle}"

try:
    # Use requests library to fetch the HTML content of the URL
    response = requests.get(url)
    
    # Check if the URL is valid and the profile exists
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find the number of tweets using the HTML tag and class
        tweet_container = soup.find('li', {'class': 'ProfileNav-item--tweets'})
        tweet_count = tweet_container.find('span', {'class': 'ProfileNav-value'})
        
        print(f"Number of tweets by {handle}: {tweet_count.text}")
        
    else:
        # Raise an exception if the profile doesn't exist or the URL is invalid
        raise Exception("Profile not found.")
        
except Exception as e:
    print(f"Error: {e}")

Enter the Twitter handle (e.g. @handle): @handle
Error: 'NoneType' object has no attribute 'find'


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [17]:
handle = input("Enter Twitter handle (@handle): ")

# URL to scrape followers count from
url = f'https://twitter.com/{handle}'

try:
    # Use requests library to get HTML content of the page
    response = requests.get(url)
    # Use BeautifulSoup to parse HTML content
    soup = BeautifulSoup(response.content, 'html.parser')
    # Find the followers count by searching for the HTML tag and class name
    followers_count = soup.find('a', {'data-nav': 'followers'}).find('span', {'class': 'ProfileNav-value'}).get('data-count')
    print(f'Number of followers for {handle}: {followers_count}')
except:
    print(f"Twitter handle '{handle}' not found.")

Enter Twitter handle (@handle): @handle
Twitter handle '@handle' not found.


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [21]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [23]:
import requests
from bs4 import BeautifulSoup

# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
lang_tags = soup.find_all('div', {'class': 'central-featured-lang'})

langs = []
for tag in lang_tags:
    name = tag.find('div', {'class': 'language_name'}).text
    articles = tag.find('div', {'class': 'articles_count'}).text
    langs.append((name, articles))

print(langs)

AttributeError: 'NoneType' object has no attribute 'text'

#### A list with the different kind of datasets available in data.gov.uk.

In [27]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [31]:
import requests
from bs4 import BeautifulSoup

# Use requests library to get the contents of the website
response = requests.get(url)

# Parse the content of the website with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the tag containing the list of datasets
dataset_list_tag = soup.find('ul', {'class': 'dataset-list'})

# Extract all the li tags containing the dataset information
dataset_tags = dataset_list_tag.find_all('li')

# Create a list to store the names of the datasets
datasets = []

# Loop through all the li tags to get the name of each dataset
for dataset_tag in dataset_tags:
    name = dataset_tag.find('a').text
    datasets.append(name)

# Print the list of datasets
print(datasets)

AttributeError: 'NoneType' object has no attribute 'find_all'

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [32]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Make a request to the URL
res = requests.get(url)

# Check if the request is successful
if res.status_code == 200:
    soup = BeautifulSoup(res.content, 'html.parser')

    # Find the table containing the information
    table = soup.find('table', {'class': 'wikitable sortable'})
    
    # Get the header row
    header = [th.text.strip() for th in table.find('tr').find_all('th')]
    
    # Get the data rows
    rows = []
    for tr in table.find_all('tr')[1:]:
        rows.append([td.text.strip() for td in tr.find_all('td')])
        
    # Create the dataframe
    df = pd.DataFrame(rows, columns=header)
    
    # Display the top 10 rows
    print(df.head(10))
else:
    print('Failed to fetch the data')

AttributeError: 'NoneType' object has no attribute 'find'

## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here