# Web Crawling
## Introduction
Web crawling is an automated method used to scrape data from websites. It's commonly used for a variety of purposes like data mining, online price comparison, and website change detection.

1. Tools and Libraries in Python
Python, being a versatile programming language, offers several libraries for web crawling, the most popular ones include:

    * `Requests`: A simple yet powerful HTTP library for making requests and accessing web content.
    * `Beautiful Soup`: A library that makes it easy to scrape information from web pages, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
    * `Scrapy`: An open-source and collaborative web crawling framework for Python, designed for large-scale web scraping.

2. Basic Steps in Web Crawling
    * `Sending a Request`: Use requests to send an HTTP request to the URL of the webpage you want to scrape.
    * `Parsing the Response`: Once you have the webpage's content, use Beautiful Soup or Scrapy to parse the HTML/XML content.
    * `Extracting Information`: After parsing, you can use the power of these libraries to extract the specific data you need.
    * `Storing or Processing Data`: The extracted data can be stored in a file or database, or further processed according to your needs.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Read Website

In [2]:
response = requests.get("https://en.wikipedia.org/wiki/University_of_Texas_at_Dallas")
html = response.text
soup = BeautifulSoup(html)
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>University of Texas at Dallas - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-lim

1. `find` Method:
* The find method is used to search for the first tag that matches the given criteria.
    * **Syntax**: soup.find(name, attrs, recursive, string, **kwargs)
    * **Example**: soup.find('div', class_='example') finds the first <div> tag with the class example.
    * It returns a single BeautifulSoup object representing the first matching tag.
2. `find_all` Method:
* The find_all method retrieves all tags that match the given criteria.
    * **Syntax**: soup.find_all(name, attrs, recursive, string, limit, **kwargs)
    * **Example**: soup.find_all('a') finds all `<a>` tags (hyperlinks) in the document.
    * It returns a list of BeautifulSoup objects for each matching tag.
    * You can limit the number of results by using the limit argument.

In [3]:
soup.find('tr')


<tr><td class="mbox-image"><div class="mbox-image-div"><span typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></span></span></div></td><td class="mbox-text"><div class="mbox-text-span">This article <b>contains content that is written like <a href="/wiki/Wikipedia:What_Wikipedia_is_not#Wikipedia_is_not_a_soapbox_or_means_of_promotion" title="Wikipedia:What Wikipedia is not">an advertisement</a></b>.<span class="hide-when-compact"> Please help <a class="external text" href="https://en.wikipedia.org/w/index.php?title=University_of_Texas_at_Dallas&amp;action=edit">improve it</a> by removing 

In [4]:
soup.find_all('tr',limit=2)

[<tr><td class="mbox-image"><div class="mbox-image-div"><span typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="40" data-file-width="40" decoding="async" height="40" src="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" width="40"/></span></span></div></td><td class="mbox-text"><div class="mbox-text-span">This article <b>contains content that is written like <a href="/wiki/Wikipedia:What_Wikipedia_is_not#Wikipedia_is_not_a_soapbox_or_means_of_promotion" title="Wikipedia:What Wikipedia is not">an advertisement</a></b>.<span class="hide-when-compact"> Please help <a class="external text" href="https://en.wikipedia.org/w/index.php?title=University_of_Texas_at_Dallas&amp;action=edit">improve it</a> by removing

## Read JSON (JavaScript Object Notation)
1. Overview
    * JSON is a text format that is completely language independent but uses conventions familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others.

2. Structure
* Data Representation: JSON is built on two structures:
    * A **collection** of **name/value pairs** (often realized as an object, record, struct, dictionary, hash table, keyed list, or associative array).
    * An ordered list of values (often realized as an array, vector, list, or sequence).
* Example:
    ```
        {
        "name": "John Doe",
        "age": 30,
        "isMarried": false,
        "address": {
            "street": "123 Main St",
            "city": "Anytown"
        },
        "phoneNumbers": ["123-456-7890", "456-789-0123"]
        }
    ```

In [5]:
import json

In [6]:
# Open the file and read its contents
with open("./Data/test.json", 'r') as file:
    file_contents = file.read()
    
jsonObj = json.loads(file_contents)
jsonObj

{'employees': [{'firstName': 'Bill', 'lastName': 'Gates'},
  {'firstName': 'George', 'lastName': 'Bush'},
  {'firstName': 'Thomas', 'lastName': 'Carter'}]}

In [7]:
jsonObj["employees"]

[{'firstName': 'Bill', 'lastName': 'Gates'},
 {'firstName': 'George', 'lastName': 'Bush'},
 {'firstName': 'Thomas', 'lastName': 'Carter'}]

In [8]:
jsonObj["employees"][0]

{'firstName': 'Bill', 'lastName': 'Gates'}

## Parse Website
* `getText()`: The `getText()` method is used to extract only the textual content from a `BeautifulSoup` object
    * **Syntax** : `tag.getText(separator='', strip=False)`
        * `separator`: Optional. A string that will be used to join the texts of different child tags.
        * `strip`: Optional. A boolean that, if set to True, will strip whitespace from the beginning and end of the text.

## Download IMDB TOP 250 Movies(Example)

In [9]:
from lxml import etree

In [10]:
# Headers to mimic browser 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# List of movie URLs 
imdb_top250 = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"

page = requests.get(imdb_top250, headers=headers)

soup = BeautifulSoup(page.text, 'html.parser')

# Find all movie containers
movies = soup.find('ul', class_='ipc-metadata-list').find_all('li')

In [11]:
# Initialize dataframe 
movies_df = []

# Loop through movies and extract data
for movie in movies:
    title_rank = movie.find('div', class_='ipc-title').a.text
    link = 'http://www.imdb.com' + movie.find('div', class_='ipc-title').a['href']
    title = str(title_rank).split(".")[-1]
    rank = str(title_rank).split(".")[0]
    year_length_cate = movie.find('div', class_='cli-title-metadata').find_all("span")
    year = year_length_cate[0].text
    length = year_length_cate[1].text
    try:
        category = year_length_cate[2].text
    except:
        print(title)
        category = 'N/A'
        break
    rating_num = movie.find('div', class_='cli-ratings-container').span.text
    rating = str(rating_num).split("\xa0")[0]
    num = str(rating_num).split("\xa0")[-1]

    # Create dictionary for each movie
    movie_data = {
        'Title': title,
        'Rank': rank,
        'Length': length,
        'Rating': rating,
        'Number of Rating': num,
        'Year': year,
        'Category': category,
        'Link': link
    }
    
    # print(movie_data)
    # Add movie data to dataframe
    movies_df.append(movie_data)

# Print first 5 rows
movies_df = pd.DataFrame(movies_df)
movies_df.head()

 12th Fail


Unnamed: 0,Title,Rank,Length,Rating,Number of Rating,Year,Category,Link
0,The Shawshank Redemption,1,2h 22m,9.3,(2.8M),1994,R,http://www.imdb.com/title/tt0111161/?ref_=chtt...
1,The Godfather,2,2h 55m,9.2,(2M),1972,R,http://www.imdb.com/title/tt0068646/?ref_=chtt...
2,The Dark Knight,3,2h 32m,9.0,(2.8M),2008,PG-13,http://www.imdb.com/title/tt0468569/?ref_=chtt...
3,The Godfather Part II,4,3h 22m,9.0,(1.3M),1974,R,http://www.imdb.com/title/tt0071562/?ref_=chtt...
4,12 Angry Men,5,1h 36m,9.0,(850K),1957,Approved,http://www.imdb.com/title/tt0050083/?ref_=chtt...


### Go to subpage to get more info

In [12]:
add_df = []
# Loop through movies 
for temp in movies_df.iloc:
    # Make request to movie URL
    response = requests.get(temp['Link'], headers= headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Get budget 
    try:
        budget = soup.find('li', {'data-testid':'title-boxoffice-budget'}).text.replace("Budget","")
        worldwide_revenue = soup.find('li', {'data-testid':'title-boxoffice-cumulativeworldwidegross'}).text.replace("Gross worldwide","")
    except:
        budget = "N/A"
        worldwide_revenue = "N/A"
    
    # Get cast 
    dom = etree.HTML(response.text)
    cast_tags = "; ".join(dom.xpath("//*//div//a[@data-testid='title-cast-item__actor']/text()"))
    crews = dom.xpath("//section[@data-testid='title-cast']//ul//a/text()")
    director = crews[0]
    writer = "; ".join(crews[1:-2]).replace("Writers; ","").replace("Writer; ","")
    
    add_data = {
        "Budget" : budget,
        "Worldwide Revenue" : worldwide_revenue,
        "Cast" : cast_tags,
        "Director" : director,
        "Writer" : writer
    }
    add_df.append(add_data)
    
add_df = pd.DataFrame(add_df)
add_df.head()

Unnamed: 0,Budget,Worldwide Revenue,Cast,Director,Writer
0,"$25,000,000 (estimated)","$28,884,716",Tim Robbins; Morgan Freeman; Bob Gunton; Willi...,Frank Darabont,Stephen King; Frank Darabont
1,"$6,000,000 (estimated)","$250,341,816",Marlon Brando; Al Pacino; James Caan; Diane Ke...,Francis Ford Coppola,Mario Puzo; Francis Ford Coppola
2,"$185,000,000 (estimated)","$1,029,266,147",Christian Bale; Heath Ledger; Aaron Eckhart; M...,Christopher Nolan,Jonathan Nolan; Christopher Nolan; David S. Goyer
3,"$13,000,000 (estimated)","$47,961,919",Al Pacino; Robert De Niro; Robert Duvall; Dian...,Francis Ford Coppola,Francis Ford Coppola; Mario Puzo
4,"$350,000 (estimated)",$955,Henry Fonda; Lee J. Cobb; Martin Balsam; John ...,Sidney Lumet,Reginald Rose


In [13]:
result =pd.concat([movies_df, add_df], axis=1)
result.head()

Unnamed: 0,Title,Rank,Length,Rating,Number of Rating,Year,Category,Link,Budget,Worldwide Revenue,Cast,Director,Writer
0,The Shawshank Redemption,1,2h 22m,9.3,(2.8M),1994,R,http://www.imdb.com/title/tt0111161/?ref_=chtt...,"$25,000,000 (estimated)","$28,884,716",Tim Robbins; Morgan Freeman; Bob Gunton; Willi...,Frank Darabont,Stephen King; Frank Darabont
1,The Godfather,2,2h 55m,9.2,(2M),1972,R,http://www.imdb.com/title/tt0068646/?ref_=chtt...,"$6,000,000 (estimated)","$250,341,816",Marlon Brando; Al Pacino; James Caan; Diane Ke...,Francis Ford Coppola,Mario Puzo; Francis Ford Coppola
2,The Dark Knight,3,2h 32m,9.0,(2.8M),2008,PG-13,http://www.imdb.com/title/tt0468569/?ref_=chtt...,"$185,000,000 (estimated)","$1,029,266,147",Christian Bale; Heath Ledger; Aaron Eckhart; M...,Christopher Nolan,Jonathan Nolan; Christopher Nolan; David S. Goyer
3,The Godfather Part II,4,3h 22m,9.0,(1.3M),1974,R,http://www.imdb.com/title/tt0071562/?ref_=chtt...,"$13,000,000 (estimated)","$47,961,919",Al Pacino; Robert De Niro; Robert Duvall; Dian...,Francis Ford Coppola,Francis Ford Coppola; Mario Puzo
4,12 Angry Men,5,1h 36m,9.0,(850K),1957,Approved,http://www.imdb.com/title/tt0050083/?ref_=chtt...,"$350,000 (estimated)",$955,Henry Fonda; Lee J. Cobb; Martin Balsam; John ...,Sidney Lumet,Reginald Rose


### Save the result

In [14]:
result.to_excel("./Data/imdb_top250.xlsx")