# Lab | Web Scraping

Welcome to the IMDb Web Scraping Adventure Lab!

**Objective**

In this lab, we will embark on a mission to unearth valuable insights from the vast sea of data available on IMDb, one of the largest online databases of movie, TV, and celebrity information. As budding data scientists and business analysts, you have been tasked to scrape a specific subset of data from IMDb to assist film production companies in understanding the landscape of highly-rated movies in a defined time period. Your insights will potentially influence the making of the next netflix movie!

**Background**

In a world where data has become the new currency, businesses are leveraging big data to make informed decisions that drive success and profitability. The entertainment industry, being no exception, utilizes data analytics to comprehend market trends, audience preferences, and the performance of films based on various parameters such as director, genre, stars involved, etc. IMDb stands as a goldmine of such data, offering intricate details of almost every movie ever made.

**Task**

Your task is to create a Python script using `BeautifulSoup` and `pandas` to scrape IMDb movie data based on user ratings and release dates. This script should be able to filter movies with ratings above a certain threshold and within a specified date range.

**Expected Outcome**

- A function named `scrape_imdb` that takes four parameters: `title_type`,`user_rating`, `start_date`, and `end_date`.
- The function should return a DataFrame with the following columns:
  - **Movie Nr**: The number representing the movie’s position in the list.
  - **Title**: The title of the movie.
  - **Year**: The year the movie was released.
  - **Rating**: The IMDb rating of the movie.
  - **Runtime (min)**: The duration of the movie in minutes.
  - **Genre**: The genre of the movie.
  - **Description**: A brief description of the movie.
  - **Director**: The director of the movie.
  - **Stars**: The main stars of the movie.
  - **Votes**: The number of votes the movie received.
  - **Gross ($M)**: The gross earnings of the movie in millions of USD.

You will execute this script to scrape data for movies with the Title Type `Feature Film` that have a user rating of `7.5 and above` and were released between `January 1, 1990, and December 31, 1992`.

Remember to experiment with different title types, dates and ratings to ensure your code is versatile and can handle various searches effectively!

**Resources**

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)
- [IMDb Advanced Search](https://www.imdb.com/search/title/)


**Hint**

Your first mission is to familiarize yourself with the IMDb advanced search page. Head over to [IMDb advanced search](https://www.imdb.com/search/title/) and input the following parameters, keeping all other fields to their default values or blank:

- **Title Type**: Feature film
- **Release date**: From 1990 to 1992 (Note: You don't need to specify the day and month)
- **User Rating**: 7.5 to -

Upon searching, you'll land on a page showcasing a list of movies, each displaying vital details such as the title, release year, and crew information. Your task is to scrape this treasure trove of data.

Carefully examine the resulting URL and construct your own URL to include all the necessary parameters for filtering the movies.


---

**Best of luck! Immerse yourself in the world of movies and may the data be with you!**

**Important note**:

In the fast-changing online world, websites often get updates and make changes. When you try this lab, the IMDb website might be different from what we expect.

If you run into problems because of these changes, like new rules or things that stop you from getting data, don't worry! Instead, get creative.

You can choose another website that interests you and is good for scraping data. Websites like Wikipedia or The New York Times are good options. The main goal is still the same: get useful data and learn how to scrape it from a website that you find interesting. It's a chance to practice your web scraping skills and explore a source of information you like.

In [1]:
# Lab | Web Scraping
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [None]:
# Insert the url and check the status code
url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,'
res = requests.get(url)
print(res.status_code)

In [None]:
res.reason

In [None]:
# No se puede entrar en esta URL así que voy a usar otra URL con una lista de películas

In [2]:
# Insert the url and check the status code
url = 'https://www.imdb.com/list/ls025598828/'
res = requests.get(url)
print(res.status_code)

200


In [3]:
# Translate HTML code
soup = BeautifulSoup(res.content, 'html.parser')

In [4]:
# Create a Python script using BeautifulSoup and pandas to scrape IMDb movie data based on user ratings and release dates.
# Create a list gathering movie positions
movie_position = soup.find_all('span', {'class': "lister-item-index unbold text-primary"})

In [5]:
# Access to every element in the movie_position list
positions = []

for i in movie_position:
    positions.append(i.getText())


In [6]:
# Create a list gathering movie titles
movie_title = soup.find_all('h3', {'class': "lister-item-header"})


In [7]:
# Access to every element in the movie_title list
titles = []

for i in movie_title:
    titles.append(i.getText())


In [8]:
# Clean titles list


# Patrón para eliminar números seguidos de un punto al principio y años entre paréntesis al final
pattern = r'^\d+\.\s*|\s*\(\d{4}\)\s*$'

# Lista para almacenar los títulos limpios
modified_titles = []

# Iterar sobre la lista original y aplicar las modificaciones
for title in titles:
    cleaned_title = re.sub(pattern, '', title).strip()
    modified_titles.append(cleaned_title)

# Imprimir la lista de títulos limpios
for title in modified_titles:
    print(title)
    


1.
Breaking Bad
(2008–2013)
2.
Minecraft
(2009 Video Game)
3.
Crank: Alto voltaje
4.
Django desencadenado
5.
Crank: Veneno en la sangre
6.
John Wick (Otro día para matar)
7.
John Wick: Pacto de sangre
8.
Juego de armas
9.
Los odiosos ocho
10.
A serbian film
11.
Seven
12.
Pulp Fiction
13.
Capitán Phillips
14.
El precio del poder
15.
El lobo de Wall Street
16.
Malditos bastardos
17.
Focus
(II)
18.
The human centipede (First sequence)
19.
Corazones de acero
20.
No es país para viejos
21.
El francotirador
22.
Toc Toc
(I)
23.
El show de Truman
24.
Tropa de élite
25.
Caminando entre las tumbas
26.
Leatherface
27.
Paranormal movie 2
28.
Scary Movie
29.
Scary Movie 2
30.
Scary Movie 3
31.
The Pirates of Somalia
32.
Crudo
33.
American Psycho
34.
Memento
35.
Los colegas del barrio
36.
Réquiem por un sueño
37.
Taxi Driver
38.
Salvar al soldado Ryan
39.
Espías
40.
El protector (Homefront)
(I)
41.
Spring Breakers
42.
Náufrago
43.
Deber cumplido
(I)
44.
Saw VIII
(I)
45.
We Are Your Friends
46.
The D

In [9]:
# Lista para almacenar los títulos limpios
cleaned_titles = []

# Iterar sobre la lista original y quitar el número y el punto del principio y el año al final de cada título
for title in titles:
    cleaned_title = title.split('.', 1)[-1].split('(', 1)[0].strip()  # Eliminar la parte antes del primer punto y la parte después del primer paréntesis
    cleaned_titles.append(cleaned_title)

# Imprimir la lista de títulos limpios
for title in cleaned_titles:
    print(title)



Breaking Bad
Minecraft
Crank: Alto voltaje
Django desencadenado
Crank: Veneno en la sangre
John Wick
John Wick: Pacto de sangre
Juego de armas
Los odiosos ocho
A serbian film
Seven
Pulp Fiction
Capitán Phillips
El precio del poder
El lobo de Wall Street
Malditos bastardos
Focus
The human centipede
Corazones de acero
No es país para viejos
El francotirador
Toc Toc
El show de Truman
Tropa de élite
Caminando entre las tumbas
Leatherface
Paranormal movie 2
Scary Movie
Scary Movie 2
Scary Movie 3
The Pirates of Somalia
Crudo
American Psycho
Memento
Los colegas del barrio
Réquiem por un sueño
Taxi Driver
Salvar al soldado Ryan
Espías
El protector
Spring Breakers
Náufrago
Deber cumplido
Saw VIII
We Are Your Friends
The Disaster Artist
It
Paranormal Activity 4
La cosa
Depredador
El resplandor
Donnie Darko
La naranja mecánica
12 monos
La milla verde
Dunkerque
Deadpool 2
Alien, el octavo pasajero
La chaqueta metálica
El otro guardaespaldas
El padrino
El exorcista
El protegido
El bueno, el feo y 

In [10]:
# Create a list gathering movie years
movie_year = soup.find_all('span', {'class': "lister-item-year text-muted unbold"})


In [11]:
# Access to every element in the movie_year list
years = []

for i in movie_year:
    years.append(i.getText())


In [12]:
# Clean years list
years = [i.replace('(', '') for i in years]
years = [i.replace(')', '') for i in years]


In [13]:
# Create a list gathering movie ratings
movie_rating = soup.find_all('span', {'class': "ipl-rating-star__rating"})


In [14]:
# Access to every element in the movie_rating list
ratings = []

for i in movie_rating:
    ratings.append(i.getText())




In [15]:
# Clean ratings list

while "Rate" in ratings:
    ratings.remove("Rate")

for item in ratings:
    print(item)

9.5
0
1
2
3
4
5
6
7
8
9
10
8.6
0
1
2
3
4
5
6
7
8
9
10
6.1
0
1
2
3
4
5
6
7
8
9
10
8.5
0
1
2
3
4
5
6
7
8
9
10
6.9
0
1
2
3
4
5
6
7
8
9
10
7.4
0
1
2
3
4
5
6
7
8
9
10
7.4
0
1
2
3
4
5
6
7
8
9
10
7.1
0
1
2
3
4
5
6
7
8
9
10
7.8
0
1
2
3
4
5
6
7
8
9
10
5
0
1
2
3
4
5
6
7
8
9
10
8.6
0
1
2
3
4
5
6
7
8
9
10
8.9
0
1
2
3
4
5
6
7
8
9
10
7.8
0
1
2
3
4
5
6
7
8
9
10
8.3
0
1
2
3
4
5
6
7
8
9
10
8.2
0
1
2
3
4
5
6
7
8
9
10
8.4
0
1
2
3
4
5
6
7
8
9
10
6.6
0
1
2
3
4
5
6
7
8
9
10
4.4
0
1
2
3
4
5
6
7
8
9
10
7.6
0
1
2
3
4
5
6
7
8
9
10
8.2
0
1
2
3
4
5
6
7
8
9
10
7.3
0
1
2
3
4
5
6
7
8
9
10
4.9
0
1
2
3
4
5
6
7
8
9
10
8.2
0
1
2
3
4
5
6
7
8
9
10
8
0
1
2
3
4
5
6
7
8
9
10
6.5
0
1
2
3
4
5
6
7
8
9
10
5
0
1
2
3
4
5
6
7
8
9
10
4.7
0
1
2
3
4
5
6
7
8
9
10
6.3
0
1
2
3
4
5
6
7
8
9
10
5.3
0
1
2
3
4
5
6
7
8
9
10
5.5
0
1
2
3
4
5
6
7
8
9
10
6.7
0
1
2
3
4
5
6
7
8
9
10
7
0
1
2
3
4
5
6
7
8
9
10
7.6
0
1
2
3
4
5
6
7
8
9
10
8.4
0
1
2
3
4
5
6
7
8
9
10
6.5
0
1
2
3
4
5
6
7
8
9
10
8.3
0
1
2
3
4
5
6
7
8
9
10
8.2
0
1
2
3
4
5
6
7
8
9
10
8.6
0
1
2

In [16]:
# Expresión regular para encontrar números decimales
patron_decimal = re.compile(r'\d+\.\d+')

# Mantener solo los números decimales en la lista
ratings_decimales = [item for item in ratings if re.match(patron_decimal, item)]

# Imprimir la lista resultante con solo números decimales
for item in ratings_decimales:
    print(item)

9.5
8.6
6.1
8.5
6.9
7.4
7.4
7.1
7.8
8.6
8.9
7.8
8.3
8.2
8.4
6.6
4.4
7.6
8.2
7.3
4.9
8.2
6.5
4.7
6.3
5.3
5.5
6.7
7.6
8.4
6.5
8.3
8.2
8.6
6.5
5.3
7.8
6.6
5.7
6.2
7.3
7.3
4.6
8.2
7.8
8.4
8.3
8.6
7.8
7.6
8.5
8.3
6.9
9.2
8.1
7.3
8.8
7.3
8.2
8.3
8.3
8.2
8.6
6.6
7.6
7.5
8.7
8.4
8.6
6.2
7.7
6.8
8.5
8.7
8.1
8.5
7.5
7.3
6.3
7.2
8.3
8.7
8.2
8.1
6.6
7.4
6.4
8.3
7.1
8.2
8.5


In [17]:
# Create a list gathering movie runtime
movie_runtime = soup.find_all('span', {'class': "runtime"})


In [18]:
# Access to every element in the movie_runtime list
runtimes = []

for i in movie_runtime:
    runtimes.append(i.getText())




In [19]:
# Create a list gathering movie genres
movie_genre = soup.find_all('span', {'class': "genre"})


In [20]:
# Access to every element in the movie_genre list
genres = []

for i in movie_genre:
    genres.append(i.getText())




In [21]:
# Clean genres list
genres = [i.replace('\n', '') for i in genres]
genres = [i.strip() for i in genres]


In [22]:
# Create a list gathering movie descriptions
descriptions = []

In [23]:
# Create movie_data
movie_data = soup.findAll('div', attrs={'class': 'lister-item mode-detail'})

In [24]:
# Iterate through movie_data to append descriptions
for movie in movie_data:
    
    description = movie.find_all('p', class_ = '' )[0].text.replace('\n','')
    descriptions.append(description)
    
     


In [25]:
# Create a list gathering movie directors
# Encontrar todos los elementos <a> con el atributo href que contienen /name
movie_directors = soup.find_all('a', href=True)
# Filtrar los elementos para obtener solo los enlaces de los directores
directors = [element.text.strip() for element in movie_directors if '/name' in element['href']]
    
# Imprimir los nombres de los directores
for director in directors:
    print(director)

Bryan Cranston
Aaron Paul
Anna Gunn
Betsy Brandt
Agnes Larsson
Pierre Coffin
Katie Crown
CW21
Luke Harrison
Mark Neveldine
Brian Taylor
Jason Statham
Amy Smart
Clifton Collins Jr.
Dwight Yoakam
Quentin Tarantino
Jamie Foxx
Christoph Waltz
Leonardo DiCaprio
Kerry Washington
Mark Neveldine
Brian Taylor
Jason Statham
Amy Smart
Carlos Sanz
Jose Pablo Cantillo
Chad Stahelski
Keanu Reeves
Michael Nyqvist
Alfie Allen
Willem Dafoe
Chad Stahelski
Keanu Reeves
Riccardo Scamarcio
Ian McShane
Ruby Rose
Todd Phillips
Jonah Hill
Miles Teller
Steve Lantz
Gregg Weiner
Quentin Tarantino
Samuel L. Jackson
Kurt Russell
Jennifer Jason Leigh
Walton Goggins
Srdjan Spasojevic
Srdjan 'Zika' Todorovic
Sergej Trifunovic
Jelena Gavrilovic
Slobodan Bestic
David Fincher
Morgan Freeman
Brad Pitt
Kevin Spacey
Andrew Kevin Walker
Quentin Tarantino
John Travolta
Uma Thurman
Samuel L. Jackson
Bruce Willis
Richard Phillips
Paul Greengrass
Tom Hanks
Barkhad Abdi
Barkhad Abdirahman
Catherine Keener
Brian De Palma
Al Pacin

In [26]:
# Reduce directors liss
modified_directors = directors[:100]

In [27]:
# Create a list gathering movie stars
# Encontrar todos los elementos <a> con el atributo href que contienen /name
star_links = soup.find_all('a', href=True)

# Filtrar los elementos para obtener solo los enlaces de las estrellas
stars = [element.text.strip() for element in star_links if '/name' in element['href']]

# Imprimir los nombres de las estrellas
for star in stars:
    print(star)

Bryan Cranston
Aaron Paul
Anna Gunn
Betsy Brandt
Agnes Larsson
Pierre Coffin
Katie Crown
CW21
Luke Harrison
Mark Neveldine
Brian Taylor
Jason Statham
Amy Smart
Clifton Collins Jr.
Dwight Yoakam
Quentin Tarantino
Jamie Foxx
Christoph Waltz
Leonardo DiCaprio
Kerry Washington
Mark Neveldine
Brian Taylor
Jason Statham
Amy Smart
Carlos Sanz
Jose Pablo Cantillo
Chad Stahelski
Keanu Reeves
Michael Nyqvist
Alfie Allen
Willem Dafoe
Chad Stahelski
Keanu Reeves
Riccardo Scamarcio
Ian McShane
Ruby Rose
Todd Phillips
Jonah Hill
Miles Teller
Steve Lantz
Gregg Weiner
Quentin Tarantino
Samuel L. Jackson
Kurt Russell
Jennifer Jason Leigh
Walton Goggins
Srdjan Spasojevic
Srdjan 'Zika' Todorovic
Sergej Trifunovic
Jelena Gavrilovic
Slobodan Bestic
David Fincher
Morgan Freeman
Brad Pitt
Kevin Spacey
Andrew Kevin Walker
Quentin Tarantino
John Travolta
Uma Thurman
Samuel L. Jackson
Bruce Willis
Richard Phillips
Paul Greengrass
Tom Hanks
Barkhad Abdi
Barkhad Abdirahman
Catherine Keener
Brian De Palma
Al Pacin

In [28]:
# Reduce stars liss
modified_stars = stars[:100]

In [29]:
# Create a list gathering movie votes
 # Encontrar todos los elementos <span> con el atributo name="nv" que contienen los valores de los votos
vote_spans = soup.find_all('span', {'name': 'nv'})

# Crear una lista para almacenar los valores de los votos
votes = []

# Iterar sobre los elementos encontrados y extraer los valores de los votos
for vote_span in vote_spans:
    vote_value = vote_span.get_text(strip=True)  # Obtener el texto del elemento sin espacios en blanco
    votes.append(vote_value)  # Agregar el valor de los votos a la lista
    


In [30]:
# Clean votes list

modified_votes = [valor for valor in votes if '$' not in valor and 'M' not in valor]




In [31]:
# Create a list gathering movie gross

 # Encontrar todos los elementos <span> con el atributo name="nv" que contienen los valores de los votos
gross_spans = soup.find_all('span', {'name': 'nv'})

# Crear una lista para almacenar los valores de los votos
gross = []

# Iterar sobre los elementos encontrados y extraer los valores de los votos
for gross_span in gross_spans:
    gross_value = gross_span.get_text(strip=True)  # Obtener el texto del elemento sin espacios en blanco
    gross.append(gross_value)  # Agregar el valor de los votos a la lista
    


In [32]:
# Clean gross list

# Expresión regular para buscar comas en los valores
patron_coma = r','

# Filtrar los valores de la lista que no contienen comas
modified_gross = [valor for valor in gross if not re.search(patron_coma, valor)]



In [33]:
# Create a dictionary with every list created
movies_dict = {}
movies_dict['Position'] = positions
movies_dict['Title'] = cleaned_titles
movies_dict['Years'] = years
movies_dict['Ratings'] = ratings_decimales
movies_dict['Runtimes'] = runtimes
movies_dict['Genres'] = genres
movies_dict['Descriptions'] = descriptions
movies_dict['Directors'] = modified_directors
movies_dict['Stars'] = modified_stars
movies_dict['Votes'] = modified_votes
movies_dict['Gross'] = modified_gross


In [34]:
len(modified_gross)

95

In [35]:
# Fill lists
# Valor de relleno
valor_relleno = 0

# Rellenar la lista hasta que tenga 100 valores
while len(ratings_decimales) < 100:
    ratings_decimales.append(valor_relleno)

    
len(ratings_decimales)

100

In [36]:
# Fill lists
valor_relleno = 0

# Rellenar la lista hasta que tenga 100 valores
while len(runtimes) < 100:
    runtimes.append(valor_relleno)

    
len(runtimes)

100

In [37]:
# Fill lists
valor_relleno = 0

# Rellenar la lista hasta que tenga 100 valores
while len(modified_gross) < 100:
    modified_gross.append(valor_relleno)

    
len(modified_gross)

100

In [38]:
# Create a dataframe with the dictionary
df = pd.DataFrame(movies_dict)

In [39]:
# Show all rows of the dataframe
pd.set_option('display.max_rows', None)

## BONUS

The search results span multiple pages, housing a total of 631 movies in our example with each page displaying 50 movies at most. To scrape data seamlessly from all pages, you'll need to dive deep into the structure of the URLs generated with each "Next" click.

Take a close look at the following URLs:
- First page:
  ```
  https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,
  ```
- Second page:
  ```
  https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=51&ref_=adv_nxt
  ```
- Third page:
  ```
  https://www.imdb.com/search/title/?title_type=feature&release_date=1990-01-01,1992-12-31&user_rating=7.5,&start=101&ref_=adv_nxt
  ```

You should notice a pattern. There is a `start` parameter incrementing by 50 with each page, paired with a constant `ref_` parameter holding the value "adv_nxt".

Modify your script so it's capable of iterating over all available pages to fetch data on all the 631 movies (631 is the total number of movies in the proposed example).

In [None]:
# Your solution goes here