<a href="https://colab.research.google.com/github/ferkrum/web-scraper-imdb/blob/main/Web_Scraper_IMDB_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Obtendo lista atualizada de lançamentos de filmes do IMDB 

Como obter uma lista atualizada com os últimos lançamentos de filmes (incluinido sinopses) do IMDB?

Fizemos um scraping usando a biblioteca Beautiful Soup e criamos um dataframe para visualização e exportação (arquivo CSV) dos resultados ao usuário. 


---
*Version history:
v2: adicionei o campo de data de lançamento ao dataframe*

In [4]:
#Importa bibliotecas
import requests
from bs4 import BeautifulSoup

In [5]:
from datetime import datetime
from pytz import timezone 
brasil = timezone('Brazil/East')
bs_time = datetime.now(brasil)
timestampBrasil = bs_time.strftime('%Y-%m-%d_%H-%M-%S') 
mes = bs_time.strftime('%m')
mes = int(mes)

In [6]:
print(timestampBrasil)

2022-07-15_17-33-26


In [7]:
# parsing
response = requests.get('https://www.imdb.com/calendar/?ref_=nv_mv_cal').content
soup = BeautifulSoup(response, 'html.parser')

In [8]:
#verifica conteúdo do parser
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb: Upcoming Releases for United States - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/calendar/" rel="canonical"/>
<meta content="http://www.imdb.com/calendar/" property="og:url">
<script>
    if (typeof uet == 'functio

Abrindo a URL no navegador, inspecionamos o código e identificamos as tags relacionadas ao conteúdo buscado: div id=main

In [9]:
lancamentos = soup.find("div",id="main")

In [10]:
#confere conteudo filtrado em lancamentos
lancamentos

<div id="main">
<h4>15 July 2022</h4>
<ul>
<li>
<a href="/title/tt9411972/?ref_=rlm">Where the Crawdads Sing</a> (2022)
                        </li>
<li>
<a href="/title/tt4428398/?ref_=rlm">Paws of Fury: The Legend of Hank</a> (2022)
                        </li>
<li>
<a href="/title/tt5151570/?ref_=rlm">Mrs Harris Goes to Paris</a> (2022)
                        </li>
</ul>
<h4>22 July 2022</h4>
<ul>
<li>
<a href="/title/tt10954984/?ref_=rlm">Nope</a> (2022)
                        </li>
<li>
<a href="/title/tt10530838/?ref_=rlm">How to Please a Woman</a> (2022)
                        </li>
<li>
<a href="/title/tt8595016/?ref_=rlm">My Old School</a> (2022)
                        </li>
<li>
<a href="/title/tt14584284/?ref_=rlm">Alone Together</a> (2022)
                        </li>
</ul>
<h4>29 July 2022</h4>
<ul>
<li>
<a href="/title/tt8912936/?ref_=rlm">DC League of Super-Pets</a> (2022)
                        </li>
<li>
<a href="/title/tt12262116/?ref_=rlm">Thirteen Lives</a> 


```
Notamos que para cada header 4, temos uma lista não numerada <ul>. 
Vamos usar essa estrutura para adicionar a coluna de data ao Data Frame.
```



In [11]:
#buscando as datas dos lancamentos
datas = lancamentos.find_all("h4")

In [12]:
#qtdade de datas diferentes 
len(datas)

59

In [13]:
datas

[<h4>15 July 2022</h4>,
 <h4>22 July 2022</h4>,
 <h4>29 July 2022</h4>,
 <h4>05 August 2022</h4>,
 <h4>12 August 2022</h4>,
 <h4>18 August 2022</h4>,
 <h4>19 August 2022</h4>,
 <h4>26 August 2022</h4>,
 <h4>31 August 2022</h4>,
 <h4>02 September 2022</h4>,
 <h4>07 September 2022</h4>,
 <h4>09 September 2022</h4>,
 <h4>13 September 2022</h4>,
 <h4>16 September 2022</h4>,
 <h4>23 September 2022</h4>,
 <h4>30 September 2022</h4>,
 <h4>01 October 2022</h4>,
 <h4>07 October 2022</h4>,
 <h4>14 October 2022</h4>,
 <h4>15 October 2022</h4>,
 <h4>21 October 2022</h4>,
 <h4>28 October 2022</h4>,
 <h4>02 November 2022</h4>,
 <h4>04 November 2022</h4>,
 <h4>11 November 2022</h4>,
 <h4>18 November 2022</h4>,
 <h4>23 November 2022</h4>,
 <h4>02 December 2022</h4>,
 <h4>15 December 2022</h4>,
 <h4>16 December 2022</h4>,
 <h4>21 December 2022</h4>,
 <h4>22 December 2022</h4>,
 <h4>25 December 2022</h4>,
 <h4>06 January 2023</h4>,
 <h4>13 January 2023</h4>,
 <h4>27 January 2023</h4>,
 <h4>03 February 2

In [14]:
#fimes em cada data encontrada 
datasFilmes = lancamentos.find_all("ul")

In [15]:
#confirma que é a mesma quantidade de datas vista celula anterior 
len(datasFilmes)

59

In [16]:
#armazena quantidade de filmes
qtdadeFilmes = len(lancamentos.find_all("a"))

In [17]:
qtdadeFilmes

100

... testando

In [18]:
datas[0].text

'15 July 2022'

In [19]:
for j in datasFilmes[0].find_all("a"):
  print(j.text)

Where the Crawdads Sing
Paws of Fury: The Legend of Hank
Mrs Harris Goes to Paris


In [20]:
contaDatas = 0
indice = 0
dictLancamentos = {}
for i in datas:                                   #percorre as datas
  for j in datasFilmes[contaDatas].find_all("a"): #percorre todos itens para cada data
    indice += 1
    print (indice, " de ", qtdadeFilmes, end = ' ')
    print (i.text, end = ' ')
    
    response2 = requests.get("https://www.imdb.com"+j["href"]).content  #cria novo request para a url contida em cada elemento "href", ou seja, ira obter o conteúdo de cada página associada a cada novo filme da página
    soup2 = BeautifulSoup(response2, 'html.parser')                     #parser do conteúdo
    print("Dados do filme: ", j.text)
    sinopse = soup2.find("span",class_="sc-16ede01-2 gXUyNh").text      #armazena em sinopse o conteudo obtido em cada pagina
  
    dictLancamentos[indice] = {"nome" : j.text, "data lancamento" : i.text, " url" : "https://www.imdb.com"+j["href"], "sinopse" : sinopse}
  contaDatas+=1
#print(contaDatas)

1  de  100 15 July 2022 Dados do filme:  Where the Crawdads Sing
2  de  100 15 July 2022 Dados do filme:  Paws of Fury: The Legend of Hank
3  de  100 15 July 2022 Dados do filme:  Mrs Harris Goes to Paris
4  de  100 22 July 2022 Dados do filme:  Nope
5  de  100 22 July 2022 Dados do filme:  How to Please a Woman
6  de  100 22 July 2022 Dados do filme:  My Old School
7  de  100 22 July 2022 Dados do filme:  Alone Together
8  de  100 29 July 2022 Dados do filme:  DC League of Super-Pets
9  de  100 29 July 2022 Dados do filme:  Thirteen Lives
10  de  100 29 July 2022 Dados do filme:  Vengeance
11  de  100 29 July 2022 Dados do filme:  The Reef: Stalked
12  de  100 29 July 2022 Dados do filme:  Resurrection
13  de  100 29 July 2022 Dados do filme:  Ali & Ava
14  de  100 05 August 2022 Dados do filme:  Bullet Train
15  de  100 05 August 2022 Dados do filme:  Bodies Bodies Bodies
16  de  100 05 August 2022 Dados do filme:  Easter Sunday
17  de  100 05 August 2022 Dados do filme:  I Love My D

In [21]:
import pandas as pd

In [22]:
dfFilmes = pd.DataFrame.from_dict(dictLancamentos).T

In [23]:
dfFilmes

Unnamed: 0,nome,data lancamento,url,sinopse
1,Where the Crawdads Sing,15 July 2022,https://www.imdb.com/title/tt9411972/?ref_=rlm,A woman who raised herself in the marshes of t...
2,Paws of Fury: The Legend of Hank,15 July 2022,https://www.imdb.com/title/tt4428398/?ref_=rlm,"Hank, a loveable dog with a head full of dream..."
3,Mrs Harris Goes to Paris,15 July 2022,https://www.imdb.com/title/tt5151570/?ref_=rlm,A widowed cleaning lady in 1950s London falls ...
4,Nope,22 July 2022,https://www.imdb.com/title/tt10954984/?ref_=rlm,The residents of a lonely gulch in inland Cali...
5,How to Please a Woman,22 July 2022,https://www.imdb.com/title/tt10530838/?ref_=rlm,When her all-male house-cleaning business gets...
...,...,...,...,...
96,The Flash,23 June 2023,https://www.imdb.com/title/tt0439572/?ref_=rlm,The plot is unknown. Feature film based on the...
97,Indiana Jones 5,30 June 2023,https://www.imdb.com/title/tt1462764/?ref_=rlm,The plot is unknown at this time.
98,Naya Legend of the Golden Dolphin,30 June 2023,https://www.imdb.com/title/tt7890826/?ref_=rlm,This is an action film following the adventure...
99,Madame Web,07 July 2023,https://www.imdb.com/title/tt11057302/?ref_=rlm,Spin-off from Spider-Man centering on a clairv...


In [27]:
#exporta para CSV
pd.DataFrame.from_dict(dictLancamentos).T.to_csv('lancamentos_'+ timestampBrasil + '.csv')