# Web Scraping IMBD Action Movies

Author: [Antonio Valbuena](https://www.linkedin.com/in/antonio-valbuena-s%C3%A1nchez-960718130/)


<table><tr>
<td> <img src="https://m.media-amazon.com/images/G/01/IMDb/BG_rectangle._CB1509060989_SY230_SX307_AL_.png" alt="Drawing" style="width: 250px;"/> </td>
<td> <img src="https://nofilmschool.com/sites/default/files/styles/article_wide/public/rambo_0.jpg?itok=K3d91BmS" alt="Drawing" style="width: 450px;"/> </td>
</tr></table>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yours truly, unleashing his web scraping code on the IMDB site (<i>dramatization<i>) 





## Introduction

<span>&#x1f1ec;&#x1f1e7;</span> I learned web scraping recently and have decided to put it into practice on the movie website [IMDB](https://www.imdb.com/). My goal is to web scrape information for as many Action movies as I can and to perform further analysis on them. The information I aim to obtain from each movie is:

+ Title of the movie
+ Year
+ Duration
+ Imdb rating
+ Metascore rating (critics
+ Director and main stars

---

<span>&#x1f1ea;&#x1f1f8;</span> Aprendí web scraping hace poco y he decidido ponerlo en práctica sobre la web de pelis [IMDB](https://www.imdb.com/). Mi objetivo será obtener información para el mayor número de películas de acción posible para después analizarlas. La información para cada película que deseo obtener es:

+ Título de la peli
+ Año
+ Duración
+ Rating Imdb (comunidad)
+ Rating Metascore (críticos)
+ Director y estrellas principales


## Importing libraries

<span>&#x1f1ec;&#x1f1e7;</span> We start by importing all necessary libraries

---

<span>&#x1f1ea;&#x1f1f8;</span> Empezamos importando las librerias necesarias


In [498]:
from urllib.request import urlopen as uReq # to grab webpage
from bs4 import BeautifulSoup as soup # to parse HTML text
import pandas as pd # to clean and analyse the data retrieved
import numpy as np # to clean and analyse the data retrieved
from time import sleep # to pause the execution of the loop
from random import randint # to pause the execution of the loop
import seaborn as sns # to plot visualisations of the data
import matplotlib.pyplot as plt # to plot visualisations of the data
import datetime

## Testing content selection: one page and one movie

<span>&#x1f1ec;&#x1f1e7;</span> We first focus on one individual page, the first one, to study its architecture; where the elements we want are and how we can retrieve them.

---

<span>&#x1f1ea;&#x1f1f8;</span> Ahora, nos centramos en una página en particular, la primera, para estudiar su arquitectura; dónde están los elementos que queremos scrapear y cómo accedemos a ellos

In [None]:
# base url from which we gather the data
url_1 = "https://www.imdb.com/search/title/?title_type=feature&genres=action&start=1"

In [306]:
uClient = uReq(url_1) # grab first page
page_html = uClient.read() # read page
#sleep(randint(4,10))
uClient.close() # close the client
page_soup = soup(page_html, "html.parser") #Parse html with beautiful soup

<span>&#x1f1ec;&#x1f1e7;</span>  We want to retrieve the info for 50 movies in these page. The info for each movie is in a container called __"lister-item mode-advanced"__ 

This is a matter of pure experimentation; for different websites and projects, you will want to retrieve different things and they will have different names. On your browser, inspect the HTML of elements of the website and do trial and error until you find the name of what you want. 

---

<span>&#x1f1ea;&#x1f1f8;</span> Queremos obetener informacion para las 50 pelis de esta página. La info para cada peli está en un container llamado __"lister-item mode-advanced"__ 

Esta información es una cuestión de pura experimentación; para diferentes webs y proyectos, querrás acceder a material distinto y este tendrá nombres diferentes. En tu buscador, inspecciona el HTML de los elementos de la web, brujulea y haz prueba y error hasta que encuentres el nombre de lo que buscas.

In [307]:
containers = page_soup.findAll("div", {"class" : "lister-item mode-advanced"}) 
# we retrieve all the movies in our website

len(containers)

50

<span>&#x1f1ec;&#x1f1e7;</span> Now, we take one container in particular and experiment ways to get all the elements we want. Remember we would like to get:
+ Title of the movie
+ Year
+ Imbd rating
+ Metascore rating
+ Director and main stars
+ Duration

---

<span>&#x1f1ea;&#x1f1f8;</span> Ahora, tomamos un container (una película) y experimentamos cómo coger los elementos que queremos:

+ Título de la peli
+ Año
+ Duración
+ Rating Imdb (comunidad)
+ Rating Metascore (críticos)
+ Director y estrellas principales

In [308]:
cont = containers[23]

### title

In [309]:
# titulo
cont.img["alt"]

'Monster Hunter'

### year

In [21]:
# year
cont.findAll("span", {"class" : "lister-item-year text-muted unbold"})[0].text

'(2021)'

### duration

In [24]:
# duration
cont.findAll("span", {"class" : "runtime"})[0].text

'114 min'

### imdb rating

In [30]:
# imdb rating: careful, not all movies have it
cont.findAll("div", {"class" : "inline-block ratings-imdb-rating"})[0].text

'\n\n5.3\n'

### metascore rating

In [174]:
# metascore
cont.findAll("div", {"class" : "inline-block ratings-metascore"})[0].text


'\n45        \n        Metascore\n            '

### director

In [483]:
# director & stars
cont.findAll("p", {"class" : ""})[0].findAll("a")

[<a href="/name/nm0062614/">Harry Baweja</a>,
 <a href="/name/nm0222426/">Ajay Devgn</a>,
 <a href="/name/nm0007107/">Urmila Matondkar</a>,
 <a href="/name/nm0154274/">Mahima Chaudhry</a>,
 <a href="/name/nm0712546/">Paresh Rawal</a>]

In [255]:
"Director:" in cont.findAll("p", {"class" : ""})[0]

False

In [274]:
cont.findAll("p", {"class" : ""})[0].findAll("a")[]

[<a href="/name/nm0000170/">Milla Jovovich</a>,
 <a href="/name/nm1388074/">Tony Jaa</a>,
 <a href="/name/nm1939267/">T.I.</a>,
 <a href="/name/nm0328709/">Meagan Good</a>]

In [207]:
cont.findAll("p", {"class" : ""})[0].findAll("a")[0].text

'Paul W.S. Anderson'

### stars

In [278]:
cont.findAll("p", {"class" : ""})[0].text.split("Stars:")[1].replace("\n", "").split(",")

['Milla Jovovich', ' Tony Jaa', ' T.I.', ' Meagan Good']

In [269]:
cont.findAll("p", {"class" : ""})[0].text.split("Stars:")[0].replace("\n", "").split(",")

['    Director:Paul W.S. Anderson|     ']

In [248]:
cont.findAll("p", {"class" : ""})[0].text.split("|")[0].replace("\n", "").split(":")[1]

'Paul W.S. Anderson'

### Number of votes

In [40]:
# n votes
cont.findAll("span", {"name" : "nv"})[0].text

'20,937'

## Scraping a whole page

<span>&#x1f1ec;&#x1f1e7;</span> Once we know how to get all the elements we want out of a particular movie, we just need to do a loop to get these bits of information for all movies in the page. The syntax gets slightly more complex because although the structure of the website for all movies is extremely similar, it is not identical, so we need to account for this with some if else conditions to prevent our code from breaking. For instance, since not all movies have a IMDB rating, we need to add a backstop in case this field is empty. For each movie, we add the information we are retrieving to several lists, which will then be transformed to columns in a dataset. 

---

<span>&#x1f1ea;&#x1f1f8;</span> Una vez sabemos cómo obtener todos los elementos deseados de una película, con un simple loop podremos obtener esta información para todas las películas de la página. Sin embargo, la sintaxis se vuelve un pelín más compleja puesto que pese a que la estructura para cada película en la web es muy similar, no es idéntica, por lo que tenemos que cubrir esto con algunas condiciones if else para evitar que nuestro código se rompa. Por ejemplo, dado que no todas las películas tienen un rating de IMDB, tenemos que añadir una salvaguarda en caso que este campo se encuentre vacío. Para cada película, añadimos la información que estamos obteniendo a una de varias listas que posteriormente serán las columnas de nuestro dataste.


In [310]:
titles = []
years = []
durations = []
imbd_ratings = []
met_ratings = []
nvotes = []
directors = []
stars = []

for cont in containers:
    
    # titulo
    tit = cont.img["alt"]
    
    # year
    yr = cont.findAll("span", {"class" : "lister-item-year text-muted unbold"})[0].text
    
    # duration
    if len(cont.findAll("span", {"class" : "runtime"})) > 0:
        dur = cont.findAll("span", {"class" : "runtime"})[0].text
    else:
        dur = np.nan
        
    # imbd rating: careful, not all movies have it
    if len(cont.findAll("div", {"class" : "inline-block ratings-imdb-rating"})) > 0:
        imbd_rat = cont.findAll("div", {"class" : "inline-block ratings-imdb-rating"})[0].text
    else:
        imbd_rat = np.nan
    
    # metascore
    if len(cont.findAll("div", {"class" : "inline-block ratings-metascore"})) > 0:
        met_rat = cont.findAll("div", {"class" : "inline-block ratings-metascore"})[0].text
    else:
        met_rat = np.nan
        
    # director
    if "Director" in str(cont.findAll("p", {"class" : ""})[0]) and len(cont.findAll("p", {"class" : ""})) > 0:
        direct = cont.findAll("p", {"class" : ""})[0].text.split("|")[0].replace("\n", "").split(":")[1]
    else:
        direct = np.nan
      
    # stars
    if "Stars" in str(cont.findAll("p", {"class" : ""})[0]) and len(cont.findAll("p", {"class" : ""})) > 0:
        star = cont.findAll("p", {"class" : ""})[0].text.split("Stars:")[1].replace("\n", "").split(",")
    else:
        star = np.nan
        
    # n votes
    if len(cont.findAll("span", {"name" : "nv"})) > 0:
        nvot = cont.findAll("span", {"name" : "nv"})[0].text
    else:
        nvot = np.nan
    
    # append to lists
    
    titles.append(tit)
    years.append(yr)
    durations.append(dur)
    imbd_ratings.append(imbd_rat)
    met_ratings.append(met_rat)
    nvotes.append(nvot)
    directors.append(direct)
    stars.append(star)
    

In [None]:
df_first_page = pd.DataFrame({"title" : titles,
             "year" : years,
             "duration" : durations,
             "imbd_rating" : imbd_ratings,
             "met_rating": met_ratings,
             "number_votes" : nvotes,
                   "director" : directors,
                  "stars" : stars})

## Scraping all pages: automating iteration over several pages from the base URL
🇬🇧 One of the first things we need to bear in mind when webscraping multiple pages is the relationship between the several pages we want to scrape. In our case, as Imdb has MANY action movies, these span over many pages, and we want a script which goes over these pages automatically. We check this relationship manually by moving through the pages and taking note of what varies from one to another. As we can see, the urls are pretty much the same, except for the number at the end: as there are 50 movies per page, for every page we move forward, we will need to increase this number by 50. Easy peasy.

🇪🇸 Una de las primeras cosas que tenemos que tener en cuenta cuando hacemos web scraping sobre múltiples páginas es la relación entre las distintas páginas que queremos scrapear. En nuestro caso, dado que Imdb tiene MUCHAS pelis de acción, estas se encuentran en multitud de páginas, por lo que queremos un script que las vaya recorriendo de forma automática para ser eficiente. Para sacar esta relación, observamos lo que varía de una página a la siguiente. Como se puede ver, las páginas son casi idénticas salvo por el número del final: como hay 50 pelis por página, por cada página que avanzamos, necesitaremos incrementar este número en 50. Fácil. 


In [None]:
# base url from which we gather the data
url_1 = "https://www.imdb.com/search/title/?title_type=feature&genres=action&start=1"

url_2 = "https://www.imdb.com/search/title/?title_type=feature&genres=action&start=51"

# only change is at the end, 51 instead of 1: for each new page, add 50 to the number at the end

for i in range(1,10000,50): # we want to take the first 10 000 movies for starters
        url_test = f"https://www.imdb.com/search/title/?title_type=feature&genres=action&start={i}"
        print(url_test)

<span>&#x1f1ec;&#x1f1e7;</span>  After the 10000th movie, the pattern of variation of the pages changes and becomes impossible to trace. Hence, our web scraping can only fetch the information of the first 10 000 movies

[Url movies 10 001 - 10 050](https://www.imdb.com/search/title/?title_type=feature&genres=action&after=Wzk4NjEyLCJ0dDAyNTUwOTciLDEwMDAwXQ%3D%3D&ref_=adv_nxt)
    
[Url movies 10 051 - 11 000](https://www.imdb.com/search/title/?title_type=feature&genres=action&after=Wzk5MjM3LCJ0dDAwMzkxOTMiLDEwMDUwXQ%3D%3D&ref_=adv_nxt)

---

<span>&#x1f1ea;&#x1f1f8;</span> Tras la película número 10000, el patrón de variación de las páginas diverge y se vuelve imposible de reconocer. Por tanto, nuestro scraping solo puede capturar información para las primeras 10 000 películas

[Url movies 10 001 - 10 050](https://www.imdb.com/search/title/?title_type=feature&genres=action&after=Wzk4NjEyLCJ0dDAyNTUwOTciLDEwMDAwXQ%3D%3D&ref_=adv_nxt)
    
[Url movies 10 051 - 11 000](https://www.imdb.com/search/title/?title_type=feature&genres=action&after=Wzk5MjM3LCJ0dDAwMzkxOTMiLDEwMDUwXQ%3D%3D&ref_=adv_nxt)


## Scapping all pages: putting all pieces together

We know how to scrape and individual page, we know how change the name of a page automatically... Putting these two pieces together is all we need to get our final web scrape. We first do a loop that changes the url to select a new page each time and within this loop and via another loop, we take the information for all the movies in that page.

---

Ya sabemos scrappear una única página, también sabemos cambiar el nombre de la página de forma automática... Juntar estas dos cosas es todo lo que necesitamos para obtener nuestro scraping final. En un primer loop, vamos cambiando la url para seleccionar una página nueva en cada iteración y dentro de este loop, mediante otro loop, obtenemos la información de las películas.



In [318]:
# empty lists where we stored the information for each item
titles = []
years = []
durations = []
imbd_ratings = []
met_ratings = []
nvotes = []
directors = []
stars = []

contador = 0


for i in range(1,10000,50):
    
    contador = contador + 1
    
    url = f"https://www.imdb.com/search/title/?title_type=feature&genres=action&start={i}"
    print(f"Reading page{contador}. Url is:" , url)

    uClient = uReq(url) # grab page
    page_html = uClient.read() # read page
    uClient.close() # close the client

    page_soup = soup(page_html, "html.parser") #Parse html with beautiful soup

    containers = page_soup.findAll("div", {"class" : "lister-item mode-advanced"})
    
    print("containers longitud:", len(containers))


    for cont in containers:

        # titulo
        tit = cont.img["alt"]

        # year
        yr = cont.findAll("span", {"class" : "lister-item-year text-muted unbold"})[0].text

        # duration
        if len(cont.findAll("span", {"class" : "runtime"})) > 0:
            dur = cont.findAll("span", {"class" : "runtime"})[0].text
        else:
            dur = np.nan

        # imbd rating: careful, not all movies have it
        if len(cont.findAll("div", {"class" : "inline-block ratings-imdb-rating"})) > 0:
            imbd_rat = cont.findAll("div", {"class" : "inline-block ratings-imdb-rating"})[0].text
        else:
            imbd_rat = np.nan

        # metascore
        if len(cont.findAll("div", {"class" : "inline-block ratings-metascore"})) > 0:
            met_rat = cont.findAll("div", {"class" : "inline-block ratings-metascore"})[0].text
        else:
            met_rat = np.nan

        # director
        if "Director" in str(cont.findAll("p", {"class" : ""})[0]) and len(cont.findAll("p", {"class" : ""})) > 0:
            direct = cont.findAll("p", {"class" : ""})[0].text.split("|")[0].replace("\n", "").split(":")[1]
        else:
            direct = np.nan

        # stars
        if "Stars" in str(cont.findAll("p", {"class" : ""})[0]) and len(cont.findAll("p", {"class" : ""})) > 0:
            star = cont.findAll("p", {"class" : ""})[0].text.split("Stars:")[1].replace("\n", "").split(",")
        else:
            star = np.nan

        # n votes
        if len(cont.findAll("span", {"name" : "nv"})) > 0:
            nvot = cont.findAll("span", {"name" : "nv"})[0].text
        else:
            nvot = np.nan

        # append to lists

        titles.append(tit)
        years.append(yr)
        durations.append(dur)
        imbd_ratings.append(imbd_rat)
        met_ratings.append(met_rat)
        nvotes.append(nvot)
        directors.append(direct)
        stars.append(star)

    print(f"Page {contador} read!")  
    sleep(randint(5,10))


Reading page1. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=1
containers longitud: 50
Page 1 read!
Reading page2. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=51
containers longitud: 50
Page 2 read!
Reading page3. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=101
containers longitud: 50
Page 3 read!
Reading page4. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=151
containers longitud: 50
Page 4 read!
Reading page5. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=201
containers longitud: 50
Page 5 read!
Reading page6. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=251
containers longitud: 50
Page 6 read!
Reading page7. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=301
containers longitud: 50
Page 7 read!
Reading page8. Url is: https://www.im

containers longitud: 50
Page 59 read!
Reading page60. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=2951
containers longitud: 50
Page 60 read!
Reading page61. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3001
containers longitud: 50
Page 61 read!
Reading page62. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3051
containers longitud: 50
Page 62 read!
Reading page63. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3101
containers longitud: 50
Page 63 read!
Reading page64. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3151
containers longitud: 50
Page 64 read!
Reading page65. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3201
containers longitud: 50
Page 65 read!
Reading page66. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=3251
containers lo

containers longitud: 50
Page 117 read!
Reading page118. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=5851
containers longitud: 50
Page 118 read!
Reading page119. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=5901
containers longitud: 50
Page 119 read!
Reading page120. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=5951
containers longitud: 50
Page 120 read!
Reading page121. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=6001
containers longitud: 50
Page 121 read!
Reading page122. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=6051
containers longitud: 50
Page 122 read!
Reading page123. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=6101
containers longitud: 50
Page 123 read!
Reading page124. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=6151

containers longitud: 50
Page 175 read!
Reading page176. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=8751
containers longitud: 50
Page 176 read!
Reading page177. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=8801
containers longitud: 50
Page 177 read!
Reading page178. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=8851
containers longitud: 50
Page 178 read!
Reading page179. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=8901
containers longitud: 50
Page 179 read!
Reading page180. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=8951
containers longitud: 50
Page 180 read!
Reading page181. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=9001
containers longitud: 50
Page 181 read!
Reading page182. Url is: https://www.imdb.com/search/title/?title_type=feature&genres=action&start=9051

## Result...

<span>&#x1f1ec;&#x1f1e7;</span>  We turn the lists that contain the information for our movies in order into the columns of a dataframe...

---

<span>&#x1f1ea;&#x1f1f8;</span> Convertimos las listas que contienen la información de nuestras películas en orden en las columnas de un dataframe...

In [420]:
df_orig = pd.DataFrame({"title" : titles,
             "year" : years,
             "duration" : durations,
             "imbd_rating" : imbd_ratings,
             "met_rating": met_ratings,
             "number_votes" : nvotes,
                   "director" : directors,
                  "stars" : stars})

In [484]:
df_orig

Unnamed: 0,title,year,duration,imbd_rating,met_rating,number_votes,director,stars
0,A descubierto,(2021),114 min,\n\n5.3\n,\n45 \n Metascore\n,21416,Mikael Håfström,"[Anthony Mackie, Damson Idris, Enzo Cilenti,..."
1,Noticias del gran mundo,(2020),118 min,\n\n6.9\n,\n73 \n Metascore\n,10583,Paul Greengrass,"[Tom Hanks, Helena Zengel, Tom Astor, Travi..."
2,Wonder Woman 1984,(2020),151 min,\n\n5.4\n,\n60 \n Metascore\n,144978,Patty Jenkins,"[Gal Gadot, Chris Pine, Kristen Wiig, Pedro..."
3,Mortal Kombat,(2021),,,,,Simon McQuoid,"[Jessica McNamee, Hiroyuki Sanada, Josh Laws..."
4,Tenet,(2020),150 min,\n\n7.5\n,\n69 \n Metascore\n,279317,Christopher Nolan,"[John David Washington, Robert Pattinson, El..."
...,...,...,...,...,...,...,...,...
9995,Tarzan the Tiger,(1929),266 min,\n\n6.1\n,,95,Henry MacRae,"[Frank Merrill, Natalie Kingston, Al Ferguso..."
9996,Uppi 2,(2015),135 min,\n\n8.0\n,,7889,Upendra,"[Upendra, Kristina Akheeva, Parul Yadav, Sa..."
9997,Chirutha,(2007),117 min,\n\n5.3\n,,1820,Puri Jagannadh,"[Ram Charan, Neha Sharma, Prakash Raj, Ashi..."
9998,Surge of Dawn,(2019),,\n\n7.7\n,,25,Alexander Fernandez,"[Victoria Amber, Vincent J. Roth, Michelle B..."


<span>&#x1f1ec;&#x1f1e7;</span>  And it worked, we have info for 10 000 movies in just a few lines of code! By the looks of it, this data needs a TON of cleaning...

---

<span>&#x1f1ea;&#x1f1f8;</span> Y funcionó, tenemos datos para 10 000 películas en solo unas líneas de códifo. También parece que estos datos necesitan un montón de limpieza

## Data Cleaning

In [421]:
# work on a copy of the original df
df = df_orig.copy()

In [429]:
# there appears to be a problem with year
print(df["year"].unique()[3])
print(df["year"].unique()[list(df["year"].unique()).index('(IV)')])

(II) (2020)
(IV)


In [None]:
# to pass years to float only where it makes sense
def to_float(x):
    try:
        return float(x)
    except:
        return np.nan

In [453]:
# operacion con el yr:

# to deal withe years formatted as (II) (2020), we split all elements by a space, resulting in
print(np.unique(df["year"].str.split(" ").values)[:15])

#then, for each element, as they are lists, we select its last element [(II) , (2020)] and convert it to a float or to 
# a nan depending on whether it is convertible. This way, we deal with obvs whose value for year 
df["year"].str.split(" ").apply(lambda x: x[-1][1:-1]).apply(to_float)[:10]

[list(['']) list(['(1914)']) list(['(1915)']) list(['(1916)'])
 list(['(1918)']) list(['(1920)']) list(['(1921)']) list(['(1923)'])
 list(['(1924)']) list(['(1925)']) list(['(1926)']) list(['(1927)'])
 list(['(1928)']) list(['(1929)']) list(['(1930)'])]


0    2021.0
1    2020.0
2    2020.0
3    2021.0
4    2020.0
5    2020.0
6    2021.0
7    1984.0
8    2021.0
9    2020.0
Name: year, dtype: float64

In [454]:
# operacion con el yr: 
df["year"] = df["year"].str.split(" ").apply(lambda x: x[-1][1:-1]).apply(to_float)

In [455]:
# operacion con minutos
df['duration'] = df['duration'].str.split(" ").apply(lambda x: x[0] if type(x) == list else x).astype(float)

In [456]:
# operacion con imbd ratings
df["imbd_rating"] = df["imbd_rating"].str.strip().astype(float)

In [457]:
# operacion con metascore ratings
df["met_rating"] = df["met_rating"].str.split().apply(lambda x: x[0] if type(x) == list else x).astype(float)

In [387]:
contador_nolist = 0
contador_lenno4 = 0

for i in df['stars']:
    if type(i) != list:
        contador_nolist = contador_nolist + 1
    else:
        if len(i) != 4:
            contador_lenno4 = contador_lenno4 + 1
            
print("elemtents not a list:", contador_nolist)
print("elements that are a list of less than 4 elements" contador_lenno4)



283
125


In [465]:
# all elements in stars which aren't a list are nans
not_list = []
for i in df['stars']:
    if type(i) != list:
        not_list.append(i)
print(set(not_list))
        

{nan}


In [470]:
longitud_stars = []
for i in df['stars']:
    if type(i) != list:
        continue    
    else:
        longitud_stars.append(len(i))

In [472]:
min(longitud_stars), max(longitud_stars)
# all elements which aren't nans in Stars, which are lists, have minimum length of 2 and maximum of 4

(2, 4)

In [473]:
star1 = []
star2 = []
star3 = []
star4 = []

for i in df['stars']:
    if (type(i) == list) and len(i) == 4:
        star1.append(i[0])
        star2.append(i[1])
        star3.append(i[2])
        star4.append(i[3])
        
    elif (type(i) == list) and len(i) == 3:
        star1.append(i[0])
        star2.append(i[1])
        star3.append(i[2])
        star4.append(np.nan)
        
    elif (type(i) == list) and len(i) == 2:
        star1.append(i[0])
        star2.append(i[1])
        star3.append(np.nan)
        star4.append(np.nan)
        
    elif (type(i) == list) and len(i) == 1:
        star1.append(i[0])
        star2.append(np.nan)
        star3.append(np.nan)
        star4.append(np.nan)
        
    else:
        star1.append(np.nan)
        star2.append(np.nan)
        star3.append(np.nan)
        star4.append(np.nan)
        

In [474]:
df["star_1"], df["star_2"], df["star_3"], df["star_4"] = star1, star2, star3, star4

In [480]:
# number of votes, commas separating 100s: replace comma by nothing and force to float
df['number_votes'] = df['number_votes'].str.replace(",", "").astype(float)

In [493]:
anios = pd.to_datetime(df['year'], format='%Y')

In [513]:
df['year'] = anios.dt.year.astype('Int64')

In [None]:
df.Date = pd.to_datetime(df.Date)

df['yyyy'] = df.Date.dt.year.astype('Int64')

## Data Analysis

At last, the data is clean... Time to pull some insights from it

In [517]:
df

Unnamed: 0,title,year,duration,imbd_rating,met_rating,number_votes,director,stars,star_1,star_2,star_3,star_4
0,A descubierto,2021,114.0,5.3,45.0,21416.0,Mikael Håfström,"[Anthony Mackie, Damson Idris, Enzo Cilenti,...",Anthony Mackie,Damson Idris,Enzo Cilenti,Emily Beecham
1,Noticias del gran mundo,2020,118.0,6.9,73.0,10583.0,Paul Greengrass,"[Tom Hanks, Helena Zengel, Tom Astor, Travi...",Tom Hanks,Helena Zengel,Tom Astor,Travis Johnson
2,Wonder Woman 1984,2020,151.0,5.4,60.0,144978.0,Patty Jenkins,"[Gal Gadot, Chris Pine, Kristen Wiig, Pedro...",Gal Gadot,Chris Pine,Kristen Wiig,Pedro Pascal
3,Mortal Kombat,2021,,,,,Simon McQuoid,"[Jessica McNamee, Hiroyuki Sanada, Josh Laws...",Jessica McNamee,Hiroyuki Sanada,Josh Lawson,Joe Taslim
4,Tenet,2020,150.0,7.5,69.0,279317.0,Christopher Nolan,"[John David Washington, Robert Pattinson, El...",John David Washington,Robert Pattinson,Elizabeth Debicki,Juhan Ulfsak
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Tarzan the Tiger,1929,266.0,6.1,,95.0,Henry MacRae,"[Frank Merrill, Natalie Kingston, Al Ferguso...",Frank Merrill,Natalie Kingston,Al Ferguson,Lillian Worth
9996,Uppi 2,2015,135.0,8.0,,7889.0,Upendra,"[Upendra, Kristina Akheeva, Parul Yadav, Sa...",Upendra,Kristina Akheeva,Parul Yadav,Sayaji Shinde
9997,Chirutha,2007,117.0,5.3,,1820.0,Puri Jagannadh,"[Ram Charan, Neha Sharma, Prakash Raj, Ashi...",Ram Charan,Neha Sharma,Prakash Raj,Ashish Vidyarthi
9998,Surge of Dawn,2019,,7.7,,25.0,Alexander Fernandez,"[Victoria Amber, Vincent J. Roth, Michelle B...",Victoria Amber,Vincent J. Roth,Michelle Bernard,Joseph Culp
