## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Wednesday of this week. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies then immediately follow her/him. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just email me and I will send you the code.)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, email me and I will send you working code so you can move on to the next step.


### Getting started: Data Architecture
You can come up with your own data scheme for this, but the one I'm recommending is three separate lists:

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

For this project, if you're interested in recalling your knowledge of SQL, you can do the additional step of entering your transformed data into postgres. Or you can just stick with pandas.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requestions like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [1]:
##Import your libraries: Beautiful soup, urllib, and re (For regular expressions)
from urllib.request import urlopen
raw_html = urlopen("http://floatingmedia.com/columbia/BBC.html").read()

In [2]:
# read the URL, and put the HTML page into beautiful soup
print(type(raw_html))

<class 'bytes'>


In [3]:
print(raw_html)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n                <meta charset="UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n    <title>BBC - Culture - The 21st Century\xe2\x80\x99s 100 greatest films: Who voted?</title>\n        <meta name="keywords" content="story, STORY, story, image, the-100-greatest-films-of-the-21st-century, ">\n        <meta name="description" content="We polled 177 critics from around the world \xe2\x80\x93 here is how they voted.">\n        <meta property="og:title" content="The 21st Century\xe2\x80\x99s 100 greatest films: Who voted?" />\n    <meta property="og:type" content="article" />\n    <meta property="og:url" content="http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted" />\n    <meta property="og:description" content="We polled 177 critics from around the world \xe2\x80\x93 here is how they voted." />\n\n    <meta name="twitter:card" content="summary_large_image">\n    <meta name="twitter

In [4]:
import requests as re
from bs4 import BeautifulSoup
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc))

<class 'bs4.BeautifulSoup'>


In [5]:
print(soup_doc.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   BBC - Culture - The 21st Century’s 100 greatest films: Who voted?
  </title>
  <meta content="story, STORY, story, image, the-100-greatest-films-of-the-21st-century, " name="keywords"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="description"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" property="og:title">
   <meta content="article" property="og:type">
    <meta content="http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted" property="og:url">
     <meta content="We polled 177 critics from around the world – here is how they voted." property="og:description">
      <meta content="summary_large_image" name="twitter:card"/>
      <meta content="@BBC_Culture" name="twitter:site"/>
      <meta content="The 21st Century’s 100 greatest fil

In [6]:
soup_doc.title.string

'BBC - Culture - The 21st Century’s 100 greatest films: Who voted?'

In [7]:
#Just exploring -HTML in the first div-
first_div = soup_doc.find('div')
first_div

<div class="bbccom_display_none" id="bbccom_interstitial_ad"></div>

In [8]:
#DO NOT FORGET: that find_all() gives a list
#Using beautiful soup find the div tag that contains the entire list of critics and movies
all_info = soup_doc.find(class_='body-content')('p')
all_info

[<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p>,
 <p><strong>Simon Abrams – Freelance film critic (US)</strong></p>,
 <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7.

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [9]:
#specific paragraphs 
all_p= soup_doc.find(class_='body-content')('p')[1:355]
all_p

[<p><strong>Simon Abrams – Freelance film critic (US)</strong></p>,
 <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7. Night Across the Street (Raoul Ruiz, 2012)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. Sparrow (Johnnie To, 2008)<br/>10. Fados (Carlos Saura, 2007)</p>,
 <p><strong>Sam Adams – Freelance film critic (US)</strong></p>,
 <p>1. In the Mood for Love (Wong Kar-wai, 2000)<br/>2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)<br/>3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)<br/>4. Spirited Away (Hayao Miyazaki, 2001)<br/>5. The Act of Killing (Joshua Oppenheimer, 2012)<br/>6. The Grand Budapest Hotel (Wes Anderson, 2014)<br/>7. The New World (Terrence Malick, 2004)<br/>8. Certified Copy (Abba

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step email me with your code!



In [10]:
all_p= soup_doc.find(class_='body-content')('p')[1:355]
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        movie_info =lines.next_sibling
        print(critic_info,movie_info)
        print('--------')

Simon Abrams – Freelance film critic (US) <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7. Night Across the Street (Raoul Ruiz, 2012)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. Sparrow (Johnnie To, 2008)<br/>10. Fados (Carlos Saura, 2007)</p>
--------
Sam Adams – Freelance film critic (US) <p>1. In the Mood for Love (Wong Kar-wai, 2000)<br/>2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)<br/>3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)<br/>4. Spirited Away (Hayao Miyazaki, 2001)<br/>5. The Act of Killing (Joshua Oppenheimer, 2012)<br/>6. The Grand Budapest Hotel (Wes Anderson, 2014)<br/>7. The New World (Terrence Malick, 2004)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. The World (Jia Zhan

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [11]:
import re

In [12]:
all_p= soup_doc.find(class_='body-content')('p')[1:355]
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        critic_name= re.findall(r'\A(\b\w.*?\b) \W',critic_info)[0]
        critic_cn= re.findall(r'\A\b\w.*?\b \W \b\w.*\b\.? [(](.*)[)]$',critic_info)[0]
        critic_org=re.findall(r'\A\b\w.*?\b \W (\b\w.*\b\.?) [(].*[)]$',critic_info)[0]
        print(critic_name)
        print(critic_cn) 
        print(critic_org)

Simon Abrams
US
Freelance film critic
Sam Adams
US
Freelance film critic
Thelma Adams
US
Freelance film critic
Arturo Aguilar
Mexico
Rolling Stone Mexico
Matthew Anderson
UK
BBC Culture
Tim Appelo
US
The Wrap
Adriano Aprà
Italy
Film historian
Michael Arbeiter
US
Nerdist
Ali Arikan
Turkey
Dipnot TV
Michael Atkinson
US
The Village Voice
Ana Maria Bahiana
Brazil
Freelance film critic
Cameron Bailey
Canada
Toronto Film Festival
Lindsay Baker
UK
BBC Culture
Miriam Bale
US
Freelance film critic
Nicholas Barber
UK
BBC Culture
Diego Batlle
Argentina
La Nacion
NT Binh
France
Positif
Lizelle Bisschoff
UK
University of Glasgow
Christian Blauvelt
US
BBC Culture
Mahen Bonetti
US
African Film Festival Inc
Andreas Borcholte
Germany
Spiegel Online
Utpal Borpujari
India
Freelance film critic
Richard Brody
US
The New Yorker
Hannah Brown
Israel
Jerusalem Post
Luke Buckmaster
Australia
The Guardian/BBC Culture
Luciano Castillo
Cuba
Cinemateca de Cuba
Monica Castillo
US
New York Times Watching
Samuel Castr

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [13]:
#Take your working loop and add the find_all for each_movie
#And the inner loop that loops through each_movie
all_p= soup_doc.find(class_='body-content')('p')[1:355]
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        critic_name= re.findall(r'\A(\b\w.*?\b) \W',critic_info)[0]
        critic_cn= re.findall(r'\A\b\w.*?\b \W \b\w.*\b\.? [(](.*)[)]$',critic_info)[0]
        critic_org=re.findall(r'\A\b\w.*?\b \W (\b\w.*\b\.?) [(].*[)]$',critic_info)[0]
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
        print(critic_name)
        print(critic_cn)
        print(critic_org)
        print(each_movie)
        print('------')

Simon Abrams
US
Freelance film critic
['1. Mulholland Drive (David Lynch, 2001)', '2. In the Mood for Love (Wong Kar-wai, 2000)', '3. The Tree of Life (Terrence Malick, 2011)', '4. Yi Yi: A One and a Two (Edward Yang, 2000)', '5. Goodbye to Language (Jean-Luc Godard, 2014)', '6. The White Meadows (Mohammad Rasoulof, 2009)', '7. Night Across the Street (Raoul Ruiz, 2012)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. Sparrow (Johnnie To, 2008)', '10. Fados (Carlos Saura, 2007)']
------
Sam Adams
US
Freelance film critic
['1. In the Mood for Love (Wong Kar-wai, 2000)', '2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)', '3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)', '4. Spirited Away (Hayao Miyazaki, 2001)', '5. The Act of Killing (Joshua Oppenheimer, 2012)', '6. The Grand Budapest Hotel (Wes Anderson, 2014)', '7. The New World (Terrence Malick, 2004)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. The World (Jia Zhangke, 2004)', '10. Elephant (Gus

------
Jordan Hoffman
US
Freelance film critic
['1. A Serious Man (Joel and Ethan Coen, 2009)', '2. Inside Llewyn Davis (Joel and Ethan Coen, 2013)', '3. Star Trek (J. J. Abrams, 2009)', '4. Waking Life (Richard Linklater, 2001)', '5. Ida (Paweł Pawlikowski, 2013)', '6. Enter the Void (Gaspar Noé, 2009)', '7. There Will Be Blood (Paul Thomas Anderson, 2007)', '8. Of Gods and Men (Xavier Beauvois, 2010)', '9. Into Great Silence (Philip Gröning, 2005)', '10. Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan (Larry Charles, 2006)']
------
Michael Hogan
US
Vanity Fair
['1. The Social Network (David Fincher, 2010)', '2. Boyhood (Richard Linklater, 2014)', '3. The Descendants (Alexander Payne, 2011)', '4. Beasts of the Southern Wild (Benh Zeitlin, 2012)', '5. No Country For Old Men (Joel and Ethan Coen, 2007)', '6. Idiocracy (Mike Judge, 2006)', '7. Bridesmaids (Paul Feig, 2011)', '8. Spring Breakers (Harmony Korine, 2012)', '9. American Psycho (Mary Harron,

Mexico
Letras Libres Magazine
['1. Mulholland Drive (David Lynch, 2001)', '2. Son of Saul (László Nemes, 2015)', '3. Ida (Paweł Pawlikowski, 2013)', '4. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)', '5. The White Ribbon (Michael Haneke, 2009)', '6. Still Walking (Hirokazu Koreeda, 2008)', '7. The Act of Killing (Joshua Oppenheimer, 2012)', '8. Dogville (Lars von Trier, 2003)', '9. Birdman (Alejandro González Iñárritu, 2014)', '10. Grizzly Man (Werner Herzog, 2005)']
------
David Stratton
Australia
The Australian
['1. Amour (Michael Haneke, 2012)', '2. Distant (Nuri Bilge Ceylan, 2002)', '3. A Separation (Asghar Farhadi, 2011)', '4. Samson & Delilah (Warwick Thornton, 2009)', '5. Leviathan (Andrey Zvyagintsev, 2014)', '6. Still Walking (Hirokazu Koreeda, 2008)', '7. Talk to Her (Pedro Almodóvar, 2002)', '8. Million Dollar Baby (Clint Eastwood, 2004)', '9. No Country For Old Men (Joel and Ethan Coen, 2007)', '10. The Man Without A Past (Aki Kaurismäki, 2002)']
------
Cédr

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [14]:
all_p= soup_doc.find(class_='body-content')('p')[1:355]
for lines in all_p:
    if lines.strong is not None:
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie:
            movie_name=re.findall(r'\d\. (\b\w.*\b\.?\!?.*) [(]',movie)[0]
            movie_director=re.findall(r'\d\. \b\w.*\b\.?\!?.* [(](.*), ',movie)[0]
            movie_year=re.findall(r'[(].*(\d{4})',movie)[0]
            movie_rank=re.findall(r'(\d+)\.',movie)[0]
            print(movie_rank)
            print(movie_name)
            print(movie_director)
            print(movie_year)
            print('------')

1
Mulholland Drive
David Lynch
2001
------
2
In the Mood for Love
Wong Kar-wai
2000
------
3
The Tree of Life
Terrence Malick
2011
------
4
Yi Yi: A One and a Two
Edward Yang
2000
------
5
Goodbye to Language
Jean-Luc Godard
2014
------
6
The White Meadows
Mohammad Rasoulof
2009
------
7
Night Across the Street
Raoul Ruiz
2012
------
8
Certified Copy
Abbas Kiarostami
2010
------
9
Sparrow
Johnnie To
2008
------
10
Fados
Carlos Saura
2007
------
1
In the Mood for Love
Wong Kar-wai
2000
------
2
Eternal Sunshine of the Spotless Mind
Michel Gondry
2004
------
3
Syndromes and a Century
Apichatpong Weerasethakul
2006
------
4
Spirited Away
Hayao Miyazaki
2001
------
5
The Act of Killing
Joshua Oppenheimer
2012
------
6
The Grand Budapest Hotel
Wes Anderson
2014
------
7
The New World
Terrence Malick
2004
------
8
Certified Copy
Abbas Kiarostami
2010
------
9
The World
Jia Zhangke
2004
------
10
Elephant
Gus Van Sant
2003
------
1
Zero Dark Thirty
Kathryn Bigelow
2012
------
2
A History of V

4
Moulin Rouge!
Baz Luhrmann
2001
------
5
Star Wars: Episode III – Revenge of the Sith
George Lucas
2005
------
6
Manakamana
Pacho Velez and Stephanie Spray
2013
------
7
The Curious Case of Benjamin Button
David Fincher
2008
------
8
You, The Living
Roy Andersson
2007
------
9
Mother
Bong Joon-ho
2009
------
10
A Serious Man
Joel and Ethan Coen
2009
------
1
Moolaadé
Ousmane Sembène
2004
------
2
Timbuktu
Abderrahmane Sissako
2014
------
3
Cuba: An African Odyssey
Jihan El-Tahri
2007
------
4
Sexe, gombo et beurre salé
Mahamat-Saleh Haroun
2008
------
5
Shoot the Messenger
Ngozi Onwurah
2006
------
6
Red Leaves
Bazi Gete
2014
------
7
The Colonial Misunderstanding
Jean-Marie Téno
2004
------
8
Restless City
Andrew Dosunmu
2011
------
9
Conversations on a Sunday Afternoon
Khalo Matabane
2005
------
10
Belle
Amma Asante
2013
------
1
Requiem for a Dream
Darren Aronofsky
2000
------
2
Antichrist
Lars von Trier
2009
------
3
Children of Men
Alfonso Cuarón
2006
------
4
The Child
Jean-Pie

1
The Stuart Hall Project
John Akomfrah
2013
------
2
Uncle Boonmee Who Can Recall His Past Lives
Apichatpong Weerasethakul
2010
------
3
A Letter to Nelson Mandela
Khalo Matabane
2013
------
4
Waiting for Happiness
Abderrahmane Sissako
2002
------
5
Something Necessary
Judy Kibinge
2013
------
6
Moolaadé
Ousmane Sembène
2004
------
7
Sembène!
Samba Gadjigo and Jason Silverman
2015
------
8
5 Broken Cameras
Emad Burnat and Guy Davidi
2011
------
9
Hooligan Sparrow
Nanfu Wang
2016
------
10
7 Letters
Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopal, Tan Pin Pin, Royston Tan and Kelvin Tong
2015
------
1
Synecdoche, New York
Charlie Kaufman
2008
------
2
Brokeback Mountain
Ang Lee
2005
------
3
Weekend
Andrew Haigh
2011
------
4
4 Months, 3 Weeks & 2 Days
Cristian Mungiu
2007
------
5
Spirited Away
Hayao Miyazaki
2001
------
6
How to Survive a Plague
David France
2012
------
7
Talk to Her
Pedro Almodóvar
2002
------
8
American Splendor
Robert Pulcini and Shari Springer Berman
2003
------
9

------
1
Saraband
Ingmar Bergman
2003
------
2
Memento
Christopher Nolan
2000
------
3
Blue Is the Warmest Color
Abdellatif Kechiche
2013
------
4
The Return
Andrey Zvyagintsev
2003
------
5
Amour
Michael Haneke
2012
------
6
Mulholland Drive
David Lynch
2001
------
7
Import Export
Ulrich Seidl
2007
------
8
Son of Saul
László Nemes
2015
------
9
Kill Bill: Vol. 1
Quentin Tarantino
2003
------
10
The Revenant
Alejandro González Iñárritu
2015
------
1
Tropical Malady
Apichatpong Weerasethakul
2004
------
2
Mulholland Drive
David Lynch
2001
------
3
The Turin Horse
Béla Tarr and Ágnes Hranitzky
2011
------
4
Tie Xi Qu: West of the Tracks
Wang Bing
2002
------
5
Le filmeur
Alain Cavalier
2005
------
6
Holy Motors
Leos Carax
2012
------
7
Elephant
Gus Van Sant
2003
------
8
A Touch of Sin
Jia Zhangke
2013
------
9
Pan's Labyrinth
Guillermo Del Toro
2006
------
10
Spirited Away
Hayao Miyazaki
2001
------
1
The New World
Terrence Malick
2005
------
2
Capitalism: Child Labor
Ken Jacobs
2007
-

2013
------
1
The Tree of Life
Terrence Malick
2011
------
2
Eden
Mia Hansen-Løve
2014
------
3
Punch-Drunk Love
Paul Thomas Anderson
2002
------
4
Ghost World
Terry Zwigoff
2001
------
5
Tabu
Miguel Gomes
2012
------
6
Wendy and Lucy
Kelly Reichardt
2008
------
7
The Headless Woman
Lucrecia Martel
2008
------
8
Everyone Else
Maren Ade
2009
------
9
Millennium Mambo
Hou Hsiao-hsien
2001
------
10
In Vanda’s Room
Pedro Costa
2000
------
1
The Master
Paul Thomas Anderson
2012
------
2
The Tree of Life
Terrence Malick
2011
------
3
Zodiac
David Fincher
2007
------
4
Toni Erdmann
Maren Ade
2016
------
5
There Will Be Blood
Paul Thomas Anderson
2007
------
6
My Golden Days
Arnaud Desplechin
2015
------
7
Carlos
Olivier Assayas
2010
------
8
A Serious Man
Joel and Ethan Coen
2009
------
9
Moonrise Kingdom
Wes Anderson
2012
------
10
The Son
Jean-Pierre and Luc Dardenne
2002
------
1
The Headless Woman
Lucrecia Martel
2008
------
2
Spirited Away
Hayao Miyazaki
2001
------
3
Mulholland Drive
D

Jacques Audiard
2009
------
1
Moon
Duncan Jones
2009
------
2
The Great Beauty
Paolo Sorrentino
2013
------
3
The Lives of Others
Florian Henckel von Donnersmarck
2006
------
4
In the Mood for Love
Wong Kar-wai
2000
------
5
The Taste of Others
Agnès Jaoui
2000
------
6
Wild Tales
Damián Szifrón
2014
------
7
Bridesmaids
Paul Feig
2011
------
8
Eternal Sunshine of the Spotless Mind
Michel Gondry
2004
------
9
O Brother, Where Art Thou?
Joel and Ethan Coen
2000
------
10
Inglourious Basterds
Quentin Tarantino
2009
------
1
Once Upon a Time in Anatolia
Nuri Bilge Ceylan
2011
------
2
The Assassination of Jesse James by the Coward Robert Ford
Andrew Dominik
2007
------
3
The Social Network
David Fincher
2010
------
4
The Grand Budapest Hotel
Wes Anderson
2014
------
5
The Lord of the Rings: The Fellowship of the Ring
Peter Jackson
2001
------
6
The Tree of Life
Terrence Malick
2011
------
7
Holy Motors
Leos Carax
2012
------
8
Timbuktu
Abderrahmane Sissako
2014
------
9
No
Pablo Larraín
2

Carol
Todd Haynes
2015
------
1
Mulholland Drive
David Lynch
2001
------
2
The Captive
Chantal Akerman
2000
------
3
Tie Xi Qu: West of the Tracks
Wang Bing
2002
------
4
The Yards
James Gray
2000
------
5
Platform
Jia Zhangke
2000
------
6
Shara
Naomi Kawase
2003
------
7
Ten
Abbas Kiarostami
2002
------
8
Twixt
Francis Ford Coppola
2011
------
9
To Die Like a Man
João Pedro Rodrigues
2009
------
10
Spring Breakers
Harmony Korine
2012
------
1
The Diving Bell and the Butterfly
Julian Schnabel
2007
------
2
Blue Is the Warmest Color
Abdellatif Kechiche
2013
------
3
Let the Right One In
Tomas Alfredson
2008
------
4
The Act of Killing
Joshua Oppenheimer
2012
------
5
Pan's Labyrinth
Guillermo Del Toro
2006
------
6
There Will Be Blood
Paul Thomas Anderson
2007
------
7
A Separation
Asghar Farhadi
2011
------
8
A Prophet
Jacques Audiard
2009
------
9
WALL-E
Andrew Stanton
2008
------
10
District 9
Neill Blomkamp
2009
------
1
Certified Copy
Abbas Kiarostami
2010
------
2
Inside Llewyn D

10
Grizzly Man
Werner Herzog
2005
------
1
Amour
Michael Haneke
2012
------
2
Distant
Nuri Bilge Ceylan
2002
------
3
A Separation
Asghar Farhadi
2011
------
4
Samson & Delilah
Warwick Thornton
2009
------
5
Leviathan
Andrey Zvyagintsev
2014
------
6
Still Walking
Hirokazu Koreeda
2008
------
7
Talk to Her
Pedro Almodóvar
2002
------
8
Million Dollar Baby
Clint Eastwood
2004
------
9
No Country For Old Men
Joel and Ethan Coen
2007
------
10
The Man Without A Past
Aki Kaurismäki
2002
------
1
Mysteries of Lisbon
Raoul Ruiz
2010
------
2
Margaret
Kenneth Lonergan
2011
------
3
The New World
Terrence Malick
2005
------
4
Secret Things
Jean-Claude Brisseau
2002
------
5
La Ciénaga
Lucrecia Martel
2001
------
6
Toni Erdmann
Maren Ade
2016
------
7
In the Family
Patrick Wang
2011
------
8
Tabu
Miguel Gomes
2012
------
9
Gerry
Gus Van Sant
2002
------
10
Tropical Malady
Apichatpong Weerasethakul
2004
------
1
The Turin Horse
Béla Tarr and Ágnes Hranitzky
2011
------
2
The Return
Andrey Zvyagi

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the move name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [15]:
#Get that loop working here
all_p= soup_doc.find(class_='body-content')('p')[1:355]
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        critic_name= re.findall(r'\A(\b\w.*?\b) \W',critic_info)[0]
        critic_cn= re.findall(r'\A\b\w.*?\b \W \b\w.*\b\.? [(](.*)[)]$',critic_info)[0]
        critic_org=re.findall(r'\A\b\w.*?\b \W (\b\w.*\b\.?) [(].*[)]$',critic_info)[0]
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie:
            movie_name=re.findall(r'\d\. (\b\w.*\b\.?\!?.*) [(]',movie)[0]
            movie_director=re.findall(r'\d\. \b\w.*\b\.?\!?.* [(](.*), ',movie)[0]
            movie_year=re.findall(r'[(].*(\d{4})',movie)[0]
            movie_rank=re.findall(r'(\d+)\.',movie)[0]
            print(critic_name)
            print(critic_cn)
            print(critic_org)
            print(movie_rank)
            print(movie_name)
            print(movie_director)
            print(movie_year)
            print('--------')

Simon Abrams
US
Freelance film critic
1
Mulholland Drive
David Lynch
2001
--------
Simon Abrams
US
Freelance film critic
2
In the Mood for Love
Wong Kar-wai
2000
--------
Simon Abrams
US
Freelance film critic
3
The Tree of Life
Terrence Malick
2011
--------
Simon Abrams
US
Freelance film critic
4
Yi Yi: A One and a Two
Edward Yang
2000
--------
Simon Abrams
US
Freelance film critic
5
Goodbye to Language
Jean-Luc Godard
2014
--------
Simon Abrams
US
Freelance film critic
6
The White Meadows
Mohammad Rasoulof
2009
--------
Simon Abrams
US
Freelance film critic
7
Night Across the Street
Raoul Ruiz
2012
--------
Simon Abrams
US
Freelance film critic
8
Certified Copy
Abbas Kiarostami
2010
--------
Simon Abrams
US
Freelance film critic
9
Sparrow
Johnnie To
2008
--------
Simon Abrams
US
Freelance film critic
10
Fados
Carlos Saura
2007
--------
Sam Adams
US
Freelance film critic
1
In the Mood for Love
Wong Kar-wai
2000
--------
Sam Adams
US
Freelance film critic
2
Eternal Sunshine of the Spotl

Miriam Bale
US
Freelance film critic
7
Almayer's Folly
Chantal Akerman
2011
--------
Miriam Bale
US
Freelance film critic
8
Inside Llewyn Davis
Joel and Ethan Coen
2013
--------
Miriam Bale
US
Freelance film critic
9
The Story of Marie and Julien
Jacques Rivette
2003
--------
Miriam Bale
US
Freelance film critic
10
Bamboozled
Spike Lee
2000
--------
Nicholas Barber
UK
BBC Culture
1
Boyhood
Richard Linklater
2014
--------
Nicholas Barber
UK
BBC Culture
2
Under the Skin
Jonathan Glazer
2013
--------
Nicholas Barber
UK
BBC Culture
3
Anomalisa
Duke Johnson and Charlie Kaufman
2015
--------
Nicholas Barber
UK
BBC Culture
4
12 Years a Slave
Steve McQueen
2013
--------
Nicholas Barber
UK
BBC Culture
5
Madagascar 3: Europe's Most Wanted
Eric Darnell, Tom McGrath and Conrad Vernon
2012
--------
Nicholas Barber
UK
BBC Culture
6
Another Year
Mike Leigh
2010
--------
Nicholas Barber
UK
BBC Culture
7
The Diving Bell and the Butterfly
Julian Schnabel
2007
--------
Nicholas Barber
UK
BBC Culture
8
Ch

--------
Monica Castillo
US
New York Times Watching
2
Persepolis
Vincent Paronnaud and Marjane Satrapi
2007
--------
Monica Castillo
US
New York Times Watching
3
Fish Tank
Andrea Arnold
2009
--------
Monica Castillo
US
New York Times Watching
4
Pan's Labyrinth
Guillermo Del Toro
2006
--------
Monica Castillo
US
New York Times Watching
5
In the Mood for Love
Wong Kar-wai
2000
--------
Monica Castillo
US
New York Times Watching
6
Spirited Away
Hayao Miyazaki
2001
--------
Monica Castillo
US
New York Times Watching
7
Mad Max: Fury Road
George Miller
2015
--------
Monica Castillo
US
New York Times Watching
8
Attack the Block
Joe Cornish
2011
--------
Monica Castillo
US
New York Times Watching
9
Eternal Sunshine of the Spotless Mind
Michel Gondry
2004
--------
Monica Castillo
US
New York Times Watching
10
The Royal Tenenbaums
Wes Anderson
2001
--------
Samuel Castro
Colombia
El Colombiano
1
The Secret in Their Eyes
Juan José Campanella
2009
--------
Samuel Castro
Colombia
El Colombiano
2
Be

Cinemateca de Cuba
9
Sympathy for Mr Vengeance
Park Chan-wook
2002
--------
Mario Espinosa
Cuba
Cinemateca de Cuba
10
Synecdoche, New York
Charlie Kaufman
2008
--------
Joseph Fahim
Egypt
Freelance film critic
1
Yi Yi: A One and a Two
Edward Yang
2000
--------
Joseph Fahim
Egypt
Freelance film critic
2
There Will Be Blood
Paul Thomas Anderson
2007
--------
Joseph Fahim
Egypt
Freelance film critic
3
Boyhood
Richard Linklater
2014
--------
Joseph Fahim
Egypt
Freelance film critic
4
Still Life
Jia Zhangke
2006
--------
Joseph Fahim
Egypt
Freelance film critic
5
Archipelago
Joanna Hogg
2010
--------
Joseph Fahim
Egypt
Freelance film critic
6
The Headless Woman
Lucrecia Martel
2008
--------
Joseph Fahim
Egypt
Freelance film critic
7
The Act of Killing
Joshua Oppenheimer
2012
--------
Joseph Fahim
Egypt
Freelance film critic
8
Caché
Michael Haneke
2005
--------
Joseph Fahim
Egypt
Freelance film critic
9
Divine Intervention
Elia Suleiman
2002
--------
Joseph Fahim
Egypt
Freelance film critic


Nuri Bilge Ceylan
2011
--------
Hauvick Habechian
Lebanon
An-Nahar Newspaper
7
In the Mood for Love
Wong Kar-wai
2000
--------
Hauvick Habechian
Lebanon
An-Nahar Newspaper
8
The Lady and the Duke
Éric Rohmer
2001
--------
Hauvick Habechian
Lebanon
An-Nahar Newspaper
9
Yi Yi: A One and a Two
Edward Yang
2000
--------
Hauvick Habechian
Lebanon
An-Nahar Newspaper
10
Silent Souls
Aleksey Fedorchenko
2010
--------
Angie Han
US
Slashfilm
1
Synecdoche, New York
Charlie Kaufman
2008
--------
Angie Han
US
Slashfilm
2
Eternal Sunshine of the Spotless Mind
Michel Gondry
2004
--------
Angie Han
US
Slashfilm
3
Boyhood
Richard Linklater
2014
--------
Angie Han
US
Slashfilm
4
Mad Max: Fury Road
George Miller
2015
--------
Angie Han
US
Slashfilm
5
Me and You and Everyone We Know
Miranda July
2005
--------
Angie Han
US
Slashfilm
6
Gone Girl
David Fincher
2014
--------
Angie Han
US
Slashfilm
7
Inside Llewyn Davis
Joel and Ethan Coen
2013
--------
Angie Han
US
Slashfilm
8
Only Lovers Left Alive
Jim Jarmu

Dominik Kamalzadeh
Austria
Der Standard
4
A History of Violence
David Cronenberg
2005
--------
Dominik Kamalzadeh
Austria
Der Standard
5
Tropical Malady
Apichatpong Weerasethakul
2004
--------
Dominik Kamalzadeh
Austria
Der Standard
6
The Death of Mr Lazarescu
Cristi Puiu
2005
--------
Dominik Kamalzadeh
Austria
Der Standard
7
Phoenix
Christian Petzold
2014
--------
Dominik Kamalzadeh
Austria
Der Standard
8
Platform
Jia Zhangke
2000
--------
Dominik Kamalzadeh
Austria
Der Standard
9
As I Was Moving Ahead Occasionally I Saw Brief Glimpses of Beauty
Jonas Mekas
2000
--------
Dominik Kamalzadeh
Austria
Der Standard
10
Secret Sunshine
Lee Chang-dong
2007
--------
Dave Karger
US
Freelance
1
Lost in Translation
Sofia Coppola
2003
--------
Dave Karger
US
Freelance
2
Silver Linings Playbook
David O. Russell
2012
--------
Dave Karger
US
Freelance
3
Brooklyn
John Crowley
2015
--------
Dave Karger
US
Freelance
4
Michael Clayton
Tony Gilroy
2007
--------
Dave Karger
US
Freelance
5
Blue Valentine
D

1
Moolaadé
Ousmane Sembène
2004
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
2
Days of Glory
Rachid Bouchareb
2006
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
3
Timbuktu
Abderrahmane Sissako
2014
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
4
Making Of
Nouri Bouzid
2006
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
5
House of Flying Daggers
Zhang Yimou
2004
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
6
City of God
Fernando Meirelles and Kátia Lund
2002
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
7
Miners Shot Down
Rehad Desai
2014
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
8
678
Mohamed Diab
2010
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
9
Hijack Stories
Oliver Schmitz
2000
--------
Hans-Christian Mahnke
Namibia
AfricAvenir.org
10
Waltz with Bashir
Ari Folman
2008
--------
Ben Mankiewicz
US
Turner Classic Movies
1
No Country For Old Men
Joel and Ethan Coen
2007
--------
Ben Mankiewicz
US
Turner Class

The New York Post/Film Comment
6
Phoenix
Christian Petzold
2014
--------
Farran Smith Nehme
US
The New York Post/Film Comment
7
Selma
Ava DuVernay
2014
--------
Farran Smith Nehme
US
The New York Post/Film Comment
8
Timbuktu
Abderrahmane Sissako
2014
--------
Farran Smith Nehme
US
The New York Post/Film Comment
9
About Elly
Asghar Farhadi
2009
--------
Farran Smith Nehme
US
The New York Post/Film Comment
10
I’m Going Home
Manoel de Oliveira
2001
--------
Joe Neumaier
US
WOR-710AM New York
1
Inside Llewyn Davis
Joel and Ethan Coen
2013
--------
Joe Neumaier
US
WOR-710AM New York
2
There Will Be Blood
Paul Thomas Anderson
2007
--------
Joe Neumaier
US
WOR-710AM New York
3
Inside Out
Pete Docter
2015
--------
Joe Neumaier
US
WOR-710AM New York
4
Carol
Todd Haynes
2015
--------
Joe Neumaier
US
WOR-710AM New York
5
Brooklyn
John Crowley
2015
--------
Joe Neumaier
US
WOR-710AM New York
6
Crouching Tiger, Hidden Dragon
Ang Lee
2000
--------
Joe Neumaier
US
WOR-710AM New York
7
Boyhood
Richard

Independent University
1
A Separation
Asghar Farhadi
2011
--------
Zakir Hossein Raju
Bangladesh
Independent University
2
Babel
Alejandro González Iñárritu
2006
--------
Zakir Hossein Raju
Bangladesh
Independent University
3
Spring, Summer, Fall, Winter…and Spring
Kim Ki-duk
2003
--------
Zakir Hossein Raju
Bangladesh
Independent University
4
In the Mood for Love
Wong Kar-wai
2000
--------
Zakir Hossein Raju
Bangladesh
Independent University
5
Spirited Away
Hayao Miyazaki
2001
--------
Zakir Hossein Raju
Bangladesh
Independent University
6
Brokeback Mountain
Ang Lee
2005
--------
Zakir Hossein Raju
Bangladesh
Independent University
7
Oldboy
Park Chan-wook
2003
--------
Zakir Hossein Raju
Bangladesh
Independent University
8
Ten
Abbas Kiarostami
2002
--------
Zakir Hossein Raju
Bangladesh
Independent University
9
Russian Ark
Aleksandr Sokurov
2002
--------
Zakir Hossein Raju
Bangladesh
Independent University
10
Once Upon a Time in Anatolia
Nuri Bilge Ceylan
2011
--------
Alberto Ramos Ru

David Fincher
2010
--------
Mike Ryan
US
Uproxx
9
12 Years a Slave
Steve McQueen
2013
--------
Mike Ryan
US
Uproxx
10
Spotlight
Tom McCarthy
2015
--------
Alaka Sahani
India
The Indian Express
1
In the Mood for Love
Wong Kar-wai
2000
--------
Alaka Sahani
India
The Indian Express
2
The Lives of Others
Florian Henckel von Donnersmarck
2006
--------
Alaka Sahani
India
The Indian Express
3
Eternal Sunshine of the Spotless Mind
Michel Gondry
2004
--------
Alaka Sahani
India
The Indian Express
4
Kill Bill: Vol. 1
Quentin Tarantino
2003
--------
Alaka Sahani
India
The Indian Express
5
Inception
Christopher Nolan
2010
--------
Alaka Sahani
India
The Indian Express
6
Talk to Her
Pedro Almodóvar
2002
--------
Alaka Sahani
India
The Indian Express
7
Amores Perros
Alejandro González Iñárritu
2000
--------
Alaka Sahani
India
The Indian Express
8
Infernal Affairs
Andrew Lau and Alan Mak
2002
--------
Alaka Sahani
India
The Indian Express
9
Lost in Translation
Sofia Coppola
2003
--------
Alaka Sahan

Ella Taylor
US
NPR.org
6
Ratatouille
Brad Bird and Jan Pinkava
2007
--------
Ella Taylor
US
NPR.org
7
Burning Bush
Agnieszka Holland
2013
--------
Ella Taylor
US
NPR.org
8
4 Months, 3 Weeks & 2 Days
Cristian Mungiu
2007
--------
Ella Taylor
US
NPR.org
9
Leviathan
Andrey Zvyagintsev
2014
--------
Ella Taylor
US
NPR.org
10
Barbara
Christian Petzold
2012
--------
Stephen Teo
Singapore
Nanyang Techological University
1
The Assassin
Hou Hsiao-hsien
2015
--------
Stephen Teo
Singapore
Nanyang Techological University
2
The Grand Budapest Hotel
Wes Anderson
2014
--------
Stephen Teo
Singapore
Nanyang Techological University
3
Inception
Christopher Nolan
2010
--------
Stephen Teo
Singapore
Nanyang Techological University
4
Uncle Boonmee Who Can Recall His Past Lives
Apichatpong Weerasethakul
2010
--------
Stephen Teo
Singapore
Nanyang Techological University
5
The Act of Killing
Joshua Oppenheimer
2012
--------
Stephen Teo
Singapore
Nanyang Techological University
6
No Country For Old Men
Joel 

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?


In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [16]:
#Trying to collect more clean information
all_p= soup_doc.find(class_='body-content')('p')[1:355]
centurymovies=[]
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        critic_name= re.findall(r'\A(\b\w.*?\b) \W',critic_info)[0]
        critic_cn= re.findall(r'\A\b\w.*?\b \W \b\w.*\b\.? [(](.*)[)]$',critic_info)[0]
        critic_org=re.findall(r'\A\b\w.*?\b \W (\b\w.*\b\.?) [(].*[)]$',critic_info)[0]
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie:
            moviesdict={}
            moviesdict['critic_name']= critic_name
            moviesdict['critic_cn']=critic_cn
            moviesdict['critic_org']=critic_org
            movie_name=re.findall(r'\d\. (\b\w.*\b\.?\!?.*) [(]',movie)[0]
            movie_director=re.findall(r'\d\. \b\w.*\b\.?\!?.* [(](.*), ',movie)[0]
            movie_year=re.findall(r'[(].*(\d{4})',movie)[0]
            movie_rank=re.findall(r'(\d+)\.',movie)[0]
            moviesdict['movie_name']= movie_name
            moviesdict['movie_director']=movie_director
            moviesdict['movie_year']=movie_year
            moviesdict['movie_rank']=movie_rank
            centurymovies.append(moviesdict)
print(centurymovies)

[{'critic_name': 'Simon Abrams', 'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'movie_name': 'Mulholland Drive', 'movie_director': 'David Lynch', 'movie_year': '2001', 'movie_rank': '1'}, {'critic_name': 'Simon Abrams', 'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'movie_name': 'In the Mood for Love', 'movie_director': 'Wong Kar-wai', 'movie_year': '2000', 'movie_rank': '2'}, {'critic_name': 'Simon Abrams', 'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'movie_name': 'The Tree of Life', 'movie_director': 'Terrence Malick', 'movie_year': '2011', 'movie_rank': '3'}, {'critic_name': 'Simon Abrams', 'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'movie_name': 'Yi Yi: A One and a Two', 'movie_director': 'Edward Yang', 'movie_year': '2000', 'movie_rank': '4'}, {'critic_name': 'Simon Abrams', 'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'movie_name': 'Goodbye to Language', 'movie_director': 'Jean-Luc Godard', 'movie_year': '2014', 'm

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [18]:
list_of_movies = []
for lines in all_p:
    if lines.strong is not None:
        critic_info = lines.find('strong').string
        regex_for_name = r'\A(\b\w.*?\b) \W'
        critic_name = re.findall(regex_for_name,critic_info)[0]
        regex_for_org = r'\A\b\w.*?\b \W (\b\w.*\b\.?) [(].*[)]$'
        critic_org = re.findall(regex_for_org,critic_info)[0]
        regex_for_cn = r'\A\b\w.*?\b \W \b\w.*\b\.? [(](.*)[)]$'
        critic_cn = re.findall(regex_for_cn,critic_info)[0]
        movie_info = lines.findNextSibling('p')
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie: 
            m_data={}
            m_data['critic_cn'] = critic_cn
            m_data['critic_org'] = critic_org
            m_data['critic_name'] = critic_name
            regex_for_rank = r'(\d+)\.' 
            movie_rank = re.findall(regex_for_rank,movie)[0]
            m_data['movie_rank'] = movie_rank
            regex_for_movie_name = r'\d\. (\b\w.*\b\.?\!?.*) [(]'
            movie_name = re.findall(regex_for_movie_name,movie)[0]
            m_data['movie_name'] = movie_name
            regex_for_dir = r'\d\. \b\w.*\b\.?\!?.* [(](.*), '
            movie_dir = re.findall(regex_for_dir,movie)[0]
            m_data['movie_dir'] = movie_dir
            regex_for_yr = r'[(].*(\d{4})'
            movie_yr = re.findall(regex_for_yr,movie)[0]
            m_data['movie_yr'] = movie_yr
            list_of_movies.append(m_data)
print(list_of_movies)#FinalSuperloop


[{'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'critic_name': 'Simon Abrams', 'movie_rank': '1', 'movie_name': 'Mulholland Drive', 'movie_dir': 'David Lynch', 'movie_yr': '2001'}, {'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'critic_name': 'Simon Abrams', 'movie_rank': '2', 'movie_name': 'In the Mood for Love', 'movie_dir': 'Wong Kar-wai', 'movie_yr': '2000'}, {'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'critic_name': 'Simon Abrams', 'movie_rank': '3', 'movie_name': 'The Tree of Life', 'movie_dir': 'Terrence Malick', 'movie_yr': '2011'}, {'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'critic_name': 'Simon Abrams', 'movie_rank': '4', 'movie_name': 'Yi Yi: A One and a Two', 'movie_dir': 'Edward Yang', 'movie_yr': '2000'}, {'critic_cn': 'US', 'critic_org': 'Freelance film critic', 'critic_name': 'Simon Abrams', 'movie_rank': '5', 'movie_name': 'Goodbye to Language', 'movie_dir': 'Jean-Luc Godard', 'movie_yr': '2014'}, {'critic_cn': 'US

In [19]:
import pandas as pd

df=pd.DataFrame(list_of_movies)
df

Unnamed: 0,critic_cn,critic_name,critic_org,movie_dir,movie_name,movie_rank,movie_yr
0,US,Simon Abrams,Freelance film critic,David Lynch,Mulholland Drive,1,2001
1,US,Simon Abrams,Freelance film critic,Wong Kar-wai,In the Mood for Love,2,2000
2,US,Simon Abrams,Freelance film critic,Terrence Malick,The Tree of Life,3,2011
3,US,Simon Abrams,Freelance film critic,Edward Yang,Yi Yi: A One and a Two,4,2000
4,US,Simon Abrams,Freelance film critic,Jean-Luc Godard,Goodbye to Language,5,2014
5,US,Simon Abrams,Freelance film critic,Mohammad Rasoulof,The White Meadows,6,2009
6,US,Simon Abrams,Freelance film critic,Raoul Ruiz,Night Across the Street,7,2012
7,US,Simon Abrams,Freelance film critic,Abbas Kiarostami,Certified Copy,8,2010
8,US,Simon Abrams,Freelance film critic,Johnnie To,Sparrow,9,2008
9,US,Simon Abrams,Freelance film critic,Carlos Saura,Fados,10,2007


In [20]:
df.to_csv('moviesofthecentury.csv', index=False)

In [21]:
pd.read_csv('moviesofthecentury.csv')

Unnamed: 0,critic_cn,critic_name,critic_org,movie_dir,movie_name,movie_rank,movie_yr
0,US,Simon Abrams,Freelance film critic,David Lynch,Mulholland Drive,1,2001
1,US,Simon Abrams,Freelance film critic,Wong Kar-wai,In the Mood for Love,2,2000
2,US,Simon Abrams,Freelance film critic,Terrence Malick,The Tree of Life,3,2011
3,US,Simon Abrams,Freelance film critic,Edward Yang,Yi Yi: A One and a Two,4,2000
4,US,Simon Abrams,Freelance film critic,Jean-Luc Godard,Goodbye to Language,5,2014
5,US,Simon Abrams,Freelance film critic,Mohammad Rasoulof,The White Meadows,6,2009
6,US,Simon Abrams,Freelance film critic,Raoul Ruiz,Night Across the Street,7,2012
7,US,Simon Abrams,Freelance film critic,Abbas Kiarostami,Certified Copy,8,2010
8,US,Simon Abrams,Freelance film critic,Johnnie To,Sparrow,9,2008
9,US,Simon Abrams,Freelance film critic,Carlos Saura,Fados,10,2007


Exploring the data

In [22]:
df.shape

(1770, 7)

In [23]:
df.dtypes

critic_cn      object
critic_name    object
critic_org     object
movie_dir      object
movie_name     object
movie_rank     object
movie_yr       object
dtype: object

In [24]:
df.sort_values(by='critic_cn',ascending=False).head()

Unnamed: 0,critic_cn,critic_name,critic_org,movie_dir,movie_name,movie_rank,movie_yr
0,US,Simon Abrams,Freelance film critic,David Lynch,Mulholland Drive,1,2001
1025,US,Ben Mankiewicz,Turner Classic Movies,Asghar Farhadi,A Separation,6,2011
1071,US,Todd McCarthy,The Hollywood Reporter,David Fincher,Zodiac,2,2007
1070,US,Todd McCarthy,The Hollywood Reporter,Olivier Assayas,Carlos,1,2010
1029,US,Ben Mankiewicz,Turner Classic Movies,Darren Aronofsky,The Wrestler,10,2008


In [25]:
df.critic_cn.value_counts()

US              820
UK              180
Canada           50
Germany          50
France           50
India            50
Cuba             50
Italy            40
Australia        40
Colombia         40
Israel           40
Lebanon          30
UAE              30
South Korea      20
Argentina        20
Chile            20
Austria          20
Mexico           20
Singapore        20
Turkey           20
Senegal          10
Philippines      10
Qatar            10
Brazil           10
Egypt            10
Japan            10
Namibia          10
Bangladesh       10
South Africa     10
China            10
Kazakhstan       10
Taiwan           10
Indonesia        10
Hong Kong        10
Belgium          10
Switzerland      10
Name: critic_cn, dtype: int64

In [26]:
df.critic_cn.value_counts(ascending= False, normalize=True)*100

US              46.327684
UK              10.169492
Canada           2.824859
Germany          2.824859
France           2.824859
India            2.824859
Cuba             2.824859
Italy            2.259887
Australia        2.259887
Colombia         2.259887
Israel           2.259887
Lebanon          1.694915
UAE              1.694915
South Korea      1.129944
Argentina        1.129944
Chile            1.129944
Austria          1.129944
Mexico           1.129944
Singapore        1.129944
Turkey           1.129944
Senegal          0.564972
Philippines      0.564972
Qatar            0.564972
Brazil           0.564972
Egypt            0.564972
Japan            0.564972
Namibia          0.564972
Bangladesh       0.564972
South Africa     0.564972
China            0.564972
Kazakhstan       0.564972
Taiwan           0.564972
Indonesia        0.564972
Hong Kong        0.564972
Belgium          0.564972
Switzerland      0.564972
Name: critic_cn, dtype: float64

In [27]:
df.movie_yr.astype(int)

0       2001
1       2000
2       2011
3       2000
4       2014
5       2009
6       2012
7       2010
8       2008
9       2007
10      2000
11      2004
12      2006
13      2001
14      2012
15      2014
16      2004
17      2010
18      2004
19      2003
20      2012
21      2005
22      2014
23      2012
24      2006
25      2004
26      2012
27      2012
28      2008
29      2001
        ... 
1740    2001
1741    2004
1742    2000
1743    2003
1744    2009
1745    2006
1746    2003
1747    2002
1748    2006
1749    2013
1750    2000
1751    2011
1752    2009
1753    2001
1754    2001
1755    2005
1756    2002
1757    2001
1758    2004
1759    2012
1760    2003
1761    2002
1762    2007
1763    2011
1764    2007
1765    2006
1766    2006
1767    2014
1768    2002
1769    2002
Name: movie_yr, Length: 1770, dtype: int32

# The choices of Latin American critics

In [39]:
LA_critics=df[df['critic_cn'].isin(['Cuba','Colombia','Argentina','Chile','Mexico','Brazil'])]
LA_critics

Unnamed: 0,critic_cn,critic_name,critic_org,movie_dir,movie_name,movie_rank,movie_yr
30,Mexico,Arturo Aguilar,Rolling Stone Mexico,Wong Kar-wai,In the Mood for Love,1,2000
31,Mexico,Arturo Aguilar,Rolling Stone Mexico,David Lynch,Mulholland Drive,2,2001
32,Mexico,Arturo Aguilar,Rolling Stone Mexico,Christopher Nolan,Inception,3,2010
33,Mexico,Arturo Aguilar,Rolling Stone Mexico,Guillermo Del Toro,Pan's Labyrinth,4,2006
34,Mexico,Arturo Aguilar,Rolling Stone Mexico,Michael Haneke,Caché,5,2005
35,Mexico,Arturo Aguilar,Rolling Stone Mexico,Werner Herzog,Grizzly Man,6,2005
36,Mexico,Arturo Aguilar,Rolling Stone Mexico,Cristian Mungiu,"4 Months, 3 Weeks & 2 Days",7,2007
37,Mexico,Arturo Aguilar,Rolling Stone Mexico,Leos Carax,Holy Motors,8,2012
38,Mexico,Arturo Aguilar,Rolling Stone Mexico,Claude Lanzmann,The Last of the Unjust,9,2013
39,Mexico,Arturo Aguilar,Rolling Stone Mexico,Paul Thomas Anderson,There Will Be Blood,10,2007


In [40]:
LA_critics.to_csv('LAchoices.csv', index=False)

In [41]:
uniq_dir_list= LA_critics.movie_dir.unique()
uniq_dir_list

array(['Wong Kar-wai', 'David Lynch', 'Christopher Nolan',
       'Guillermo Del Toro', 'Michael Haneke', 'Werner Herzog',
       'Cristian Mungiu', 'Leos Carax', 'Claude Lanzmann',
       'Paul Thomas Anderson', 'Alfonso Cuarón',
       'Fernando Meirelles and Kátia Lund', 'Brad Bird',
       'Alejandro González Iñárritu', 'Joel and Ethan Coen', 'Ang Lee',
       'Andrew Stanton', 'Hayao Miyazaki', 'Richard Linklater',
       'Nanni Moretti', 'Mariano Llinás', 'David Fincher', 'Gus Van Sant',
       'Apichatpong Weerasethakul', 'Kim Ki-duk', 'Lars von Trier',
       'Aleksandr Sokurov', 'Jacques Audiard', 'Pete Docter',
       'Juan José Campanella', 'Michel Gondry',
       'Brad Bird and Jan Pinkava', 'Sarah Polley', 'Asghar Farhadi',
       'Steven Spielberg', 'Baz Luhrmann', 'Roy Andersson',
       'Joshua Oppenheimer', 'Yorgos Lanthimos', 'Andrzej Wajda',
       'Ari Folman', 'Park Chan-wook', 'Charlie Kaufman',
       'Clint Eastwood', 'Fabián Bielinsky', 'Hong Sang-soo',
       

In [31]:
uniq_movie_list= LA_critics.movie_name.unique()
uniq_movie_list

array(['In the Mood for Love', 'Mulholland Drive', 'Inception',
       "Pan's Labyrinth", 'Caché', 'Grizzly Man',
       '4 Months, 3 Weeks & 2 Days', 'Holy Motors',
       'The Last of the Unjust', 'There Will Be Blood', 'WALL-E',
       'Spirited Away', 'Boyhood', "The Son's Room",
       'Extraordinary Stories', 'The Social Network', 'Elephant',
       'Uncle Boonmee Who Can Recall His Past Lives', 'The White Ribbon',
       'Spring, Summer, Fall, Winter…and Spring', 'Melancholia',
       'Russian Ark', 'City of God', 'A Prophet', 'Inside Out',
       'The Secret in Their Eyes', 'Before Sunset',
       'Eternal Sunshine of the Spotless Mind', 'Ratatouille',
       'Stories We Tell', 'Amores Perros', 'A Separation',
       'Minority Report', 'Moulin Rouge!',
       'A Pigeon Sat on a Branch Reflecting on Existence',
       'The Act of Killing', 'Dogtooth', 'Katyn', '3-Iron',
       'The Congress', 'Sympathy for Mr Vengeance',
       'Synecdoche, New York', 'Jersey Boys', 'Nine Queens

# After finishing this first part. I found the IMDB API. I tried to work with it for a little while. However it did not provided me the directors country information. Or I did not find it. So I gave up and decided to scrape my way into the website. 

In [32]:
import imdb
import urllib
import json
from imdb.Person import Person
from imdb import IMDb, IMDbError

In [33]:
Person.default_info

('main', 'filmography', 'biography')

In [34]:
ia = imdb.IMDb()

In [35]:
#def get_movies_country:
    

In [36]:
s_result = ia.search_movie('Whiplash')
Whiplash = s_result[0]
ia.update(Whiplash)
print(Whiplash['country'])
print(Whiplash['genre'])
print(Whiplash['director'])

['United States']
['Drama', 'Music']
[<Person id:3227090[http] name:_Chazelle, Damien_>]


In [37]:
person = ia.get_person('3227090')
print(person['name'])

Damien Chazelle
