## Homework 7.0: BBC Movie List Scraping and Regex

In 2016 the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's [aggregate poll](http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films) is interesting, the long list including everyone who voted is perhaps more revealing from the data standpoint:

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

How do I wrangle this data? That is the central challenge that you'll be dealing with this week. You need to use beautiful soup to find the critic--as well as the list of movies that immediately follow them—and then use regular expression to divide the critic information and the movie info to create the most useful possible data structure. What should the data structure be? That is up to you to figure out.



### Getting started: Data Architecture

The central challenge of this assignment it's figuring out how you are going to set up your table (list of dictionaries) from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: what are the main categories of analysis: Try to design a schema that will give you a table that you can run solid aggregations in pandas. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### STEP 1

The first thing you need to do is scrape the page. 

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

Okay let's begin! (Note: I have set up the first few cells so that you can run requests once AND save the HTML page as a local file. And then load that local file in and do the spray thing on it. That way you only need to run requests once (ever)!)



In [387]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup
import re


In [388]:
# # RUN THIS ONE TIME
# THEN COMMENT-OUT ALL OF THIS CODE
my_url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
raw_html = requests.get(my_url).content


In [389]:
# WRITING THE HTML FILE TO A LOCAL HTML FILE
# RUN THIS ONE TIME, THEN COMMENT-OUT ALL OF THIS CODE
with open('bbc.html', 'wb+') as f:
    f.write(raw_html)

In [390]:
#If you have run requests already--START HERE
f = open("bbc.html", "r")
local_html = f.read()
#local_html

In [391]:
# read the URL, and put the HTML page into beautiful soup

soup_doc = BeautifulSoup(local_html, "html.parser")
#print(soup_doc.prettify())

In [392]:
#Using beautiful soup find the tag that contains 
#the entire list of critics and movies
#Make a variable (like full_list) that holds all that information 
full_list = soup_doc.article.find_all('p')


In [393]:
div_list = soup_doc.article.find_all('div')
len(div_list)

508

In [394]:
div_list[9]

<div class="sc-18fde0d6-0 dlWCEZ" data-component="text-block"><p class="sc-eb7bd5f6-0 fYAfXe"><b class="sc-7dcfb11b-0 kVRnKf" id="sam-adams-–-freelance-film-critic-(us)">Sam Adams – Freelance film critic (US)<!-- --></b></p></div>

In [395]:
allb = soup_doc.find_all('b')
len(allb)
for b in allb:
    #print(b.text)

SyntaxError: incomplete input (1085289792.py, line 4)

**STEP 2** Using Beautiful Soup figure out how to separate the entries.


In [396]:
full_list[2]
full_list[-9]
len(full_list)

1957

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Set up a loop the PRINTS critics and movies: You need to set it up so that you're getting the critic string followed their movies. 

So just print out the lines along with a print message like "CRITIC" or "MOVIE" to make sure that the loop is recognizing the two categories differently.


In [397]:
##Write your loop for STEP 3 here


for entry in full_list[2:-8]:
    if entry.b:
        print(entry.text)
    else:
        #print(entry.text)
        #print("------------")


SyntaxError: incomplete input (3224247263.py, line 9)

**STEP 4**
If your loop is successfully isolating those two categories: now it's time to parse each with regular expressions (separately). This will need to happen inside the loop--for every critic, and then (in STEP 5) for every movie. But FIRST, just **focus on getting the critics name, organization, and country** in isolation (outside of the loops).

Once you have think you have your regular expressions working then bring them into a loop (just for CRITICS) and see how well they work.

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)[0]`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [398]:
#Practice/Build your regular expressions here
import re
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^([^–]+)"
regex_for_org = r"–([^(]+)\("
regex_for_cn = r"\((.+)\)$"
name = re.findall(regex_for_cn,crit_sample)
name[0]


'Mexico'

In [399]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        #print(name + "|||" + org+ "|||"+ cn)


**STEP 5**
Now you need to get your **movie info**. You will want to use the same loop you have been working on (in STEP 6), and get the name of each movie along with the critic information.

But **FIRST**: practice your regular expressions and make sure that they're going to work before you bring them into the loop.


In [400]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^\d{1,2}\. (.+)[(][^(]+[)]$"
regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
regex_for_year = r",\s+(\d{4})\)$"
#what else should you extract???
#set up all regexes here
movie_name = re.findall(regex_for_year,movie_harder)
movie_name[0].strip()


'2007'

**STEP 6**
You're almost there!!! Now that you have working regulars expression put those in your inner loop to get the movie name.

So now the entire loop should be getting critic information and movie information all separated as separate columns/properties.

Build this loop(s) using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [401]:
#Get that loop working here

for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        #print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        regex_for_multi_paren = r".+\(.+\("
        regex_for_odddate = r", .{4}\)$"
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0]
        movie_dir = re.findall(regex_for_dir,entry.text)[0]
        movie_year = re.findall(regex_for_year,entry.text)[0]
        print(movie_name+ "|||" + movie_dir+ "|||"+movie_year)
        paren = re.findall(regex_for_odddate,entry.text)
        if paren:
           # print(paren)


        







SyntaxError: incomplete input (3886663153.py, line 33)

**STEP 7**
This is the final step of the hardest part! 

The final step is building a list of dictionaries of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [402]:
#figure out how you're going to collect your clean information
list_of_movies = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0].strip()
        org = re.findall(regex_for_org,entry.text)[0].strip()
        cn = re.findall(regex_for_cn,entry.text)[0].strip()
        print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        regex_for_multi_paren = r".+\(.+\("
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0].strip()
        movie_dir = re.findall(regex_for_dir,entry.text)[0].strip()
        movie_year = re.findall(regex_for_year,entry.text)[0].strip()
        new_movie_entry = [movie_name,movie_dir,movie_year,name,org,cn]
        list_of_movies.append(new_movie_entry)
        

#Try to figure out how you want to append things
#That is, how you want to organize your data

    

Simon Abrams|||Freelance film critic|||US
Sam Adams|||Freelance film critic|||US
Thelma Adams|||Freelance film critic|||US
Arturo Aguilar|||Rolling Stone Mexico|||Mexico
Matthew Anderson|||BBC Culture|||UK
Tim Appelo|||The Wrap|||US
Adriano Aprà|||Film historian|||Italy
Michael Arbeiter|||Nerdist|||US
Ali Arikan|||Dipnot TV|||Turkey
Michael Atkinson|||The Village Voice|||US
Ana Maria Bahiana|||Freelance film critic|||Brazil
Cameron Bailey|||Toronto Film Festival|||Canada
Lindsay Baker|||BBC Culture|||UK
Miriam Bale|||Freelance film critic|||US
Nicholas Barber|||BBC Culture|||UK
Diego Batlle|||La Nacion|||Argentina
NT Binh|||Positif|||France
Lizelle Bisschoff|||University of Glasgow|||UK
Christian Blauvelt|||BBC Culture|||US
Mahen Bonetti|||African Film Festival Inc|||US
Andreas Borcholte|||Spiegel Online|||Germany
Utpal Borpujari|||Freelance film critic|||India
Richard Brody|||The New Yorker|||US
Hannah Brown|||Jerusalem Post|||Israel
Luke Buckmaster|||The Guardian/BBC Culture|||Austra

In [403]:
##Take a peek at your final lists of lists
list_of_movies
len(list_of_movies)
#list_of_movies[22:44]

1770

In [404]:
# for mov in list_of_movies:
#     if mov[1].startswith("Mool"):
#         print(mov)

for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        print(mov[2])

Sembèène 2004


Could fix this here, like this, but I am going to fix in pandas

In [405]:
for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        mov[1] = mov[1]+" "+mov[2].split(" ")[0]
        print(mov[1])
        mov[2] = mov[2].split(" ")[1]
        print(mov[2])

Ousmane Sembèène
2004


If you made it this far, yay!


And now, let's bring that into PANDAS!

In [406]:
import numpy as np
import pandas as pd
col_names = ['movie', 'director', 'm_year', 'critic','crit_org','crit_cn']
df = pd.DataFrame.from_records(list_of_movies, columns=col_names)

In [407]:
df.head()

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US


In [408]:
#most popular films
df['movie'].value_counts().head(15)

movie
In the Mood for Love                     49
Mulholland Drive                         47
There Will Be Blood                      35
Spirited Away                            34
Boyhood                                  30
Eternal Sunshine of the Spotless Mind    29
A Separation                             28
The Tree of Life                         23
Yi Yi: A One and a Two                   22
No Country For Old Men                   21
Inside Llewyn Davis                      20
Children of Men                          18
4 Months, 3 Weeks & 2 Days               17
Pan's Labyrinth                          17
The Act of Killing                       16
Name: count, dtype: int64

In [409]:
m_count = df['movie'].value_counts()
m_count[m_count<10]

movie
The Dark Knight                                                                        9
Certified Copy                                                                         9
Margaret                                                                               9
Uncle Boonmee Who Can Recall His Past Lives                                            9
Timbuktu                                                                               9
A Serious Man                                                                          8
Inside Out                                                                             8
Crouching Tiger, Hidden Dragon                                                         8
Amour                                                                                  8
Tabu                                                                                   8
The New World                                                                          8
Inception      

In [410]:
#critics per country!
df.groupby('crit_cn')['critic'].nunique()

crit_cn
Argentina        2
Australia        4
Austria          2
Bangladesh       1
Belgium          1
Brazil           1
Canada           5
Chile            2
China            1
Colombia         4
Cuba             5
Egypt            1
France           5
Germany          5
Hong Kong        1
India            5
Indonesia        1
Israel           4
Italy            4
Japan            1
Kazakhstan       1
Lebanon          3
Mexico           2
Namibia          1
Philippines      1
Qatar            1
Senegal          1
Singapore        2
South Africa     1
South Korea      2
Switzerland      1
Taiwan           1
Turkey           2
UAE              3
UK              18
US              82
Name: critic, dtype: int64

In [411]:
#back up your results!!!
df.to_csv(r'backup_BBC1.csv', index = False)

In [412]:
df_new = pd.read_csv("backup_BBC1.csv")

In [413]:
d_list = list(df_new['director'].unique())
d_list.sort()
d_list.head()

AttributeError: 'list' object has no attribute 'head'

**Getting the names separated**

In [414]:
pd.set_option('display.max_rows', None)
#df_new[df_new['director'].str.contains(r"\band\b",regex=True, case=False)]

And this lambda to function, just sends each cell (director) to the dirs_names() function.

Note that here I am testing to make sure the function is working, I'm not saving this work yet.

In [415]:
df_new['director'].value_counts().head()

director
Paul Thomas Anderson    52
Joel and Ethan Coen     52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Name: count, dtype: int64

In [416]:
import re
def dirs_names(dirs):
    each_word = re.split(r"\s+",dirs)
    if len(each_word) > 1 and each_word[1] == "and":
        each_word[0] = each_word[0] + " " + each_word[-1]
        print(' '.join(each_word))
        return ' '.join(each_word)
    else:
        return dirs
        

In [417]:
df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

0                                             David Lynch
1                                            Wong Kar-wai
2                                         Terrence Malick
3                                             Edward Yang
4                                         Jean-Luc Godard
5                                       Mohammad Rasoulof
6                                              Raoul Ruiz
7                                        Abbas Kiarostami
8                                              Johnnie To
9                                            Carlos Saura
10                                           Wong Kar-wai
11                                          Michel Gondry
12                              Apichatpong Weerasethakul
13                                         Hayao Miyazaki
14                                     Joshua Oppenheimer
15                                           Wes Anderson
16                                        Terrence Malick
17            

Looking for directors with a single name because that function is assuming that it is always First Name "and"

In [418]:
df_new[df_new['director'].str.contains(r"^\S+$",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Oh-oh, problem! Let's fix!

In [419]:
df_new["director"].iat[172] = df_new["director"].iat[172] + " " + df_new["m_year"].iat[172].split(" ")[0]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [420]:
df_new["director"].iat[172]

'Ousmane Sembèène'

In [421]:
df_new["m_year"].iat[172] = df_new["m_year"].iat[172].split(" ")[1]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [None]:
df_new["m_year"].iat[172]

In [None]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]


In [422]:
Whaaatttt???

SyntaxError: invalid syntax (1854124894.py, line 1)

In [423]:
df_new['director'] = df_new['director'].str.replace('Sembèène','Sembène')


In [424]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
172,Moolaadé,Ousmane Sembène,2004,Lizelle Bisschoff,University of Glasgow,UK
190,Moolaadé,Ousmane Sembène,2004,Mahen Bonetti,African Film Festival Inc,US
395,Moolaadé,Ousmane Sembène,2004,Lindiwe Dovey,University of London,UK
1010,Moolaadé,Ousmane Sembène,2004,Hans-Christian Mahnke,AfricAvenir.org,Namibia
1536,Moolaadé,Ousmane Sembène,2004,Yael Shuv,Time Out Tel Aviv,Israel


**Okay...**

So after that tangent, I'm gonna go ahead and update the directors!!

Here I am saving the work, updating the director column.


In [425]:
df_new['director']=df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

And checking to make sure it came out, right!

In [426]:
df_new[df_new['director'].str.contains(r"\bCoen\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel Coen and Ethan Coen,2007,Tim Appelo,The Wrap,US
77,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel Coen and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
108,No Country For Old Men,Joel Coen and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
137,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Miriam Bale,Freelance film critic,US
189,A Serious Man,Joel Coen and Ethan Coen,2009,Christian Blauvelt,BBC Culture,US
204,No Country For Old Men,Joel Coen and Ethan Coen,2007,Andreas Borcholte,Spiegel Online,Germany
232,No Country For Old Men,Joel Coen and Ethan Coen,2007,Hannah Brown,Jerusalem Post,Israel
242,"O Brother, Where Art Thou?",Joel Coen and Ethan Coen,2000,Luke Buckmaster,The Guardian/BBC Culture,Australia
260,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Monica Castillo,New York Times Watching,US


Now I need to deal with multiple directors by looking for commas.

In [427]:
df_new[df_new['director'].str.contains(r",\s+",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
144,Madagascar 3: Europe's Most Wanted,"Eric Darnell, Tom McGrath and Conrad Vernon",2012,Nicholas Barber,BBC Culture,UK
399,7 Letters,"Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopal...",2015,Lindiwe Dovey,University of London,UK
1396,"Monsters, Inc.","Pete Docter, David Silverman and Lee Unkrich",2001,Jonathan Romney,Freelance film critic,UK


In [428]:
#beware of oxford commas!!!!
df_new[df_new['director'].str.contains(r",\s+\band\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Replacing all the commas with ' and ' so that I have a consistent separator for every multiple director cell.

In [429]:
df_new['director']=df_new['director'].str.replace(r",\s+",' and ',regex=True)

In [430]:
df_new[df_new['director'].str.contains(r"\bUnkrich\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
54,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Tim Appelo,The Wrap,US
126,Toy Story 3,Lee Unkrich,2010,Lindsay Baker,BBC Culture,UK
495,Toy Story 3,Lee Unkrich,2010,Javier Porta Fouz,La Nacion,Argentina
719,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Ann Hornaday,The Washington Post,US
1343,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Sam Rigby,BBC Culture,UK
1396,"Monsters, Inc.",Pete Docter and David Silverman and Lee Unkrich,2001,Jonathan Romney,Freelance film critic,UK
1567,Toy Story 3,Lee Unkrich,2010,Eric D Snider,Freelance film critic,US


In [431]:
df_new['director'].apply(lambda x: len(re.findall(r'\band\b',x))+1)

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
30      1
31      1
32      1
33      1
34      1
35      1
36      1
37      1
38      1
39      1
40      1
41      1
42      1
43      1
44      1
45      1
46      1
47      1
48      1
49      1
50      2
51      1
52      1
53      1
54      2
55      1
56      1
57      1
58      1
59      1
60      2
61      1
62      1
63      1
64      1
65      1
66      1
67      1
68      2
69      2
70      1
71      1
72      1
73      1
74      1
75      1
76      1
77      2
78      1
79      1
80      1
81      1
82      1
83      1
84      1
85      1
86      2
87      1
88      1
89      1
90      1
91      1
92      1
93      1
94      1
95      1
96      1
97      1
98      1
99      1


Now I am transforming the Director cells into lists using split

In [432]:
df_new['director']=df_new['director'].str.split(' and ')

In [433]:
df_new.iloc[[1396]]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK


In [434]:
#df_new

This is fun...! I can use those lists to count the number of directors for a movie, like, why not?

In [435]:
#make dir numbers

df_new['nm_dir'] = df_new['director'].apply(lambda x: len(x))

In [436]:
df_new[df_new['nm_dir']>2]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
144,Madagascar 3: Europe's Most Wanted,"[Eric Darnell, Tom McGrath, Conrad Vernon]",2012,Nicholas Barber,BBC Culture,UK,3
399,7 Letters,"[Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopa...",2015,Lindiwe Dovey,University of London,UK,7
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK,3


But, more importantly, let's use **explode()**

This is why we put the director names into a list. explode() allows us to then take that list and make separate rows for each element in the list. This way we are "unwinding" the multiple directors.

This is making a new data frame that will have more rows.

In [437]:
df_large = df_new.explode('director')

In [438]:
df_large.shape

(1891, 7)

In [439]:
df_large[df_large['director'].str.contains(r"\bCoen\b",regex=True, case=False)].head()

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
50,No Country For Old Men,Joel Coen,2007,Tim Appelo,The Wrap,US,2
50,No Country For Old Men,Ethan Coen,2007,Tim Appelo,The Wrap,US,2
77,Inside Llewyn Davis,Joel Coen,2013,Michael Arbeiter,Nerdist,US,2
77,Inside Llewyn Davis,Ethan Coen,2013,Michael Arbeiter,Nerdist,US,2
86,A Serious Man,Joel Coen,2009,Ali Arikan,Dipnot TV,Turkey,2


Now we can get better aggregations with one director per row.

In [440]:
df_large['director'].value_counts().head(15)

director
Ethan Coen              52
Joel Coen               52
Paul Thomas Anderson    52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Hayao Miyazaki          35
Michael Haneke          35
Terrence Malick         32
Asghar Farhadi          31
David Fincher           31
Michel Gondry           30
Christopher Nolan       28
Wes Anderson            28
Alfonso Cuarón          24
Name: count, dtype: int64

In [441]:
df_large.groupby('movie')['critic'].nunique().sort_values(ascending=False).reset_index(name='count')

Unnamed: 0,movie,count
0,In the Mood for Love,49
1,Mulholland Drive,47
2,There Will Be Blood,35
3,Spirited Away,34
4,Boyhood,30
5,Eternal Sunshine of the Spotless Mind,29
6,A Separation,28
7,The Tree of Life,23
8,Yi Yi: A One and a Two,22
9,No Country For Old Men,21


Now I can get a much better Director List!

In [442]:
d_list = list(df_large['director'].unique())
d_list.sort()
#d_list

In [443]:
df_large

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1
5,The White Meadows,Mohammad Rasoulof,2009,Simon Abrams,Freelance film critic,US,1
6,Night Across the Street,Raoul Ruiz,2012,Simon Abrams,Freelance film critic,US,1
7,Certified Copy,Abbas Kiarostami,2010,Simon Abrams,Freelance film critic,US,1
8,Sparrow,Johnnie To,2008,Simon Abrams,Freelance film critic,US,1
9,Fados,Carlos Saura,2007,Simon Abrams,Freelance film critic,US,1


## NEXT STEP

Getting more data!!

In [444]:
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url).content
print(raw_html)

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n</body>\r\n</html>\r\n'


In [445]:
for director in d_list[:20]:
    url = "https://en.wikipedia.org/wiki/" + director.replace(" ","_")
    print(url)

https://en.wikipedia.org/wiki/Abbas_Kiarostami
https://en.wikipedia.org/wiki/Abdellatif_Kechiche
https://en.wikipedia.org/wiki/Abderrahmane_Sissako
https://en.wikipedia.org/wiki/Adam_Curtis
https://en.wikipedia.org/wiki/Adam_McKay
https://en.wikipedia.org/wiki/Agnieszka_Holland
https://en.wikipedia.org/wiki/Agnès_Jaoui
https://en.wikipedia.org/wiki/Agnès_Varda
https://en.wikipedia.org/wiki/Aki_Kaurismäki
https://en.wikipedia.org/wiki/Alain_Cavalier
https://en.wikipedia.org/wiki/Alain_Gomis
https://en.wikipedia.org/wiki/Alain_Guiraudie
https://en.wikipedia.org/wiki/Alain_Resnais
https://en.wikipedia.org/wiki/Alan_Mak
https://en.wikipedia.org/wiki/Albert_Serra
https://en.wikipedia.org/wiki/Alejandro_González_Iñárritu
https://en.wikipedia.org/wiki/Aleksandr_Sokurov
https://en.wikipedia.org/wiki/Aleksey_Fedorchenko
https://en.wikipedia.org/wiki/Aleksey_German
https://en.wikipedia.org/wiki/Alex_Garland


In [370]:
from bs4 import BeautifulSoup
import requests
import time 
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
url = "https://www.imdb.com/name/nm0327120/" 
response = requests.get(url, headers = headers)
soup_doc = BeautifulSoup(response.content)
time.sleep(5)
page_text = soup_doc.get_text()
pattern = r"born.*?in\s+([\w\s,]+)\."

# Search for the pattern
match = re.search(pattern, page_text, re.IGNORECASE)

if match:
    print("Sentence found:", match.group())
else:
    print("No sentence starting with 'born' found.")

Sentence found: born in 1972 in Paris, France.


In [None]:
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        name = re.findall(regex_for_name,entry.text)[0]
        #print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        movie_name = re.findall(regex_for_mname,entry.text)[0]
        paren = re.findall(regex_for_odddate,entry.text)
        if paren:
           print(paren)

In [None]:
# for entry in full_list[2:-8]:
#     if entry.b:
#         regex_for_name = r"^([^–]+)"
#         regex_for_org = r"–([^(]+)\("
#         regex_for_cn = r"\((.+)\)$"
#         name = re.findall(regex_for_name,entry.text)[0]
#         birthplace_data.append(name)
#         print(name)
#     else:
#         movie_name = re.findall(regex_for_mname,entry.text)[0]
#         birthplace_data.append(movie_name)
#         print(movie_name)
            

In [446]:
df_large

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1
5,The White Meadows,Mohammad Rasoulof,2009,Simon Abrams,Freelance film critic,US,1
6,Night Across the Street,Raoul Ruiz,2012,Simon Abrams,Freelance film critic,US,1
7,Certified Copy,Abbas Kiarostami,2010,Simon Abrams,Freelance film critic,US,1
8,Sparrow,Johnnie To,2008,Simon Abrams,Freelance film critic,US,1
9,Fados,Carlos Saura,2007,Simon Abrams,Freelance film critic,US,1


In [473]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

# Initialize birthplace data
birthplace_data = []

for director in d_list:
    url = "https://en.wikipedia.org/wiki/" + director.replace(" ", "_")
    response = requests.get(url)
    soup_doc = BeautifulSoup(response.content, "html.parser")

    # Check for birthplace in Wikipedia
    birthplace = soup_doc.find('div', class_="birthplace")
    if birthplace:
        birthplace_strip = birthplace.text.strip()
        birthplace_data.append(birthplace_strip)
        print(f"Found birthplace on Wikipedia: {director} - {birthplace_strip}")
    else:
        # IMDb fallback if Wikipedia doesn't have the birthplace
        # IMDb fallback if Wikipedia doesn't have the birthplace
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
        }
        imdb_url = "https://www.imdb.com/search/name/?name=" + director.replace(" ", "%20")
        response = requests.get(imdb_url, headers=headers)
        
        # Debugging: Print IMDb URL and Status Code
        print(f"Fetching IMDb URL: {imdb_url}")
        print(f"Response Status Code: {response.status_code}")
        
        soup_doc = BeautifulSoup(response.content, "html.parser")
        
        # Use get_text() to extract the entire page's text
        page_text = soup_doc.get_text()  # Extract all text from the page
        
        # Regex pattern to find "born in <location>"
        pattern = r"born.*?in\s+([\w\s,]+)\."
        match = re.search(pattern, page_text, re.IGNORECASE)
        
        if match:
            birthplace_strip = match.group(1).strip()
            birthplace_data.append(birthplace_strip)
            print(f"Found birthplace on IMDb: {director} - {birthplace_strip}")
        else:
            birthplace_data.append("Null")
            print(f"No birthplace found for {director} on IMDb.")


# Add special case for Aleksey German
aleksey_url = "https://en.wikipedia.org/wiki/Aleksei_Yuryevich_German"
response = requests.get(aleksey_url)
soup_doc = BeautifulSoup(response.content, "html.parser")
birthplace = soup_doc.find('div', class_="birthplace")
if birthplace:
    birthplace_strip = birthplace.text.strip()
    birthplace_data[0] = birthplace_strip  # Update Aleksey's entry
    print(f"Updated Aleksey German birthplace: {birthplace_strip}")

# Add special case for André Singer
andre_url = "https://en.wikipedia.org/wiki/Andr%C3%A9_Singer_(producer)"
response = requests.get(andre_url)
soup_doc = BeautifulSoup(response.content, "html.parser")
birthplace_paragraphs = soup_doc.find_all('p')
birthplace = [p.text.strip() for p in birthplace_paragraphs if p.text.strip().startswith("Born")]
if birthplace:
    birthplace_text = birthplace[0]
    match = re.search(r"Born in ([\w\s,]+?)[.,]", birthplace_text)
    if match:
        birthplace_cleaned = match.group(1).strip()
        birthplace_data[1] = birthplace_cleaned  # Update André's entry
        print(f"Updated André Singer birthplace: {birthplace_cleaned}")

# Add special case for Anthony Russo
anthony_russo_url = "https://en.wikipedia.org/wiki/Russo_brothers"
response = requests.get(anthony_russo_url)
soup_doc = BeautifulSoup(response.content, "html.parser")
birthplace = soup_doc.find('td', class_="infobox-data")
if birthplace:
    birthplace_text = birthplace.text.strip()
    print(f"Full birthplace text: {birthplace_text}")
    matches = re.findall(r"\bCleveland, Ohio, U\.S\.\b", birthplace_text)
    if matches:
        birthplace_data[2] = matches[0]  # Update Anthony's entry
        print(f"Updated Anthony Russo birthplace: {matches[0]}")
    else:
        print("No valid location found for Anthony Russo.")

# Save to CSV
df_new = pd.DataFrame({
    "director": d_list,
    "birthplace": birthplace_data,
})

df_new.to_csv("birthplaces.csv", index=False)
print("Data saved to birthplaces.csv.")


Found birthplace on Wikipedia: Abbas Kiarostami - Tehran, Imperial State of Iran
Found birthplace on Wikipedia: Abdellatif Kechiche - Tunis, Tunisia
Found birthplace on Wikipedia: Abderrahmane Sissako - Kiffa, Mauritania
Found birthplace on Wikipedia: Adam Curtis - Dartford, Kent, England, United Kingdom
Found birthplace on Wikipedia: Adam McKay - Denver, Colorado, U.S.
Found birthplace on Wikipedia: Agnieszka Holland - Warsaw, Poland
Found birthplace on Wikipedia: Agnès Jaoui - Antony, France
Found birthplace on Wikipedia: Agnès Varda - Ixelles, Brussels, Belgium
Found birthplace on Wikipedia: Aki Kaurismäki - Orimattila, Finland
Found birthplace on Wikipedia: Alain Cavalier - Vendôme, Loir-et-Cher, France
Fetching IMDb URL: https://www.imdb.com/search/name/?name=Alain%20Gomis
Response Status Code: 200
Found birthplace on IMDb: Alain Gomis - 1972 in Paris, France
Found birthplace on Wikipedia: Alain Guiraudie - Villefranche-de-Rouergue, Aveyron, France
Found birthplace on Wikipedia: A

In [470]:
print(f"Length of d_list: {len(d_list)}")
print(f"Length of birthplace_data: {len(birthplace_data)}")
print(f"Length of movie_name: {len(df_large['movie'])}")
#Debug why it wasn't scraping, index & length failure 

Length of d_list: 441
Length of birthplace_data: 441
Length of movie_name: 1891


In [474]:
df_new.head()

Unnamed: 0,director,birthplace
0,Abbas Kiarostami,"Leningrad, Russian SFSR, Soviet Union(present ..."
1,Abdellatif Kechiche,London
2,Abderrahmane Sissako,"Cleveland, Ohio, U.S."
3,Adam Curtis,"Dartford, Kent, England, United Kingdom"
4,Adam McKay,"Denver, Colorado, U.S."


In [480]:
df_large.head()

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1


In [481]:
# print(df_large[["director", "nm_dir"]].head())
# print(df_new[["director", "nm_dir"]].head())  # If df_new has these columns
# # so one is a list and one is a string that's why it's not working!!! 

In [482]:
# Merge using the correct column names
df_combined = df_large.merge(df_new, on="director", how="left")

# Save to CSV
df_combined.to_csv("final_movies_with_birthplaces.csv", index=False)

# Display the updated DataFrame
df_combined

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir,birthplace
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1,"Missoula, Montana, U.S."
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1,"Shanghai, China"
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1,"Ottawa, Illinois, U.S."
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1,"Shanghai, Republic of China"
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1,"Paris, France"
5,The White Meadows,Mohammad Rasoulof,2009,Simon Abrams,Freelance film critic,US,1,"Shiraz, Imperial State of Iran"
6,Night Across the Street,Raoul Ruiz,2012,Simon Abrams,Freelance film critic,US,1,"Puerto Montt, Chile"
7,Certified Copy,Abbas Kiarostami,2010,Simon Abrams,Freelance film critic,US,1,"Leningrad, Russian SFSR, Soviet Union(present ..."
8,Sparrow,Johnnie To,2008,Simon Abrams,Freelance film critic,US,1,"Sham Shui Po, Kowloon, British Hong Kong"
9,Fados,Carlos Saura,2007,Simon Abrams,Freelance film critic,US,1,"Huesca, Spain"


In [486]:
import re

# Function to clean up birthplace data
def clean_birthplace(entry):
    # Step 1: Remove years, phrases like "in", "on", "at", etc.
    entry = re.sub(r"\b(?:in|on|at|at the end of)\b", "", entry, flags=re.IGNORECASE)
    entry = re.sub(r"\b\d{4}\b", "", entry)  # Remove 4-digit years
    entry = re.sub(r"[()\[\]]", "", entry)   # Remove parentheses or brackets

    # Step 2: Normalize abbreviations like "U.S." or "USA"
    entry = re.sub(r"\bU\.S\.A\.|\bUSA\b", "United States", entry, flags=re.IGNORECASE)
    entry = re.sub(r"\bU\.S\.|\bUS\b", "United States", entry, flags=re.IGNORECASE)
    
    # Step 3: Standardize the format
    # Split into parts (e.g., "City, State, Country") and strip whitespace
    parts = [part.strip() for part in entry.split(",") if part.strip()]

    # Step 4: Join the cleaned parts
    return ", ".join(parts)

# Apply the cleaning function to the 'birthplace' column
df_combined['birthplace'] = df_combined['birthplace'].apply(clean_birthplace)

# Display the cleaned DataFrame
print(df_combined)
df_combined.to_csv("final_movies_with_birthplaces.csv", index=False)

                                                  movie  \
0                                      Mulholland Drive   
1                                  In the Mood for Love   
2                                      The Tree of Life   
3                                Yi Yi: A One and a Two   
4                                   Goodbye to Language   
5                                     The White Meadows   
6                               Night Across the Street   
7                                        Certified Copy   
8                                               Sparrow   
9                                                 Fados   
10                                 In the Mood for Love   
11                Eternal Sunshine of the Spotless Mind   
12                              Syndromes and a Century   
13                                        Spirited Away   
14                                   The Act of Killing   
15                             The Grand Budapest Hotel 

In [485]:
Anthony_Russo_url = "https://en.wikipedia.org/wiki/Russo_brothers"
response = requests.get(Anthony_Russo_url)
soup_doc = BeautifulSoup(response.content)
birthplace_data = []
birthplace = soup_doc.find('td', class_="infobox-data")
if birthplace: 
    birthplace_strip = birthplace.text.strip()
    birthplace_data.append(birthplace_strip)
    print(f"Found birthplace on Wikipedia: {director}{birthplace_data[-1]}")

else: 
    print(f"Not found")

Found birthplace on Wikipedia: Éric RohmerAnthony Russo (1970-02-03) February 3, 1970 (age 54)Cleveland, Ohio, U.S.Joseph Russo (1971-07-18) July 18, 1971 (age 53)Cleveland, Ohio, U.S.


In [None]:
Anthony_Russo_url = "https://en.wikipedia.org/wiki/Russo_brothers"
response = requests.get(Anthony_Russo_url)
soup_doc = BeautifulSoup(response.content, "html.parser")

birthplace_data = []
birthplace = soup_doc.find('td', class_="infobox-data")

if birthplace:
    # Extract the text and clean it
    birthplace_text = birthplace.text.strip()
    print(f"Full birthplace text: {birthplace_text}")  # Debugging: Print the full text to analyze

    # Refined regex to extract only "Cleveland, Ohio, U.S." occurrences
    matches = re.findall(r"\bCleveland, Ohio, U\.S\.\b", birthplace_text)

    if matches:
        # Append only unique matches to birthplace_data
        for match in matches:
            birthplace_data.append(match)
        print(f"Found birthplace(s) on Wikipedia: {', '.join(birthplace_data)}")


In [None]:
# Eran_Riklis = "https://www.imdb.com/name/nm0726954/bio/?ref_=nm_ov_bio_sm"
# response = requests.get(Eran_Riklis)
# soup_doc = BeautifulSoup(response.content, "html.parser")
# birthplace_data = []
# birthplace_paragraphs = soup_doc.find_all('div', class_="sc-f65f65be-0 dQVJPm")
# birthplace_paragraphs 

In [None]:
André_url = "https://en.wikipedia.org/wiki/Andr%C3%A9_Singer_(producer)"
response = requests.get(André_url)
soup_doc = BeautifulSoup(response.content, "html.parser")

birthplace_data = []
birthplace_paragraphs = soup_doc.find_all('p')

# Filter paragraphs starting with "Born"
birthplace = [p.text.strip() for p in birthplace_paragraphs if p.text.strip().startswith("Born")]

if birthplace:
    birthplace_text = birthplace[0]
    #Extract born in 
    match = re.search(r"Born in ([\w\s,]+?)[.,]", birthplace_text)
    if match:
        birthplace_cleaned = match.group(1).strip()
        birthplace_data.append(birthplace_cleaned)
        print(f"Found birthplace on Wikipedia: André Singer - {birthplace_cleaned}")


In [79]:
# Found birthplace on Wikipedia: Abbas KiarostamiTehran, Imperial State of Iran
# Found birthplace on Wikipedia: Abdellatif KechicheTunis, Tunisia
# Found birthplace on Wikipedia: Abderrahmane SissakoKiffa, Mauritania
# Found birthplace on Wikipedia: Adam CurtisDartford, Kent, England, United Kingdom
# Found birthplace on Wikipedia: Adam McKayDenver, Colorado, U.S.
# Found birthplace on Wikipedia: Agnieszka HollandWarsaw, Poland
# Found birthplace on Wikipedia: Agnès JaouiAntony, France
# Found birthplace on Wikipedia: Agnès VardaIxelles, Brussels, Belgium
# Found birthplace on Wikipedia: Aki KaurismäkiOrimattila, Finland
# Found birthplace on Wikipedia: Alain CavalierVendôme, Loir-et-Cher, France
# Found birthplace on IMDb: Alain Gomis1972 in Paris, France
# Found birthplace on Wikipedia: Alain GuiraudieVillefranche-de-Rouergue, Aveyron, France
# Found birthplace on Wikipedia: Alain ResnaisVannes, France
# Found birthplace on IMDb: Alan MakHong Kong in 1965
# Found birthplace on Wikipedia: Albert SerraBanyoles, Catalonia, Spain
# Found birthplace on Wikipedia: Alejandro González IñárrituMexico City, Mexico
# Found birthplace on Wikipedia: Aleksandr SokurovPodorvikha, Irkutsky District, Soviet Union
# Found birthplace on Wikipedia: Aleksey FedorchenkoYekaterinburg, Russian SFSR
# Found birthplace on IMDb: Aleksey GermanSt !!Try again: https://www.imdb.com/name/nm0314516/bio/?ref_=nm_ov_bio_sm Leningrad, Russian SFSR, USSR [now St. Petersburg, Russia]
# Found birthplace on Wikipedia: Alex Garland London, England
# Found birthplace on Wikipedia: Alexander Payne Omaha, Nebraska, U.S.
# Found birthplace on Wikipedia: Alfonso CuarónMexico City, Mexico
# Found birthplace on Wikipedia: Amma AsanteLambeth, London, England
# Found birthplace on Wikipedia: Ana Lily AmirpourMargate, Kent, England
# Found birthplace on Wikipedia: Andrea ArnoldDartford, Kent, England
# Found birthplace on Wikipedia: Andrew AdamsonAuckland, New Zealand
# Found birthplace on Wikipedia: Andrew DominikWellington, New Zealand
# Found birthplace on Wikipedia: Andrew DosunmuLagos, Nigeria
# Found birthplace on Wikipedia: Andrew HaighHarrogate, England
# Found birthplace on Wikipedia: Andrew LauBritish Hong Kong
# Found birthplace on Wikipedia: Andrew StantonRockport, Massachusetts, U.S.
# Found birthplace on Wikipedia: Andrey ZvyagintsevNovosibirsk, RSFSR, Soviet Union
# Found birthplace on Wikipedia: Andrzej WajdaSuwałki, Second Polish Republic
# Found birthplace on Wikipedia: Andrzej ZulawskiLviv, Ukrainian SSR, Soviet Union
# No birthplace found for André Singer!! Try again: Born in London https://en.wikipedia.org/wiki/Andr%C3%A9_Singer_(producer)
# Found birthplace on Wikipedia: Ang LeeChaozhou, Pingtung, Taiwan
# Found birthplace on IMDb: Angela Ricci LucchiMilan, Italy
# Found birthplace on IMDb: Annemarie JacirBethlehem, Palestine
# No birthplace found for Anthony Russo!!!! https://en.wikipedia.org/wiki/Russo_brothers
# No birthplace found for Antonio Di Trapani !!! No info 
# Found birthplace on Wikipedia: Anurag KashyapGorakhpur, Uttar Pradesh, India
# Found birthplace on Wikipedia: Apichatpong WeerasethakulBangkok, Thailand
# Found birthplace on Wikipedia: Ari FolmanHaifa, Israel
# Found birthplace on Wikipedia: Arnaud DesplechinRoubaix, France
# Found birthplace on Wikipedia: Asghar FarhadiHomayoon Shahr, Isfahan province, Imperial State of Iran
# Found birthplace on Wikipedia: Ashutosh GowarikerKolhapur, Maharashtra, India
# Found birthplace on Wikipedia: Asif KapadiaLondon Borough of Hackney, England
# Found birthplace on Wikipedia: Ava DuVernayLong Beach, California, U.S.
# Found birthplace on Wikipedia: Avi NesherRamat Gan, Israel[1]
# Found birthplace on Wikipedia: Bahman GhobadiBaneh, Iran
# Found birthplace on Wikipedia: Bart LaytonHammersmith, London, England
# Found birthplace on Wikipedia: Baz LuhrmannSydney, New South Wales, Australia[1]
# No birthplace found for Bazi Gete !!! No info, 
# Found birthplace on IMDb: Ben Rivers1972 in Somerset, England, UK
# Found birthplace on Wikipedia: Ben StillerNew York City, U.S.
# Found birthplace on Wikipedia: Ben WheatleyBillericay, Essex, England
# Found birthplace on Wikipedia: Benh ZeitlinNew York City, New York, U.S.
# Found birthplace on Wikipedia: Benny SafdieNew York City, U.S.
# Found birthplace on Wikipedia: Bong Joon-hoDaegu, South Korea
# Found birthplace on Wikipedia: Boo JunfengSingapore
# Found birthplace on Wikipedia: Bouli LannersMoresnet-Chapelle, in Plombières, Liège, Belgium
# Found birthplace on Wikipedia: Brad BirdKalispell, Montana, U.S.
# Found birthplace on Wikipedia: Brian De PalmaNewark, New Jersey, U.S.
# Found birthplace on IMDb: Brian Taylorthe country ##No info
# Found birthplace on Wikipedia: Béla TarrPécs, People's Republic of Hungary
# Found birthplace on Wikipedia: Cameron CrowePalm Springs, California, U.S.
# Found birthplace on Wikipedia: Carlos ReygadasMexico City, Mexico
# Found birthplace on Wikipedia: Carlos SauraHuesca, Spain
# Found birthplace on Wikipedia: Carol MorleyStockport, England
# Found birthplace on Wikipedia: Cary Joji FukunagaOakland, California, U.S.
# Found birthplace on Wikipedia: Catherine BreillatBressuire, Deux-Sèvres, France
# Found birthplace on Wikipedia: Chaitanya TamhaneMumbai, Maharashtra, India
# Found birthplace on Wikipedia: Chantal AkermanBrussels, Belgium
# Found birthplace on IMDb: Charles BurnettVicksburg, Mississippi on April 13, 1944, Charles Burnett moved with his family to the Watts area of Los Angeles at an early age
# Found birthplace on Wikipedia: Charlie KaufmanNew York City, U.S.
# Found birthplace on Wikipedia: Chris BuckWichita, Kansas, U.S.
# Found birthplace on Wikipedia: Christian MarclaySan Rafael, California, U.S.
# Found birthplace on IMDb: Christian PetzoldHilden in 1960
# Found birthplace on IMDb: Christopher GuestNew York City
# Found birthplace on IMDb: Christopher MorrisBristol, England, UK
# Found birthplace on Wikipedia: Christopher NolanLondon, England
# Found birthplace on Wikipedia: Claire DenisParis, France
# Found birthplace on Wikipedia: Claude LanzmannBois-Colombes, France
# Found birthplace on IMDb: Clint EastwoodSan Francisco, to Clinton Eastwood Sr
# Found birthplace on IMDb: Clio BarnardOtley, Leeds, West Yorkshire, England, UK
# Found birthplace on Wikipedia: Conrad VernonLubbock, Texas, U.S.
# Found birthplace on Wikipedia: Corneliu PorumboiuVaslui, Vaslui County, Romania
# Found birthplace on Wikipedia: Courtney HuntMemphis, Tennessee, United States
# Found birthplace on Wikipedia: Cristi PuiuBucharest, Romania
# Found birthplace on Wikipedia: Cristian MungiuIași, Romania
# Found birthplace on Wikipedia: Céline SciammaPontoise, Val-d'Oise, France
# Found birthplace on Wikipedia: Damien ChazelleProvidence, Rhode Island, U.S.
# Found birthplace on Wikipedia: Damián SzifrónRamos Mejía, Argentina
# Found birthplace on Wikipedia: Dan GilroySanta Monica, California, U.S.
# Found birthplace on Wikipedia: Danis TanovicZenica, SR Bosnia and Herzegovina, SFR Yugoslavia
# Found birthplace on IMDb: Danièle HuilletParis, France
# Found birthplace on Wikipedia: Darren AronofskyNew York City, U.S.
# Found birthplace on Wikipedia: David CronenbergToronto, Ontario, Canada
# Found birthplace on Wikipedia: David FincherDenver, Colorado, U.S.
# Found birthplace on IMDb: David FranceWidnes, Cheshire, England, UK
# Found birthplace on Wikipedia: David LynchMissoula, Montana, U.S.
# Found birthplace on Wikipedia: David MichôdSydney, New South Wales, Australia
# Found birthplace on IMDb: David O'ReillyKilkenny, Ireland
# Found birthplace on IMDb: David O. RussellNew York City, Russell attended public schools in Mamaroneck, NY
# Found birthplace on IMDb: David SilvermanNew York City, New York, USA
# Found birthplace on Wikipedia: David WainShaker Heights, Ohio, U.S.
# Found birthplace on Wikipedia: Debra GranikCambridge, Massachusetts, U.S.
# Found birthplace on Wikipedia: Denis VilleneuveGentilly, Quebec, Canada
# Found birthplace on Wikipedia: Derek CianfranceLakewood, Colorado, U.S.
# Found birthplace on Wikipedia: Destin Daniel CrettonHaiku, Hawaii, U.S.[1]
# Found birthplace on Wikipedia: Dibakar BanerjeeNew Delhi, India
# Found birthplace on Wikipedia: Don HertzfeldtFremont, California, U.S.
# No birthplace found for Dror Moreh !!! No info 
# Found birthplace on IMDb: Duke JohnsonSt
# Found birthplace on Wikipedia: Duncan JonesBromley, London, England
# Found birthplace on Wikipedia: Edgar WrightPoole, Dorset, England
# Found birthplace on Wikipedia: Edward YangShanghai, Republic of China
# Found birthplace on Wikipedia: Elia SuleimanNazareth, Israel
# Found birthplace on IMDb: Emad BurnatPalestine
# Found birthplace on Wikipedia: Eran KolirinHolon
# No birthplace found for Eran Riklis !!! Born in Beersheba, Israel, https://www.imdb.com/name/nm0726954/bio/?ref_=nm_ov_bio_sm
# Found birthplace on IMDb: Eric DarnellPrairie Village, Kansas, USA
# Found birthplace on Wikipedia: Eric KhooSingapore
# Found birthplace on Wikipedia: Ermanno OlmiBergamo, Italy
# Found birthplace on IMDb: Ernie GehrMilwaukee, Wisconsin, USA
# Found birthplace on Wikipedia: Ethan CoenSt. Louis Park, Minnesota, U.S. (both)
# Found birthplace on Wikipedia: Evan GoldbergVancouver, British Columbia, Canada
# Found birthplace on Wikipedia: Eytan FoxNew York City, New York, U.S.
# Found birthplace on Wikipedia: Fabián BielinskyBuenos Aires, Argentina
# Found birthplace on Wikipedia: Fatih AkinHamburg, West Germany
# Found birthplace on Wikipedia: Fernando MeirellesSão Paulo, Brazil
# Found birthplace on Wikipedia: Florian Henckel von DonnersmarckCologne, West Germany
# Found birthplace on Wikipedia: Francis Ford CoppolaDetroit, Michigan, U.S.
# Found birthplace on Wikipedia: Franco PiavoliPozzolengo, Lombardy, Italy
# Found birthplace on Wikipedia: François OzonParis, France
# Found birthplace on Wikipedia: Frederick WisemanBoston, Massachusetts, U.S.
# Found birthplace on Wikipedia: Gabriela PichlerHuddinge, Sweden
# Found birthplace on IMDb: Gareth Edwardsthe English town of Nuneaton, Warwickshire
# Found birthplace on Wikipedia: Gaspar NoéBuenos Aires, Argentina
# Found birthplace on Wikipedia: George ClooneyLexington, Kentucky, U.S.
# Found birthplace on Wikipedia: George LucasModesto, California, U.S.
# Found birthplace on IMDb: George MillerEdinburgh, Scotland, UK
# Found birthplace on IMDb: Gerhard Benedikt FriedlBad Aussee, Styria, Austria
# Found birthplace on Wikipedia: Gina Prince-BythewoodChicago, Illinois, U.S.
# Found birthplace on Wikipedia: Greg MottolaDix Hills, New York, U.S.
# Found birthplace on Wikipedia: Guillermo Del ToroGuadalajara, Jalisco, Mexico
# Found birthplace on Wikipedia: Gurinder ChadhaNairobi, Kenya
# Found birthplace on Wikipedia: Gus Van SantLouisville, Kentucky, U.S.
# Found birthplace on Wikipedia: Guy DavidiJaffa
# Found birthplace on Wikipedia: Guy MaddinWinnipeg, Manitoba, Canada
# Found birthplace on Wikipedia: Hany Abu-AssadNazareth, Israel
# Found birthplace on Wikipedia: Harmony KorineBolinas, California, U.S.
# Found birthplace on Wikipedia: Hayao MiyazakiTokyo City, Empire of Japan
# Found birthplace on Wikipedia: Hirokazu KoreedaTokyo, Japan
# Found birthplace on Wikipedia: Hong Sang-sooSeoul, South Korea
# Found birthplace on Wikipedia: Hou Hsiao-hsienMeixian County, Guangdong, Republic of China
# Found birthplace on Wikipedia: Ingmar BergmanUppsala, Sweden
# Found birthplace on Wikipedia: J. A. BayonaBarcelona, Spain
# Found birthplace on Wikipedia: J. J. AbramsNew York City, U.S.
# Found birthplace on Wikipedia: Jack NeoState of Singapore
# No birthplace found for Jacob Krupnick !! No info 
# Found birthplace on Wikipedia: Jacques AudiardParis, France
# Found birthplace on Wikipedia: Jacques RivetteRouen, France
# Found birthplace on Wikipedia: Jafar PanahiMianeh, East Azerbaijan, Imperial State of Iran
# Found birthplace on Wikipedia: James CameronKapuskasing, Ontario, Canada
# Found birthplace on IMDb: James Graya small English town called Grantham in 1985
# No birthplace found for James Marsch !! Try again:Truro, Cornwall, England, UK https://www.imdb.com/name/nm1016428/
# Found birthplace on IMDb: James MarshTruro, Cornwall, England, UK
# Found birthplace on Wikipedia: Jan PinkavaPrague, Czechoslovakia
# Found birthplace on Wikipedia: Jane CampionWellington, New Zealand
# Found birthplace on Wikipedia: Janie GeiserBaton Rouge, Louisiana
# Found birthplace on Wikipedia: Jason ReitmanMontreal, Quebec, Canada
# No birthplace found for Jason Silverman !!No info 
# Found birthplace on IMDb: Jean-Charles FitoussiTours, France
# Found birthplace on Wikipedia: Jean-Claude BrisseauParis, France[1]
# Found birthplace on Wikipedia: Jean-Gabriel PériotBellac, France
# Found birthplace on Wikipedia: Jean-Luc GodardParis, France
# Found birthplace on Wikipedia: Jean-Marc ValléeMontreal, Quebec, Canada
# Found birthplace on IMDb: Jean-Marie StraubMetz, Moselle, Lorraine, France
# No birthplace found for Jean-Marie Téno !! Famleng, Bandjoun Try again: https://en.wikipedia.org/wiki/Jean-Marie_Teno
# No birthplace found for Jean-Pierre Dardenne !! Try again: Liège, Belgium https://en.wikipedia.org/wiki/Dardenne_brothers
# Found birthplace on Wikipedia: Jean-Pierre JeunetRoanne, Loire, France
# Found birthplace on Wikipedia: Jeff NicholsLittle Rock, Arkansas, U.S.
# Found birthplace on Wikipedia: Jeff TremaineDurham, North Carolina, U.S.
# Found birthplace on Wikipedia: Jennifer KentBrisbane, Queensland, Australia
# Found birthplace on IMDb: Jennifer LeeBarrington, Rhode Island, USA
# Found birthplace on Wikipedia: Jessica HausnerVienna, Austria
# Found birthplace on Wikipedia: Jia ZhangkeFenyang, Shanxi, China
# Found birthplace on Wikipedia: Jiang WenTangshan, Hebei, China
# No birthplace found for Jihan El-Tahri !! Try again: born in Beirut, Lebanon, https://en.wikipedia.org/wiki/Jihan_El-Tahri
# Found birthplace on Wikipedia: Jim JarmuschCuyahoga Falls, Ohio, U.S.
# Found birthplace on Wikipedia: Joanna HoggLondon, England
# Found birthplace on Wikipedia: Joe CornishLondon, England
# Found birthplace on IMDb: Joe RussoAbbeville, Louisiana, USA
# Found birthplace on Wikipedia: Joel CoenSt. Louis Park, Minnesota, U.S. (both)
# Found birthplace on Wikipedia: John AkomfrahAccra, Dominion of Ghana
# Found birthplace on Wikipedia: John Cameron MitchellEl Paso, Texas, U.S.
# Found birthplace on IMDb: John Carney1972 in Dublin, Ireland
# Found birthplace on IMDb: John CrowleyCork, Ireland
# No birthplace found for John Gianvito !! No info 
# Found birthplace on Wikipedia: Johnnie ToSham Shui Po, Kowloon, British Hong Kong
# Found birthplace on Wikipedia: Jonas MekasSemeniškiai, Lithuania
# Found birthplace on Wikipedia: Jonathan GlazerLondon, England
# Found birthplace on Wikipedia: Joseph CedarNew York, United States
# No birthplace found for Joseph Kahn !! Try again: Busan, South Korea https://en.wikipedia.org/wiki/Joseph_Kahn_(director)
# Found birthplace on Wikipedia: Josephine DeckerLondon, UK
# Found birthplace on Wikipedia: Josh SafdieNew York City, U.S.
# Found birthplace on Wikipedia: Joshua OppenheimerAustin, Texas, U.S.
# Found birthplace on Wikipedia: Joss WhedonNew York City, U.S.
# Found birthplace on Wikipedia: João Pedro RodriguesLisbon, Portugal
# Found birthplace on Wikipedia: Juan José CampanellaBuenos Aires, Argentina
# Found birthplace on Wikipedia: Judy KibingeNairobi, Kenya
# Found birthplace on Wikipedia: Julian SchnabelNew York City, U.S.
# Found birthplace on Wikipedia: Jørgen LethAarhus, Denmark
# No birthplace found for K. Rajagopal !! Singapore Try again: https://en.wikipedia.org/wiki/K._Rajagopal_(director)
# Found birthplace on Wikipedia: Karim AïnouzFortaleza, Ceará, Brazil
# Found birthplace on Wikipedia: Kathryn BigelowSan Carlos, California, U.S.
# Found birthplace on Wikipedia: Kelly ReichardtMiami, Florida, U.S.
# Found birthplace on Wikipedia: Kelvin TongSingapore
# Found birthplace on Wikipedia: Ken JacobsNew York City, US[1]
# Found birthplace on Wikipedia: Ken LoachNuneaton, England
# No birthplace found for Kenneth Lonergan !! Try again: The Bronx, New York, U.S. https://en.wikipedia.org/wiki/Kenneth_Lonergan
# Found birthplace on Wikipedia: Kevin CostnerLynwood, California, U.S.
# No birthplace found for Khalo Matabane !! Try again: Limpopo, South Africa https://africanfilmny.org/directors/khalo-matabane/#:~:text=Khalo%20Matabane%20is%20an%20award,both%20national%20and%20international%20acclaim.
# Found birthplace on Wikipedia: Kim Jee-woonSeoul, South Korea
# Found birthplace on Wikipedia: Kim Ki-dukPonghwa, South Korea
# Found birthplace on Wikipedia: Kinji FukasakuMito, Ibaraki, Japan
# Found birthplace on Wikipedia: Kiyoshi KurosawaKobe, Japan
# Found birthplace on Wikipedia: Kleber Mendonça FilhoRecife, Pernambuco, Brazil
# Found birthplace on IMDb: Kátia Lund1966 in São Paulo, São Paulo, Brazil
# Found birthplace on IMDb: Larry CharlesBrooklyn, New York City, New York, USA
# Found birthplace on Wikipedia: Lars von TrierKongens Lyngby, Denmark
# Found birthplace on Wikipedia: Laurent CantetMelle, Deux-Sèvres, France
# Found birthplace on IMDb: Laurie AndersonPotsdam, New York, USA
# Found birthplace on Wikipedia: Lav DiazColumbio, Cotabato, Philippines[a][1]
# Found birthplace on Wikipedia: Lee Chang-dongDaegu, South Korea
# Found birthplace on Wikipedia: Lee UnkrichCleveland, Ohio, U.S.
# Found birthplace on Wikipedia: Leos CaraxSuresnes, Hauts-de-Seine, France
# No birthplace found for Lewis Klahr !! Try again: New York https://en.wikipedia.org/wiki/Lewis_Klahr
# Found birthplace on IMDb: Luc DardenneAwirs, Wallonia, Belgium
# Found birthplace on Wikipedia: Luca GuadagninoPalermo, Italy
# No birthplace found for Lucien Castaing-Taylor !! Try again: Liverpool, United Kingdom https://en.wikipedia.org/wiki/Lucien_Castaing-Taylor
# Found birthplace on Wikipedia: Lucrecia MartelSalta, Argentina
# No birthplace found for Luigi M. Faccini !!Try again: Lerici, Italy https://www.imdb.com/name/nm0264825/
# Found birthplace on Wikipedia: Lukas MoodyssonLund, Scania, Sweden
# No birthplace found for Luke Fowler !! Try again: Glasgow, UK https://hepworthwakefield.org/artist/luke-fowler/
# Found birthplace on Wikipedia: Lynne RamsayGlasgow, Scotland[2]
# Found birthplace on Wikipedia: László NemesBudapest, Hungary
# Found birthplace on Wikipedia: Mahamat-Saleh HarounN'Djamena, Chad
# Found birthplace on Wikipedia: Mani HaghighiTehran, Iran
# Found birthplace on Wikipedia: Manoel de OliveiraPorto, Portugal
# Found birthplace on IMDb: Marcelo GomesRecife, Pernambuco, Brazil
# Found birthplace on Wikipedia: Marco BellocchioBobbio, Italy
# No birthplace found for Marco De Angelis !! no info 
# Found birthplace on Wikipedia: Maren AdeKarlsruhe, West Germany
# Found birthplace on Wikipedia: Mariano LlinásBuenos Aires, Argentina
# Found birthplace on Wikipedia: Marjane SatrapiRasht, Imperial State of Iran
# Found birthplace on Wikipedia: Mark NeveldineWatertown, New York, United States
# Found birthplace on Wikipedia: Martin CampbellHastings, New Zealand
# Found birthplace on Wikipedia: Martin ScorseseNew York City, U.S.
# Found birthplace on Wikipedia: Mary HarronBracebridge, Ontario, Canada
# Found birthplace on Wikipedia: Mel GibsonPeekskill, New York, U.S.
# Found birthplace on Wikipedia: Mia Hansen-LøveParis, France
# Found birthplace on Wikipedia: Michael BayLos Angeles, California, U.S.
# Found birthplace on Wikipedia: Michael HanekeMunich, Germany
# Found birthplace on Wikipedia: Michael MannChicago, Illinois, U.S.
# Found birthplace on Wikipedia: Michael MooreFlint, Michigan, U.S.
# Found birthplace on Wikipedia: Michael WinterbottomBlackburn, Lancashire, England
# Found birthplace on Wikipedia: Michel GondryVersailles, France
# Found birthplace on Wikipedia: Michel HazanaviciusParis, France
# Found birthplace on Wikipedia: Michelangelo AntonioniFerrara, Kingdom of Italy
# Found birthplace on Wikipedia: Michelangelo FrammartinoMilan, Italy
# Found birthplace on Wikipedia: Miguel ArtetaSan Juan, Puerto Rico
# Found birthplace on IMDb: Miguel GomesLisbon, Portugal
# Found birthplace on Wikipedia: Mike JudgeGuayaquil, Ecuador
# Found birthplace on Wikipedia: Mike LeighWelwyn Garden City, England
# Found birthplace on IMDb: Mike Mills1966, Berkeley, California
# Found birthplace on Wikipedia: Mira NairRourkela, Orissa, India
# Found birthplace on Wikipedia: Miranda JulyBarre, Vermont, US
# Found birthplace on Wikipedia: Mohamed DiabIsmailia, Egypt
# No birthplace found for Mohamed Soueid !! Try again: Beirut, Lebanon https://www.arabculturefund.org/Programs/Jurors/239#:~:text=Mohamed%20Soueid%20is%20a%20Lebanese,at%20Al%20Arabiya%20News%20Channel.
# Found birthplace on Wikipedia: Mohammad RasoulofShiraz, Imperial State of Iran
# Found birthplace on Wikipedia: Mojtaba MirtahmasbKerman, Iran
# Found birthplace on Wikipedia: Nadine LabakiBaabdat,  Lebanon[1]
# No birthplace found for Naji Abou Nowar!! Try again: Oxford, United Kingdom https://en.wikipedia.org/wiki/Naji_Abu_Nowar
# Found birthplace on Wikipedia: Nanfu WangJiangxi, China
# Found birthplace on Wikipedia: Nanni MorettiBruneck, South Tyrol, Italy
# Found birthplace on Wikipedia: Naomi KawaseNara, Japan
# Found birthplace on Wikipedia: Neill BlomkampJohannesburg, Transvaal, South Africa
# Found birthplace on IMDb: Newton I. Aduaka !! No official info, online says: Ogidi, Nigeria
# Found birthplace on Wikipedia: Ngozi OnwurahNigeria
# Found birthplace on Wikipedia: Nick CassavetesNew York City, New York, U.S.
# Found birthplace on Wikipedia: Nicolas Winding RefnCopenhagen, Denmark
# Found birthplace on IMDb: Nikolaus Geyrhalter1972 in Vienna, Austria
# Found birthplace on IMDb: Nina PaleyChampaign, Illinois, USA
# Found birthplace on Wikipedia: Noah BaumbachNew York City, U.S.
# No birthplace found for Norbert Pfaffenbichler !! Steyr,austria Try again: https://www.themoviedb.org/person/1270259-norbert-pfaffenbichler?language=en-US
# Found birthplace on Wikipedia: Nouri BouzidSfax, Tunisia
# Found birthplace on Wikipedia: Nuri Bilge CeylanIstanbul, Turkey
# Found birthplace on Wikipedia: Oliver SchmitzCape Town, South Africa
# Found birthplace on Wikipedia: Olivier AssayasParis, France
# Found birthplace on Wikipedia: Ossama MohammedLatakia, Syria
# No birthplace found for Ousmane Sembène !! Try again: Ziguinchor, Senegal https://en.wikipedia.org/wiki/Ousmane_Semb%C3%A8ne
# Found birthplace on Wikipedia: Pablo BergerBilbao, Spain
# Found birthplace on Wikipedia: Pablo LarraínSantiago, Chile
# No birthplace found for Pacho Velez !! No info 
# Found birthplace on Wikipedia: Paolo BenvenutiPisa, Italy
# Found birthplace on Wikipedia: Paolo SorrentinoNaples, Campania, Italy
# Found birthplace on Wikipedia: Park Chan-wookSeoul, South Korea
# Found birthplace on Wikipedia: Patricio GuzmánSantiago, Chile
# No birthplace found for Patrick Wang !! Try again: houston, Texas https://clermont-filmfest.org/en/film-director-patrick-wang-in-the-international-jury-2016/#:~:text=Patrick%20Wang%20was%20born%20in%201976%20in,Intitute%20of%20Technology%20with%20a%20degree%20in
# Found birthplace on Wikipedia: Patty JenkinsVictorville, California, U.S.
# Found birthplace on Wikipedia: Paul FeigRoyal Oak, Michigan, U.S.
# Found birthplace on Wikipedia: Paul GreengrassCheam, Surrey, England
# Found birthplace on Wikipedia: Paul Thomas AndersonLos Angeles, California, U.S.
# Found birthplace on Wikipedia: Paweł PawlikowskiWarsaw, Poland
# Found birthplace on Wikipedia: Pedro AlmodóvarCalzada de Calatrava, Spain
# Found birthplace on Wikipedia: Pedro CostaLisbon, Portugal
# Found birthplace on Wikipedia: Pete DocterBloomington, Minnesota, U.S.
# Found birthplace on Wikipedia: Peter JacksonWellington, New Zealand
# Found birthplace on Wikipedia: Peter MullanPeterhead, Aberdeenshire, Scotland
# Found birthplace on IMDb: Peter Tscherkassky1958 in Vienna
# Found birthplace on Wikipedia: Peter WatkinsNorbiton, Surrey, England
# Found birthplace on Wikipedia: Peter WeirSydney, Australia
# No birthplace found for Phil Solomon !! Try again: Manhattan, New York City, US https://en.wikipedia.org/wiki/Phil_Solomon_(filmmaker)
# Found birthplace on IMDb: Philip GröningDüsseldorf, West Germany
# Found birthplace on Wikipedia: Philip KaufmanChicago, Illinois, U.S.
# Found birthplace on Wikipedia: Philippe GarrelBoulogne-Billancourt, France
# Found birthplace on IMDb: Philippe GrandrieuxBelgium
# Found birthplace on Wikipedia: Pietro MarcelloCaserta, Italy
# No birthplace found for Pippo Delbono !!No info: online says Varazze, Italy
# Found birthplace on Wikipedia: Pirjo HonkasaloHelsinki, Finland
# Found birthplace on Wikipedia: Quentin TarantinoKnoxville, Tennessee, U.S.
# Found birthplace on Wikipedia: Raam ReddyKarnataka, India
# Found birthplace on Wikipedia: Rachid BoucharebParis, France
# Found birthplace on Wikipedia: Rajat KapoorNew Delhi, India
# Found birthplace on Wikipedia: Ramin BahraniWinston-Salem, North Carolina, U.S.
# No birthplace found for Raoul Peck !! Try again: Port-au-Prince, Haiti https://en.wikipedia.org/wiki/Raoul_Peck
# Found birthplace on Wikipedia: Raoul RuizPuerto Montt, Chile
# No birthplace found for Rehad Desai !! NO info but online says Cape Town, South Africa
# Found birthplace on Wikipedia: Rian JohnsonSilver Spring, Maryland, U.S.
# Found birthplace on IMDb: Richard CurtisWellington, New Zealand
# Found birthplace on IMDb: Richard KellyNewport News, Virginia, the son of Lane and Ennis Kelly
# Found birthplace on Wikipedia: Richard LinklaterHouston, Texas, U.S.
# Found birthplace on Wikipedia: Rick AlversonSpokane, Washington, U.S.
# Found birthplace on Wikipedia: Ridley ScottSouth Shields, Tyne and Wear, England
# No birthplace found for Rita Azevedo Gomes !! No info: but online says  Lisbon, Portugal
# Found birthplace on Wikipedia: Rob MarshallMadison, Wisconsin, U.S.
# Found birthplace on Wikipedia: Robert AltmanKansas City, Missouri, U.S.
# No birthplace found for Robert Greene !!Try again: Charlotte, North Carolina https://en.wikipedia.org/wiki/Robert_Greene_(filmmaker)
# Found birthplace on IMDb: Robert PulciniNew York City, New York, USA
# Found birthplace on Wikipedia: Robert ZemeckisChicago, Illinois, U.S.
# Found birthplace on Wikipedia: Roberto BenigniCastiglion Fiorentino, Tuscany, Italy
# Found birthplace on Wikipedia: Roman PolanskiParis, France
# Found birthplace on Wikipedia: Ronit ElkabetzBeersheba, Israel
# Found birthplace on Wikipedia: Roy AnderssonGothenburg, Sweden
# Found birthplace on Wikipedia: Royston TanSingapore
# Found birthplace on Wikipedia: Ryan CooglerOakland, California, U.S.
# Found birthplace on Wikipedia: S. Craig ZahlerMiami, Florida, U.S.
# Found birthplace on Wikipedia: Sam MendesReading, Berkshire, England
# Found birthplace on Wikipedia: Sam RaimiRoyal Oak, Michigan, U.S.
# Found birthplace on Wikipedia: Samba GadjigoKidira, Senegal
# Found birthplace on IMDb: Sara Fattahi1983 in Damascus, Syria
# Found birthplace on Wikipedia: Sarah PolleyToronto, Ontario, Canada
# Found birthplace on Wikipedia: Satoshi KonSapporo, Hokkaido, Japan
# Found birthplace on Wikipedia: Scandar CoptiJaffa
# Found birthplace on IMDb: Sean Bakerthe Bronx, New York
# Found birthplace on Wikipedia: Seth RogenVancouver, British Columbia, Canada
# Found birthplace on Wikipedia: Shane BlackPittsburgh, Pennsylvania, U.S.
# Found birthplace on Wikipedia: Shane CarruthMyrtle Beach, South Carolina, U.S.
# Found birthplace on Wikipedia: Shane MeadowsUttoxeter, Staffordshire, England
# Found birthplace on IMDb: Shari Springer BermanNew York City, New York, USA
# Found birthplace on Wikipedia: Shimit AminKampala, Uganda
# Found birthplace on Wikipedia: Shlomi ElkabetzBeersheba, Israel[1]
# Found birthplace on Wikipedia: Sidney LumetPhiladelphia, Pennsylvania, U.S.
# Found birthplace on Wikipedia: Sion SonoToyokawa, Aichi, Japan
# Found birthplace on Wikipedia: Sofia CoppolaNew York City, U.S.
# Found birthplace on Wikipedia: Spike JonzeNew York City, U.S.
# Found birthplace on Wikipedia: Spike LeeAtlanta, Georgia, U.S.
# No birthplace found for Stephanie Spray !! No info but online says: Minnesota, US 
# Found birthplace on Wikipedia: Stephen ChowBritish Hong Kong
# Found birthplace on Wikipedia: Stephen DaldryDorset, England
# Found birthplace on Wikipedia: Stephen FrearsLeicester, England
# No birthplace found for Stevan Riley!! No info but online says: Northern Ireland, United Kingdom
# Found birthplace on Wikipedia: Steve McQueenBeech Grove, Indiana, U.S.
# Found birthplace on IMDb: Steven Knightthe Los Angeles area, Steve enjoyed police work
# Found birthplace on Wikipedia: Steven SoderberghAtlanta, Georgia, U.S.
# Found birthplace on Wikipedia: Steven SpielbergCincinnati, Ohio, US
# Found birthplace on Wikipedia: Sudhir MishraLucknow, Uttar Pradesh, India
# Found birthplace on Wikipedia: Susanne BierCopenhagen, Denmark
# Found birthplace on Wikipedia: Takahisa ZezeTochigi, Japan
# Found birthplace on Wikipedia: Takeshi KitanoAdachi, Tokyo, Japan
# Found birthplace on Wikipedia: Tan Pin PinSingapore
# Found birthplace on Wikipedia: Terence DaviesLiverpool, England
# Found birthplace on Wikipedia: Terrence MalickOttawa, Illinois, U.S.
# Found birthplace on Wikipedia: Terry GeorgeBelfast, Northern Ireland
# Found birthplace on Wikipedia: Terry ZwigoffAppleton, Wisconsin, U.S.
# Found birthplace on IMDb: Thom Andersen1943 in Chicago, Illinois, USA
# Found birthplace on Wikipedia: Thomas VinterbergFrederiksberg, Denmark
# Found birthplace on Wikipedia: Tigmanshu DhuliaAllahabad, Uttar Pradesh, India
# Found birthplace on Wikipedia: Todd HaynesLos Angeles, California, U.S.
# Found birthplace on Wikipedia: Tom FordAustin, Texas, U.S.
# Found birthplace on Wikipedia: Tom HooperLondon, England
# No birthplace found for Tom McCarthy!! Try again" New Providence, New Jersey, US https://en.wikipedia.org/wiki/Tom_McCarthy_(director)
# Found birthplace on IMDb: Tom McGrathRutherglen, Scotland, UK
# Found birthplace on Wikipedia: Tomas AlfredsonLidingö, Sweden
# Found birthplace on Wikipedia: Tomm MooreNewry, Northern Ireland, U.K.
# Found birthplace on Wikipedia: Tommy Lee JonesSan Saba, Texas, U.S.
# Found birthplace on Wikipedia: Tony GilroyNew York City, U.S.
# Found birthplace on Wikipedia: Tony ScottTynemouth, England
# Found birthplace on IMDb: Travis Wilkerson1969 in Denver, Colorado, USA
# Found birthplace on Wikipedia: Trey ParkerConifer, Colorado, U.S.
# Found birthplace on Wikipedia: Tsai Ming-liangKuching, Crown Colony of Sarawak (present-day Kuching, Malaysia)
# Found birthplace on Wikipedia: Uli EdelNeuenburg am Rhein, Germany
# Found birthplace on Wikipedia: Ulrich SeidlVienna,[1] Austria
# No birthplace found for Valeska Grisebach !! Try again  Bremen, West Germany [now Germany Try again: https://www.imdb.com/name/nm1007136/
# No birthplace found for Verena Paravel!! Try again: Neuchâtel, Switzerland, https://en.wikipedia.org/wiki/V%C3%A9r%C3%A9na_Paravel
# Found birthplace on Wikipedia: Vicky JensonLos Angeles, California, U.S.
# Found birthplace on Wikipedia: Vikramaditya MotwaneBombay, Maharashtra, India
# Found birthplace on Wikipedia: Vincent ParonnaudLa Rochelle, France
# Found birthplace on Wikipedia: Vishal BhardwajChandpur, Uttar Pradesh, India
# Found birthplace on Wikipedia: Víctor EriceKarrantza, Biscay, Spain
# Found birthplace on Wikipedia: Víctor GaviriaLiborina, Antioquia, Colombia
# Found birthplace on IMDb: Wang Bing1930 in Beijing, China
# Found birthplace on Wikipedia: Warwick ThorntonAlice Springs, Northern Territory, Australia
# Found birthplace on IMDb: Wayne BlairTaree, New South Wales, Australia
# Found birthplace on Wikipedia: Werner HerzogMunich, German Reich
# Found birthplace on Wikipedia: Wes AndersonHouston, Texas, U.S.
# Found birthplace on Wikipedia: Whit StillmanWashington, D.C.
# Found birthplace on IMDb: Wolfgang BeckerBerlin, Germany
# Found birthplace on Wikipedia: Wong Kar-waiShanghai, China
# Found birthplace on Wikipedia: Woody AllenNew York City, U.S.
# Found birthplace on Wikipedia: Xavier BeauvoisAuchel, France
# Found birthplace on Wikipedia: Xavier DolanMontreal, Quebec, Canada
# Found birthplace on Wikipedia: Yann Arthus-BertrandParis, France
# Found birthplace on IMDb: Yaron Shani1973 in Tel Aviv, Israel
# No birthplace found for Yervant Gianikian !!! Meran, Tyrol, Austria-Hungary [now Merano, Alto Adige, Italy] Try again: https://www.imdb.com/name/nm0316214/ 
# Found birthplace on Wikipedia: Yorgos LanthimosAthens, Greece
# Found birthplace on Wikipedia: Zhang YimouXi'an, Shaanxi, China
# Found birthplace on IMDb: Ágnes HranitzkyDerecske, Hungary
# Found birthplace on Wikipedia: Éric RohmerTulle, France

In [151]:
# from playwright.sync_api import sync_playwright
# import time
# from bs4 import BeautifulSoup

# def run():
#     with sync_playwright() as p:
#         # Launch the browser
#         browser = p.chromium.launch(headless=False)  # Set headless=False to see the browser
#         page = browser.new_page()

#         # Navigate to the webpage
#         page.goto("https://www.imdb.com/name/nm0000186/", wait_until="domcontentloaded")  # Replace with your URL

#         # Click the button (use the appropriate selector for your button)
#         page.get_by_text("Trademarks").click()
#         print("button clicked")

#         print("time start")
#         time.sleep(3)
#         print("time up")
        
#         soup = BeautifulSoup(page.content(), 'html.parser')

#         result = soup.find_all("div", class_="ipc-list-card--border-line ipc-list-card--tp-none ipc-list-card--bp-none ipc-list-card--base ipc-list-card sc-63e3f2a7-0 gnTrvA")
#         print(result)

#         # # Close the browser
#         # browser.close()

# #if __name__ == "__main__":

In [71]:
from bs4 import BeautifulSoup
import requests

url = "https://en.wikipedia.org/wiki/Christopher_Nolan"

response = requests.get(url)
soup_doc = BeautifulSoup(response.content)
birthplace = soup_doc.find('div', class_="birthplace")
if birthplace: 
    birthplace_strip = birthplace.text.strip()
    print(birthplace_strip)

else: 
    print(f"Null")

London, England


In [66]:
{"director":"David Lynch","link":"/name/nm0000186/?ref_=sr_t_1"}

{'director': 'David Lynch', 'link': '/name/nm0000186/?ref_=sr_t_1'}

In [67]:
#step 1: loop through director list and search for the link
#get a list of dictionaries (table) with just name and link to imdb page
for director in d_list:
    url = "https://en.wikipedia.org/wiki/" + director.replace(" ","_")
    soup_doc = BeautifulSoup(url, "html.parser")
    birthplace = soup_doc.find('div', class_="birthplace")
    birthplace.text.strip()
    print(birthplace)


  soup_doc = BeautifulSoup(url, "html.parser")


AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
import requests

In [None]:
#step 2: loop through that list of dicts 
#going to each individual and page adding to the dict


In [None]:
raw_html = requests.get(url).content
raw_html

Getting more Director Info from IMDB

In [None]:
#tell imdb you are using a browser!!
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
#possible urls to use
# url ="https://www.imdb.com/search/title/?name=spirited%20away&title_type=feature"
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url,headers=head).content
#save html file
with open('imdb.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")
links = soup_doc.find_all('ul', class_="ipc-metadata-list ipc-metadata-list--dividers-none ipc-metadata-list--base")
links

# #creating links 
# base = "https://www.imdb.com/" 
# full_URL = base + link['href']
# print(full_URL)

# #fetching nationality
# full_URL = requests.get(full_URL,headers=head).content
# soup_doc = BeautifulSoup(full_URL, "html.parser")
# nationality = soup_doc.find_all('a',class_="ipc-overflowText-overlay bio-overflowtext-overlay")
# print(nationality)

# for bio in nationality: 
#     print (bio.get('href'))

# bio_URL = base + bio.get('href')
# bio_URL = requests.get(bio_URL,headers=head).content
# soup_doc = BeautifulSoup(full_URL, "html.parser")
# nationality = soup_doc.find_all('div', data-testid="sub-section-overview")
# nationality

In [None]:
# from bs4 import BeautifulSoup
# import time 
# import requests
# url = "https://www.imdb.com/name/nm0000186/?ref_=nmqu_ov"
# headers = {
#     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
# }
# response = requests.get(url, headers=headers)
# soup = BeautifulSoup(response.content, "html.parser")
# time.sleep(2)
# trademarks_section = soup.find_all("li", class_="ipc-metadata-list__item ipc-metadata-list__item--expanded ipc-metadata-list-item--expandable")
# print(trademarks_section)
# if trademarks_section:
#     # Extract all list items (<li>) from the section
#     trademarks = trademarks_section.find_all("li")
#     print("Trademarks:")
#     for trademark in trademarks:
#         print("-", trademark)
# else:
#     print("Trademarks section not found.")

In [None]:
# import asyncio
# from playwright.async_api import async_playwright
# from bs4 import BeautifulSoup
# import nest_asyncio

# # Allow nested asyncio loops (needed for Jupyter Notebook)
# nest_asyncio.apply()

# async def fetch_trademarks():
#     url = "https://www.imdb.com/name/nm0000186/?ref_=nmqu_ov"

#     async with async_playwright() as p:
#         # Launch the browser
#         browser = await p.chromium.launch(headless=False)
#         page = await browser.new_page()

#         # Navigate to the IMDb URL
#         await page.goto(url)

#         # Increase timeout and check for visibility
#         try:
#             await page.wait_for_selector('li[data-testid="name-dyk-trademarks"]', timeout=500)
#         except Exception as e:
#             print(f"Error waiting for selector: {e}")
#             print(await page.content())  # Print page content for debugging
#             await browser.close()
#             return

#         # Get the full page content after rendering
#         html = await page.content()
#         await browser.close()

#     # Parse the HTML with BeautifulSoup
#     soup = BeautifulSoup(html, "html.parser")

#     if trademarks_section:
#     # Debug: Print the raw HTML of the Trademarks section
#         print("Trademarks section HTML:")
#         print(trademarks_section.prettify())

#     # Extract all visible text, including nested elements
#     trademarks = list(trademarks_section.stripped_strings)

#     print("Trademarks:")
#     for trademark in trademarks:
#         print("-", trademark)
#     else:
#         print("Trademarks section not found.")

In [None]:
!pip install nest_asyncio


In [None]:
#tell imdb you are using a browser!!
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
#possible urls to use
# url ="https://www.imdb.com/search/title/?name=spirited%20away&title_type=feature"
url = "https://en.wikipedia.org/wiki/David_Lynch"
#add headers to request
raw_html = requests.get(url,headers=head).content
#save html file
with open('imdb.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")
birthplace = soup_doc.find('div', class_="birthplace")
birthplace.text.strip()