# Pixar Web Scrape

### We will need to create a data set

To gather the Pixar movie information we will scrape the IMDB movie database. We will begin by importinging beautiful soup, pandas and requests.

In [1]:
import bs4
import pandas as pd
import requests

We will create a function that will extract the html on a webpage into a BeautifulSoup object.

In [2]:
def get_page_contents(url):
    page = requests.get(url, headers={"Accept-Language": "en-US"})
    return bs4.BeautifulSoup(page.text, "html.parser")

Next we will create a few more functions to scrape numeric movie data, text values and nested values.

In [3]:
def numeric_value(movie, tag, class_=None, order=None):
    if order:
        if len(movie.findAll(tag, class_)) > 1:
            to_extract = movie.findAll(tag, class_)[order]['data-value']
        else:
            to_extract = None
    else:
        to_extract = movie.find(tag, class_)['data-value']

    return to_extract

In [4]:
def text_value(movie, tag, class_=None):
    if movie.find(tag, class_):
        return movie.find(tag, class_).text
    else:
        return

In [5]:
def nested_text_value(movie, tag_1, class_1, tag_2, class_2, order=None):
    if not order:
        try:
            return movie.find(tag_1, class_1).find(tag_2, class_2).text
        except:
            return ""
    else:
        return [val.text for val in movie.find(tag_1, class_1).findAll(tag_2, class_2)[order]]

We will now create a function that will run the previous functions depending on need.

In [6]:
def extract_attribute(soup, tag_1, class_1='', tag_2='', class_2='',
                      text_attribute=True, order=None, nested=False):
    movies = soup.findAll('div', class_='lister-item-content')
    data_list = []
    for movie in movies:
        if text_attribute:
            if nested:
                data_list.append(nested_text_value(movie, tag_1, class_1, tag_2, class_2, order))
            else:
                data_list.append(text_value(movie, tag_1, class_1))
        else:
            data_list.append(numeric_value(movie, tag_1, class_1, order))

    return data_list

Our final function will create a dictionary with the information scraped as well as two functions to handle specific cases not covered by our previous functions.

In [7]:
def create_dict():
    title = extract_attribute(soup, 'a')
    release = extract_attribute(soup, 'span', 'lister-item-year text-muted unbold')
    audience_rating = extract_attribute(soup, 'span', 'certificate')
    runtime = extract_attribute(soup, 'span', 'runtime')
    genre = extract_attribute(soup, 'span', 'genre')
    imdb_rating = extract_attribute(soup, 'div', 'inline-block ratings-imdb-rating', False)
    metascore = extract_attribute(soup, 'div', 'inline-block ratings-metascore', False)
    votes = extract_attribute(soup, 'span' , {'name' : 'nv'}, False, 0)
    earnings = extract_attribute(soup, 'span' , {'name' : 'nv'}, False, 1)
    directors = extract_attribute(soup, 'p', '', 'a', '', True, 0, True)
    actors = extract_attribute(soup, 'p', '', 'a', '', True, slice(1, 5, None), True)
    
    movies = soup.findAll('div', class_='lister-item-content')
    imdb_id = []
    for movie in movies:
        imdb_id.append(soup.find('h3').a['href'].split('/')[2])
    
    movies = soup.findAll('div', class_='lister-item-content')
    description = []
    for movie in movies:
        description.append(movie.findAll('p', class_='text-muted')[-1].text.lstrip())
    
    df_dict = {'IMDB ID': imdb_id, 'Title': title, 'Year': release, 'Audience Rating': audience_rating,
           'Runtime': runtime, 'Genre': genre, 'IMDB Rating': imdb_rating,
           'Votes': votes, 'Box Office Earnings': earnings, 'Description' : description, 'Metascore': metascore, 'Director': directors,
           'Actors': actors}
    
    return df_dict

We will now run the get_page_contents function and connect to the IMDB Pixar animations studios movie search pages 1 and 2.

In [8]:
soup = get_page_contents('https://www.imdb.com/search/title/?companies=co0017902&ref_=adv_prv')

df_dict2 = {}
df_dict2.update(create_dict())

In [9]:
df = pd.DataFrame(df_dict2)
df

Unnamed: 0,IMDB ID,Title,Year,Audience Rating,Runtime,Genre,IMDB Rating,Votes,Box Office Earnings,Description,Metascore,Director,Actors
0,tt7146812,Onward,(I) (2020),PG,102 min,"\nAnimation, Adventure, Comedy",\n\n7.5\n,70107.0,70107.0,Two elven brothers embark on a quest to bring ...,\n61 \n Metascore\n,Dan Scanlon,"[Tom Holland, Chris Pratt, Julia Louis-Dreyfus..."
1,tt7146812,Baby Driver,(2017),R,113 min,"\nAction, Crime, Drama",\n\n7.6\n,416911.0,416911.0,After being coerced into working for a crime b...,\n86 \n Metascore\n,Edgar Wright,"[Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Go..."
2,tt7146812,Toy Story 4,(2019),G,100 min,"\nAnimation, Adventure, Comedy",\n\n7.8\n,184996.0,184996.0,"When a new toy called ""Forky"" joins Woody and ...",\n84 \n Metascore\n,Josh Cooley,"[Tom Hanks, Tim Allen, Annie Potts, Tony Hale]"
3,tt7146812,Coco,(I) (2017),PG,105 min,"\nAnimation, Adventure, Family",\n\n8.4\n,346051.0,346051.0,"Aspiring musician Miguel, confronted with his ...",\n81 \n Metascore\n,Lee Unkrich,"[Adrian Molina, Anthony Gonzalez, Gael García ..."
4,tt7146812,Cars,(2006),G,117 min,"\nAnimation, Comedy, Family",\n\n7.1\n,355337.0,355337.0,A hot-shot race-car named Lightning McQueen ge...,\n73 \n Metascore\n,John Lasseter,"[Joe Ranft, Owen Wilson, Bonnie Hunt, Paul New..."
5,tt7146812,Inside Out,(I) (2015),PG,95 min,"\nAnimation, Adventure, Comedy",\n\n8.1\n,587600.0,587600.0,After young Riley is uprooted from her Midwest...,\n94 \n Metascore\n,Pete Docter,"[Ronnie Del Carmen, Amy Poehler, Bill Hader, L..."
6,tt7146812,Soul,(2020),,90 min,"\nAnimation, Adventure, Comedy",,,,A musician who has lost his passion for music ...,,Pete Docter,"[Kemp Powers, Jamie Foxx, Tina Fey, Quest Love]"
7,tt7146812,Ratatouille,(2007),G,111 min,"\nAnimation, Adventure, Comedy",\n\n8.0\n,620890.0,620890.0,A rat who can cook makes an unusual alliance w...,\n96 \n Metascore\n,Brad Bird,"[Jan Pinkava, Brad Garrett, Lou Romano, Patton..."
8,tt7146812,Incredibles 2,(2018),PG,118 min,"\nAnimation, Action, Adventure",\n\n7.6\n,236615.0,236615.0,The Incredibles hero family takes on a new mis...,\n80 \n Metascore\n,Brad Bird,"[Craig T. Nelson, Holly Hunter, Sarah Vowell, ..."
9,tt7146812,The Incredibles,(2004),PG,115 min,"\nAnimation, Action, Adventure",\n\n8.0\n,639378.0,639378.0,"A family of undercover superheroes, while tryi...",\n90 \n Metascore\n,Brad Bird,"[Craig T. Nelson, Samuel L. Jackson, Holly Hun..."


In [10]:
soup = get_page_contents('https://www.imdb.com/search/title/?companies=co0017902&start=51&ref_=adv_nxt')

df_dict2 = {}
df_dict2.update(create_dict())

In [11]:
df2 = pd.DataFrame(df_dict2)
df2

Unnamed: 0,IMDB ID,Title,Year,Audience Rating,Runtime,Genre,IMDB Rating,Votes,Box Office Earnings,Description,Metascore,Director,Actors
0,tt2033372,Toy Story Toons: Small Fry,(2011),G,7 min,"\nAnimation, Short, Comedy",\n\n7.1\n,5541,5541,A fast food restaurant mini variant of Buzz fo...,,Angus MacLane,"[Dylan Brown, Tom Hanks, Tim Allen, Wallace Sh..."
1,tt2033372,Loop,(I) (2020),PG,9 min,"\nAnimation, Short, Adventure",\n\n6.8\n,559,559,"In LOOP, two kids at canoe camp find themselve...",,Erica Milsom,"[Madison Bandy, Christiano Delgado, Louis Gonz..."
2,tt2033372,Mater's Tall Tales,(2008–2014),Not Rated,36 min,"\nAnimation, Adventure, Comedy",\n\n6.9\n,2555,2555,Cruise into the crazy adventures of Tow Mater ...,,Larry the Cable Guy,"[Keith Ferguson, Elissa Knight, Lindsey Collins]"
3,tt2033372,Making Waves: The Art of Cinematic Sound,(2019),Unrated,94 min,\nDocumentary,\n\n7.5\n,864,864,An exploration of the history and emotional po...,\n80 \n Metascore\n,Midge Costin,"[Erik Aadahl, Ioan Allen, Richard L. Anderson,..."
4,tt2033372,Jimmy's Hall,(2014),PG-13,109 min,"\nBiography, Drama, History",\n\n6.7\n,5752,5752,"During the Depression, Jimmy Gralton returns h...",\n63 \n Metascore\n,Ken Loach,"[Barry Ward, Francis Magee, Aileen Henry, Simo..."
5,tt2033372,Lava,(2014),G,7 min,"\nAnimation, Short, Family",\n\n7.2\n,14171,14171,A story that takes place over millions of year...,,James Ford Murphy,"[Napua Greig, Kuana Torres Kahele]"
6,tt2033372,Buzz Lightyear of Star Command: The Adventure ...,(2000 Video),Not Rated,70 min,"\nAnimation, Action, Adventure",\n\n6.2\n,4043,4043,Buzz Lightyear must battle Emperor Zurg with t...,,Tad Stones,"[Tim Allen, Nicole Sullivan, Larry Miller, Ste..."
7,tt2033372,LEGO The Incredibles,(2018 Video Game),E10+,,"\nAction, Adventure, Comedy",\n\n7.1\n,207,207,A new game where players take control of their...,,Pete Gomer,"[Maeve Andrews, Jonathan Banks, John Eric Bent..."
8,tt2033372,Riley's First Date?,(2015 Video),G,5 min,"\nAnimation, Short, Comedy",\n\n7.5\n,5829,5829,"Riley, now 12, who is hanging out with her par...",,Josh Cooley,"[Pete Docter, Ben Cox, Kyle MacLachlan, Diane ..."
9,tt2033372,Purl,(2018),PG,8 min,"\nAnimation, Short, Comedy",\n\n6.5\n,1982,1982,An earnest ball of yarn named Purl gets a job ...,,Kristen Lester,"[Bret 'Brook' Parker, Emily Davis, Michael Dal..."


We will now combine the data from the two pages.

In [12]:
combined_data = pd.concat([df, df2], ignore_index = True, sort=True)

In [13]:
clean_df = combined_data

We will now clean the data.

In [14]:
clean_df['Year'] = clean_df['Year'].str.slice(start=-5, stop=-1)

In [15]:
clean_df['IMDB Rating'] = clean_df['IMDB Rating'].str.replace('\n',"")

In [16]:
clean_df['Metascore'] = clean_df['Metascore'].str.replace('\n',"")
clean_df['Metascore'] = clean_df['Metascore'].str.replace('Metascore',"")

In [17]:
clean_df['Runtime'] = clean_df['Runtime'].str.replace('min',"")

In [18]:
clean_df['Genre'] = clean_df['Genre'].str.slice(start=1)

In [19]:
clean_df['Studio'] = "Pixar Animation Studio"

In [20]:
clean_df.head()

Unnamed: 0,Actors,Audience Rating,Box Office Earnings,Description,Director,Genre,IMDB ID,IMDB Rating,Metascore,Runtime,Title,Votes,Year,Studio
0,"[Tom Holland, Chris Pratt, Julia Louis-Dreyfus...",PG,70107,Two elven brothers embark on a quest to bring ...,Dan Scanlon,"Animation, Adventure, Comedy",tt7146812,7.5,61,102,Onward,70107,2020,Pixar Animation Studio
1,"[Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Go...",R,416911,After being coerced into working for a crime b...,Edgar Wright,"Action, Crime, Drama",tt7146812,7.6,86,113,Baby Driver,416911,2017,Pixar Animation Studio
2,"[Tom Hanks, Tim Allen, Annie Potts, Tony Hale]",G,184996,"When a new toy called ""Forky"" joins Woody and ...",Josh Cooley,"Animation, Adventure, Comedy",tt7146812,7.8,84,100,Toy Story 4,184996,2019,Pixar Animation Studio
3,"[Adrian Molina, Anthony Gonzalez, Gael García ...",PG,346051,"Aspiring musician Miguel, confronted with his ...",Lee Unkrich,"Animation, Adventure, Family",tt7146812,8.4,81,105,Coco,346051,2017,Pixar Animation Studio
4,"[Joe Ranft, Owen Wilson, Bonnie Hunt, Paul New...",G,355337,A hot-shot race-car named Lightning McQueen ge...,John Lasseter,"Animation, Comedy, Family",tt7146812,7.1,73,117,Cars,355337,2006,Pixar Animation Studio


In [21]:
clean_df = clean_df[['IMDB ID', 'Title', 'Year', 'Genre', 'Audience Rating', 'Description', 'Studio', 'Director', 'Actors', 'Box Office Earnings', 'Metascore', 'IMDB Rating', 'Votes']]

Now that we are done we will save the file.

In [22]:
save_path = r"C:\Users\Basil\Documents\Data Science\Projects\20200521 Disney\1. Original Data\Pixar Animation Studios.csv"
clean_df.to_csv(save_path)