## Getting Started

In this web-scrapping project, we shall do the following:
- Scrape the list of Walt Disney Movies
- Clean the data
- Use regex to filter data
- Accessing data through API
- Save data in multiple extensions

![Web-Scraping](/images/ws/ws1.jpg)

# Importing Libaries

In [1]:
from bs4 import BeautifulSoup as bs
import requests

## Scraping data from webpage

![List-of-WaltDisneyMovies](/images/ws/ws2.jpg)

In [103]:
# func for filtering contents of webpage
def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    elif row_data.find("br"):
        return [text for text in row_data.stripped_strings]
    else:
        return row_data.get_text(" ", strip=True).replace('\xa0', ' ')

# remove redundant tags
def clean_tags(soup):
    for tag in soup.find_all(["sup", "span"]):
        tag.decompose()

# func for extracting info box of movie page
def get_info_box(url):

    r = requests.get(url)
    soup = bs(r.content)
  
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")

    clean_tags(soup)

    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['Title'] = row.find("th").get_text(" ", strip=True)
        else:
            header = row.find('th')
            if header:
                content_key = row.find("th").get_text(" ", strip=True)
                content_value = get_content_value(row.find("td"))
                movie_info[content_key] = content_value

    return movie_info

![Info-box](/images/ws/ws3.jpg)

In [104]:
# scraping info box for all movies using for loop
r = requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films')
soup = bs(r.content)
movies = soup.select('.wikitable.sortable i a')
base_path = 'https://en.wikipedia.org/'

movie_info_list = []

for index, movie in enumerate(movies):
    if index % 10 == 0:
        print(index)
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        movie_info_list.append(get_info_box(full_path))

    except Exception as e:
        print(index, movie['title'])
        print(e)

0
10
20
30
40
43 Zorro (1957 TV series)
'NoneType' object has no attribute 'find'
48 Zorro (1957 TV series)
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
124 True-Life Adventures
'NoneType' object has no attribute 'find_all'
130
140
145 The Omega Connection
'NoneType' object has no attribute 'find'
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
487 Strange World (film)
'NoneType' object has no attribute 'find_all'
490
500
505 Sister Act 3
'NoneType' object has no attribute 'find'
508 The Twilight Zone Tower of Terror
'NoneType' object has no attribute 'find_all'
509 Tron: Ares
'NoneType' object has no attribute 'find'


In [105]:
len(movie_info_list)

502

## Saving data to json

In [2]:
import json

# functions for saving and loading files into json
def save_data_json(filename, data):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

def load_data_json(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return json.load(f)

In [115]:
save_data_json('Walt_Disney_Movies(v1).json', movie_info_list)

In [3]:
movie_info_list = load_data_json('Walt_Disney_Movies(v1).json')

## Data Cleaning

Cleaning tasks:
- Converting 'Running time' to integer
- Convert 'Budget' and 'Box office' to numerical values
- Convert 'Release date' to datetime object<br>
(We shall do these tasks by creating a new column separately for each task and adding the desired conversion we want)

- Converting 'Running time' to integer

In [4]:
for movie in movie_info_list:
    print(movie.get('Running time', 'N/A'))

41 minutes (74 minutes 1966 release)
83 minutes
88 minutes
126 minutes
74 minutes
64 minutes
70 minutes
42 minutes
70 min
71 minutes
75 minutes
94 minutes
73 minutes
75 minutes
82 minutes
68 minutes
74 minutes
96 minutes
75 minutes
84 minutes
77 minutes
92 minutes
69 minutes
81 minutes
['60 minutes (VHS version)', '71 minutes (original)']
127 minutes
92 minutes
76 minutes
75 minutes
73 minutes
85 minutes
81 minutes
70 minutes
90 min.
80 minutes
75 minutes
83 minutes
83 minutes
72 minutes
97 minutes
75 minutes
104 minutes
93 minutes
105 minutes
95 minutes
97 minutes
134 minutes
69 minutes
92 minutes
131 minutes
79 minutes
97 minutes
128 minutes
73 minutes
91 minutes
105 minutes
98 minutes
130 minutes
89 min.
93 minutes
67 minutes
98 minutes
100 minutes
118 minutes
103 minutes
110 minutes
80 min.
79 minutes
91 minutes
91 minutes
97 minutes
118 minutes
139 minutes
131 mins.
92 minutes
87 minutes
116 minutes
93 minutes
110 min.
110 min.
131 minutes
101 minutes
108 minutes
84 minutes
78 min

In [7]:
# func for converting running minutes to integer by add a new key
def convert_min2int(running_time):
    if running_time == 'N/A':
        return None

    if isinstance(running_time, list):
        return running_time[0].split(' ')[0]
        
    else:
        return running_time.split(' ')[0]
        

for movie in movie_info_list:
    movie['Running time(int)'] = convert_min2int(movie.get('Running time', 'N/A'))

In [8]:
movie_info_list[-65]

{'title': 'Cars 3',
 'Directed by': 'Brian Fee',
 'Screenplay by': ['Kiel Murray', 'Bob Peterson', 'Mike Rich'],
 'Story by': ['Brian Fee', 'Ben Queen', 'Eyal Podell', 'Jonathan E. Stewart'],
 'Produced by': 'Kevin Reher',
 'Starring': ['Owen Wilson',
  'Cristela Alonzo',
  'Chris Cooper',
  'Armie Hammer',
  'Larry the Cable Guy',
  'Bonnie Hunt',
  'Nathan Fillion',
  'Lea DeLaria',
  'Kerry Washington'],
 'Cinematography': ['Jeremy Lasky (camera)', 'Kim White (lighting)'],
 'Edited by': 'Jason Hudak',
 'Music by': 'Randy Newman',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['May 23, 2017 ( Kannapolis )',
  'June 16, 2017 (United States)'],
 'Running time': '102 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$175 million',
 'Box office': '$383.9 million',
 'Running time(int)': '102'}

- Convert 'Budget' and 'Box office' to numerical values

In [9]:
keys = ['Budget', 'Box office']

for key in keys:
    for movie in movie_info_list:
        print(movie.get(key))

None
$1.49 million
$2.6 million
$2.28 million
$600,000
$950,000
$858,000
None
$788,000
None
$1.35 million
$2.125 million
None
$1.5 million
$1.5 million
None
$2.2 million
$1,800,000
$3 million
None
$4 million
$2 million
$300,000
$1.8 million
None
$5 million
None
$4 million
None
None
None
None
None
None
$700,000
None
None
None
None
None
$6 million
under $1 million or $1,250,000
None
$2 million
None
None
$2.5 million
None
None
$4 million
$3.6 million
None
None
None
None
$3 million
None
$3 million
None
None
None
None
None
None
None
None
None
$3 million
None
None
None
None
$4.4–6 million
None
None
None
None
None
None
None
None
None
None
None
$4 million
None
$5 million
None
None
None
None
$5 million
None
None
None
None
None
None
$4 million
None
None
None
$6.3 million
None
None
None
None
None
None
None
None
$1.5-5 million
None
None
None
None
$8 million
None
None
None
None
None
AU$1 million
None
None
None
None
$5 million
None
None
None
$7.5 million
None
$10 million
None
None
$3.5 to 4 million


In [10]:
# using regex to filter out the numerical values
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000        ## Value syntax
'''
def money_conversion(money):
    if money == "N/A":
        return None

    if isinstance(money, list):
        money = money[0]
        
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())

    else:
        return None

In [11]:
for movie in movie_info_list:
    movie['Budget(float)'] = money_conversion(movie.get('Budget', 'N/A'))
    movie['Box office(float)'] = money_conversion(movie.get('Box office', 'N/A'))

In [12]:
movie_info_list[-50]

{'title': 'The Lion King',
 'Directed by': 'Jon Favreau',
 'Screenplay by': 'Jeff Nathanson',
 'Based on': ["Disney 's The Lion King by Irene Mecchi Jonathan Roberts Linda Woolverton"],
 'Produced by': ['Jon Favreau', 'Jeffrey Silver', 'Karen Gilchrist'],
 'Starring': ['Donald Glover',
  'Seth Rogen',
  'Chiwetel Ejiofor',
  'Alfre Woodard',
  'Billy Eichner',
  'John Kani',
  'John Oliver',
  'Beyoncé Knowles-Carter',
  'James Earl Jones'],
 'Cinematography': 'Caleb Deschanel',
 'Edited by': ['Mark Livolsi', 'Adam Gerstel'],
 'Music by': 'Hans Zimmer',
 'Production companies': ['Walt Disney Pictures', 'Fairview Entertainment'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['July 9, 2019 ( Hollywood )',
  'July 19, 2019 (United States)'],
 'Running time': '118 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$250–260 million',
 'Box office': '$1.663 billion',
 'Running time(int)': '118',
 'Budget(float)': 250000000.0,
 'Box o

- Convert 'Release date' to datetime object

In [13]:
for movie in movie_info_list:
    print(movie.get('Release date'))

['May 19, 1937']
['December 21, 1937 ( Carthay Circle Theatre )']
['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)']
['November 13, 1940']
['June 27, 1941']
['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)']
['August 9, 1942 (World Premiere – London)', 'August 13, 1942 (Premiere – New York City)', 'August 21, 1942 (U.S.)']
['August 24, 1942 (World Premiere – Rio de Janeiro)', 'February 6, 1943 (U.S. Premiere – Boston)', 'February 19, 1943 (U.S.)']
['July 17, 1943']
['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)']
['April 20, 1946 (New York City premiere)', 'August 15, 1946 (U.S.)']
['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946', 'March 30, 1947 (Stanford Theatre, Palo Alto, California)']
['September 27, 1947']
May 27, 1948
['November 29, 1948 (Chicago, Illinois)', 'January 19, 1949 (Indianapolis, Indiana)']
['October 5, 1949']
['February 15, 1950 (Boston)', 'March 4, 1950 (United States)']
['June 22, 1950 (World

In [14]:
from datetime import datetime

def clean_date(date):
    return date.split('(')[0].strip()

def date_conversion(date):
    if isinstance(date, list):
        date = date[0]

    if date == 'N/A':
        return None

    date = clean_date(date)
    formats = ['%B %d, %Y', '%d %B %Y']

    for format in formats:
        try:
            return datetime.strptime(date, format)
        except:
            pass
    return None

In [16]:
for movie in movie_info_list:
    movie['Release date(datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [17]:
movie_info_list[-100]

{'title': 'Frozen',
 'Directed by': ['Chris Buck', 'Jennifer Lee'],
 'Screenplay by': 'Jennifer Lee',
 'Story by': ['Chris Buck', 'Jennifer Lee', 'Shane Morris'],
 'Produced by': 'Peter Del Vecho',
 'Starring': ['Kristen Bell',
  'Idina Menzel',
  'Jonathan Groff',
  'Josh Gad',
  'Santino Fontana'],
 'Cinematography': ['Scott Beattie',
  '(layout)',
  'Mohit Kallianpur',
  '(lighting)'],
 'Edited by': 'Jeff Draheim',
 'Music by': ['Christophe Beck', 'Robert Lopez', 'Kristen Anderson-Lopez'],
 'Production companies': ['Walt Disney Pictures',
  'Walt Disney Animation Studios'],
 'Distributed by': ['Walt Disney Studios', 'Motion Pictures'],
 'Release date': ['November 19, 2013 ( El Capitan Theatre )',
  'November 22, 2013 (United States)'],
 'Running time': '102 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$150 million',
 'Box office': '$1.280 billion',
 'Running time(int)': '102',
 'Budget(float)': 150000000.0,
 'Box office(float)': 1280000000.0,
 'Release 

## Save data (Cleaned)

In [18]:
# let's save in pickle file
import pickle

# func for saving and loading
def save_data_pickle(filename, data):
    with open(filename, 'wb') as f:
        pickle.dump(data, f)

def load_data_pickle(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)    

In [19]:
save_data_pickle('Walt_Disney_Movies(v2).pickle', movie_info_list)

In [None]:
movie_info_list = load_data_pickle('Walt_Disney_Movies(v2).pickle')

## API (Application Programming Interface)

![API](/images/ws/api.jpg)

## Attach Genre, IMDB & Rotten Tomoto scores using API

![OMDbAPI-mainpage](/images/ws/ws4.jpg)

Instructions for obtaining API key:
- Visit http://www.omdbapi.com/
- Sign up and verify email
- Use free services (1,000 daily limit) (or) sign up for patreon for extensive usage
- You will receive API key via email

![OMDbAPI-APIkey](/images/ws/ws5.jpg)

In [None]:
import requests
import urllib  

# func for obtaining path for each movie
def get_omdb_info(title):
    base_url = 'http://www.omdbapi.com/?'
    parameters = {'apikey': '21c6e072', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

# obtaining rotten tomato scores
def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', []) 
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

If you would like to hide API in the script, you can use environmental variables in this case:<br>
__import os<br>
parameters = {'apikey': os.environ[your_environment_variable], ...}__<br>
I haven't hid the API key as I'm only using the free services

In [None]:
for index, movie in enumerate(movie_info_list):
    if index % 10 == 0:
        print(index)
    title = movie['title']
    ombd_info = get_omdb_info(title)

    movie['Genre'] = ombd_info.get('Genre', None)  
    movie['imdbRating'] = ombd_info.get('imdbRating', None)
    movie['Metascore'] = ombd_info.get('Metascore', None)
    movie['Rotten Tomatoes'] = get_rotten_tomato_score(ombd_info)

In [31]:
movie_info_list[-38]

{'title': 'Hamilton',
 'Directed by': 'Thomas Kail',
 'Written by': 'Lin-Manuel Miranda',
 'Based on': ['Alexander Hamilton', 'by', 'Ron Chernow'],
 'Produced by': ['Thomas Kail', 'Lin-Manuel Miranda', 'Jeffrey Seller'],
 'Starring': ['Daveed Diggs',
  'Renée Elise Goldsberry',
  'Jonathan Groff',
  'Christopher Jackson',
  'Jasmine Cephas Jones',
  'Lin-Manuel Miranda',
  'Leslie Odom Jr.',
  'Okieriete Onaodowan',
  'Anthony Ramos',
  'Phillipa Soo'],
 'Cinematography': 'Declan Quinn',
 'Edited by': 'Jonah Moran',
 'Music by': 'Lin-Manuel Miranda',
 'Production companies': ['Walt Disney Pictures',
  '5000 Broadway Productions',
  'Nevis Productions',
  'Old 320 Sycamore Pictures',
  'RadicalMedia'],
 'Distributed by': 'Walt Disney Studios Motion Pictures',
 'Release date': ['July 3, 2020'],
 'Running time': '160 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$12.5 million (stage production)',
 'Running time(int)': 160,
 'Budget(float)': 12500000.0,
 'Box o

# Save data

In [33]:
# saving data as csv so let's convert it into dataframe
import pandas as pd

df = pd.DataFrame(movie_info_list)
df.to_csv('Walt_Disney_Movies_final(v3).csv')

In [34]:
df

Unnamed: 0,title,Production company,Distributed by,Release date,Running time,Country,Language,Box office,Running time(int),Budget(float),...,Hepburn,Adaptation by,Animation by,Traditional,Simplified,Original title,Layouts by,Created by,Original work,Owner
0,Academy Award Review of,Walt Disney Productions,RKO Radio Pictures,"[May 19, 1937]",41 minutes (74 minutes 1966 release),United States,English,$45.472,41.0,,...,,,,,,,,,,
1,Snow White and the Seven Dwarfs,Walt Disney Productions,RKO Radio Pictures,"[December 21, 1937 ( Carthay Circle Theatre )]",83 minutes,United States,English,$418 million,83.0,1490000.0,...,,,,,,,,,,
2,Pinocchio,Walt Disney Productions,RKO Radio Pictures,"[February 7, 1940 ( Center Theatre ), February...",88 minutes,United States,English,$164 million,88.0,2600000.0,...,,,,,,,,,,
3,Fantasia,Walt Disney Productions,RKO Radio Pictures,"[November 13, 1940]",126 minutes,United States,English,$76.4–$83.3 million (United States and Canada),126.0,2280000.0,...,,,,,,,,,,
4,The Reluctant Dragon,Walt Disney Productions,RKO Radio Pictures,"[June 27, 1941]",74 minutes,United States,English,"$960,000 (worldwide rentals)",74.0,600000.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497,Lilo & Stitch,,"Buena Vista Pictures Distribution, Inc.","[June 21, 2002]",85 minutes,United States,English,$273.1 million,85.0,80000000.0,...,,,,,,,,,,
498,National Treasure: Book of Secrets,,Walt Disney Studios Motion Pictures,"[December 21, 2007]",124 minutes,United States,English,$459.2 million,124.0,130000000.0,...,,,,,,,,,,
499,"Honey, I Shrunk the Kids",,,,,,,,,,...,,,,,,,,"[Stuart Gordon, Brian Yuzna, Ed Naha]","Honey, I Shrunk the Kids (1989)",The Walt Disney Company
500,Snow White and the Seven Dwarfs,Walt Disney Productions,RKO Radio Pictures,"[December 21, 1937 ( Carthay Circle Theatre )]",83 minutes,United States,English,$418 million,83.0,1490000.0,...,,,,,,,,,,
