## <center>__Getting Started__</center>

This is the raw file of web scraping project and it contains a complete walkthrough for each line. Sometimes, the comments can be overwhelming and messy but helps us know what we're doing, now let's start :<br>
Here, we are going to scrape the list of all Walt Disney movies that is openly available in the Wikipedia page.<br>
First we will need to import libraries required for web scraping i.e. BeautifulSoup and requests

## Importing Libraries

In [1]:
# importing necessary libraries
from bs4 import BeautifulSoup as bs
import requests

## Loading Webpage

In [2]:
# at first we are going to get only the web page of single movie on the list to analyse on what data
# can be useful, imminent and consistent among the others

## using requests to get web page
r = requests.get('https://en.wikipedia.org/wiki/Toy_Story_4')

soup = bs(r.content)

print(soup.prettify())    # prettify is used for proper indentation

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Toy Story 4 - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"be44a2c8-d432-49c1-b30a-ab6a126e3ce5","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Toy_Story_4","wgTitle":"Toy Story 4","wgCurRevisionId":1062107408,"wgRevisionId":1062107408,"wgArticleId":57782491,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with non-numeric formatnum arguments","Articles with short description","Short description is different from Wikidata","Use American E

In [3]:
# these commands can be used to see through the particular elements in the webpage
print(soup.title)
print(soup.title.string)
print(soup.a)
print(soup.p)

<title>Toy Story 4 - Wikipedia</title>
Toy Story 4 - Wikipedia
<a id="top"></a>
<p class="mw-empty-elt">
</p>


In [4]:
# after thorough analysis of the webpage, it is imminent that the info-box situated on the top right side of
# the webpage can be useful and also seems consistent among the other movies too
# using commands in beautifulsoup to find out the info-box elements is a tedious task
# so we can use the web page to inspect the elements of info-box and scrape it

info_box = soup.find(class_='infobox vevent') # class_ must be used here or else python class will get initiated

In [5]:
# we have found the elements which consists of info box

info_box.find_all('tr')

# looks like all the 'tr'(table rows) consists of info that we need
# so let's assign it to a variable

info_rows = info_box.find_all('tr')

In [6]:
# let's use for loop to iterate over rows and check wheather we are getting the right info
for row in info_rows:
    print(row.prettify())

<tr>
 <th class="infobox-above summary" colspan="2" style="font-size: 125%; font-style: italic;">
  Toy Story 4
 </th>
</tr>

<tr>
 <td class="infobox-image" colspan="2">
  <a class="image" href="/wiki/File:Toy_Story_4_poster.jpg">
   <img alt="Toy Story 4 poster.jpg" class="thumbborder" data-file-height="326" data-file-width="220" decoding="async" height="326" src="//upload.wikimedia.org/wikipedia/en/4/4c/Toy_Story_4_poster.jpg" width="220"/>
  </a>
  <div class="infobox-caption">
   Theatrical release poster
  </div>
 </td>
</tr>

<tr>
 <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;">
  Directed by
 </th>
 <td class="infobox-data">
  <a href="/wiki/Josh_Cooley" title="Josh Cooley">
   Josh Cooley
  </a>
 </td>
</tr>

<tr>
 <th class="infobox-label" scope="row" style="white-space: nowrap; padding-right: 0.65em;">
  Screenplay by
 </th>
 <td class="infobox-data">
  <div class="plainlist">
   <ul>
    <li>
     <a href="/wiki/Andrew_Stanton" tit

In [11]:
# perfect !! that's all we wanted

# some 'td' elements consists of 'li'(list) elements and it creates special characters like '/n'
# creating a function to list the 'li' elements in a python list using list comprehension
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\\xa0', ' ') for li in row_data.find_all('li')]
    else:
        return row_data.get_text(' ', strip=True)

# creating a dictionary and storing the info by key value pairs
movie_info = {}

for index, row in enumerate(info_rows):
    if index == 0:       # title 
        movie_info['Title'] = row.find('th').get_text(' ', strip=True)
    elif index == 1:     # redundant so continue without it
        continue
    else:
        content_key = row.find('th').get_text(' ', strip=True)
        content_value = get_content_value(row.find('td'))
        movie_info[content_key] = content_value

In [8]:
movie_info

{'Title': 'Toy Story 4',
 'Directed by': 'Josh Cooley',
 'Screenplay by': ['Andrew Stanton', 'Stephany Folsom'],
 'Story by': ['John Lasseter',
  'Andrew Stanton [1]',
  'Josh Cooley',
  'Valerie LaPointe',
  'Rashida Jones',
  'Will McCormack',
  'Martin Hynes',
  'Stephany Folsom'],
 'Produced by': ['Mark Nielsen', 'Jonas Rivera'],
 'Starring': ['Tom Hanks',
  'Tim Allen',
  'Annie Potts',
  'Tony Hale',
  'Keegan-Michael Key',
  'Jordan Peele',
  'Madeleine McGraw',
  'Christina Hendricks',
  'Keanu Reeves',
  'Ally Maki',
  'Jay Hernandez',
  'Lori Alan',
  'Joan Cusack',
  'Wallace Shawn',
  'John Ratzenberger',
  'Blake Clark',
  'Don Rickles',
  'Estelle Harris',
  'Bonnie Hunt',
  'Jeff Garlin',
  'Kristen Schaal',
  'Timothy Dalton'],
 'Cinematography': ['Patrick Lin', 'Jean-Claude Kalache'],
 'Edited by': 'Axel Geddes',
 'Music by': 'Randy Newman [2]',
 'Production companies': ['Walt Disney Pictures', 'Pixar Animation Studios'],
 'Distributed by': 'Walt Disney Studios Motion 

In [9]:
# This seems pretty good and these info can be useful and consistent among the other movies as well
# So now let's scrape the all the other movies as well

In [10]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films#2020s')

soup = bs(r.content)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of Walt Disney Pictures films - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6b97ff45-9c8a-4c41-8bac-eacb83067465","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Walt_Disney_Pictures_films","wgTitle":"List of Walt Disney Pictures films","wgCurRevisionId":1061677051,"wgRevisionId":1061677051,"wgArticleId":1970335,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Sh

In [12]:
# getting the elements which consists of list of movies
# and also the href and title of each movie for futher proceedings
# we see that the movie names are italicised and 
# are inside 'a' element which has href and title, so let's filter it out
movies = soup.select(".wikitable.sortable i a")
movies[:10]  # let's see the first 10 movies to see if we extracted correctly

[<a href="/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons" title="Academy Award Review of Walt Disney Cartoons">Academy Award Review of Walt Disney Cartoons</a>,
 <a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a>,
 <a href="/wiki/Pinocchio_(1940_film)" title="Pinocchio (1940 film)">Pinocchio</a>,
 <a href="/wiki/Fantasia_(1940_film)" title="Fantasia (1940 film)">Fantasia</a>,
 <a href="/wiki/The_Reluctant_Dragon_(1941_film)" title="The Reluctant Dragon (1941 film)">The Reluctant Dragon</a>,
 <a href="/wiki/Dumbo" title="Dumbo">Dumbo</a>,
 <a href="/wiki/Bambi" title="Bambi">Bambi</a>,
 <a href="/wiki/Saludos_Amigos" title="Saludos Amigos">Saludos Amigos</a>,
 <a href="/wiki/Victory_Through_Air_Power_(film)" title="Victory Through Air Power (film)">Victory Through Air Power</a>,
 <a href="/wiki/The_Three_Caballeros" title="The Three Caballeros">The Three Caballeros</a>]

In [13]:
# looking for a single instance
movies[0]['href']

'/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons'

In [29]:
movies[0]['title']

'Academy Award Review of Walt Disney Cartoons'

In [14]:
# We got the title and href
# Now we shall create a function to extract just the href and title from all the list of movies and
# takeout the information on the infobox

In [14]:
def get_content_value(row_data):
    if row_data.find('li'):
        return [li.get_text(' ', strip=True).replace('\xa0', ' ') for li in row_data.find_all('li')]
    elif row_data.find('br'):
        return [text for text in row_data.stripped_strings]  # for stripping strings which has br tag
    else:
        return row_data.get_text(' ', strip=True).replace('\xa0', ' ')

# function for removing references and inconsistent elements i.e. sup and span tags(references like [1]..,)
def clean_tags(soup):
    for tag in soup.find_all(['sup', 'span']):
        tag.decompose()

def get_info_box(url):

    r = requests.get(url)
    soup = bs(r.content)
    clean_tags(soup)
    info_box = soup.find(class_='infobox vevent')
    info_rows = info_box.find_all('tr')

    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['Title'] = row.find('th').get_text(' ', strip=True)

        else:
            header = row.find('th')
            if header:
                content_key = row.find('th').get_text(' ', strip=True)
                content_value = get_content_value(row.find('td'))
                movie_info[content_key] = content_value
            
    return movie_info

In [31]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films#2020s')
soup = bs(r.content)
movies = soup.select('.wikitable.sortable i a')

base_path = 'https://en.wikipedia.org/'

movie_info_list = []
for index, movie in enumerate(movies):
    if index % 10 == 0:
        print(index)
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path
        title = movie['title']

        movie_info_list.append(get_info_box(full_path))

    except Exception as e:
        print(movie.get_text())
        print(e)

0
10
20
30
40
Zorro the Avenger
'NoneType' object has no attribute 'find'
The Sign of Zorro
'NoneType' object has no attribute 'find'
50
60
70
80
90
100
110
120
True-Life Adventures
'NoneType' object has no attribute 'find_all'
130
140
The London Connection
'NoneType' object has no attribute 'find'
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
Strange World
'NoneType' object has no attribute 'find_all'
490
500
Sister Act 3
'NoneType' object has no attribute 'find'
Tower of Terror
'NoneType' object has no attribute 'find_all'
Tron: Ares
'NoneType' object has no attribute 'find'


In [None]:
# after inspecting the movies in the exceptions through the webpage, the reason why these movies cannot be
# scraped is that these movies do not have href and hence our functions don't work
# we have scraped almost all the movies, so we might as well skip those as they don't provide any info
# other than title

In [91]:
len(movie_info_list)

502

In [90]:
movie_info_list[0]

{'Title': 'Academy Award Review of',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['May 19, 1937'],
 'Running time': '41 minutes (74 minutes 1966 release)',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$45.472'}

## Saving scraped data into json

In [24]:
import json

# creating a function for saving data as json
def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

# creating a function for loading json files
def load_data(title):
    with open(title, encoding='utf-8') as f:
        return json.load(f)

In [33]:
save_data('Walt_Disney_Movies_raw.json', movie_info_list)

In [3]:
movie_info_list = load_data('Walt_Disney_Movies_raw.json')

## Data Cleaning

Now we have scraped the data from the webpage, it's time to clean up the data

#### Observations after looking through data;
- ~~Remove references like [1]~~
- ~~Split up long strings which are not separated during scraping~~
- Convert running time to integer type
- Convert budget and box office to numerical values
- Convert release date to datetime object

We shall try to fix the problems by modifying the functions we created while scraping data or else we shall do it after scraping

We fixed two issues that we found and they are indicating by strikes, others are typecasting, so it must be doable

- Convert running time to integer type

In [19]:
print([movie.get('Running time', 'N/A') for movie in movie_info_list])

# some rows include multiple values and strings too, so we shall create a new key pair which includes 
# only the first minutes 

['41 minutes (74 minutes 1966 release)', '83 minutes', '88 minutes', '126 minutes', '74 minutes', '64 minutes', '70 minutes', '42 minutes', '70 min', '71 minutes', '75 minutes', '94 minutes', '73 minutes', '75 minutes', '82 minutes', '68 minutes', '74 minutes', '96 minutes', '75 minutes', '84 minutes', '77 minutes', '92 minutes', '69 minutes', '81 minutes', ['60 minutes (VHS version)', '71 minutes (original)'], '127 minutes', '92 minutes', '76 minutes', '75 minutes', '73 minutes', '85 minutes', '81 minutes', '70 minutes', '90 min.', '80 minutes', '75 minutes', '83 minutes', '83 minutes', '72 minutes', '97 minutes', '75 minutes', '104 minutes', '93 minutes', '105 minutes', '95 minutes', '97 minutes', '134 minutes', '69 minutes', '92 minutes', '131 minutes', '79 minutes', '97 minutes', '128 minutes', '73 minutes', '91 minutes', '105 minutes', '98 minutes', '130 minutes', '89 min.', '93 minutes', '67 minutes', '98 minutes', '100 minutes', '118 minutes', '103 minutes', '110 minutes', '80 m

In [4]:
# creating new running time to be in int and just specifies the numerical value without other strings or numbers
def minute_to_integer(running_time):
    if running_time == 'N/A':
        return None

    if isinstance(running_time, list):
        first_entry = running_time[0]
        value = int(first_entry.split(" ")[0])
        return value
    else:
        value = int(running_time.split(" ")[0])
        return value

for movie in movie_info_list:
    movie['Running time(int)'] = minute_to_integer(movie.get('Running time', 'N/A'))

In [6]:
movie_info_list[0]

{'title': 'Academy Award Review of',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['May 19, 1937'],
 'Running time': '41 minutes (74 minutes 1966 release)',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$45.472',
 'Running time(int)': 41}

In [11]:
print([movie.get('Running time(int)', 'N/A') for movie in movie_info_list])

[41, 83, 88, 126, 74, 64, 70, 42, 70, 71, 75, 94, 73, 75, 82, 68, 74, 96, 75, 84, 77, 92, 69, 81, 60, 127, 92, 76, 75, 73, 85, 81, 70, 90, 80, 75, 83, 83, 72, 97, 75, 104, 93, 105, 95, 97, 134, 69, 92, 131, 79, 97, 128, 73, 91, 105, 98, 130, 89, 93, 67, 98, 100, 118, 103, 110, 80, 79, 91, 91, 97, 118, 139, 131, 92, 87, 116, 93, 110, 110, 131, 101, 108, 84, 78, 75, 164, 106, 110, 99, 113, 108, 112, 93, 91, 93, 100, 100, 79, 96, 113, 89, 118, 92, 88, 92, 87, 93, 93, 93, 90, 83, 96, 88, 89, 91, 93, 92, 97, 100, 100, 89, 91, 112, 115, 95, 91, 97, 104, 74, 48, 77, 104, 128, 101, 94, 104, 90, 100, 88, 93, 98, 112, 84, 97, 97, 114, 96, 97, 109, 83, 90, 107, 96, 103, 91, 95, 105, 113, 80, 101, 90, 74, 90, 89, 110, 74, 93, 84, 83, 74, 77, 107, 93, 88, 108, 84, 121, 89, 104, 90, 86, 84, 108, 107, 96, 98, 105, 108, 94, 106, 102, 88, 102, 102, 97, 111, 100, 96, 98, 78, 81, 108, 89, 99, 89, 81, 92, 100, 89, 79, 91, 101, 104, 103, 86, 105, 75, 93, 92, 98, 95, 93, 87, 93, 87, 128, 77, 86, 95, 114, 93

In [13]:
movie_info_list[1]

{'title': 'Snow White and the Seven Dwarfs',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Written by': ['Ted Sears',
  'Richard Creedon',
  'Otto Englander',
  'Dick Rickard',
  'Earl Hurd',
  'Merrill De Maris',
  'Dorothy Ann Blank',
  'Webb Smith'],
 'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Produced by': 'Walt Disney',
 'Starring': ['Adriana Caselotti',
  'Lucille La Verne',
  'Harry Stockwell',
  'Roy Atwell',
  'Pinto Colvig',
  'Otis Harlan',
  'Scotty Mattraw',
  'Billy Gilbert',
  'Eddie Collins',
  'Moroni Olsen',
  'Stuart Buchanan'],
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Running time': '83 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$1.49 million',
 'Box offi

- Convert budget and box office to numerical values

In [5]:
# we shall use regular expressions(re) for this particular task as using re we can easily filter out the pattern
# of characters that we want
import re

amounts = r"thousand|million|billion"
number = r"\d+(,\d{3})*\.*\d*"

word_re = rf"\${number}(-|\sto\s|–)?({number})?\s({amounts})"
value_re = rf"\${number}"

def word_to_value(word):
    value_dict = {"thousand": 1000, "million": 1000000, "billion": 1000000000}
    return value_dict[word]

def parse_word_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    word = re.search(amounts, string, flags=re.I).group().lower()
    word_value = word_to_value(word)
    return value*word_value

def parse_value_syntax(string):
    value_string = re.search(number, string).group()
    value = float(value_string.replace(",", ""))
    return value

'''
money_conversion("$12.2 million") --> 12200000 ## Word syntax
money_conversion("$790,000") --> 790000        ## Value syntax
'''
def money_conversion(money):
    if money == "N/A":
        return None

    if isinstance(money, list):
        money = money[0]
        
    word_syntax = re.search(word_re, money, flags=re.I)
    value_syntax = re.search(value_re, money)

    if word_syntax:
        return parse_word_syntax(word_syntax.group())

    elif value_syntax:
        return parse_value_syntax(value_syntax.group())

    else:
        return None

In [6]:
for movie in movie_info_list:
    movie['Budget(float)'] = money_conversion(movie.get('Budget', 'N/A'))
    movie['Box office(float)'] = money_conversion(movie.get('Box office', 'N/A'))

In [10]:
movie_info_list[1]

{'title': 'Snow White and the Seven Dwarfs',
 'Directed by': ['David Hand',
  'William Cottrell',
  'Wilfred Jackson',
  'Larry Morey',
  'Perce Pearce',
  'Ben Sharpsteen'],
 'Written by': ['Ted Sears',
  'Richard Creedon',
  'Otto Englander',
  'Dick Rickard',
  'Earl Hurd',
  'Merrill De Maris',
  'Dorothy Ann Blank',
  'Webb Smith'],
 'Based on': ['Snow White', 'by The', 'Brothers Grimm'],
 'Produced by': 'Walt Disney',
 'Starring': ['Adriana Caselotti',
  'Lucille La Verne',
  'Harry Stockwell',
  'Roy Atwell',
  'Pinto Colvig',
  'Otis Harlan',
  'Scotty Mattraw',
  'Billy Gilbert',
  'Eddie Collins',
  'Moroni Olsen',
  'Stuart Buchanan'],
 'Music by': ['Frank Churchill', 'Paul Smith', 'Leigh Harline'],
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['December 21, 1937 ( Carthay Circle Theatre )'],
 'Running time': '83 minutes',
 'Country': 'United States',
 'Language': 'English',
 'Budget': '$1.49 million',
 'Box offi

- Convert release date to datetime object

In [11]:
from datetime import datetime

In [12]:
dates = [movie.get('Release date', 'N/A') for movie in movie_info_list]

In [15]:
# let's have a look at the release date
for index, date in enumerate(dates):
    print(index, date)

0 ['May 19, 1937']
1 ['December 21, 1937 ( Carthay Circle Theatre )']
2 ['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)']
3 ['November 13, 1940']
4 ['June 27, 1941']
5 ['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)']
6 ['August 9, 1942 (World Premiere – London)', 'August 13, 1942 (Premiere – New York City)', 'August 21, 1942 (U.S.)']
7 ['August 24, 1942 (World Premiere – Rio de Janeiro)', 'February 6, 1943 (U.S. Premiere – Boston)', 'February 19, 1943 (U.S.)']
8 ['July 17, 1943']
9 ['December 21, 1944 (Mexico City)', 'February 3, 1945 (US)']
10 ['April 20, 1946 (New York City premiere)', 'August 15, 1946 (U.S.)']
11 ['November 12, 1946 (Premiere: Atlanta, Georgia)', 'November 20, 1946', 'March 30, 1947 (Stanford Theatre, Palo Alto, California)']
12 ['September 27, 1947']
13 May 27, 1948
14 ['November 29, 1948 (Chicago, Illinois)', 'January 19, 1949 (Indianapolis, Indiana)']
15 ['October 5, 1949']
16 ['February 15, 1950 (Boston)', 'March 4, 1950

In [22]:
# some movies have several dates and also have other values in brackets with them
# so we will try to extract just the date

# function for removing values with parenthesis and taking the date alone
def clean_date(date):
    return date.split('(')[0].strip()

def date_conversion(date):     # taking out only the first date even if multiple dates exists
    if isinstance(date, list):
        date = date[0]
    
    if date == 'N/A':          # for null values
        return None

    date_str = clean_date(date)
    print(date_str)

    formats = ['%B %d, %Y', '%d %B %Y']    # some dates have different format, so adding multiple formats
    for format in formats:
        try:                                           # adding try and except blocks to rectify any errors
            return datetime.strptime(date_str, format)
        except Exception as e:
            print(e)
    return None             # return None if doesn't have any values as per format

In [9]:
# testing function
date_conversion('25 April, 2019 (Bangalore, India)')

25 April, 2019


In [19]:
# function works fine, let's implement it on the data
for date in dates:
    print(date_conversion(date))
    print()

May 19, 1937
1937-05-19 00:00:00

December 21, 1937
1937-12-21 00:00:00

February 7, 1940
1940-02-07 00:00:00

November 13, 1940
1940-11-13 00:00:00

June 27, 1941
1941-06-27 00:00:00

October 23, 1941
1941-10-23 00:00:00

August 9, 1942
1942-08-09 00:00:00

August 24, 1942
1942-08-24 00:00:00

July 17, 1943
1943-07-17 00:00:00

December 21, 1944
1944-12-21 00:00:00

April 20, 1946
1946-04-20 00:00:00

November 12, 1946
1946-11-12 00:00:00

September 27, 1947
1947-09-27 00:00:00

May 27, 1948
1948-05-27 00:00:00

November 29, 1948
1948-11-29 00:00:00

October 5, 1949
1949-10-05 00:00:00

February 15, 1950
1950-02-15 00:00:00

June 22, 1950
1950-06-22 00:00:00

July 26, 1951
1951-07-26 00:00:00

March 13, 1952
1952-03-13 00:00:00

February 5, 1953
1953-02-05 00:00:00

July 23, 1953
1953-07-23 00:00:00

November 10, 1953
1953-11-10 00:00:00

26 October 1953
time data '26 October 1953' does not match format '%B %d, %Y'
1953-10-26 00:00:00

August 17, 1954
1954-08-17 00:00:00

December 23,

In [26]:
# same functions as above but without print statements
def clean_date(date):
    return date.split('(')[0].strip()

def date_conversion(date):     
    if isinstance(date, list):
        date = date[0]
    
    if date == 'N/A':          
        return None

    date_str = clean_date(date)

    formats = ['%B %d, %Y', '%d %B %Y']    
    for format in formats:
        try:                                      
            return datetime.strptime(date_str, format)
        except:
            pass
    return None

In [27]:
for movie in movie_info_list:
    movie['Release date(datetime)'] = date_conversion(movie.get('Release date', 'N/A'))

In [28]:
movie_info_list[0]

{'title': 'Academy Award Review of',
 'Production company': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['May 19, 1937'],
 'Running time': '41 minutes (74 minutes 1966 release)',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$45.472',
 'Running time(int)': 41,
 'Budget(float)': None,
 'Box office(float)': 45.472,
 'Release date(datetime)': datetime.datetime(1937, 5, 19, 0, 0)}

## Saving cleaned data

In [29]:
## save_data('Walt_Disney_Movies(Cleaned).json', movie_info_list)

# we tried to save data into json but we get an error message stating that
# 'TypeError: Object of type datetime is not JSON serializable'
# seems we can't save it into json if any of data is in datetime object
# We will find a solution later, for now, no worries, we shall save the data into an pickle file

TypeError: Object of type datetime is not JSON serializable

In [2]:
# creating new save and load functions using pickle
import pickle

def save_data_pickle(title, data):
    with open(title, 'wb') as f:
        pickle.dump(data, f)

def load_data_pickle(title):
    with open(title, 'rb') as f:
        return pickle.load(f)

In [31]:
save_data_pickle('Walt_Disney_Movies(Cleaned).pickle', movie_info_list)

In [3]:
movie_info_list = load_data_pickle('Walt_Disney_Movies(Cleaned).pickle')

## Attach IMDB/Rotten Tomatoes scores

After exporing a bit on the internet on adding IMDB and Rotten Tomatoes scores to the data, we came across a api hosting platform which allows us to request movie data through their api.<br>
Steps:
- Visit 'http://www.omdbapi.com/' and get an api key by verifying email address.
- It has the option to either use free services (allows 1,000 daily limit) or subscribe to their patreon for more access.
- I went with their free services and got the api key.
- Then you can use that api key and request movie info from the api.

In [1]:
api = 'http://www.omdbapi.com/?i=tt3896198&apikey=21c6e072'

import requests
import urllib                  # to append parameters to the url

def get_omdb_info(title):
    base_url = 'http://www.omdbapi.com/?'
    parameters = {'apikey': '21c6e072', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    print(full_url)

In [2]:
# checking
get_omdb_info('into the woods')

http://www.omdbapi.com/?apikey=21c6e072&t=into+the+woods


It works as expected, now before adding requests in the function, a little heads up.
Api keys are sensitive info that anyone can misuse because after publishing your work in github or some public platform, users can see your api and can misuse them, however there is a solution: you can use environment variables in your windows machine to store api keys and add them without showing your api key, let's do that:<br>
(Here I'm not hiding my api key as I'm just using the free services of that api)

In [None]:
# import requests
# import urllib
# import os      

# def get_omdb_info(title):
#     base_url = 'http://www.omdbapi.com/?'
#     parameters = {'apikey': os.environ['OMDB_API_KEY'], 't': title}   # calling the environmental variable
#     params_encoded = urllib.parse.urlencode(parameters)                      ## which is stored in the machine
#     full_url = base_url + params_encoded
#     print(full_url)

In [4]:
# now let's add requests
import requests
import urllib  

def get_omdb_info(title):
    base_url = 'http://www.omdbapi.com/?'
    parameters = {'apikey': '21c6e072', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

In [5]:
get_omdb_info('into the woods')

{'Title': 'Into the Woods',
 'Year': '2014',
 'Rated': 'PG',
 'Released': '25 Dec 2014',
 'Runtime': '125 min',
 'Genre': 'Adventure, Comedy, Drama',
 'Director': 'Rob Marshall',
 'Writer': 'James Lapine',
 'Actors': 'Anna Kendrick, Meryl Streep, Chris Pine',
 'Plot': 'A witch tasks a childless baker and his wife with procuring magical items from classic fairy tales to reverse the curse put on their family tree.',
 'Language': 'English',
 'Country': 'United States',
 'Awards': 'Nominated for 3 Oscars. 10 wins & 74 nominations total',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMTY4MzQ4OTY3NF5BMl5BanBnXkFtZTgwNjM5MDI3MjE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '5.9/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '71%'},
  {'Source': 'Metacritic', 'Value': '69/100'}],
 'Metascore': '69',
 'imdbRating': '5.9',
 'imdbVotes': '137,784',
 'imdbID': 'tt2180411',
 'Type': 'movie',
 'DVD': '24 Mar 2015',
 'BoxOffice': '$128,002,372',
 'Production': 'N

In [6]:
# Bingo, but rotten tomato scores is hidden inside a tag, let's get those by using a function
def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', [])     # get 'Ratings' tag else return []
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

sample = get_omdb_info('into the woods')
get_rotten_tomato_score(sample)

# that's all we needed, now let's execute it in our data without further ado

'71%'

In [11]:
import requests
import urllib  

def get_omdb_info(title):
    base_url = 'http://www.omdbapi.com/?'
    parameters = {'apikey': '21c6e072', 't': title}
    params_encoded = urllib.parse.urlencode(parameters)
    full_url = base_url + params_encoded
    return requests.get(full_url).json()

def get_rotten_tomato_score(omdb_info):
    ratings = omdb_info.get('Ratings', []) 
    for rating in ratings:
        if rating['Source'] == 'Rotten Tomatoes':
            return rating['Value']
    return None

In [12]:
for index, movie in enumerate(movie_info_list):
    if index % 10 == 0:
        print(index)
    title = movie['title']
    ombd_info = get_omdb_info(title)

    movie['Genre'] = ombd_info.get('Genre', None)   # also genre would add up good info, so let's add it up too
    movie['imdbRating'] = ombd_info.get('imdbRating', None)
    movie['Metascore'] = ombd_info.get('Metascore', None)
    movie['Rotten Tomatoes'] = get_rotten_tomato_score(ombd_info)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500


In [13]:
movie_info_list[-70]

{'title': 'Dangal',
 'Directed by': 'Nitesh Tiwari',
 'Written by': ['Nitesh Tiwari',
  'Piyush Gupta',
  'Shreyas Jain',
  'Nikhil Meharotra'],
 'Story by': ['Curation:', 'Nitesh Tiwari', 'Concept:', 'Divya V. Rao'],
 'Based on': 'Lives of Mahavir Singh Phogat and Phogat sisters',
 'Produced by': ['Aamir Khan', 'Kiran Rao', 'Siddharth Roy Kapur'],
 'Starring': ['Aamir Khan',
  'Sakshi Tanwar',
  'Fatima Sana Shaikh',
  'Zaira Wasim',
  'Sanya Malhotra',
  'Suhani Bhatnagar',
  'Aparshakti Khurana',
  'Girish Kulkarni'],
 'Narrated by': 'Aparshakti Khurana',
 'Cinematography': 'Setu',
 'Edited by': 'Ballu Saluja',
 'Music by': 'Pritam',
 'Production companies': ['Aamir Khan Productions',
  'Walt Disney Pictures India'],
 'Distributed by': 'UTV Motion Pictures',
 'Release date': ['21 December 2016 (United States)',
  '23 December 2016 (India)'],
 'Running time': '161 minutes',
 'Country': 'India',
 'Language': 'Hindi',
 'Budget': '(US$9.3 million)',
 'Box office': '(US$270 million)',
 '

## Save data

In [14]:
# earlier when we tried to save using json, it won't let us because of the datetime object
# so now let's reverse what we did to datetime object and save it as json

# let's create a copy of data 
movie_copy = [movie.copy() for movie in movie_info_list]

In [19]:
for movie in movie_copy:
    current_date = movie['Release date(datetime)']
    if current_date:
        movie['Release date(datetime)'] = current_date.strftime('%B %d, %Y')
    else:
        movie['Release date(datetime)'] = None

In [23]:
movie_copy[10].get('Release date(datetime)')

# it looks fine, now let's save it as json

'April 20, 1946'

In [25]:
save_data('Walt_Disney_Movies_Final.json', movie_copy)

In [26]:
# let's save it as csv too
import pandas as pd

df = pd.DataFrame(movie_info_list)  # converting list of dicts to dataframe

In [29]:
df.to_csv('Walt_Disney_Movies_Final.csv')