# Take-home Test: The Big IMDB quest

### **Objective**

Your assignment is to create an application that scrapes data from [IMDB](https://www.imdb.com/chart/top/) and adjusts IMDB ratings based on some rules. You don’t have to extract the whole list, please concentrate your attention on the TOP 20 movies only

### **Tasks**

- Implement assignment using:
    - Language: **any language**
    - Libraries: **any libraries**
- Three functions are required:
    - Scraper - See below
    - Rating Adjustment
        - Oscar Calculator - See Below
        - Review Penalizer - See Below
    - Provide Unit tests for all functions
- Write out the TOP 20 movies in a sorted (descending) way including both the original and the adjusted new ratings to a file (JSON, CSV, txt, etc..).
- Provide detailed instructions on how to run your assignment in a separate markdown file.

### Scraper

Scrape the following properties for each movie from the [IMDB TOP 250](https://www.imdb.com/chart/top/) list. It is part of the exercise to design the data structure for it: 

- Rating
- Number of ratings
- Number of Oscars
- Title of the movie

### Review Penalizer:

Ratings are good because they give us an impression of how many people think a film is good or bad. However, it does matter how many people voted. The goal of this exercise is to penalize those films where the number of reviews is low. 

Find the film with the maximum number of reviews (remember, out of the TOP 20 only). This is going to be the benchmark. Compare every movie’s number of reviews to this and penalize each of them based on the following rule: Every 100k deviation from the maximum translates to a point deduction of 0.1. 

*For example*, suppose the maximum number of reviews is 2.456.123. For a given movie with 1.258.369 ratings and an IMDB score of 9.4, the amount of the deduction is 1.1 and therefore the adjusted rating is 8.3.

### Oscar Calculator

The Oscars should mean something, shouldn’t they? Here are the rewards for them:

- 1 or 2 oscars → 0.3 point
- 3 or 5 oscars → 0.5 point
- 6 - 10 oscars → 1 point
- 10+ oscars → 1.5 point

*For example*, if a movie is awarded 4 Oscar titles and the original IMDB rating is 7.5, the adjusted value will increase to 8 points.

### **Evaluation Criteria**

- Programming best-practices.
- Implementation of *Scraper*, *Review Penalizer* & *Oscar Calculator.*
- Show us your work through your commit history.
- Completeness: did you complete the features? Are all the tests running?
- Correctness: does the code act in a sensible, thought-out way?
- Maintainability: is it written in a clean, supportable way?

### Imports

* pandas                 - DataFrame calculations
* bs4.BeautifulSoup      - HTML parsing
* requests               - Web communication
* html                   - HTML escape sequence parsing
* tdqm                   - Progress bar
* re                     - regex operations for extracting data from string
* json                   - Parsing textual data into JSON format
* datetime               - File name postfix

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests, html
from tqdm import tqdm
import re
import json
from datetime import datetime

#### Pulling IMDB page for "Le fabuleux destin d'Amélie Poulain"

Returning page content ( type: bytes )

In [2]:
l_current_link = f'https://www.imdb.com/title/tt0211915/'
l_content = requests.get(l_current_link).content

l_content

b'<!DOCTYPE html><html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta name="viewport" content="width=device-width"/><meta charSet="utf-8"/><script>if(typeof uet === \'function\'){ uet(\'bb\', \'LoadTitle\', {wb: 1}); }</script><script>window.addEventListener(\'load\', (event) => {\n        if (typeof window.csa !== \'undefined\' && typeof window.csa === \'function\') {\n            var csaLatencyPlugin = window.csa(\'Content\', {\n                element: {\n                    slotId: \'LoadTitle\',\n                    type: \'service-call\'\n                }\n            });\n            csaLatencyPlugin(\'mark\', \'clickToBodyBegin\', 1666707233452);\n        }\n    })</script><title>Am\xc3\xa9lie csod\xc3\xa1latos \xc3\xa9lete (2001) - IMDb</title><meta name="description" content="Am\xc3\xa9lie csod\xc3\xa1latos \xc3\xa9lete: Directed by Jean-Pierre Jeunet. With Audrey Tautou, Mathieu Kassovitz, Rufus, Lorella

#### Parsing HTML content using BeautifulSoup

In [3]:
soup = BeautifulSoup(l_content, 'html.parser')

soup

<!DOCTYPE html>
<html lang="en-US" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/"><head><meta content="width=device-width" name="viewport"/><meta charset="utf-8"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1666707233452);
        }
    })</script><title>Amélie csodálatos élete (2001) - IMDb</title><meta content="Amélie csodálatos élete: Directed by Jean-Pierre Jeunet. With Audrey Tautou, Mathieu Kassovitz, Rufus, Lorella Cravotta. Amélie is an innocent and naive girl in Paris with her own sense of justice. She de

#### Searching for application data in the parsed HTML content

-> Find the first "script" type tag, where the type is "application/ld+json"

( JSON-LD is a JSON-based format to serialize Linked Data. The syntax is designed to easily integrate into deployed systems that already use JSON, and provides a smooth upgrade path from JSON to JSON-LD. It is primarily intended to be a way to use Linked Data in Web-based programming environments, to build interoperable Web services, and to store Linked Data in JSON-based storage engines. )

This data contains most fields that we are looking for ( title, publish date, votes and rating )

In [4]:
soup_data = soup.find("script", type="application/ld+json").text

soup_data

'{"@context":"https://schema.org","@type":"Movie","url":"/title/tt0211915/","name":"Le fabuleux destin d&apos;Amélie Poulain","alternateName":"Amélie csodálatos élete","image":"https://m.media-amazon.com/images/M/MV5BNDg4NjM1YjMtYmNhZC00MjM0LWFiZmYtNGY1YjA3MzZmODc5XkEyXkFqcGdeQXVyNDk3NzU2MTQ@._V1_.jpg","description":"Amélie is an innocent and naive girl in Paris with her own sense of justice. She decides to help those around her and, along the way, discovers love.","review":{"@type":"Review","itemReviewed":{"@type":"CreativeWork","url":"/title/tt0211915/"},"author":{"@type":"Person","name":"Boyo-2"},"dateCreated":"2003-03-01","inLanguage":"English","name":"This is rated so highly cause it deserves it","reviewBody":"A slice of heaven right here on earth, &quot;Amelie&quot; is a joy to behold, and has some of the most gorgeous cinematography I&apos;ve ever seen in a movie. \\n\\n Audrey Tatou is perfection as the title character.  A combination of Audrey Hepburn, Dolly Levi and Roger Rab

#### Parse JSON data

As the search result is already in JSON format, we can easily parse it using the json library.

In [5]:
imdb_data = json.loads(soup_data)

imdb_data

{'@context': 'https://schema.org',
 '@type': 'Movie',
 'url': '/title/tt0211915/',
 'name': 'Le fabuleux destin d&apos;Amélie Poulain',
 'alternateName': 'Amélie csodálatos élete',
 'image': 'https://m.media-amazon.com/images/M/MV5BNDg4NjM1YjMtYmNhZC00MjM0LWFiZmYtNGY1YjA3MzZmODc5XkEyXkFqcGdeQXVyNDk3NzU2MTQ@._V1_.jpg',
 'description': 'Amélie is an innocent and naive girl in Paris with her own sense of justice. She decides to help those around her and, along the way, discovers love.',
 'review': {'@type': 'Review',
  'itemReviewed': {'@type': 'CreativeWork', 'url': '/title/tt0211915/'},
  'author': {'@type': 'Person', 'name': 'Boyo-2'},
  'dateCreated': '2003-03-01',
  'inLanguage': 'English',
  'name': 'This is rated so highly cause it deserves it',
  'reviewBody': 'A slice of heaven right here on earth, &quot;Amelie&quot; is a joy to behold, and has some of the most gorgeous cinematography I&apos;ve ever seen in a movie. \n\n Audrey Tatou is perfection as the title character.  A combi

#### Handle missing release dates

In some cases, the datePublished field is missing, particularly for movies from the early 1900's, when release dates were not permanently stored.

If the field is not present, a KeyError is raised, in which case the release_date will be 'N/A'.

In [6]:
try:
    release_date = imdb_data['datePublished']
except KeyError as ke:
    release_date = 'N/A'
    
release_date

'2002-02-21'

In [7]:
try:
    na_release_date = imdb_data['datePublished2']  # No such field in the JSON
except KeyError as ke:
    na_release_date = 'N/A'
    
na_release_date

'N/A'

#### Extracting movie name

In case the title of the movie contains exotic characters ( mostly apostrophes ) we need to unescape them to get a presentable format.

Using the unescape function from the html package to implement it.

In [8]:
imdb_data['name']

'Le fabuleux destin d&apos;Amélie Poulain'

In [9]:
l_movie_name = html.unescape(imdb_data['name'])

l_movie_name

"Le fabuleux destin d'Amélie Poulain"

#### Find number of Oscars won by movie

The relevant data is in one of the links on the page marked with the followgin class designation:

```
"class":"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"
```

If a movie won oscars,it is presented in the format "Won X Oscars" where X is the number of Academy Awards won.

By iterating through the list of relevant link descriptions and searching them via regex expression, the number of Oscars can be extracted.

Explanation for the regex expressions:
```
'Won(.+?)Oscars'
```

'Won' -> exact string

'Oscars' -> exact string

(.+?) 

    '.' -> Any character except line break
    
    '+' -> One or more of the previous expression
    
    '?' -> Once or none
    

r'\d+'

    'r'    -> indicates a string that is a regex expression
    '\d'   -> a decimal number
    '+'    -> One or more of the previous expression
    
    
#### Note:
The other route would be to parse the awards page for the movie ( "awards/?ref_=tt_awd" postfix ).
This is more time consuming, as we have to pull another page from IMDB.

In [10]:
soup_oscars = soup.findAll("a", attrs={"class":"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"})

soup_oscars

[<a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="/title/tt0211915/fullcredits/cast?ref_=tt_ov_st_sm" rel="" target="">Stars</a>,
 <a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="/title/tt0211915/fullcredits/cast?ref_=tt_ov_st_sm" rel="" target="">Stars</a>,
 <a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="/title/tt0211915/awards/?ref_=tt_awd" rel="" target="">Nominated for 5 Oscars</a>,
 <a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="/title/tt0211915/fullcredits/?ref_=tt_cl_sm" rel="" target="">All cast &amp; crew</a>,
 <a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="https://pro.imdb.com/title/tt0211915/?rf=cons_tt_btf_cc&amp;ref_=cons_tt_btf_cc" rel="" target="_blank">Production, box office &amp; more at IMDbPro</a>,
 <a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" href="/title/tt021

In [11]:
num_of_oscars = 0
for i in soup_oscars:
    if re.search('Won(.+?)Oscars',i.text):
        num_of_oscars = int(re.findall(r'\d+',i.text)[0])
        
num_of_oscars

0

#### Parse result for a single movie

In [12]:
[l_movie_name,
 release_date,
 imdb_data['aggregateRating']['ratingValue'],
 imdb_data['aggregateRating']['ratingCount'],
 num_of_oscars]

["Le fabuleux destin d'Amélie Poulain", '2002-02-21', 8.3, 753890, 0]

#### Parse Top 250 url

In [13]:
soup = BeautifulSoup(requests.get('https://www.imdb.com/list/ls068082370/').text,"html.parser")

soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 250 Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/list/ls068082370/" rel="canonical"/>
<meta content="http://www.imdb.com/list/ls068082370/" property="og:url">
<script>
    if (typeof uet == 'function') {
     

#### Get movie link list from script tag

In [14]:
l_title_list_soup = soup.find('script', type="application/ld+json").text

l_title_list_soup

'{\n  "@context": "http://schema.org",\n  "@type": "CreativeWork",\n  "about": {\n    "@type": "ItemList",\n    "itemListElement": [\n      {\n        "@type": "ListItem",\n        "position": "1",\n        "url": "/title/tt0111161/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "2",\n        "url": "/title/tt0068646/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "3",\n        "url": "/title/tt0468569/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "4",\n        "url": "/title/tt0071562/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "5",\n        "url": "/title/tt0110912/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "6",\n        "url": "/title/tt0108052/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "7",\n        "url": "/title/tt0167260/"\n      },\n      {\n        "@type": "ListItem",\n        "position": "8",\n        "url": "/title/t

#### Parse link list into JSON object

In [15]:
l_title_json = json.loads(l_title_list_soup)

l_title_json

{'@context': 'http://schema.org',
 '@type': 'CreativeWork',
 'about': {'@type': 'ItemList',
  'itemListElement': [{'@type': 'ListItem',
    'position': '1',
    'url': '/title/tt0111161/'},
   {'@type': 'ListItem', 'position': '2', 'url': '/title/tt0068646/'},
   {'@type': 'ListItem', 'position': '3', 'url': '/title/tt0468569/'},
   {'@type': 'ListItem', 'position': '4', 'url': '/title/tt0071562/'},
   {'@type': 'ListItem', 'position': '5', 'url': '/title/tt0110912/'},
   {'@type': 'ListItem', 'position': '6', 'url': '/title/tt0108052/'},
   {'@type': 'ListItem', 'position': '7', 'url': '/title/tt0167260/'},
   {'@type': 'ListItem', 'position': '8', 'url': '/title/tt0050083/'},
   {'@type': 'ListItem', 'position': '9', 'url': '/title/tt0060196/'},
   {'@type': 'ListItem', 'position': '10', 'url': '/title/tt0109830/'},
   {'@type': 'ListItem', 'position': '11', 'url': '/title/tt1375666/'},
   {'@type': 'ListItem', 'position': '12', 'url': '/title/tt0120737/'},
   {'@type': 'ListItem', '

#### Extract list of links from JSON

In [16]:
l_title_list = l_title_json['about']['itemListElement']

l_title_list

[{'@type': 'ListItem', 'position': '1', 'url': '/title/tt0111161/'},
 {'@type': 'ListItem', 'position': '2', 'url': '/title/tt0068646/'},
 {'@type': 'ListItem', 'position': '3', 'url': '/title/tt0468569/'},
 {'@type': 'ListItem', 'position': '4', 'url': '/title/tt0071562/'},
 {'@type': 'ListItem', 'position': '5', 'url': '/title/tt0110912/'},
 {'@type': 'ListItem', 'position': '6', 'url': '/title/tt0108052/'},
 {'@type': 'ListItem', 'position': '7', 'url': '/title/tt0167260/'},
 {'@type': 'ListItem', 'position': '8', 'url': '/title/tt0050083/'},
 {'@type': 'ListItem', 'position': '9', 'url': '/title/tt0060196/'},
 {'@type': 'ListItem', 'position': '10', 'url': '/title/tt0109830/'},
 {'@type': 'ListItem', 'position': '11', 'url': '/title/tt1375666/'},
 {'@type': 'ListItem', 'position': '12', 'url': '/title/tt0120737/'},
 {'@type': 'ListItem', 'position': '13', 'url': '/title/tt0137523/'},
 {'@type': 'ListItem', 'position': '14', 'url': '/title/tt5074352/'},
 {'@type': 'ListItem', 'posit

#### Filter to first 20 movies in the list

In [17]:
filtered = list(filter(lambda pos: int(pos['position']) <= 20,l_title_list))

filtered

[{'@type': 'ListItem', 'position': '1', 'url': '/title/tt0111161/'},
 {'@type': 'ListItem', 'position': '2', 'url': '/title/tt0068646/'},
 {'@type': 'ListItem', 'position': '3', 'url': '/title/tt0468569/'},
 {'@type': 'ListItem', 'position': '4', 'url': '/title/tt0071562/'},
 {'@type': 'ListItem', 'position': '5', 'url': '/title/tt0110912/'},
 {'@type': 'ListItem', 'position': '6', 'url': '/title/tt0108052/'},
 {'@type': 'ListItem', 'position': '7', 'url': '/title/tt0167260/'},
 {'@type': 'ListItem', 'position': '8', 'url': '/title/tt0050083/'},
 {'@type': 'ListItem', 'position': '9', 'url': '/title/tt0060196/'},
 {'@type': 'ListItem', 'position': '10', 'url': '/title/tt0109830/'},
 {'@type': 'ListItem', 'position': '11', 'url': '/title/tt1375666/'},
 {'@type': 'ListItem', 'position': '12', 'url': '/title/tt0120737/'},
 {'@type': 'ListItem', 'position': '13', 'url': '/title/tt0137523/'},
 {'@type': 'ListItem', 'position': '14', 'url': '/title/tt5074352/'},
 {'@type': 'ListItem', 'posit

#### Extract url from the filtered list

In [18]:
link_list = [x['url'] for x in filtered]

link_list

['/title/tt0111161/',
 '/title/tt0068646/',
 '/title/tt0468569/',
 '/title/tt0071562/',
 '/title/tt0110912/',
 '/title/tt0108052/',
 '/title/tt0167260/',
 '/title/tt0050083/',
 '/title/tt0060196/',
 '/title/tt0109830/',
 '/title/tt1375666/',
 '/title/tt0120737/',
 '/title/tt0137523/',
 '/title/tt5074352/',
 '/title/tt0080684/',
 '/title/tt0076759/',
 '/title/tt0133093/',
 '/title/tt0099685/',
 '/title/tt0073486/',
 '/title/tt0317248/']

#### Process links as detailed above

In [19]:
imdb_top_250_data = []
for link in tqdm(link_list):
    l_current_link = f'https://www.imdb.com{link}'
    l_content = requests.get(l_current_link).content
    
    soup = BeautifulSoup(l_content, 'html.parser')

    soup_data = soup.find("script", type="application/ld+json").text

    imdb_data = json.loads(soup_data)
    
    l_movie_name = html.unescape(imdb_data['name'])
    
    soup_oscars = soup.findAll("a", attrs={"class":"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"})
        
    num_of_oscars = 0
    for i in soup_oscars:
        if re.search('Won(.+?)Oscars',i.text):
            num_of_oscars = int(re.findall(r'\d+',i.text)[0])
            
    try:
        release_date = imdb_data['datePublished']
    except KeyError as ke:
        release_date = 'N/A'
        
    l_result_list = [l_movie_name,
                     release_date,
                     imdb_data['aggregateRating']['ratingValue'],
                     imdb_data['aggregateRating']['ratingCount'],
                     num_of_oscars]
    
    imdb_top_250_data.append(l_result_list)
    
imdb_top_250_data

100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:33<00:00,  1.68s/it]


[['The Shawshank Redemption', '1995-05-25', 9.3, 2653780, 0],
 ['The Godfather', '1982-03-25', 9.2, 1839544, 3],
 ['The Dark Knight', '2008-08-07', 9, 2626211, 2],
 ['The Godfather Part II', '1983-04-21', 9, 1260388, 6],
 ['Pulp Fiction', '1995-04-13', 8.9, 2031337, 0],
 ["Schindler's List", '1994-03-10', 9, 1344621, 7],
 ['The Lord of the Rings: The Return of the King',
  '2004-01-08',
  9,
  1829392,
  11],
 ['12 Angry Men', '1960-01-28', 9, 783580, 0],
 ['Il buono, il brutto, il cattivo', '1979-11-29', 8.8, 756961, 0],
 ['Forrest Gump', '1994-12-08', 8.8, 2055715, 6],
 ['Inception', '2010-07-22', 8.8, 2327473, 4],
 ['The Lord of the Rings: The Fellowship of the Ring',
  '2002-01-10',
  8.8,
  1857303,
  4],
 ['Fight Club', '2000-01-27', 8.8, 2099035, 0],
 ['Dangal', '2016-12-21', 8.3, 190469, 0],
 ['The Empire Strikes Back', '1982-01-28', 8.7, 1282146, 0],
 ['Star Wars', '1979-08-16', 8.6, 1354839, 6],
 ['The Matrix', '1999-08-05', 8.7, 1897142, 4],
 ['Goodfellas', '1991-02-14', 8.7

#### Create DataFrame with headers

In [20]:
index = ["name", "release_date", "rating", "votes", "oscars"]

df = pd.DataFrame(imdb_top_250_data,columns=index)

df

Unnamed: 0,name,release_date,rating,votes,oscars
0,The Shawshank Redemption,1995-05-25,9.3,2653780,0
1,The Godfather,1982-03-25,9.2,1839544,3
2,The Dark Knight,2008-08-07,9.0,2626211,2
3,The Godfather Part II,1983-04-21,9.0,1260388,6
4,Pulp Fiction,1995-04-13,8.9,2031337,0
5,Schindler's List,1994-03-10,9.0,1344621,7
6,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11
7,12 Angry Men,1960-01-28,9.0,783580,0
8,"Il buono, il brutto, il cattivo",1979-11-29,8.8,756961,0
9,Forrest Gump,1994-12-08,8.8,2055715,6


### Oscar Calculator

The Oscars should mean something, shouldn’t they? Here are the rewards for them:

- 1 or 2 oscars → 0.3 point
- 3 or 5 oscars → 0.5 point
- 6 - 10 oscars → 1 point
- 10+ oscars → 1.5 point

*For example*, if a movie is awarded 4 Oscar titles and the original IMDB rating is 7.5, the adjusted value will increase to 8 points.

In [21]:
def oscars_adjustment(p_num_of_oscars: int) -> float:
    if p_num_of_oscars == 0:
        return 0
    elif p_num_of_oscars > 0 and p_num_of_oscars < 3:
        return 0.3
    elif p_num_of_oscars > 2 and p_num_of_oscars < 6:
        return 0.5
    elif p_num_of_oscars > 5 and p_num_of_oscars < 11:
        return 1
    else:
        return 1.5
    
    
for oscars in range(15):
    print(f'Number of Oscars: {oscars}, Oscars Adjustment: {oscars_adjustment(oscars)}')

Number of Oscars: 0, Oscars Adjustment: 0
Number of Oscars: 1, Oscars Adjustment: 0.3
Number of Oscars: 2, Oscars Adjustment: 0.3
Number of Oscars: 3, Oscars Adjustment: 0.5
Number of Oscars: 4, Oscars Adjustment: 0.5
Number of Oscars: 5, Oscars Adjustment: 0.5
Number of Oscars: 6, Oscars Adjustment: 1
Number of Oscars: 7, Oscars Adjustment: 1
Number of Oscars: 8, Oscars Adjustment: 1
Number of Oscars: 9, Oscars Adjustment: 1
Number of Oscars: 10, Oscars Adjustment: 1
Number of Oscars: 11, Oscars Adjustment: 1.5
Number of Oscars: 12, Oscars Adjustment: 1.5
Number of Oscars: 13, Oscars Adjustment: 1.5
Number of Oscars: 14, Oscars Adjustment: 1.5


In [22]:
df['oscars_adjustment'] =  [oscars_adjustment(x) for x in df['oscars']]

df

Unnamed: 0,name,release_date,rating,votes,oscars,oscars_adjustment
0,The Shawshank Redemption,1995-05-25,9.3,2653780,0,0.0
1,The Godfather,1982-03-25,9.2,1839544,3,0.5
2,The Dark Knight,2008-08-07,9.0,2626211,2,0.3
3,The Godfather Part II,1983-04-21,9.0,1260388,6,1.0
4,Pulp Fiction,1995-04-13,8.9,2031337,0,0.0
5,Schindler's List,1994-03-10,9.0,1344621,7,1.0
6,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,1.5
7,12 Angry Men,1960-01-28,9.0,783580,0,0.0
8,"Il buono, il brutto, il cattivo",1979-11-29,8.8,756961,0,0.0
9,Forrest Gump,1994-12-08,8.8,2055715,6,1.0


### Review Penalizer:

Ratings are good because they give us an impression of how many people think a film is good or bad. However, it does matter how many people voted. The goal of this exercise is to penalize those films where the number of reviews is low. 

Find the film with the maximum number of reviews (remember, out of the TOP 20 only). This is going to be the benchmark. Compare every movie’s number of reviews to this and penalize each of them based on the following rule: Every 100k deviation from the maximum translates to a point deduction of 0.1. 

*For example*, suppose the maximum number of reviews is 2.456.123. For a given movie with 1.258.369 ratings and an IMDB score of 9.4, the amount of the deduction is 1.1 and therefore the adjusted rating is 8.3.

In [23]:
max_votes = df.sort_values('rating', ascending=False).head(20).max(axis = 0)['votes']

max_votes

2653780

In [24]:
df['review_penalty'] = (max_votes - df['votes']) // 100000  * -0.1

df

Unnamed: 0,name,release_date,rating,votes,oscars,oscars_adjustment,review_penalty
0,The Shawshank Redemption,1995-05-25,9.3,2653780,0,0.0,-0.0
1,The Godfather,1982-03-25,9.2,1839544,3,0.5,-0.8
2,The Dark Knight,2008-08-07,9.0,2626211,2,0.3,-0.0
3,The Godfather Part II,1983-04-21,9.0,1260388,6,1.0,-1.3
4,Pulp Fiction,1995-04-13,8.9,2031337,0,0.0,-0.6
5,Schindler's List,1994-03-10,9.0,1344621,7,1.0,-1.3
6,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,1.5,-0.8
7,12 Angry Men,1960-01-28,9.0,783580,0,0.0,-1.8
8,"Il buono, il brutto, il cattivo",1979-11-29,8.8,756961,0,0.0,-1.8
9,Forrest Gump,1994-12-08,8.8,2055715,6,1.0,-0.5


#### Final adjustment calculation

In [25]:
df['adjusted_rating'] = df['rating'] + df['oscars_adjustment'] + df['review_penalty']

df

Unnamed: 0,name,release_date,rating,votes,oscars,oscars_adjustment,review_penalty,adjusted_rating
0,The Shawshank Redemption,1995-05-25,9.3,2653780,0,0.0,-0.0,9.3
1,The Godfather,1982-03-25,9.2,1839544,3,0.5,-0.8,8.9
2,The Dark Knight,2008-08-07,9.0,2626211,2,0.3,-0.0,9.3
3,The Godfather Part II,1983-04-21,9.0,1260388,6,1.0,-1.3,8.7
4,Pulp Fiction,1995-04-13,8.9,2031337,0,0.0,-0.6,8.3
5,Schindler's List,1994-03-10,9.0,1344621,7,1.0,-1.3,8.7
6,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,1.5,-0.8,9.7
7,12 Angry Men,1960-01-28,9.0,783580,0,0.0,-1.8,7.2
8,"Il buono, il brutto, il cattivo",1979-11-29,8.8,756961,0,0.0,-1.8,7.0
9,Forrest Gump,1994-12-08,8.8,2055715,6,1.0,-0.5,9.3


In [26]:
df = df.drop("oscars_adjustment", axis='columns')
df = df.drop("review_penalty", axis='columns')

df

Unnamed: 0,name,release_date,rating,votes,oscars,adjusted_rating
0,The Shawshank Redemption,1995-05-25,9.3,2653780,0,9.3
1,The Godfather,1982-03-25,9.2,1839544,3,8.9
2,The Dark Knight,2008-08-07,9.0,2626211,2,9.3
3,The Godfather Part II,1983-04-21,9.0,1260388,6,8.7
4,Pulp Fiction,1995-04-13,8.9,2031337,0,8.3
5,Schindler's List,1994-03-10,9.0,1344621,7,8.7
6,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,9.7
7,12 Angry Men,1960-01-28,9.0,783580,0,7.2
8,"Il buono, il brutto, il cattivo",1979-11-29,8.8,756961,0,7.0
9,Forrest Gump,1994-12-08,8.8,2055715,6,9.3


#### Sort DataFrame by adjustment rating

In [27]:
sorted_df = df.sort_values('adjusted_rating', ascending=False).head(20).reindex().reset_index(drop=True)

sorted_df

Unnamed: 0,name,release_date,rating,votes,oscars,adjusted_rating
0,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,9.7
1,The Shawshank Redemption,1995-05-25,9.3,2653780,0,9.3
2,The Dark Knight,2008-08-07,9.0,2626211,2,9.3
3,Forrest Gump,1994-12-08,8.8,2055715,6,9.3
4,Inception,2010-07-22,8.8,2327473,4,9.0
5,The Godfather,1982-03-25,9.2,1839544,3,8.9
6,Schindler's List,1994-03-10,9.0,1344621,7,8.7
7,The Godfather Part II,1983-04-21,9.0,1260388,6,8.7
8,The Lord of the Rings: The Fellowship of the Ring,2002-01-10,8.8,1857303,4,8.6
9,The Matrix,1999-08-05,8.7,1897142,4,8.5


#### Adjust index to start at 1, turn index into "rank" column

In [28]:
sorted_df.index += 1

sorted_df.insert(loc=0, column='rank', value=sorted_df.index)

sorted_df['rank'] = sorted_df.index

sorted_df

Unnamed: 0,rank,name,release_date,rating,votes,oscars,adjusted_rating
1,1,The Lord of the Rings: The Return of the King,2004-01-08,9.0,1829392,11,9.7
2,2,The Shawshank Redemption,1995-05-25,9.3,2653780,0,9.3
3,3,The Dark Knight,2008-08-07,9.0,2626211,2,9.3
4,4,Forrest Gump,1994-12-08,8.8,2055715,6,9.3
5,5,Inception,2010-07-22,8.8,2327473,4,9.0
6,6,The Godfather,1982-03-25,9.2,1839544,3,8.9
7,7,Schindler's List,1994-03-10,9.0,1344621,7,8.7
8,8,The Godfather Part II,1983-04-21,9.0,1260388,6,8.7
9,9,The Lord of the Rings: The Fellowship of the Ring,2002-01-10,8.8,1857303,4,8.6
10,10,The Matrix,1999-08-05,8.7,1897142,4,8.5


#### Write CSV file with current time postfix, omit index

In [29]:
sorted_df.to_csv(path_or_buf=f'imdb_top_250_adjusted_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv', 
                sep=';', 
                index=False, 
                header=True)