# Assignment 8: Pandas

Please read the tasks description carefully and implement **only** what the tasks ask you to implement. Closely following the task descriptions will be beneficial, so keep your divergence in check - the test cases below each input cell are the gold standard. Finally, for this assignment, you do not need any error handling, you can assume that all input to your function will be valid.

As for the other assignments, using `print` is encouraged to test your implementation but is never required. Make sure not to confuse `return` and `print` statements: If your function has to **return** something, use the `return` statement. 

Try to implement the tasks yourself or in a small team. If you blindly copy a solution from the Internet or other students, you will not take home any learnings. Rather, make an effort to understand the solution! Furthermore, do not modify the _test cells_ - if you do, you effectively cheat the system which is not helpful for your learning process.

Some aspects of this assignment require you to <strong>self-study</strong> and do some research beyond the lecture contents - use your favorite search engine to look up documentation, usage examples, and definitions of the mentioned functions. There might be tasks where you have to read and investigate the [Python Standard Library](https://docs.python.org/3/library/) to find the documentation for a function that is used or that you want to use.

This assignment will use the third-party module [pandas](https://pandas.pydata.org/).

In Google Colab and Anaconda, it is already installed. If you see an `ImportError` in the next cell, run `%pip install pandas` to install this module.

---
# Task 0: Loading the `csv` file.

We will operate on a `pd.DataFrame` from the [Pandas](https://pandas.pydata.org/) third-party module.

We will load the file with the following function `load_file()`. Do not modify this function.

In [None]:
import pandas as pd

Execute the following cell to check if you uploaded the file correctly.

In [3]:
from pathlib import Path

DB_FILE = Path('movies.csv')
if not DB_FILE.exists():
    print("\033[1;41m", " " * 48, '\n', "      Please upload the the movies.csv file.     \n", " " * 48, "\033[0m")
else:
    print("\033[1;42m", " " * 48, '\n', "The movies.csv file was found and can be loaded. \n", " " * 48, "\033[0m")

[1;42m                                                  
 The movies.csv file was found and can be loaded. 
                                                  [0m


In [4]:
def load_file():
    _df = pd.read_csv("movies.csv")
    _categories = ['color', 'language', 'country', 'content_rating']
    for _cat in _categories:
        _df[_cat] = _df[_cat].astype('category')
    _ints = ['duration', 'director_facebook_likes', 
             'actor_1_facebook_likes', 'actor_2_facebook_likes', 'actor_3_facebook_likes',
             'facenumber_in_poster',
             'num_critic_for_reviews', 'num_user_for_reviews', 
             'budget', 'title_year', 'movie_facebook_likes']
    for _int in _ints:
        _df[_int] = _df[_int].astype(pd.Int64Dtype())
    
    return _df

Execute the next cell to load the file. You'll see an error if it doesn't load correctly.

Afterwards, you will have the full dataframe in the variable `DF`.

If you modified, for some reasons, this dataframe, simply call this function `load_file` again to reload the file from disk.

Please try not not modify this variable `DF`. If you want to remove columns, add columns, drop rows or add rows, or modify it in any other way, make a copy first with `DF.copy()`.

In [7]:
DF = load_file()

We can look at the first 5 elements:

In [8]:
DF.head(5)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0,855.0,Joel David Moore,1000,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563,1000.0,Orlando Bloom,40000,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0,161.0,Rory Kinnear,11000,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000,23000.0,Christian Bale,27000,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000,8.5,2.35,164000
4,,Doug Walker,,,131,,Rob Walker,131,,Documentary,...,,,,,,,12,7.1,,0


And the columns:

In [9]:
DF.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

Or 5 random elements:

In [10]:
DF.sample(5)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
1417,Color,Mike Mitchell,127,100,31,591,Kelly Preston,947,63939454.0,Adventure|Comedy|Family|Sci-Fi,...,217,English,USA,PG,35000000.0,2005,742,6.2,2.35,0
2090,Color,Irwin Winkler,53,114,34,287,Jeremy Northam,649,50728000.0,Action|Crime|Drama|Mystery|Thriller,...,166,English,USA,PG-13,22000000.0,1995,327,5.8,1.85,0
1843,Black and White,Richard Attenborough,56,175,0,14,Dirk Bogarde,385,50800000.0,Drama|History|War,...,210,English,USA,PG,26000000.0,1977,232,7.4,2.35,0
1880,Color,Michael Winterbottom,78,101,187,127,Ava Acres,554,,Drama,...,25,English,UK,Not Rated,,2014,213,4.7,2.35,488
813,Color,Stephen Herek,56,114,65,440,Morgan Fairchild,743,12065985.0,Comedy|Drama,...,88,English,USA,PG,60000000.0,1998,730,4.9,2.35,296


---
# Task 1: Get Movie Title

Implement the function `get_movie_title(position)` that returns the *name* of the movie at the integer position `position`.

You have to find the correct column first, meaning the one that corresponds to the name of the movie.

Use only instance methods from `pd.DataFrame`.

_Hint_: It is only one line.

In [26]:
# ⬇️ Add your code below this line ⬇️
### BEGIN SOLUTION

def get_movie_title(p):
    return DF.iloc[p][['movie_title','duration']]

get_movie_title(11)
### END SOLUTION
# ⬆️ Add your code above this line ⬆️

movie_title    Superman Returns
duration                    169
Name: 11, dtype: object

In [14]:
# Test Case
from unittest import TestCase
__ = TestCase()

# Sanity
__.assertTrue('get_movie_title' in locals(), msg='You have to call the function `get_movie_title`.')

# reset DF in case it was modified
DF = load_file()

# Actual Test
__.assertEqual('Spider-Man', get_movie_title(161), msg="At integer position 161 is the movie 'Spider-Man'")
__.assertEqual('Spider-Man 2', get_movie_title(31), msg="At integer position 31 is the movie 'Spider-Man 2'")
__.assertEqual('Spider-Man 3', get_movie_title(6), msg="At integer position 6 is the movie 'Spider-Man 3'")
__.assertEqual('Batman: The Movie', get_movie_title(4457), msg="At integer position 4457 is the movie 'Batman: The Movie'")
__.assertEqual('Batman Returns', get_movie_title(441), msg="At integer position 441 is the movie 'Batman Returns'")
__.assertEqual('Batman Forever', get_movie_title(309), msg="At integer position 309 is the movie 'Batman Forever'")
__.assertEqual('Batman & Robin', get_movie_title(217), msg="At integer position 217 is the movie 'Batman & Robin'")
__.assertEqual('Cloudy with a Chance of Meatballs', get_movie_title(357), msg="At integer position 357 is the movie 'Cloudy with a Chance of Meatballs'")

print("\n\033[37;42;2m  Success! Your code works as intended.  \033[0m\n")


[37;42;2m  Success! Your code works as intended.  [0m



---
# Task 2: Extract Keywords

Implement the function `get_keywords(position)` that returns a `list` of keywords for the movie at the given integer position `position`.

If the entry is empty (use `pd.isna()` to check), return an empty list.

_Hint_: Look at the dataframe first to understand the data structure.

_Hint_: Remember the `.split()` function of strings? Use it accordingly.

In [None]:
# ⬇️ Add your code below this line ⬇️
### BEGIN SOLUTION

def get_keywords(position):
    keywords = DF.iloc[position]['plot_keywords']
    
    if pd.isna(keywords):
        return []
    else:
        return keywords.split("|")

### END SOLUTION
# ⬆️ Add your code above this line ⬆️

# To test:
print(get_keywords(42))  # ← ['autopsy', 'lantern', 'planet', 'ring', 'test pilot']

In [None]:
# Test Case
from unittest import TestCase
__ = TestCase()

# Sanity
__.assertTrue('get_keywords' in locals(), msg='You have to call the function `get_keywords`.')

# reset DF in case it was modified
DF = load_file()

__.assertListEqual(
    ['autopsy', 'lantern', 'planet', 'ring', 'test pilot'],
    get_keywords(42),
    msg="You have a wrong result.")
__.assertListEqual(
    ['kidnapping', 'reference to franz beckenbauer', 'scene during end credits', 'second part', 'singing in a car'],
    get_keywords(1234),
    msg="You have a wrong result.")
__.assertListEqual(
    ['cat killer', 'death of animal', 'high heels', 'kneed in the crotch', 'kneed in the groin'],
    get_keywords(4321),
    msg="You have a wrong result.")

# Test where empty
__.assertListEqual(
    [],
    get_keywords(4711),
    msg="You have a wrong result.")

print("\n\033[37;42;2m  Success! Your code works as intended.  \033[0m\n")

---
# Task 3: Average Length

Implement the function `average_movie_lengths()` that returns a `pd.Series` where the *index* is the name of the movie director and the *value* the average of all movies for the individual movie director.
The returned series must be sorted in *ascending* order by the index (meaning the director names).

- If the director name is empty, skip this entry.
- If the movie length is empty, skip this entry and also ignore it in the computation for the average.
- If a director has no valid movie lengths, also ignore it.

In [None]:
# ⬇️ Add your code below this line ⬇️
### BEGIN SOLUTION

def average_movie_lengths():
    # We modify the dataframe, so we need a copy first:
    df = DF.copy()
    
    # We only want director_name and duration
    df = df[['director_name', 'duration']]
    
    # We don't care about nan values (dropping rows, NOT columns)
    df = df.dropna(axis='index')
    
    # Index should be director names
    df = df.set_index('director_name')
    
    # Now, we compute the average by grouping along the index
    df = df.groupby('director_name').mean()
    
    # And we sort by the index
    df = df.sort_index()
    
    # But we need a series:
    s = df['duration']
    
    # Done.
    return s

### END SOLUTION
# ⬆️ Add your code above this line ⬆️


# Example:
average_movie_lengths()['Harold Becker'] # ← 107.0

In [None]:
# Test Case
from unittest import TestCase
__ = TestCase()

# Sanity
__.assertTrue('average_movie_lengths' in locals(), msg='You have to call the function `average_movie_lengths`.')


# reset DF in case it was modified
DF = load_file()

# Call functino

_student = average_movie_lengths()

__.assertIsInstance(_student, pd.Series, msg="Your return value is not a Series.")

__.assertEqual(2388, len(_student), msg="The returned Series must have 2388 entries.")
__.assertEqual(100.0, _student['Aaron Schneider'], msg="Aaron Schneider's average is 100.0")
__.assertEqual(103.0, _student['Greg Coolidge'], msg="Greg Coolidge's average is 103.0")
__.assertEqual(82.0, _student['Mickey Liddell'], msg="Mickey Liddell's average is 82.0")
__.assertEqual(107.0, _student['Harold Becker'], msg="Mickey Liddell's average is 107.0")

# Is it sorted?
__.assertListEqual(list(_student.index), sorted(_student.index), msg='Your Series is not sorted correctly.')

# Check 5 elements from the beginning
__.assertListEqual(
    [('Adam Brooks', 112.0), ('Adam Carolla', 98.0), ('Adam Goldberg', 111.0), ('Adam Green', 93.0), ('Adam Jay Epstein', 76.0)],
    list(_student.items())[5:10],
    msg="The first 10 elements are not the same."
)
# Check 5 elements in the middle
__.assertListEqual(
    [('F. Gary Gray', 122.875), ('Fabián Bielinsky', 114.0), ('Fatih Akin', 99.0), ('Fede Alvarez', 96.0), ('Fedor Bondarchuk', 115.0)],
    list(_student.items())[670:675],
    msg="The elements 670-680 differ."
)
# Check another 5 elements from the middle
__.assertListEqual(
    [('William Cottrell', 83.0), ('William Dear', 106.0), ('William Eubank', 97.0), ('William Friedkin', 104.42857142857143), ('William Gazecki', 105.5)],
    list(_student.items())[2350:2355],
    msg="The elements 2350-2360."
)

print("\n\033[37;42;2m  Success! Your code works as intended.  \033[0m\n")

---
# Task 4: Movie Profit Ratio

Implement the function `gross_per_budget` that computes the ratio of gross/budget for each movie and returns a `DataFrame` with that information.

The resulting dataframe must fulfill all of the following criteria:

- The columns must only be "director_name", "movie_title", "gross", "budget", and finally "gross/budget" (and the index column)
- The value in the column "gross/budget" must be the result of the row-wise division of the column "gross" and "budget".
- There must not be **any** NaN (empty) values in the resulting dataframe.
- The dataframe must be sorted in **descending** order by the newly created "gross/budget" column. Do not reset the index.

Do **NOT** modify the original dataframe. The first statement in your function should be: `df = DF.copy()`

In [None]:
# ⬇️ Add your code below this line ⬇️
### BEGIN SOLUTION

def gross_per_budget():
    # Copy the original dataframe
    df = DF.copy()
    
    # We only care about the required columns
    df = df[['director_name', 'movie_title', 'gross', 'budget']]
    
    # Remove any nan-row, we don't want any nan
    df = df.dropna(axis='index')
    
    # Create a new column "gross/budget" with the result of the division
    df["gross/budget"] = df['gross'] / df['budget']
    
    # Sort by this newly created column in descending order
    df = df.sort_values(by="gross/budget", ascending=False)
        
    # Done.
    return df

### END SOLUTION
# ⬆️ Add your code above this line ⬆️

# Example: (Should display: "161  Spider-Man  2.904362"
result = gross_per_budget()
print(result[result.movie_title == 'Spider-Man'][['movie_title', 'gross/budget']])

In [None]:
# Test Case
from unittest import TestCase
__ = TestCase()

# Sanity
__.assertTrue('gross_per_budget' in locals(), msg='You have to call the function `gross_per_budget`.')

# reset DF in case it was modified
DF = load_file()

# call function
_student = gross_per_budget()

__.assertIsInstance(_student, pd.DataFrame, msg="Your return value is not a DataFrame.")

__.assertEqual((3891, 5), _student.shape, msg="You have not the requests number of rows or columns.")
__.assertEqual(
    sorted(['director_name', 'movie_title', 'gross', 'budget', 'gross/budget']),
    sorted(_student.columns),
    msg="You have different columns."
)

# for the test, we're rounding the values to elimiate rounding errors
_student["gross/budget"] = _student["gross/budget"].apply(round)

__.assertTrue(
    (pd.Series(data={
        'director_name': 'Oren Peli',
        'movie_title': 'Paranormal Activity',
        'gross': 107917283.0,
        'budget': 15000,
        'gross/budget': round(7194.4855333333335)
    }, name=4793) == _student.iloc[0]).all(),
    msg="You have a different value at the first position.\n\nLook at the test case to see the wanted values."
)
__.assertTrue(
    (pd.Series(data={
        'director_name': 'Ekachai Uekrongtham',
        'movie_title': 'Skin Trade',
        'gross': 162.0,
        'budget': 9000000,
        'gross/budget': round(0.000018)
    }, name=3330) == _student.iloc[-1]).all(),
    msg="You have a different value at the last position.\n\nLook at the test case to see the wanted values."
)
__.assertTrue(
    (pd.Series(data={
        'director_name': 'David Bowers',
        'movie_title': 'Diary of a Wimpy Kid: Rodrick Rules',
        'gross': 52691009.0,
        'budget': 21000000,
        'gross/budget': round(2.5090956666666666)
    }, name=2405) == _student.loc[2405]).all(),
    msg="You have a different value at the index 2405.\n\nLook at the test case to see the wanted values."
)

print("\n\033[37;42;2m  Success! Your code works as intended.  \033[0m\n")