<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1: Descriptive statistics and visualisation


Let's put into practice the functions we have seen of central tendency, dispersion and visualisation.

## Initial setup

In [None]:
# Import the Python libraries required

import matplotlib.pyplot as plt
import pandas as pd  # Pandas allows manipulating large tables
import seaborn as sns

sns.set_style('whitegrid')  # Adds a rather cute background grid in Seaborn

# Round the output to two decimal points
%precision %.2f

We are working with a dataset in CSV format containing statistics on 85,855 films from [IMDB](https://www.imdb.com/).

Each row contains the following information:

* `imdb_title_id`: IMDb identifier
* `title`: movie title
* `original_title`: original title (usually matches the `title` field, but not always)
* `year`: release year
* `date_published`: release date
* `genre`: genre
* `duration`: duration (in minutes)
* `country`: country of the film
* `language`: original language
* `director`: name of the director
* `writer`: name of the screenwriter
* `production_company`: name of the production company
* `actors`: names of the main actors, separated by commas
* `description`: brief description of the storyline
* `avg_vote`: users' rating (from 0 to 10)
* `votes`: number of votes received
* `budget`: budget of the movie
* `usa_gross_income`: US box office income
* `worldwide_gross_income`: worldwide box office income
* `metascore`: critics' score (from 0 to 100)
* `reviews_from_users`: number of user reviews
* `reviews_from_critics`: number of reviews from critics

In [None]:
# Getting the CSV file with the data

!wget https://github.com/d-tomas/transform4europe/raw/main/datasets/imdb.tgz
!tar xvfz imdb.tgz
!rm imdb.tgz

In [None]:
# Load data in CSV format

data = pd.read_csv('imdb.csv', index_col='imdb_title_id')
data

## Some initial cleaning

In [None]:
# Let's look at the data types of each column

data.info()

In [None]:
# Rename column 'worlwide_gross_income' as 'worldwide_gross_income'
data.rename(columns={'worlwide_gross_income': 'worldwide_gross_income'}, inplace=True)

# Transform 'year' to numeric type
data['year'] = pd.to_numeric(data['year'], errors='coerce')

# Transform 'budget' to numeric
data['budget'] = pd.to_numeric(data['budget'].str.split(' ').str[-1], errors='coerce')

# Transform 'usa_gross_income' to numeric
data['usa_gross_income'] = pd.to_numeric(data['usa_gross_income'].str.split(' ').str[-1], errors='coerce')

# Transform 'worldwide_gross_income' to numeric
data['worldwide_gross_income'] = pd.to_numeric(data['worldwide_gross_income'].str.split(' ').str[-1], errors='coerce')

## Exercises

**1.** How long (`duration`) is the shortest film?

**2.** What is the title (`title`) of the longest film (`duration`)?

**3.** What is the average critic score (`metascore`) for all films?

**4.** What about the average user rating (`avg_vote`)?

**5.** What is the interquartile range (IQR) of the user score (`avg_vote`)?

**6.** How many films have achieved a user rating (`avg_vote`) higher than 9?

**7.** Which director (`director`) has directed the most films?

**8.** Which film has the highest user rating (`avg_vote`)?

**9.** What is the standard deviation of the critics' score (`metascore`)?

**10.** What is the standard deviation of the users' score (`avg_vote`)? Multiply by 10 to be on the same scale as the reviewers' score.

**11.** How many films are there of each duration?

**12.** Show the correlation between number of votes received (`votes`), number of user reviews (`reviews_from_users`), number of critic reviews (`reviews_from_critics`), US income (`usa_gross_income`) and worldwide income (`worldwide_gross_income`).

**13.** Show the histograms of the users' score (`avg_vote`) and the critics' score (`metascore`) overlaid on top of each other.

**14.** Show the correlation between worldwide income (`worldwide_gross_income`) and user ratings (`avg_vote`).

# Referencias

* [IMDb movies extensive dataset](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset)