# Solve Any Data Analysis Problem

## Chapter 5 - Project 4 - Example solution - Part 1

We need to find and extract:

- annual admissions breakdown
- breakdown of genre
- breakdown of distributors

Figure out where relevant tables are for a few years (2018 onwards so we have both pre- and "post-" COVID periods)

In [1]:
import numpy as np
import pandas as pd

import pdfplumber

from IPython.display import display

Let's try a sample snippet to extract tables from a specific page

In [2]:
pdf_path = "./files/2019 - BFI yearbook 2019 - 888.pdf"
page_num = 11

page_tables = []

with pdfplumber.open(pdf_path) as pdf:
    page = pdf.pages[page_num-1]
    # page_tables is a list of lists of lists(!)
    # extracted from special Table objects
    page_tables = [t.extract() for t in page.find_tables()]

page_tables

[[['Month', '2017 (million)', '2018 (million)', '% +/- on 2017'],
  ['January', '15.0', '16.2', '8.0'],
  ['February', '16.5', '16.1', '-2.3'],
  ['March', '16.2', '13.5', '-16.2'],
  ['April', '15.6', '15.5', '-0.9'],
  ['May', '11.3', '13.7', '21.0'],
  ['June', '9.6', '10.4', '8.6'],
  ['July', '17.8', '15.6', '-12.3'],
  ['August', '14.5', '19.2', '32.8'],
  ['September', '10.8', '10.1', '-6.0'],
  ['October', '12.1', '16.0', '32.6'],
  ['November', '14.1', '14.8', '5.4'],
  ['December', '17.2', '15.7', '-8.7'],
  ['Total', '170.6', '177.0', '3.7']]]

Convert to a DataFrame

In [3]:
table = page_tables[0]

pd.DataFrame(table[1:-1], columns=table[0])

Unnamed: 0,Month,2017 (million),2018 (million),% +/- on 2017
0,January,15.0,16.2,8.0
1,February,16.5,16.1,-2.3
2,March,16.2,13.5,-16.2
3,April,15.6,15.5,-0.9
4,May,11.3,13.7,21.0
5,June,9.6,10.4,8.6
6,July,17.8,15.6,-12.3
7,August,14.5,19.2,32.8
8,September,10.8,10.1,-6.0
9,October,12.1,16.0,32.6


Let's now create a reusable function to extract and print all the tables in a PDF

In [4]:
def extract_tables(pdf_path, pages=[], print_tables=True):
    """
    Extract all tables found in a PDF.
    
    `pdf_path`: file path pointing to the PDF
    `pages`: the page number(s) to read
    `print_tables`: whether to also print out all the tables that are found (default: True)
    
    returns: a list of pandas DataFrames
    """
    
    print(f"Reading {pdf_path}")
    
    tables = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num in pages:
            page = pdf.pages[page_num-1]
            
            # page_tables is a list of lists of lists(!)
            # extracted from special Table objects
            page_tables = [t.extract() for t in page.find_tables()]
            
            # in each case, table is now a list of lists
            # first list is column headers
            dfs = [pd.DataFrame(table[1:-1], columns=table[0]) for table in page_tables]
            
            tables.extend(dfs)
    
    print(f"{len(tables)} tables found.")
    
    if len(tables) > 0:
        if print_tables:
            for index, df in enumerate(tables):
                print(f"\n##########################\n\tTable {index}\n##########################\n")
                display(df)
    
    return tables

Test our function

In [5]:
tables = extract_tables("./files/2019 - BFI yearbook 2019 - 888.pdf", pages=[11,34,70])

Reading ./files/2019 - BFI yearbook 2019 - 888.pdf
3 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Month,2017 (million),2018 (million),% +/- on 2017
0,January,15.0,16.2,8.0
1,February,16.5,16.1,-2.3
2,March,16.2,13.5,-16.2
3,April,15.6,15.5,-0.9
4,May,11.3,13.7,21.0
5,June,9.6,10.4,8.6
6,July,17.8,15.6,-12.3
7,August,14.5,19.2,32.8
8,September,10.8,10.1,-6.0
9,October,12.1,16.0,32.6



##########################
	Table 1
##########################



Unnamed: 0,Genre,Number of\nreleases,% of\nreleases,Gross box\noffice\n(£ million),% of total\nbox office,Top performing title
0,Action,77,9.8,361.3,27.7,Avengers: Infinity War
1,Animation,46,5.8,242.2,18.6,Incredibles 2
2,Drama,246,31.3,154.1,11.8,A Star Is Born
3,Comedy,144,18.3,88.3,6.8,Johnny English Strikes Again
4,Biopic,5,0.6,78.5,6.0,Bohemian Rhapsody
5,Musical,5,0.6,67.0,5.1,Mamma Mia! Here We Go Again
6,Family,10,1.3,60.3,4.6,Mary Poppins Returns
7,Fantasy,4,0.5,57.6,4.4,Fantastic Beasts: The Crimes of Grindelwald
8,Horror,38,4.8,57.1,4.4,A Quiet Place
9,Adventure,10,1.3,36.4,2.8,Ready Player One



##########################
	Table 2
##########################



Unnamed: 0,Distributor,Market share\n(%),Films on release\nin 2018,Box office gross\n(£ million)
0,Walt Disney,23.6,24,325.6
1,Universal,19.5,40,268.5
2,20th Century Fox,14.5,28,199.3
3,Warner Bros,13.9,31,191.4
4,Sony,10.7,33,146.8
5,Paramount,4.8,12,66.0
6,eOne Films,3.2,22,43.9
7,StudioCanal,2.8,31,38.4
8,Lionsgate,1.5,21,21.2
9,Entertainment,1.1,9,15.6


# 2018

For 2018 all necessary tables are in the same PDF so let's extract them all here.

In [6]:
tables_2018 = extract_tables("./files/2019 - BFI yearbook 2019 - 888.pdf", pages=[11,34,70])

Reading ./files/2019 - BFI yearbook 2019 - 888.pdf
3 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Month,2017 (million),2018 (million),% +/- on 2017
0,January,15.0,16.2,8.0
1,February,16.5,16.1,-2.3
2,March,16.2,13.5,-16.2
3,April,15.6,15.5,-0.9
4,May,11.3,13.7,21.0
5,June,9.6,10.4,8.6
6,July,17.8,15.6,-12.3
7,August,14.5,19.2,32.8
8,September,10.8,10.1,-6.0
9,October,12.1,16.0,32.6



##########################
	Table 1
##########################



Unnamed: 0,Genre,Number of\nreleases,% of\nreleases,Gross box\noffice\n(£ million),% of total\nbox office,Top performing title
0,Action,77,9.8,361.3,27.7,Avengers: Infinity War
1,Animation,46,5.8,242.2,18.6,Incredibles 2
2,Drama,246,31.3,154.1,11.8,A Star Is Born
3,Comedy,144,18.3,88.3,6.8,Johnny English Strikes Again
4,Biopic,5,0.6,78.5,6.0,Bohemian Rhapsody
5,Musical,5,0.6,67.0,5.1,Mamma Mia! Here We Go Again
6,Family,10,1.3,60.3,4.6,Mary Poppins Returns
7,Fantasy,4,0.5,57.6,4.4,Fantastic Beasts: The Crimes of Grindelwald
8,Horror,38,4.8,57.1,4.4,A Quiet Place
9,Adventure,10,1.3,36.4,2.8,Ready Player One



##########################
	Table 2
##########################



Unnamed: 0,Distributor,Market share\n(%),Films on release\nin 2018,Box office gross\n(£ million)
0,Walt Disney,23.6,24,325.6
1,Universal,19.5,40,268.5
2,20th Century Fox,14.5,28,199.3
3,Warner Bros,13.9,31,191.4
4,Sony,10.7,33,146.8
5,Paramount,4.8,12,66.0
6,eOne Films,3.2,22,43.9
7,StudioCanal,2.8,31,38.4
8,Lionsgate,1.5,21,21.2
9,Entertainment,1.1,9,15.6


The first table is the monthly breakdown we're looking for.

In [7]:
admissions_2018 = (
    tables_2018[0]
    .iloc[:,[0, 2]] # keep month and 2018 admissions column
)

admissions_2018.columns = ["Month", "Admissions (million)"]

admissions_2018.insert(0, "Year", 2018)

admissions_2018.head()

Unnamed: 0,Year,Month,Admissions (million)
0,2018,January,16.2
1,2018,February,16.1
2,2018,March,13.5
3,2018,April,15.5
4,2018,May,13.7


In [8]:
admissions_2018.tail()

Unnamed: 0,Year,Month,Admissions (million)
7,2018,August,19.2
8,2018,September,10.1
9,2018,October,16.0
10,2018,November,14.8
11,2018,December,15.7


#### Genres (2018)

In [9]:
tables_2018[1]

Unnamed: 0,Genre,Number of\nreleases,% of\nreleases,Gross box\noffice\n(£ million),% of total\nbox office,Top performing title
0,Action,77,9.8,361.3,27.7,Avengers: Infinity War
1,Animation,46,5.8,242.2,18.6,Incredibles 2
2,Drama,246,31.3,154.1,11.8,A Star Is Born
3,Comedy,144,18.3,88.3,6.8,Johnny English Strikes Again
4,Biopic,5,0.6,78.5,6.0,Bohemian Rhapsody
5,Musical,5,0.6,67.0,5.1,Mamma Mia! Here We Go Again
6,Family,10,1.3,60.3,4.6,Mary Poppins Returns
7,Fantasy,4,0.5,57.6,4.4,Fantastic Beasts: The Crimes of Grindelwald
8,Horror,38,4.8,57.1,4.4,A Quiet Place
9,Adventure,10,1.3,36.4,2.8,Ready Player One


In [10]:
genres_2018 = (
    tables_2018[1]
    .drop(columns=[tables_2018[1].columns[2], tables_2018[1].columns[4]])
)

genres_2018.insert(0, "Year", 2018)

genres_2018.columns = ["Year", "Genre", "Number of releases", "Gross box office (£ million)", "Top performing title"]

genres_2018.head()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2018,Action,77,361.3,Avengers: Infinity War
1,2018,Animation,46,242.2,Incredibles 2
2,2018,Drama,246,154.1,A Star Is Born
3,2018,Comedy,144,88.3,Johnny English Strikes Again
4,2018,Biopic,5,78.5,Bohemian Rhapsody


In [11]:
genres_2018.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
13,2018,Romance,17,11.8,The Guernsey Literary and Potato Peel Pie Society
14,2018,Documentary,112,9.3,Free Solo
15,2018,War,3,1.3,Journey’s End
16,2018,Western,2,0.1,Sweet Country
17,2018,Mystery,2,0.1,Dark River


#### Distributors (2018)

In [12]:
tables_2018[2]

Unnamed: 0,Distributor,Market share\n(%),Films on release\nin 2018,Box office gross\n(£ million)
0,Walt Disney,23.6,24,325.6
1,Universal,19.5,40,268.5
2,20th Century Fox,14.5,28,199.3
3,Warner Bros,13.9,31,191.4
4,Sony,10.7,33,146.8
5,Paramount,4.8,12,66.0
6,eOne Films,3.2,22,43.9
7,StudioCanal,2.8,31,38.4
8,Lionsgate,1.5,21,21.2
9,Entertainment,1.1,9,15.6


In [13]:
distributors_2018 = (
    tables_2018[2]
    .drop(index=[10]) # drop rows relating to totals EXCEPT the row totalling "others"
)

distributors_2018.insert(0, "Year", 2018)

distributors_2018.columns = ["Year", "Distributor", "Market share", "Films on release", "Box office gross (£ million)"]

distributors_2018.head()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
0,2018,Walt Disney,23.6,24,325.6
1,2018,Universal,19.5,40,268.5
2,2018,20th Century Fox,14.5,28,199.3
3,2018,Warner Bros,13.9,31,191.4
4,2018,Sony,10.7,33,146.8


In [14]:
distributors_2018.tail()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
6,2018,eOne Films,3.2,22,43.9
7,2018,StudioCanal,2.8,31,38.4
8,2018,Lionsgate,1.5,21,21.2
9,2018,Entertainment,1.1,9,15.6
11,2018,Others (128 distributors),4.5,715,61.4


# 2019

In [15]:
tables_2019 = extract_tables("./files/2020 - BFI Statistical Yearbook 2020 - 12815.pdf", pages=[11,36,63])

Reading ./files/2020 - BFI Statistical Yearbook 2020 - 12815.pdf
4 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Territory,Admissions 2018\n(million),Admissions 2019\n(million),+/- 2018 (%)
0,China,1717,1727,0.6
1,India,1463,1592,8.8
2,USA and Canada,1310,1256,-4.1
3,Mexico,332,350,5.5
4,South Korea,216,227,4.8
5,Russia,200,219,9.5
6,France,201,213,6.0
7,Japan,169,195,15.2
8,UK,177,176,-0.5
9,Brazil,163,172,5.6



##########################
	Table 1
##########################



Unnamed: 0,Month,2018 (million),2019 (million),% +/- on 2018
0,January,16.2,13.7,-15.4
1,February,16.1,12.2,-24.0
2,March,13.5,11.4,-15.5
3,April,15.5,16.0,3.1
4,May,13.7,16.6,20.8
5,June,10.4,13.9,33.7
6,July,15.6,18.7,19.5
7,August,19.2,15.6,-19.1
8,September,10.1,11.0,8.2
9,October,16.1,16.3,1.8



##########################
	Table 2
##########################



Unnamed: 0,Genre,Number of\nreleases,% of\nreleases,Gross\nbox office\n(£ million),% of total\nbox office,Top performing title
0,Action,102,13.4,328.3,25.2,Avengers: Endgame
1,Animation,40,5.2,312.1,24.0,The Lion King
2,Drama,264,34.6,166.4,12.8,Downton Abbey
3,Comedy,136,17.8,83.5,6.4,Yesterday
4,Biopic,19,2.5,80.7,6.2,Rocketman
5,Thriller,20,2.6,75.3,5.8,Joker
6,Sci-fi,3,0.4,64.3,4.9,Star Wars: The Rise of Skywalker
7,Horror,37,4.8,60.6,4.7,It Chapter Two
8,Adventure,8,1.0,44.5,3.4,Aladdin
9,Family,5,0.7,42.7,3.3,Dumbo



##########################
	Table 3
##########################



Unnamed: 0,Distributor,Market share\n(%),Films on release\nin 2019,Box office gross\n(£ million)
0,Walt Disney,37.9,24,506.8
1,Universal,13.9,39,185.6
2,Warner Bros,12.4,40,166.3
3,Sony,9.3,27,124.7
4,Paramount,5.9,13,79.0
5,20th Century Fox*,5.7,30,75.5
6,Lionsgate,4.2,19,56.4
7,Entertainment One,3.8,27,50.4
8,STX Entertainment,1.2,7,15.4
9,StudioCanal,1.0,22,14.0


Here it looks like admissions is the second table.

Technically we could have extracted the 2018 figures from here as well. If we wanted a genre or distributor-level breakdown for 2018, we would have needed to open those PDFs anyway, so it doesn't make a difference.

If you were being really careful, you could also compare these figures with the ones in the 2018 PDF to make sure they match...

In [16]:
admissions_2019 = (
    tables_2019[1]
    .iloc[:,[0, 2]] # ignore first two rows, last 2 rows, and keep first and third columns
)

admissions_2019.columns = ["Month", "Admissions (million)"]
admissions_2019.insert(0, "Year", 2019)

admissions_2019.head()

Unnamed: 0,Year,Month,Admissions (million)
0,2019,January,13.7
1,2019,February,12.2
2,2019,March,11.4
3,2019,April,16.0
4,2019,May,16.6


In [17]:
admissions_2019.tail()

Unnamed: 0,Year,Month,Admissions (million)
7,2019,August,15.6
8,2019,September,11.0
9,2019,October,16.3
10,2019,November,12.2
11,2019,December,18.5


#### Genres (2019)

In [18]:
genres_2019 = (
    tables_2019[2]
    .drop(columns=[tables_2019[2].columns[2], tables_2019[2].columns[4]])
)

genres_2019.insert(0, "Year", 2019)

genres_2019.columns = ["Year", "Genre", "Number of releases", "Gross box office (£ million)", "Top performing title"]

genres_2019.head(7)

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2019,Action,102,328.3,Avengers: Endgame
1,2019,Animation,40,312.1,The Lion King
2,2019,Drama,264,166.4,Downton Abbey
3,2019,Comedy,136,83.5,Yesterday
4,2019,Biopic,19,80.7,Rocketman
5,2019,Thriller,20,75.3,Joker
6,2019,Sci-fi,3,64.3,Star Wars: The Rise of Skywalker


In [19]:
genres_2019.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
12,2019,Documentary,99,7.8,Apollo 11
13,2019,Crime,11,2.8,21 Bridges
14,2019,Mystery,1,0.5,The Souvenir
15,2019,War,2,0.1,Saving Private Ryan (D-Day 75th Anniversary)
16,2019,Fantasy,2,<0.1,Gwen


This code is very similar to the one for 2018. We *could* write a generic function to extract and transform a genre-level breakdown from any PDF, but we cannot assume they will all take the same form.

We generally don't want to assume data across multiple files will have the same format, but this is particularly true for PDF reports spanning many years.

#### Distributors (2019)

In [20]:
distributors_2019 = (
    tables_2019[3]
    .drop(index=[10]) # drop totals EXCEPT the row totalling "others"
)

distributors_2019.insert(0, "Year", 2019)

distributors_2019.columns = ["Year", "Distributor", "Market share", "Films on release", "Box office gross (£ million)"]

distributors_2019.head()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
0,2019,Walt Disney,37.9,24,506.8
1,2019,Universal,13.9,39,185.6
2,2019,Warner Bros,12.4,40,166.3
3,2019,Sony,9.3,27,124.7
4,2019,Paramount,5.9,13,79.0


In [21]:
distributors_2019.tail()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
6,2019,Lionsgate,4.2,19,56.4
7,2019,Entertainment One,3.8,27,50.4
8,2019,STX Entertainment,1.2,7,15.4
9,2019,StudioCanal,1.0,22,14.0
11,2019,Others (Total 130 distributors),4.6,671,61.6


# 2020

In [22]:
tables_2020 = extract_tables("./files/2021 - BFI Statistical Yearbook 2021 - 24979.pdf", pages=[10,35,60])

Reading ./files/2021 - BFI Statistical Yearbook 2021 - 24979.pdf
4 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Admissions\n(million),2016,2017,2018,2019,2020,% change\non 2019
0,January,14.0,15.0,16.2,13.7,16.5,20.4
1,February,15.4,16.5,16.1,12.2,14.5,19.0
2,March,13.4,16.2,13.5,11.4,4.8,-57.9
3,April,13.1,15.6,15.5,16.0,-,-100.0
4,May,12.5,11.3,13.7,16.6,-,-100.0
5,June,10.7,9.6,10.4,13.9,-,-100.0
6,July,16.6,17.8,15.6,18.7,0.4,-97.9
7,August,18.1,14.5,19.2,15.6,2.1,-86.7
8,September,11.7,10.8,10.1,11.0,2.7,-75.8
9,October,15.2,12.1,16.1,16.3,2.2,-86.6



##########################
	Table 1
##########################



Unnamed: 0,Genre,Number of\nreleases,% of\nreleases,Gross box\noffice\n(£ million),% of total\nbox office,Average number\nof sites at widest\npoint of release,Top performing title
0,Action,36,9.4%,48.5,19.7%,167,Bad Boys For Life
1,War,3,0.8%,44.3,17.9%,296,1917
2,Animation,19,5.0%,34.8,14.1%,260,Sonic The Hedgehog
3,Thriller,15,3.9%,32.8,13.3%,192,Tenet
4,Drama,140,36.7%,31.4,12.7%,83,The Personal History of\nDavid Copperfield
5,Comedy,65,17.1%,24.0,9.7%,108,Jojo Rabbit
6,Crime,8,2.1%,12.5,5.1%,118,The Gentlemen
7,Horror,24,6.3%,12.1,4.9%,172,The Invisible Man
8,Adventure,9,2.4%,3.8,1.6%,152,The Call Of The Wild
9,Fantasy,2,0.5%,0.9,0.4%,315,Pinocchio



##########################
	Table 2
##########################



Unnamed: 0,Distributor,Market share\nin 2020\n(%),Number of films\non release\nin 2020,Box office\ngross in 2020\n(£ million),Number of films\non release\nin 2019
0,Entertainment One,15.9,30,49.4,28.0
1,Sony,15.6,38,48.4,39.0
2,Walt Disney,14.3,58,44.3,47.0
3,Universal,12.5,73,39.0,55.0
4,Warner Bros,11.7,89,36.5,64.0
5,Paramount,8.0,22,25.0,13.0
6,Lionsgate,4.5,29,14.0,23.0
7,StudioCanal,4.5,46,13.9,31.0
8,Entertainment Film Distributors,4.3,6,13.3,10.0
9,Shear Entertainment*,1.3,1,4.0,



##########################
	Table 3
##########################



Unnamed: 0,Distributor,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Entertainment One,5.1,6.7,9.0,8.0,3.9,8.4,2.5,3.2,3.8,15.9
1,Sony,7.2,18.0,8.7,6.2,11.8,6.6,10.3,10.7,9.3,15.6
2,Walt Disney,8.7,10.2,15.2,10.1,20.0,23.2,19.7,23.6,37.9,14.3
3,Universal,11.8,10.7,15.1,11.2,21.6,14.0,16.0,19.5,13.9,12.5
4,Warner Bros,18.2,12.9,17.2,15.9,9.0,15.6,16.6,13.9,12.4,11.7
5,Paramount,16.3,7.7,7.8,5.8,4.0,5.4,3.7,4.8,5.9,8.0
6,Lionsgate,-,5.7,4.7,5.5,4.0,4.0,6.3,1.5,4.2,4.5
7,Optimum/StudioCanal1,3.8,-,2.8,6.7,4.7,1.5,4.2,2.8,1.0,4.5
8,Entertainment Film Distributors,6.7,3.1,1.9,5.2,1.6,1.5,1.7,1.1,-,4.3
9,Shear Entertainment,-,-,-,-,-,-,-,-,-,1.3


In [23]:
admissions_2020 = (
    tables_2020[0]
    .iloc[:,[0,5]]
)

admissions_2020.columns = ["Month", "Admissions (million)"]
admissions_2020.insert(0, "Year", 2020)

admissions_2020.head()

Unnamed: 0,Year,Month,Admissions (million)
0,2020,January,16.5
1,2020,February,14.5
2,2020,March,4.8
3,2020,April,-
4,2020,May,-


Let's replace those hyphens with explicit missing (`NaN`) values

In [24]:
admissions_2020 = admissions_2020.replace("-", np.NaN)
admissions_2020.head()

Unnamed: 0,Year,Month,Admissions (million)
0,2020,January,16.5
1,2020,February,14.5
2,2020,March,4.8
3,2020,April,
4,2020,May,


In [25]:
admissions_2020.tail()

Unnamed: 0,Year,Month,Admissions (million)
7,2020,August,2.1
8,2020,September,2.7
9,2020,October,2.2
10,2020,November,0.3
11,2020,December,0.5


#### Genres

In [26]:
genres_2020 = (
    tables_2020[1]
    .drop(columns=[tables_2020[1].columns[2], tables_2020[1].columns[4], tables_2020[1].columns[5]])
)

genres_2020.insert(0, "Year", 2020)

genres_2020.columns = ["Year", "Genre", "Number of releases", "Gross box office (£ million)", "Top performing title"]

genres_2020.head(8)

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2020,Action,36,48.5,Bad Boys For Life
1,2020,War,3,44.3,1917
2,2020,Animation,19,34.8,Sonic The Hedgehog
3,2020,Thriller,15,32.8,Tenet
4,2020,Drama,140,31.4,The Personal History of\nDavid Copperfield
5,2020,Comedy,65,24.0,Jojo Rabbit
6,2020,Crime,8,12.5,The Gentlemen
7,2020,Horror,24,12.1,The Invisible Man


In [27]:
genres_2020.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
11,2020,Documentary,42,0.4,I Am Greta
12,2020,Western,2,0.3,True History Of The Kelly Gang
13,2020,Family,3,0.3,Max Winslow And The\nHouse Of Secrets
14,2020,Biopic,3,0.2,Mr. Jones
15,2020,Sci fi,4,0.1,Color Out Of Space


Looks like a slight data error where table rows spill over.

Let's manually fix these since we're here.

In [28]:
genres_2020["Top performing title"] = genres_2020["Top performing title"].str.replace("\n", " ")

genres_2020.head()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2020,Action,36,48.5,Bad Boys For Life
1,2020,War,3,44.3,1917
2,2020,Animation,19,34.8,Sonic The Hedgehog
3,2020,Thriller,15,32.8,Tenet
4,2020,Drama,140,31.4,The Personal History of David Copperfield


In [29]:
genres_2020.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
11,2020,Documentary,42,0.4,I Am Greta
12,2020,Western,2,0.3,True History Of The Kelly Gang
13,2020,Family,3,0.3,Max Winslow And The House Of Secrets
14,2020,Biopic,3,0.2,Mr. Jones
15,2020,Sci fi,4,0.1,Color Out Of Space


#### Distributors

In [30]:
tables_2020[2]

Unnamed: 0,Distributor,Market share\nin 2020\n(%),Number of films\non release\nin 2020,Box office\ngross in 2020\n(£ million),Number of films\non release\nin 2019
0,Entertainment One,15.9,30,49.4,28.0
1,Sony,15.6,38,48.4,39.0
2,Walt Disney,14.3,58,44.3,47.0
3,Universal,12.5,73,39.0,55.0
4,Warner Bros,11.7,89,36.5,64.0
5,Paramount,8.0,22,25.0,13.0
6,Lionsgate,4.5,29,14.0,23.0
7,StudioCanal,4.5,46,13.9,31.0
8,Entertainment Film Distributors,4.3,6,13.3,10.0
9,Shear Entertainment*,1.3,1,4.0,


In [31]:
distributors_2020 = (
    tables_2020[2]
    .iloc[:,:4]
    .drop(index=[10]) # drop totals EXCEPT the row totalling "others"
)

distributors_2020.insert(0, "Year", 2020)

distributors_2020.columns = ["Year", "Distributor", "Market share", "Films on release", "Box office gross (£ million)"]

distributors_2020.head()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
0,2020,Entertainment One,15.9,30,49.4
1,2020,Sony,15.6,38,48.4
2,2020,Walt Disney,14.3,58,44.3
3,2020,Universal,12.5,73,39.0
4,2020,Warner Bros,11.7,89,36.5


In [32]:
distributors_2020.tail()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
6,2020,Lionsgate,4.5,29,14.0
7,2020,StudioCanal,4.5,46,13.9
8,2020,Entertainment Film Distributors,4.3,6,13.3
9,2020,Shear Entertainment*,1.3,1,4.0
11,2020,Other distributors (143),7.5,534,23.2


# 2021

In [33]:
tables_2021 = extract_tables('./files/2022 - The box office 2021 - full report - 26781.pdf', pages=[6])

Reading ./files/2022 - The box office 2021 - full report - 26781.pdf
1 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Admissions (million),2019,2020,2021,% change on 2019
0,January,13.7,16.5,-,-100.0
1,February,12.2,14.5,-,-100.0
2,March,11.4,4.8,-,-100.0
3,April,16.0,-,-,-100.0
4,May,16.6,-,3.5,-78.8
5,June,13.9,-,7.0,-49.9
6,July,18.7,0.4,7.8,-58.5
7,August,15.6,2.1,10.4,-33.1
8,September,11.0,2.7,6.5,-40.6
9,October,16.3,2.2,16.4,0.9


In [34]:
admissions_2021 = (
    tables_2021[0]
    .iloc[:,[0, 3]]
)

admissions_2021.columns = ["Month", "Admissions (million)"]
admissions_2021.insert(0, "Year", 2021)

admissions_2021["Admissions (million)"] = admissions_2021["Admissions (million)"].replace("-", np.NaN)

admissions_2021.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  admissions_2021["Admissions (million)"] = admissions_2021["Admissions (million)"].replace("-", np.NaN)


Unnamed: 0,Year,Month,Admissions (million)
0,2021,January,
1,2021,February,
2,2021,March,
3,2021,April,
4,2021,May,3.5


In [35]:
admissions_2021.tail()

Unnamed: 0,Year,Month,Admissions (million)
7,2021,August,10.4
8,2021,September,6.5
9,2021,October,16.4
10,2021,November,8.8
11,2021,December,13.5


#### Genres

For the first time these stats are actually in another PDF

In [36]:
tables_2021_genre = extract_tables("./files/2022 - Top films in 2021 - full report - 29763.pdf", pages=[13])

Reading ./files/2022 - Top films in 2021 - full report - 29763.pdf
1 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Genre,Number\nof releas-\nes,% of\nreleases,Box office\ngross\n(£ million),% of\ntotal\nbox office,Average number\nof sites at\nwidest point\nof release,Top performing title
0,Action,52,11.8,296.7,48.2,210,No Time to Die
1,Comedy,54,12.2,65.1,10.6,138,Free Guy
2,Animation,38,8.6,58.2,9.4,260,The Addams Family 2
3,Horror,24,5.4,40.8,6.6,277,A Quiet Place Part II
4,Family,8,1.8,29.6,4.8,187,Peter Rabbit 2: The Runaway
5,Sci fi,4,0.9,29.6,4.8,420,Dune
6,Drama,134,30.3,20.6,3.3,78,Nomadland
7,Musical,8,1.8,15.5,2.5,313,West Side Story
8,Adventure,6,1.4,15.2,2.5,243,Jungle Cruise
9,Fantasy,3,0.7,14.9,2.4,263,Eternals


In [37]:
genres_2021 = (
    tables_2021_genre[0]
    .drop(columns=[tables_2021_genre[0].columns[2], tables_2021_genre[0].columns[4], tables_2021_genre[0].columns[5]])
)

genres_2021.insert(0, "Year", 2021)

genres_2021.columns = ["Year", "Genre", "Number of releases", "Gross box office (£ million)", "Top performing title"]

# fix data errors
genres_2021["Top performing title"] = genres_2021["Top performing title"].str.replace("\n", " ")

genres_2021.head()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2021,Action,52,296.7,No Time to Die
1,2021,Comedy,54,65.1,Free Guy
2,2021,Animation,38,58.2,The Addams Family 2
3,2021,Horror,24,40.8,A Quiet Place Part II
4,2021,Family,8,29.6,Peter Rabbit 2: The Runaway


In [38]:
genres_2021.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
11,2021,Thriller,10,8.2,The Forever Purge
12,2021,Suspense,12,4.5,Old
13,2021,Documentary,69,2.1,"Summer of Soul (...Or, When the Revolution Cou..."
14,2021,Romance,13,1.5,The Last Letter from Your Lover
15,2021,Crime,2,0.3,Rise of the Footsoldier: Origins


#### Distributors

In [39]:
tables_2021_dist = extract_tables("./files/2022 - Distribution - full report - 26782.pdf", pages=[4])

Reading ./files/2022 - Distribution - full report - 26782.pdf
1 tables found.

##########################
	Table 0
##########################



Unnamed: 0,Distributor,Market share in 2021 (%),Films on release in 2021,Box office gross in 2021\n(£ million)
0,Universal,30.6,51,180.8
1,Sony,22.2,52,130.8
2,Walt Disney,21.2,63,125.1
3,Warner Bros,14.4,80,84.7
4,Paramount,3.7,16,21.8
5,Lionsgate,1.8,23,10.5
6,Entertainment One,1.4,12,8.1
7,Park Circus,0.6,96,3.4
8,STX Entertainment,0.6,3,3.4
9,StudioCanal,0.5,36,2.7


In [40]:
distributors_2021 = (
    tables_2021_dist[0]
    .drop(index=[10]) # drop Top 10 total row
)

distributors_2021.insert(0, "Year", 2021)
distributors_2021.columns = ["Year", "Distributor", "Market share", "Films on release", "Box office gross (£ million)"]
distributors_2021.head()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
0,2021,Universal,30.6,51,180.8
1,2021,Sony,22.2,52,130.8
2,2021,Walt Disney,21.2,63,125.1
3,2021,Warner Bros,14.4,80,84.7
4,2021,Paramount,3.7,16,21.8


In [41]:
distributors_2021.tail()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
6,2021,Entertainment One,1.4,12,8.1
7,2021,Park Circus,0.6,96,3.4
8,2021,STX Entertainment,0.6,3,3.4
9,2021,StudioCanal,0.5,36,2.7
11,2021,Other distributors (137),3.2,549,19.0


# Combine everything

In [42]:
admissions = pd.concat([admissions_2018, admissions_2019, admissions_2020, admissions_2021],
                       ignore_index=True,
                       axis=0)

admissions.head()

Unnamed: 0,Year,Month,Admissions (million)
0,2018,January,16.2
1,2018,February,16.1
2,2018,March,13.5
3,2018,April,15.5
4,2018,May,13.7


In [43]:
admissions["Year"].value_counts()

2018    12
2019    12
2020    12
2021    12
Name: Year, dtype: int64

In [44]:
admissions.to_csv("admissions.csv", index=False)

#### Genres

In [45]:
genres = pd.concat([genres_2018, genres_2019, genres_2020, genres_2021],
                   ignore_index=True,
                   axis=0)

genres.head()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
0,2018,Action,77,361.3,Avengers: Infinity War
1,2018,Animation,46,242.2,Incredibles 2
2,2018,Drama,246,154.1,A Star Is Born
3,2018,Comedy,144,88.3,Johnny English Strikes Again
4,2018,Biopic,5,78.5,Bohemian Rhapsody


In [46]:
genres.tail()

Unnamed: 0,Year,Genre,Number of releases,Gross box office (£ million),Top performing title
62,2021,Thriller,10,8.2,The Forever Purge
63,2021,Suspense,12,4.5,Old
64,2021,Documentary,69,2.1,"Summer of Soul (...Or, When the Revolution Cou..."
65,2021,Romance,13,1.5,The Last Letter from Your Lover
66,2021,Crime,2,0.3,Rise of the Footsoldier: Origins


In [47]:
genres["Year"].value_counts()

2018    18
2019    17
2020    16
2021    16
Name: Year, dtype: int64

In [48]:
genres.to_csv("genres.csv", index=False)

#### Distributors

In [49]:
distributors = pd.concat([distributors_2018, distributors_2019, distributors_2020, distributors_2021],
                         ignore_index=True,
                         axis=0)

distributors.head()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
0,2018,Walt Disney,23.6,24,325.6
1,2018,Universal,19.5,40,268.5
2,2018,20th Century Fox,14.5,28,199.3
3,2018,Warner Bros,13.9,31,191.4
4,2018,Sony,10.7,33,146.8


In [50]:
distributors.tail()

Unnamed: 0,Year,Distributor,Market share,Films on release,Box office gross (£ million)
39,2021,Entertainment One,1.4,12,8.1
40,2021,Park Circus,0.6,96,3.4
41,2021,STX Entertainment,0.6,3,3.4
42,2021,StudioCanal,0.5,36,2.7
43,2021,Other distributors (137),3.2,549,19.0


In [51]:
distributors["Year"].value_counts()

2018    11
2019    11
2020    11
2021    11
Name: Year, dtype: int64

In [52]:
distributors.to_csv("distributors.csv", index=False)