In [1]:
import requests
import pandas as pd

## Part 0. Download box office dataset & Data pre-processing

### 0.0. Download box office dataset

Send a request to [data.gov.tw](https://data.gov.tw/) to get the download list of datasets.

In [2]:
response = requests.get('https://data.gov.tw/api/v2/rest/dataset/94224')
response_results = response.json()['result']['distribution']

An example of the records in `response_results`:
> ```
> [{'resourceDescription': '2018年7月30日至8月5日全國電影票房統計數據',
>    'resourceField': [{'name': '序號', 'description': ''},
>     {'name': '國別地區', 'description': ''},
>     {'name': '中文片名', 'description': ''},
>     {'name': '上映日期', 'description': ''},
>     {'name': '申請人', 'description': ''},
>     {'name': '出品', 'description': ''},
>     {'name': '上映院數', 'description': ''},
>     {'name': '銷售票數', 'description': ''},
>     {'name': '銷售金額', 'description': ''},
>     {'name': '累計銷售票數', 'description': ''},
>     {'name': '累計銷售金額', 'description': ''}],
>    'qcLevel': '',
>    'resourceFormat': 'CSV',
>    'resourceCharacterEncoding': 'UTF-8',
>    'resourceModifiedDate': '2022-09-30 14:05:28',
>    'resourceDownloadUrl': 'https://opendata.culture.tw/upload/dataSource/2018-08-09/1c3753a5-50f4-44f8-a75b-2b4d0dd2a143/69c6f154369fc266e8a3593f83d3b444.csv',
>    'resourceAmount': 241,
>    'resourceNotes': '',
>    'resourceRequestMethod': '',
>    'resourceOasUrl': '',
>    'resourceRequestParameters': []
>   },
>    ...
>  ]
>  ```

We will use the value from `resourceDownloadUrl` of `response_results` to download the box office statistics of each week.

Note that some of the datasets were stored with JSON as well as CSV. For example `response_results[96]` and `response_results[97]` represent the same dataset, one was stored with CSV, the other was stored with JSON.

In [3]:
print('response_results[96]: ', response_results[96]['resourceDescription'], '\n'
      'response_results[97]: ', response_results[97]['resourceDescription'], '\n')

response_results[96]:  2020年6月1日至2020年6月7日全國電影票房統計數據 
response_results[97]:  2020年6月1日至2020年6月7日全國電影票房統計數據JSON格式 



So we must not include the duplicated JSON datasets.

The for loop below completes the follow things: first of all, distinguish if it is not a duplicated JSON dataset, if not:
1. get the download URL of each dataset from `resourceDownloadUrl` and import the CSV with the URLs
2. store the statistical started dates of each dataset as a new column, `統計起始日`
3. concatenate all CSVs

In [4]:
urls = []
data_list = []
for ii in range(len(response_results)):
    filename = response_results[ii]['resourceDescription']
    if 'JSON格式' not in filename:
        start_date = filename.split('至')[0]
        start_date = start_date.replace('年', '/').replace('月', '/').replace('日', '')
        url = response_results[ii]['resourceDownloadUrl']
        urls.append(url)
        table = pd.read_csv(url)
        table['統計起始日'] = start_date
        data_list.append(table)

data = pd.concat(data_list, axis = 0)

In [5]:
# data

### 0.1. Data pre-processing

Keep only the columns that will be used in the analysis later.

In [6]:
columns = ['統計起始日', '上映日期', '中文片名', '國別地區', '上映院數', '累計銷售票數', '累計銷售金額']
df = data[columns]

Deal with the dataset bug. One record has date format error: `df['上映日期'][4829]` returns `'2019/0807'`.

In [7]:
df = df.replace({ '上映日期': '2019/0807' }, '2019/08/07')

Check if there was new data that needed to be updated in `df`, if yes, update the CSV file.

In [8]:
df_old = pd.read_csv('./../data/box_office.csv')

if len(df_old) > len(df):
    df.to_csv('./../data/box_office.csv', index = False)
    print('dataset updated!')
else:
    df = df_old
    print('dataset assigned!')

dataset assigned!


---

## Part 1. Import dataset

Import dataset and rename columns.

In [9]:
df = pd.read_csv('./../data/box_office.csv')
df.rename(columns = { '統計起始日': 'date', '上映日期': 'release', '上映院數': 'theater', '中文片名': 'name', '國別地區': 'country', '累計銷售票數': 'ticket', '累計銷售金額': 'revenue' }, inplace = True)

Convert data types of `date` and `release` to **datetime64**, change data types of `ticket` and `revenue` to **Int64** and **Float64**, respectively.

In [10]:
df['date'] = pd.to_datetime(df['date'])
df['date_year'] = df['date'].apply(lambda d: str(d.year))
df['release'] = pd.to_datetime(df['release'])
df['release_year'] = df['release'].apply(lambda d: str(d.year))
df['ticket'] = df['ticket'].str.replace(',', '')
df['ticket'] = pd.to_numeric(df['ticket']).astype('Int64')
df['revenue'] = df['revenue'].str.replace(',', '')
df['revenue'] = pd.to_numeric(df['revenue']).astype('Float64')
df['revenue_10k'] = df['revenue'].div(10000)
null_index = df.index[(df['ticket'].isnull()) | (df['revenue_10k'].isnull())]
df = df.drop(null_index)

Remove records without the information of `ticket` and `revenue`.

In [11]:
null_index = df.index[(df['ticket'].isnull()) | (df['revenue'].isnull())]
df = df.drop(null_index)

In [12]:
# df
# df.dtypes

Remove duplicate movies with same `name`, keep the record with highest `revenue`.

In [13]:
df_dedup = df.sort_values(['name', 'revenue'], ascending = False).drop_duplicates(subset = 'name', keep = 'last')
df_dedup = df_dedup.reset_index(drop = True)

In [14]:
# df_dedup

---

## Part 2. Analysis

### 2.0. Seperate by years

In [15]:
df_2018 = df[(df['date'] > '2018-01-01') & (df['date'] <= '2018-12-31')]
df_2019 = df[(df['date'] > '2019-01-01') & (df['date'] <= '2019-12-31')]
df_2020 = df[(df['date'] > '2020-01-01') & (df['date'] <= '2020-12-31')]
df_2021 = df[(df['date'] > '2021-01-01') & (df['date'] <= '2021-12-31')]
df_2022 = df[(df['date'] > '2022-01-01') & (df['date'] <= '2022-12-31')]
print(' 2018:', len(df_2018), 'records\n', '2019:', len(df_2019), 'records\n', '2020:', len(df_2020), 'records\n',
      '2021:', len(df_2021), 'records\n', '2022:', len(df_2022), 'records')

 2018: 2111 records
 2019: 4603 records
 2020: 5849 records
 2021: 3893 records
 2022: 3535 records


### 2.1. Top 10 movies of 2022

In [16]:
top10movies_2022 = df_2022.drop_duplicates(subset = 'name', keep = 'last').sort_values('revenue', ascending = False)[:10]
top10movies_2022

Unnamed: 0,date,release,name,country,theater,ticket,revenue,date_year,release_year,revenue_10k
20022,2022-09-26,2022-05-23,捍衛戰士: 獨行俠,美國,53,2664743,729083790.0,2022,2022,72908.379
18052,2022-04-25,2020-10-30,鬼滅之刃劇場版 無限列車篇,日本,5,2602313,634938914.0,2022,2020,63493.8914
18507,2022-05-30,2021-12-15,蜘蛛人：無家日,美國,2,1967162,498216125.0,2022,2021,49821.6125
20020,2022-09-26,2022-06-08,侏羅紀世界: 統霸天下,美國,5,1278003,334124827.0,2022,2022,33412.4827
18150,2022-05-02,2021-11-24,月老,中華民國,2,1091230,264306579.0,2022,2021,26430.6579
17591,2022-03-21,2021-11-03,永恆族,美國,1,981345,258299255.0,2022,2021,25829.9255
20025,2022-09-26,2022-05-04,奇異博士2：失控多重宇宙,美國,1,931021,244509640.0,2022,2022,24450.964
19699,2022-09-05,2022-02-24,劇場版 咒術迴戰 0,日本,1,893860,232306593.0,2022,2022,23230.6593
19682,2022-09-05,2022-07-06,雷神索爾：愛與雷霆,美國,10,860011,225026200.0,2022,2022,22502.62
18974,2022-07-11,2022-03-18,咒,中華民國,1,721528,171754843.0,2022,2022,17175.4843


### 2.2. Average movie ticket price in Taiwan

In [17]:
cost = top10movies_2022['revenue'].sum()
ticket = top10movies_2022['ticket'].sum()
spend = cost/ticket
print(f'The avarage cost of a movie ticket is NT${spend:.2f}.')

The avarage cost of a movie ticket is NT$256.77.


---

## Part3. Applications

Data visualization with [bokeh](https://demo.bokeh.org/).

In [18]:
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, CustomJS, Slider, RadioButtonGroup, Select
from bokeh.plotting import figure, show
from math import pi
output_notebook()

# from bokeh.plotting import figure, output_file, show, save
# output_file('./prototype.html', title = 'Static HTML file')
# save(p)

# reset_output()

### 3.0. Filters

#### 3.0.0 Options of the select widgets

In [19]:
# year
year_options = []
for year in df_dedup['date_year']:
    if year not in year_options:
        year_options.append(str(year))
year_options.insert(0, '全部')
print(year_options)

# country
country_count = df_dedup['country'].value_counts()
country_list = country_count.index.tolist()[:10]
country_list.append('其他')
country_options = country_list.copy()
country_options.insert(0, '全部')
print(country_options)

['全部', '2022', '2018', '2021', '2019', '2020']
['全部', '美國', '日本', '中華民國', '法國', '南韓', '英國', '香港', '德國', '義大利', '西班牙', '其他']


#### 3.0.1 Widgets

In [20]:
type_checkbox = RadioButtonGroup(labels = ['年', '週'], active = 0)
year_select = Select(title = '年份', options = list(year_options), value = '2022')
country_select = Select(title = '國別地區', options = list(country_options), value = '全部')
revenue_slider = Slider(title = '累計銷售金額大於', start = 0, end = max(df['revenue']), value = 0, step = 100000)
ticket_slider = Slider(title = '累計銷售票數大於', start = 0, end = max(df['ticket']), value = 0, step = 100000)
theater_select = Slider(title = '上映院數大於', start = 0, end = max(df['theater']), value = 0, step = 10)

show(type_checkbox)
show(year_select)
show(country_select)
show(revenue_slider)
show(ticket_slider)
show(theater_select)

### 3.1. Plots

#### 3.1.0 Main

In [21]:
source = ColumnDataSource(df_dedup)
tooltips = [
    ('name', '@name'),
    ('release_year', '@release_year'),
    ('ticket', '@ticket'),
    ('revenue', '@revenue')
]
p = figure(height = 400, width = 400, x_axis_label = 'release_year', y_axis_label = 'revenue', tooltips=tooltips)
p.circle(x = 'release_year', y = 'revenue', source = source)

xaxis = p.xaxis[0]; xaxis.formatter.use_scientific = False
yaxis = p.yaxis[0]; yaxis.formatter.use_scientific = False

show(p)

In [22]:
# callback = CustomJS(args = dict(source = source, source_ref = source_ref, ticket_slider = ticket_slider), code = 
#     '''
#     var df = source.data
#     const df_ref = source_ref.data
#     const df_new = {
#         上映日期: [],
#         上映院數: [],
#         中文片名: [],
#         國別地區: [],
#         累計銷售票數: [],
#         累計銷售金額: [],
#         累計銷售金額_萬: [],
#         統計起始日: [],
#     }
#     var i = 0
#     Object.values(df_ref.累計銷售票數).filter(function(val, ind) {
#         if (val > ticket_slider.value) {
#             df_new.上映日期[i] = df_ref['上映日期'].slice(ind-1, ind)[0]
#             df_new.上映院數[i] = df_ref['上映院數'].slice(ind-1, ind)[0]
#             df_new.中文片名[i] = df_ref['中文片名'].slice(ind-1, ind)[0]
#             df_new.國別地區[i] = df_ref['國別地區'].slice(ind-1, ind)[0]
#             df_new.累計銷售票數[i] = df_ref['累計銷售票數'].slice(ind-1, ind)[0]
#             df_new.累計銷售金額[i] = df_ref['累計銷售金額'].slice(ind-1, ind)[0]
#             df_new.累計銷售金額_萬[i] = df_ref['累計銷售金額_萬'].slice(ind-1, ind)[0]
#             df_new.統計起始日[i] = df_ref['統計起始日'].slice(ind-1, ind)[0]
#             i += 1
#         }
#     })
    
#     df.上映日期 = df_new.上映日期
#     df.上映院數 = df_new.上映院數
#     df.國別地區 = df_new.國別地區
#     df.累計銷售票數 = df_new.累計銷售票數
#     df.累計銷售金額 = df_new.累計銷售金額
#     df.累計銷售金額_萬 = df_new.累計銷售金額_萬
#     df.統計起始日 = df_new.統計起始日
    
#     console.log(ticket_slider.value)
#     console.log(df)
    
#     source.data = df
#     source.change.emit()
#     '''
# )
# ticket_slider.js_on_change('value', callback)

#### 3.1.1 Countries of the released movies from

In [23]:
df_copy = df_dedup.copy()
for ii in range(len(df_copy)):
    if df_copy['country'][ii] not in country_list:
        df_copy.loc[ii, 'country'] = '其他'
country_count_others = list(df_copy['country'].value_counts())
country_data = {
    'country': country_list,
    'count': country_count_others
}
p_country_data = pd.DataFrame(country_data)
# p_country_data

In [24]:
p_country = figure(width = 400, height = 400, x_range = p_country_data['country'], x_axis_label = 'country', y_axis_label = 'count',
                   toolbar_location = None, tools = 'hover', tooltips = [('country', '@country'), ('count', '@count')], title = '')
p_country.vbar(x = 'country', top = 'count', width = 0.9, source = p_country_data)

p_country.xaxis.major_label_orientation = pi/4
p_country.y_range.start = 0

show(p_country)

#### 3.1.1 Years of the released movies

In [25]:
year_count = df_dedup['release_year'].value_counts()
year_list = sorted(year_count.index.tolist()[:9], reverse = True)
year_list.append('Before ' + year_list[-1])

df_copy = df_dedup.copy()
for ii in range(len(df_copy)):
    if df_copy['release_year'][ii] not in year_list:
        df_copy.loc[ii, 'release_year'] = year_list[-1]
year_count_others = list(df_copy['release_year'].value_counts())
year_data = {
    'year': year_list,
    'count': year_count_others
}
p_year_data = pd.DataFrame(year_data)
# # p_year_data

In [26]:
p_year = figure(width = 400, height = 400, x_range = p_year_data['year'], x_axis_label = 'year', y_axis_label = 'count',
                   toolbar_location = None, tools = 'hover', tooltips = [('year', '@year'), ('count', '@count')], title = '')
p_year.vbar(x = 'year', top = 'count', width = 0.9, source = p_year_data)

p_year.xaxis.major_label_orientation = pi/4
p_year.y_range.start = 0

show(p_year)

---