## Part 0. Download box office dataset & Data pre-processing

### 0.0. Download box office dataset

Get the dataset download list by sending a request to [data.gov.tw](https://data.gov.tw/).

In [1]:
import datetime
import pandas as pd
import requests

In [2]:
request_responses = requests.get('https://data.gov.tw/api/v2/rest/dataset/94224')
responses = request_responses.json()['result']['distribution']

An example of a record in `responses`:
> ```
> [{'resourceDescription': '2018年7月30日至8月5日全國電影票房統計數據',
>    'resourceField': [{'name': '序號', 'description': ''},
>     {'name': '國別地區', 'description': ''},
>     {'name': '中文片名', 'description': ''},
>     {'name': '上映日期', 'description': ''},
>     {'name': '申請人', 'description': ''},
>     {'name': '出品', 'description': ''},
>     {'name': '上映院數', 'description': ''},
>     {'name': '銷售票數', 'description': ''},
>     {'name': '銷售金額', 'description': ''},
>     {'name': '累計銷售票數', 'description': ''},
>     {'name': '累計銷售金額', 'description': ''}],
>    'qcLevel': '',
>    'resourceFormat': 'CSV',
>    'resourceCharacterEncoding': 'UTF-8',
>    'resourceModifiedDate': '2022-09-30 14:05:28',
>    'resourceDownloadUrl': 'https://opendata.culture.tw/upload/dataSource/2018-08-09/1c3753a5-50f4-44f8-a75b-2b4d0dd2a143/69c6f154369fc266e8a3593f83d3b444.csv',
>    'resourceAmount': 241,
>    'resourceNotes': '',
>    'resourceRequestMethod': '',
>    'resourceOasUrl': '',
>    'resourceRequestParameters': []
>   },
>    ...
>  ]
>  ```

We will use `resourceDownloadUrl` to download the weekly box office dataset.

Note that some of the datasets were stored in JSON format as well as CSV. For example, `response_results[96]` and `response_results[97]` represent the same dataset, one was stored with CSV, the other was stored with JSON.

In [3]:
print('responses[96]: ', responses[96]['resourceDescription'], '\n'
      'responses[97]: ', responses[97]['resourceDescription'], '\n')

responses[96]:  2020年6月1日至2020年6月7日全國電影票房統計數據 
responses[97]:  2020年6月1日至2020年6月7日全國電影票房統計數據JSON格式 



So we must not include the duplicated JSON datasets later.

### 0.1. Manage the responses

Functions under `response_results` class manage the response data.

In [4]:
def get_time():
    return datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

In [5]:
class response_results: 
    def shape_data_list(keep_id_list, responses):
        urls = []
        data_list = []
        for jj in keep_id_list:
            filename = responses[jj]['resourceDescription']
            start_date = filename.split('至')[0]
            start_date = start_date.replace('年', '/').replace('月', '/').replace('日', '')
            url = responses[jj]['resourceDownloadUrl']
            urls.append(url)
            table = pd.read_csv(url)
            table['統計起始日'] = start_date
            data_list.append(table)
        return data_list
    
    def data_list_to_data_frame(data_list):
        data = pd.concat(data_list, axis = 0)
        new_colname = ['統計起始日', '上映日期', '中文片名', '國別地區', '上映院數', '累計銷售票數', '累計銷售金額']
        return data[new_colname]
        
    def fix_data_bug(data):
        data = data.replace({ '上映日期': '2019/0807' }, '2019/08/07')
        return data

`shape_data_list()`: the for loop inside the function completes the follow things:
1. get the download URL of each dataset from `resourceDownloadUrl` and import the CSV with the URLs
2. store the statistical started dates of each dataset as a new column, `統計起始日`
3. concatenate all CSVs

`data_list_to_data_frame()`: concatenate the `data_list` into a pandas dataframe, and only keep the columns that will be used in the analysis later

`fix_data_bug()`: deal with some known bugs I have found in this box office dataset
1. one record has date format error: `data['上映日期'][4829]` returns `'2019/0807'`

Compare the # of the weeks in the new and existing dataset.

In [6]:
keep_id_list = []
for ii in range(len(responses)):
    filename = responses[ii]['resourceDescription']
    if 'JSON格式' not in filename:
        keep_id_list.append(ii)
dataset_count = len(keep_id_list)
print('# of the weeks of the new dataset: ', dataset_count)

count_file = open('./../data/count.txt', 'r')
dataset_count_old = int(count_file.readlines()[0])
print('# of the weeks of the existing dataset: ', dataset_count_old)

# of the weeks of the new dataset:  222
# of the weeks of the existing dataset:  222


Then update the dataset with functions under `response_results` class if the new dataset has more data, or just assign the existing dataset if nothing needed to be updated.

In [7]:
if dataset_count > dataset_count_old:
    print(get_time(), ' updating dataset...')
    data_list = response_results.shape_data_list(keep_id_list, responses)
    data = response_results.data_list_to_data_frame(data_list)
    data = response_results.fix_data_bug(data)
    data.to_csv('./../data/box_office.csv', index = False)
    count_file = open('./../data/count.txt', 'w')
    count_file.writelines(str(dataset_count))
    count_file.close()
    print(get_time(), ' dataset updated!')
elif dataset_count == dataset_count_old:
    print(get_time(), ' assigning dataset...')
    data = pd.read_csv('./../data/box_office.csv')
    print(get_time(), ' dataset assigned!')
else:
    print(get_time(), ' the new dataset has less records, please check the response results!')

2023-01-04 14:10:51  assigning dataset...
2023-01-04 14:10:51  dataset assigned!


In [8]:
# data
# data_count

---

## Part 1. Import dataset

Import dataset and rename columns.

In [9]:
df = pd.read_csv('./../data/box_office.csv')
df.rename(columns={'統計起始日': 'statistic_date', '上映日期': 'release_date', '上映院數': 'theater', '中文片名': 'name',
                     '國別地區': 'country', '累計銷售票數': 'ticket', '累計銷售金額': 'revenue'}, inplace=True)

Convert data types of `date` and `release` to **datetime64**, change data types of `ticket` and `revenue` to **Int64** and **Float64**, respectively.

Remove records without the information of `ticket` and `revenue`.

In [10]:
null_index = df.index[(df['ticket'].isnull()) | (df['revenue'].isnull())]
df = df.drop(null_index)

In [11]:
class preprocessing:
    def datetime_set_format(column):
        return pd.to_datetime(column)
    
    def datetime_retrive_year(column):
        return column.apply(lambda d: str(d.year))
    
    def number_set_format(column, num_type):
        tmp = column.apply(lambda d: str(d).replace(',', ''))
        return pd.to_numeric(tmp).astype(num_type)
    
    def number_divide(column, num_div):
        return column.div(num_div)

In [12]:
df['statistic_date'] = preprocessing.datetime_set_format(df['statistic_date'])
df['statistic_year'] = preprocessing.datetime_retrive_year(df['statistic_date'])
df['release_date'] = preprocessing.datetime_set_format(df['release_date'])
df['release_year'] = preprocessing.datetime_retrive_year(df['release_date'])

df['ticket'] = preprocessing.number_set_format(df['ticket'], 'Int64')
df['revenue'] = preprocessing.number_set_format(df['revenue'], 'Float64')

df['revenue_100m'] = preprocessing.number_divide(df['revenue'], 100000000)

In [13]:
# df
# df.dtypes

Remove duplicate movies with same `name`, keep the record with highest `revenue`.

In [14]:
df_dedup = df.sort_values(['name', 'revenue'], ascending = False).drop_duplicates(subset = 'name', keep = 'last')
df_dedup = df_dedup.reset_index(drop = True)

In [15]:
# df_dedup

---

## Part 2. Analysis

### 2.0. Seperate by years

In [16]:
df_2018 = df[(df['statistic_date'] > '2018-01-01') & (df['statistic_date'] <= '2018-12-31')]
df_2019 = df[(df['statistic_date'] > '2019-01-01') & (df['statistic_date'] <= '2019-12-31')]
df_2020 = df[(df['statistic_date'] > '2020-01-01') & (df['statistic_date'] <= '2020-12-31')]
df_2021 = df[(df['statistic_date'] > '2021-01-01') & (df['statistic_date'] <= '2021-12-31')]
df_2022 = df[(df['statistic_date'] > '2022-01-01') & (df['statistic_date'] <= '2022-12-31')]
print(' 2018:', len(df_2018), 'records\n', '2019:', len(df_2019), 'records\n', '2020:', len(df_2020), 'records\n',
      '2021:', len(df_2021), 'records\n', '2022:', len(df_2022), 'records')

 2018: 2111 records
 2019: 4603 records
 2020: 5849 records
 2021: 3893 records
 2022: 4773 records


### 2.1. Top 10 movies of 2022

In [17]:
top10movies_2022 = df_2022.drop_duplicates(subset = 'name', keep = 'last').sort_values('revenue', ascending = False)[:10]
top10movies_2022

Unnamed: 0,statistic_date,release_date,name,country,theater,ticket,revenue,statistic_year,release_year,revenue_100m
21266,2022-12-19,2022-05-23,捍衛戰士: 獨行俠,美國,7,2682434,734498188.0,2022,2022,7.344982
18052,2022-04-25,2020-10-30,鬼滅之刃劇場版 無限列車篇,日本,5,2602313,634938914.0,2022,2020,6.349389
18507,2022-05-30,2021-12-15,蜘蛛人：無家日,美國,2,1967162,498216125.0,2022,2021,4.982161
20366,2022-10-17,2022-06-08,侏羅紀世界: 統霸天下,美國,1,1278551,334165457.0,2022,2022,3.341655
21191,2022-12-19,2022-12-14,阿凡達：水之道,美國,100,1003392,324492308.0,2022,2022,3.244923
18150,2022-05-02,2021-11-24,月老,中華民國,2,1091230,264306579.0,2022,2021,2.643066
17591,2022-03-21,2021-11-03,永恆族,美國,1,981345,258299255.0,2022,2021,2.582993
21233,2022-12-19,2022-11-09,黑豹 2：瓦干達萬歲,美國,89,918863,248143185.0,2022,2022,2.481432
20025,2022-09-26,2022-05-04,奇異博士2：失控多重宇宙,美國,1,931021,244509640.0,2022,2022,2.445096
19699,2022-09-05,2022-02-24,劇場版 咒術迴戰 0,日本,1,893860,232306593.0,2022,2022,2.323066


### 2.2. Average movie ticket price in Taiwan

In [18]:
cost = top10movies_2022['revenue'].sum()
ticket = top10movies_2022['ticket'].sum()
spend = cost/ticket
print(f'The avarage cost of a movie ticket is NT${spend:.2f}.')

The avarage cost of a movie ticket is NT$262.98.


---

## Part3. Applications

Data visualization with [bokeh](https://demo.bokeh.org/).

In [19]:
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, CustomJS, Slider, RadioButtonGroup, Select
from bokeh.plotting import figure, show
from math import pi
output_notebook()

# from bokeh.plotting import figure, output_file, show, save
# output_file('./prototype.html', title = 'Static HTML file')
# save(p)

# reset_output()

### 3.0. Filters

#### 3.0.0 Options of the select widgets

In [20]:
# year
year_options = []
for year in df_dedup['statistic_year']:
    if year not in year_options:
        year_options.append(str(year))
year_options.insert(0, '全部')
print(year_options)

# country
country_count = df_dedup['country'].value_counts()
country_list = country_count.index.tolist()[:10]
country_list.append('其他')
country_options = country_list.copy()
country_options.insert(0, '全部')
print(country_options)

['全部', '2022', '2018', '2021', '2019', '2020']
['全部', '美國', '日本', '中華民國', '法國', '南韓', '英國', '香港', '德國', '義大利', '西班牙', '其他']


#### 3.0.1 Widgets

In [21]:
type_checkbox = RadioButtonGroup(labels = ['年', '週'], active = 0)
year_select = Select(title = '年份', options = list(year_options), value = '2022')
country_select = Select(title = '國別地區', options = list(country_options), value = '全部')
revenue_slider = Slider(title = '累計銷售金額大於', start = 0, end = max(df['revenue']), value = 0, step = 100000)
ticket_slider = Slider(title = '累計銷售票數大於', start = 0, end = max(df['ticket']), value = 0, step = 100000)
theater_select = Slider(title = '上映院數大於', start = 0, end = max(df['theater']), value = 0, step = 10)

show(type_checkbox)
show(year_select)
show(country_select)
show(revenue_slider)
show(ticket_slider)
show(theater_select)

### 3.1. Plots

#### 3.1.0 Main

In [22]:
source = ColumnDataSource(df_dedup)
tooltips = [
    ('name', '@name'),
    ('release_year', '@release_year'),
    ('ticket', '@ticket'),
    ('revenue', '@revenue')
]
p = figure(height = 400, width = 400, x_axis_label = 'release_year', y_axis_label = 'revenue', tooltips=tooltips)
p.circle(x = 'release_year', y = 'revenue', source = source)

xaxis = p.xaxis[0]; xaxis.formatter.use_scientific = False
yaxis = p.yaxis[0]; yaxis.formatter.use_scientific = False

show(p)

#### 3.1.1 Countries of the released movies from

In [23]:
df_copy = df_dedup.copy()
for ii in range(len(df_copy)):
    if df_copy['country'][ii] not in country_list:
        df_copy.loc[ii, 'country'] = '其他'
country_count_others = list(df_copy['country'].value_counts())
country_data = {
    'country': country_list,
    'count': country_count_others
}
p_country_data = pd.DataFrame(country_data)
# p_country_data

In [24]:
p_country = figure(width = 400, height = 400, x_range = p_country_data['country'], x_axis_label = 'country', y_axis_label = 'count',
                   toolbar_location = None, tools = 'hover', tooltips = [('country', '@country'), ('count', '@count')], title = '')
p_country.vbar(x = 'country', top = 'count', width = 0.9, source = p_country_data)

p_country.xaxis.major_label_orientation = pi/4
p_country.y_range.start = 0

show(p_country)

#### 3.1.1 Years of the released movies

In [25]:
year_count = df_dedup['release_year'].value_counts()
year_list = sorted(year_count.index.tolist()[:9], reverse = True)
year_list.append('Before ' + year_list[-1])

df_copy = df_dedup.copy()
for ii in range(len(df_copy)):
    if df_copy['release_year'][ii] not in year_list:
        df_copy.loc[ii, 'release_year'] = year_list[-1]
year_count_others = list(df_copy['release_year'].value_counts())
year_data = {
    'year': year_list,
    'count': year_count_others
}
p_year_data = pd.DataFrame(year_data)
# # p_year_data

In [26]:
p_year = figure(width = 400, height = 400, x_range = p_year_data['year'], x_axis_label = 'year', y_axis_label = 'count',
                   toolbar_location = None, tools = 'hover', tooltips = [('year', '@year'), ('count', '@count')], title = '')
p_year.vbar(x = 'year', top = 'count', width = 0.9, source = p_year_data)

p_year.xaxis.major_label_orientation = pi/4
p_year.y_range.start = 0

show(p_year)