# Python 的 50+ 練習：資料科學學習手冊

> 資料科學模組 Pandas 入門

[數據交點](https://www.datainpoint.com) | 郭耀仁 <yaojenkuo@datainpoint.com>

## 練習題指引

- 練習題閒置超過 10 分鐘會自動斷線，只要重新點選練習題連結即可重新啟動。
- 第一個程式碼儲存格會將可能用得到的模組載入。
- 如果練習題需要載入檔案，檔案會存放在 `data` 資料夾中。
- 練習題已經給定函數、類別、預期輸入或參數名稱，我們只需要寫作程式區塊。同時也給定函數的類別提示，說明預期輸入以及預期輸出的類別。
- 說明（Docstring）會描述測試如何進行，閱讀說明能夠暸解預期輸入以及預期輸出之間的關係，幫助我們更快解題。
- 請在 `### BEGIN SOLUTION` 與 `### END SOLUTION` 這兩個註解之間寫作函數或者類別的程式區塊。
- 將預期輸出放置在 `return` 保留字之後，若只是用 `print()` 函數將預期輸出印出無法通過測試。
- 語法錯誤（`SyntaxError`）或縮排錯誤（`IndentationError`）等將會導致測試失效，測試之前應該先在筆記本使用函數觀察是否與說明（Docstring）描述的功能相符。
- 如果卡關，可以先看練習題詳解或者複習課程單元影片之後再繼續寫作。
- 執行測試的步驟：
    1. 點選上方選單的 File -> Save Notebook 儲存 exercises.ipynb。
    2. 點選上方選單的 File -> New -> Terminal 開啟終端機。
    3. 在終端機輸入 `python 11-pandas/test_runner.py` 後按下 Enter 執行測試。

In [1]:
import numpy as np
import pandas as pd

## 81. 建立 `Index`

定義函數 `create_an_index()` 能夠回傳指定的 `Index`

- 

In [2]:
def create_an_index():
    """
    >>> an_index = create_an_index()
    >>> print(type(an_index))
    <class 'pandas.core.indexes.base.Index'>
    >>> print(an_index.shape)
    (5,)
    """
    ### BEGIN SOLUTION
    out_index = pd.Index(['first', 'second', 'third', 'fourth', 'fifth'])
    return out_index
    ### END SOLUTION

## 02. Define a function named `create_first_five_odds_series` that is able to create a `Series` instance from scratch.

- Expected inputs: None.
- Expected outputs: a (5,) Series.

```
first     1
second    3
third     5
fourth    7
fifth     9
dtype: int64
```

In [3]:
def create_first_five_odds_series():
    """
    >>> first_five_odds_series = create_first_five_odds_series()
    >>> print(type(first_five_odds_series))
    <class 'pandas.core.series.Series'>
    >>> print(first_five_odds_series.shape)
    (5,)
    """
    ### BEGIN SOLUTION
    out_index = pd.Index(['first', 'second', 'third', 'fourth', 'fifth'])
    out_arr = np.arange(1, 10, 2)
    out_ser = pd.Series(out_arr, index=out_index)
    return out_ser
    ### END SOLUTION

## 03. Define a function named `create_first_five_evens_series` that is able to create a `Series` instance from scratch.

- Expected inputs: None,
- Expected outputs: a (5,) Series.

```
first     0
second    2
third     4
fourth    6
fifth     8
dtype: int64
```

In [4]:
def create_first_five_evens_series():
    """
    >>> first_five_evens_series = create_first_five_evens_series()
    >>> print(type(first_five_evens_series))
    <class 'pandas.core.series.Series'>
    >>> print(first_five_evens_series.shape)
    (5,)
    """
    ### BEGIN SOLUTION
    out_index = pd.Index(['first', 'second', 'third', 'fourth', 'fifth'])
    out_arr = np.arange(0, 10, 2)
    out_ser = pd.Series(out_arr, index=out_index)
    return out_ser
    ### END SOLUTION

## 04. Define a function named `create_first_five_integers_df` that is able to create a `DataFrame` instance from scratch.

- Expected inputs: None,
- Expected outputs: a (5, 2) DataFrame.

```
        even  odd
first      0    1
second     2    3
third      4    5
fourth     6    7
fifth      8    9
```

In [5]:
def create_first_five_integers_df():
    """
    >>> first_five_integers_df = create_first_five_integers_df()
    >>> print(type(first_five_integers_df))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(first_five_integers_df.shape)
    (5, 2)
    """
    ### BEGIN SOLUTION
    first_five_odds_series = create_first_five_odds_series()
    first_five_evens_series = create_first_five_evens_series()
    out_df = pd.DataFrame()
    out_df["even"] = first_five_evens_series
    out_df["odd"] = first_five_odds_series
    return out_df
    ### END SOLUTION

## 05. Define a function named `create_trilogy_df` that is able to create a `DataFrame` instance from scratch.

- Expected inputs: None,
- Expected outputs: a (6, 4) DataFrame.

```
                 trilogy                       title  release_year  \
0  The Lord of the Rings  The Fellowship of the Ring          2001   
1  The Lord of the Rings              The Two Towers          2002   
2  The Lord of the Rings      The Return of the King          2003   
3        The Dark Knight               Batman Begins          2005   
4        The Dark Knight             The Dark Knight          2008   
5        The Dark Knight       The Dark Knight Rises          2012   

            director  
0      Peter Jackson  
1      Peter Jackson  
2      Peter Jackson  
3  Christopher Nolan  
4  Christopher Nolan  
5  Christopher Nolan
```

In [6]:
def create_trilogy_df():
    """
    >>> trilogy_df = create_trilogy_df()
    >>> print(type(trilogy_df))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(trilogy_df.shape)
    (6, 4)
    """
    ### BEGIN SOLUTION
    titles = ["The Fellowship of the Ring", "The Two Towers", "The Return of the King", "Batman Begins", "The Dark Knight", "The Dark Knight Rises"]
    release_years = [2001, 2002, 2003, 2005, 2008, 2012]
    imdb_ratings = [8.8, 8.7, 8,9, 8.2, 9.0, 8.3]
    directors = ["Peter Jackson"] * 3 + ["Christopher Nolan"] * 3
    triology_titles = ["The Lord of the Rings"] * 3 + ["The Dark Knight"] * 3
    out_df = pd.DataFrame()
    out_df["trilogy"] = triology_titles
    out_df["title"] = titles
    out_df["release_year"] = release_years
    out_df["director"] = directors
    return out_df
    ### END SOLUTION

## 06. Define a function named `get_olympic_df` that is able to import a csv file as a pandas DataFrame.

- Expected inputs: a CSV file `all_time_olympic_medals.csv`.
- Expected outputs: a (153, 17) DataFrame.

```
                            team_name team_ioc  no_summer_games  \
0                         Afghanistan      AFG               14   
1                             Algeria      ALG               13   
2                           Argentina      ARG               24   
3                             Armenia      ARM                6   
4                         Australasia      ANZ                2   
..                                ...      ...              ...   
148                          Zimbabwe      ZIM               13   
149      Independent Olympic Athletes      IOA                3   
150  Independent Olympic Participants      IOP                1   
151                        Mixed team      ZZX                3   
152                            Totals      NaN               28   

     no_summer_golds  no_summer_silvers  no_summer_bronzes  no_summer_totals  \
0                  0                  0                  2                 2   
1                  5                  4                  8                17   
2                 21                 25                 28                74   
3                  2                  6                  6                14   
4                  3                  4                  5                12   
..               ...                ...                ...               ...   
148                3                  4                  1                 8   
149                1                  0                  1                 2   
150                0                  1                  2                 3   
151                8                  5                  4                17   
152             5116               5081               5488             15685   

     no_winter_games  no_winter_golds  no_winter_silvers  no_winter_bronzes  \
0                  0                0                  0                  0   
1                  3                0                  0                  0   
2                 19                0                  0                  0   
3                  7                0                  0                  0   
4                  0                0                  0                  0   
..               ...              ...                ...                ...   
148                1                0                  0                  0   
149                0                0                  0                  0   
150                0                0                  0                  0   
151                0                0                  0                  0   
152               23             1062               1059               1050   

     no_winter_totals  no_combined_games  no_combined_golds  \
0                   0                 14                  0   
1                   0                 16                  5   
2                   0                 43                 21   
3                   0                 13                  2   
4                   0                  2                  3   
..                ...                ...                ...   
148                 0                 14                  3   
149                 0                  3                  1   
150                 0                  1                  0   
151                 0                  3                  8   
152              3171                 51               6178   

     no_combined_silvers  no_combined_bronzes  no_combined_totals  
0                      0                    2                   2  
1                      4                    8                  17  
2                     25                   28                  74  
3                      6                    6                  14  
4                      4                    5                  12  
..                   ...                  ...                 ...  
148                    4                    1                   8  
149                    0                    1                   2  
150                    1                    2                   3  
151                    5                    4                  17  
152                 6140                 6538               18856  

[153 rows x 17 columns]
```

In [7]:
def get_olympic_df(csv_file_path):
    """
    >>> olympic_df = get_olympic_df('all_time_olympic_medals.csv')
    >>> print(type(olympic_df))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(olympic_df.shape)
    (153, 17)
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file_path)
    return df
    ### END SOLUTION

## 07. Define a function named `find_taiwan` that is able to retrieve the data of Taiwan as a pandas DataFrame.

PS Taiwan might not be "Taiwan" in Olympic data.

- Expected inputs: a CSV file `all_time_olympic_medals.csv`.
- Expected outputs: a (1, 17) DataFrame.

In [15]:
def find_taiwan(csv_file_path):
    """
    >>> taiwan = find_taiwan('all_time_olympic_medals.csv')
    >>> print(type(taiwan))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(taiwan.shape)
    (1, 17)
    >>> print(taiwan['team_name'].values[0])
    'Chinese Taipei'
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file_path)
    df_tw = df[df['team_name'] == 'Chinese Taipei']
    return df_tw
    ### END SOLUTION

## 08. Define a function named `find_the_king_of_summer_olympics` that is able to retrieve the data of the country that won the most gold medals in summer Olympics.

- Expected inputs: a CSV file `all_time_olympic_medals.csv`.
- Expected outputs: a (1, 17) DataFrame.

In [9]:
def find_the_king_of_summer_olympics(csv_file_path):
    """
    >>> the_king_of_summer_olympics = find_the_king_of_summer_olympics('all_time_olympic_medals.csv')
    >>> print(type(the_king_of_summer_olympics))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(the_king_of_summer_olympics.shape)
    (1, 17)
    >>> print(the_king_of_summer_olympics['no_summer_golds'].values[0])
    1022
    >>> print(the_king_of_summer_olympics['team_name'].values[0])
    'United States'
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file_path)
    df_without_totals = df[df['team_name'] != 'Totals']
    max_gold = df_without_totals['no_summer_golds'].max()
    out = df_without_totals[df_without_totals['no_summer_golds'] == max_gold]
    return out
    ### END SOLUTION

## 09. Define a function named `find_the_king_of_winter_olympics` that is able to retrieve the data of the country that won the most gold medals in winter Olympics.

- Expected inputs: a CSV file `all_time_olympic_medals.csv`.
- Expected outputs: a (1, 17) DataFrame.

In [10]:
def find_the_king_of_winter_olympics(csv_file_path):
    """
    >>> the_king_of_winter_olympics = find_the_king_of_winter_olympics('all_time_olympic_medals.csv')
    >>> print(type(the_king_of_winter_olympics))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(the_king_of_winter_olympics.shape)
    (1, 17)
    >>> print(the_king_of_winter_olympics['no_winter_golds'].values[0])
    132
    >>> print(the_king_of_winter_olympics['team_name'].values[0])
    'Norway'
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file_path)
    df_without_totals = df[df['team_name'] != 'Totals']
    max_gold = df_without_totals['no_winter_golds'].max()
    out = df_without_totals[df_without_totals['no_winter_golds'] == max_gold]
    return out
    ### END SOLUTION

## 10. Define a function named `find_largest_ratio` that is able to retrieve the data of the country that has the largest ratio which is calculated as:

\begin{equation}
\text{Ratio} = \frac{\text{Summer Gold} - \text{Winter Gold}}{\text{Total Gold}}
\end{equation}

PS You have to exclude the countries with ratio calculated as 1.

- Expected inputs: a CSV file `all_time_olympic_medals.csv`.
- Expected outputs: a Series of size 17.

In [11]:
def find_largest_ratio(csv_file_path):
    """
    >>> largest_ratio = find_largest_ratio('all_time_olympic_medals.csv')
    >>> print(type(largest_ratio))
    <class 'pandas.core.series.Series'>
    >>> print(largest_ratio.size)
    17
    >>> print(largest_ratio['team_name']
    'Hungary'
    """
    ### BEGIN SOLUTION
    df = pd.read_csv(csv_file_path)
    df_without_totals = df[df['team_name'] != 'Totals']
    ratio = (df_without_totals['no_summer_golds'] - df_without_totals['no_winter_golds']) / (df_without_totals['no_summer_golds'] + df_without_totals['no_winter_golds'])
    ratio_not_one = ratio[ratio != 1]
    out_index = ratio_not_one.idxmax()
    out = df_without_totals.loc[out_index, :]
    return out
    ### END SOLUTION