# Python 的 50+ 練習：資料科學學習手冊

> 基礎資料框操作

[數據交點](https://www.datainpoint.com) | 郭耀仁 <yaojenkuo@datainpoint.com>

## 練習題指引

- 練習題閒置超過 10 分鐘會自動斷線，只要重新點選練習題連結即可重新啟動。
- 第一個程式碼儲存格會將可能用得到的模組載入。
- 如果練習題需要載入檔案，檔案存放絕對路徑為 `/home/jovyan/data`
- 練習題已經給定函數、類別、預期輸入或參數名稱，我們只需要寫作程式區塊。同時也給定函數的類別提示，說明預期輸入以及預期輸出的類別。
- 說明（Docstring）會描述測試如何進行，閱讀說明能夠暸解預期輸入以及預期輸出之間的關係，幫助我們更快解題。
- 請在 `### BEGIN SOLUTION` 與 `### END SOLUTION` 這兩個註解之間寫作函數或者類別的程式區塊。
- 將預期輸出放置在 `return` 保留字之後，若只是用 `print()` 函數將預期輸出印出無法通過測試。
- 語法錯誤（`SyntaxError`）或縮排錯誤（`IndentationError`）等將會導致測試失效，測試之前應該先在筆記本使用函數觀察是否與說明（Docstring）描述的功能相符。
- 如果卡關，可以先看練習題詳解或者複習課程單元影片之後再繼續寫作。
- 執行測試的步驟：
    1. 點選上方選單的 File -> Save Notebook 儲存 exercises.ipynb。
    2. 點選上方選單的 File -> New -> Terminal 開啟終端機。
    3. 在終端機輸入 `python 15-basic-dataframe-manipulations/test_runner.py` 後按下 Enter 執行測試。

In [2]:
import pandas as pd

## 121. 載入 `movies.csv`

定義函數 `import_movies_csv()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 載入為一個 `DataFrame`

- 使用絕對路徑。
- 使用 `pd.read_csv()` 函數。
- 將預期輸出寫在 `return` 之後。

In [72]:
def import_movies_csv() -> pd.core.frame.DataFrame:
    """
    >>> movies_csv = import_movies_csv()
    >>> type(movies_csv)
    pandas.core.frame.DataFrame
    >>> movies_csv.shape
    (250, 6)
    """
    ### BEGIN SOLUTION
    file_path = "C:\\Users\\Yan-Ju-Wang\\Python_Hahow\\internet-movie-database\\movies.csv"
    return pd.read_csv(file_path)
    ### END SOLUTION

In [73]:
movies_csv = import_movies_csv()
type(movies_csv)
movies_csv.shape

(250, 6)

## 122. `movies.csv` 的前 10 列與後 10 列

定義類別 `ShowHeadAndTail` 能夠用來建立具有兩個方法 `show_head()`、`show_tail()` 的物件，將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 顯示前 10 列與後 10 列。

- 使用 `import_movies_csv()` 函數。
- 使用 `self`
- 使用 `__init__()`
- 以 `self.attribute` 在類別程式區塊中使用屬性。
- 以 `self.method()` 在類別程式區塊中使用方法。

In [7]:
class ShowHeadAndTail:
    """
    >>> show_head_and_tail = ShowHeadAndTail()
    >>> show_head_and_tail.show_head()
       id                                              title  release_year  \
    0   1                           The Shawshank Redemption          1994   
    1   2                                      The Godfather          1972   
    2   3                             The Godfather: Part II          1974   
    3   4                                    The Dark Knight          2008   
    4   5                                       12 Angry Men          1957   
    5   6                                   Schindler's List          1993   
    6   7      The Lord of the Rings: The Return of the King          2003   
    7   8                                       Pulp Fiction          1994   
    8   9                     The Good, the Bad and the Ugly          1966   
    9  10  The Lord of the Rings: The Fellowship of the Ring          2001   

       rating              director  runtime  
    0     9.3        Frank Darabont      142  
    1     9.2  Francis Ford Coppola      175  
    2     9.0  Francis Ford Coppola      202  
    3     9.0     Christopher Nolan      152  
    4     9.0          Sidney Lumet       96  
    5     8.9      Steven Spielberg      195  
    6     8.9         Peter Jackson      201  
    7     8.9     Quentin Tarantino      154  
    8     8.8          Sergio Leone      178  
    9     8.8         Peter Jackson      178
    >>> show_head_and_tail.show_tail()
          id                                           title  release_year  \
    240  241                                 Rang De Basanti          2006   
    241  242                                    Paris, Texas          1984   
    242  243                                        Drishyam          2013   
    243  244                      Portrait of a Lady on Fire          2019   
    244  245                           It Happened One Night          1934   
    245  246  Neon Genesis Evangelion: The End of Evangelion          1997   
    246  247                              7 Kogustaki Mucize          2019   
    247  248                                      Tangerines          2013   
    248  249                                        Drishyam          2015   
    249  250                                          Swades          2004   

         rating                 director  runtime  
    240     8.2  Rakeysh Omprakash Mehra      167  
    241     8.1              Wim Wenders      145  
    242     8.4            Jeethu Joseph      160  
    243     8.1           Céline Sciamma      122  
    244     8.1              Frank Capra      105  
    245     8.1             Hideaki Anno       87  
    246     8.2       Mehmet Ada Öztekin      132  
    247     8.2           Zaza Urushadze       87  
    248     8.2          Nishikant Kamat      163  
    249     8.2       Ashutosh Gowariker      189
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    def __init__(self):
        self.movies_csv = import_movies_csv()
    
    def show_head(self):
        return self.movies_csv.head(10)
    
    def show_tail(self):
        return self.movies_csv.tail(10)      
    ### END SOLUTION

In [9]:
show_head_and_tail = ShowHeadAndTail()
show_head_and_tail.show_head()
show_head_and_tail.show_tail()

Unnamed: 0,id,title,release_year,rating,director,runtime
240,241,Rang De Basanti,2006,8.2,Rakeysh Omprakash Mehra,167
241,242,"Paris, Texas",1984,8.1,Wim Wenders,145
242,243,Drishyam,2013,8.4,Jeethu Joseph,160
243,244,Portrait of a Lady on Fire,2019,8.1,Céline Sciamma,122
244,245,It Happened One Night,1934,8.1,Frank Capra,105
245,246,Neon Genesis Evangelion: The End of Evangelion,1997,8.1,Hideaki Anno,87
246,247,7 Kogustaki Mucize,2019,8.2,Mehmet Ada Öztekin,132
247,248,Tangerines,2013,8.2,Zaza Urushadze,87
248,249,Drishyam,2015,8.2,Nishikant Kamat,163
249,250,Swades,2004,8.2,Ashutosh Gowariker,189


## 123. 數值欄位的描述性統計

定義函數 `describe_dataframes_numeric_columns()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 數值欄位描述性統計回傳。

- 使用 `import_movies_csv()` 函數。
- 使用 `DataFrame.describe()`
- 將預期輸出寫在 `return` 之後。

In [11]:
def describe_dataframes_numeric_columns() -> pd.core.frame.DataFrame:
    """
    >>> dataframes_numeric_columns = describe_dataframes_numeric_columns()
    >>> type(dataframes_numeric_columns)
    pandas.core.frame.DataFrame
    >>> dataframes_numeric_columns
                   id  release_year     rating    runtime
    count  250.000000    250.000000  250.00000  250.00000
    mean   125.500000   1987.532000    8.30520  130.76800
    std     72.312977     24.873999    0.22138   33.60248
    min      1.000000   1921.000000    8.10000   45.00000
    25%     63.250000   1971.000000    8.10000  108.00000
    50%    125.500000   1995.000000    8.20000  127.00000
    75%    187.750000   2007.000000    8.40000  147.75000
    max    250.000000   2021.000000    9.30000  321.00000
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    return movies_csv.describe()
    ### END SOLUTION

In [13]:
dataframes_numeric_columns = describe_dataframes_numeric_columns()
type(dataframes_numeric_columns)
dataframes_numeric_columns

Unnamed: 0,id,release_year,rating,runtime
count,250.0,250.0,250.0,250.0
mean,125.5,1987.532,8.3052,130.768
std,72.312977,24.873999,0.22138,33.60248
min,1.0,1921.0,8.1,45.0
25%,63.25,1971.0,8.1,108.0
50%,125.5,1995.0,8.2,127.0
75%,187.75,2007.0,8.4,147.75
max,250.0,2021.0,9.3,321.0


## 124. 特定數值欄位的描述性統計

定義函數 `describe_selected_numeric_columns()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 中 `release_year`、`rating`、`runtime` 三個欄位描述性統計回傳。

- 使用 `describe_dataframes_numeric_columns()` 函數。
- 運用選擇變數的操作技巧。
- 將預期輸出寫在 `return` 之後。

In [19]:
def describe_selected_numeric_columns() -> pd.core.frame.DataFrame:
    """
    >>> selected_numeric_columns = describe_selected_numeric_columns()
    >>> type(selected_numeric_columns)
    pandas.core.frame.DataFrame
    >>> selected_numeric_columns
           release_year     rating    runtime
    count    250.000000  250.00000  250.00000
    mean    1987.532000    8.30520  130.76800
    std       24.873999    0.22138   33.60248
    min     1921.000000    8.10000   45.00000
    25%     1971.000000    8.10000  108.00000
    50%     1995.000000    8.20000  127.00000
    75%     2007.000000    8.40000  147.75000
    max     2021.000000    9.30000  321.00000
    """
    ### BEGIN SOLUTION
    dataframes_numeric_columns = describe_dataframes_numeric_columns()
    return dataframes_numeric_columns[["release_year", "rating", "runtime"]]
    ### END SOLUTION

In [21]:
selected_numeric_columns = describe_selected_numeric_columns()
type(selected_numeric_columns)
selected_numeric_columns

Unnamed: 0,release_year,rating,runtime
count,250.0,250.0,250.0
mean,1987.532,8.3052,130.768
std,24.873999,0.22138,33.60248
min,1921.0,8.1,45.0
25%,1971.0,8.1,108.0
50%,1995.0,8.2,127.0
75%,2007.0,8.4,147.75
max,2021.0,9.3,321.0


## 125. 數值欄位的特定描述性統計

定義函數 `describe_filtered_stats()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 數值欄位描述性統計回傳，將 `25%` 與 `75%` 分位數統計移除。

- 使用 `describe_dataframes_numeric_columns()` 函數。
- 運用篩選資料列的操作技巧。
- 將預期輸出寫在 `return` 之後。

In [23]:
def describe_filtered_stats() -> pd.core.frame.DataFrame:
    """
    >>> filtered_stats = describe_filtered_stats()
    >>> type(filtered_stats)
    pandas.core.frame.DataFrame
    >>> filtered_stats
                   id  release_year     rating    runtime
    count  250.000000    250.000000  250.00000  250.00000
    mean   125.500000   1987.532000    8.30520  130.76800
    std     72.312977     24.873999    0.22138   33.60248
    min      1.000000   1921.000000    8.10000   45.00000
    50%    125.500000   1995.000000    8.20000  127.00000
    max    250.000000   2021.000000    9.30000  321.00000
    """
    ### BEGIN SOLUTION
    dataframes_numeric_columns = describe_dataframes_numeric_columns()
    return dataframes_numeric_columns.loc[["count", "mean", "std", "min", "50%", "max"] , ["id", "release_year", "rating", "runtime"]]
    ### END SOLUTION

In [25]:
filtered_stats = describe_filtered_stats()
type(filtered_stats)
filtered_stats

Unnamed: 0,id,release_year,rating,runtime
count,250.0,250.0,250.0,250.0
mean,125.5,1987.532,8.3052,130.768
std,72.312977,24.873999,0.22138,33.60248
min,1.0,1921.0,8.1,45.0
50%,125.5,1995.0,8.2,127.0
max,250.0,2021.0,9.3,321.0


## 126. 由 Christopher Nolan 執導的電影

定義函數 `filter_christopher_nolans_movies()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 中由 Christopher Nolan 執導的電影篩選出來。

- 使用 `import_movies_csv()` 函數。
- 運用篩選資料列的操作技巧。
- 將預期輸出寫在 `return` 之後。

In [28]:
def filter_christopher_nolans_movies() -> pd.core.frame.DataFrame:
    """
    >>> christopher_nolans_movies = filter_christopher_nolans_movies()
    >>> type(christopher_nolans_movies)
    pandas.core.frame.DataFrame
    >>> christopher_nolans_movies
          id                  title  release_year  rating           director  \
    3      4        The Dark Knight          2008     9.0  Christopher Nolan   
    12    13              Inception          2010     8.8  Christopher Nolan   
    28    29           Interstellar          2014     8.6  Christopher Nolan   
    46    47           The Prestige          2006     8.5  Christopher Nolan   
    53    54                Memento          2000     8.4  Christopher Nolan   
    70    71  The Dark Knight Rises          2012     8.4  Christopher Nolan   
    126  127          Batman Begins          2005     8.2  Christopher Nolan   

         runtime  
    3        152  
    12       148  
    28       169  
    46       130  
    53       113  
    70       164  
    126      140
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    return movies_csv.loc[movies_csv["director"] == "Christopher Nolan"]
    ### END SOLUTION

In [30]:
christopher_nolans_movies = filter_christopher_nolans_movies()
type(christopher_nolans_movies)
christopher_nolans_movies

Unnamed: 0,id,title,release_year,rating,director,runtime
3,4,The Dark Knight,2008,9.0,Christopher Nolan,152
12,13,Inception,2010,8.8,Christopher Nolan,148
28,29,Interstellar,2014,8.6,Christopher Nolan,169
46,47,The Prestige,2006,8.5,Christopher Nolan,130
53,54,Memento,2000,8.4,Christopher Nolan,113
70,71,The Dark Knight Rises,2012,8.4,Christopher Nolan,164
126,127,Batman Begins,2005,8.2,Christopher Nolan,140


## 127. 由 Christopher Nolan、Steven Spielberg 執導的電影

定義函數 `filter_christopher_nolans_steven_spielbergs_movies()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 中由 Christopher Nolan、Steven Spielberg 執導的電影篩選出來。

- 使用 `import_movies_csv()` 函數。
- 運用篩選資料列的操作技巧。
- 將預期輸出寫在 `return` 之後。

In [95]:
def filter_christopher_nolans_steven_spielbergs_movies() -> pd.core.frame.DataFrame:
    """
    >>> christopher_nolans_steven_spielbergs_movies = filter_christopher_nolans_steven_spielbergs_movies()
    >>> type(christopher_nolans_steven_spielbergs_movies)
    pandas.core.frame.DataFrame
    >>> christopher_nolans_steven_spielbergs_movies
          id                                          title  release_year  rating  \
    3      4                                The Dark Knight          2008     9.0   
    5      6                               Schindler's List          1993     8.9   
    12    13                                      Inception          2010     8.8   
    25    26                            Saving Private Ryan          1998     8.6   
    28    29                                   Interstellar          2014     8.6   
    46    47                                   The Prestige          2006     8.5   
    53    54                                        Memento          2000     8.4   
    55    56  Indiana Jones and the Raiders of the Lost Ark          1981     8.4   
    70    71                          The Dark Knight Rises          2012     8.4   
    121  122             Indiana Jones and the Last Crusade          1989     8.2   
    126  127                                  Batman Begins          2005     8.2   
    163  164                                  Jurassic Park          1993     8.1   
    188  189                            Catch Me If You Can          2002     8.1   

                  director  runtime  
    3    Christopher Nolan      152  
    5     Steven Spielberg      195  
    12   Christopher Nolan      148  
    25    Steven Spielberg      169  
    28   Christopher Nolan      169  
    46   Christopher Nolan      130  
    53   Christopher Nolan      113  
    55    Steven Spielberg      115  
    70   Christopher Nolan      164  
    121   Steven Spielberg      127  
    126  Christopher Nolan      140  
    163   Steven Spielberg      127  
    188   Steven Spielberg      141
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv() 
    condition_nolan = movies_csv["director"] == "Christopher Nolan"
    condition_Spielberg = movies_csv["director"] == "Steven Spielberg"
    output_dataframe = movies_csv[condition_nolan | condition_Spielberg]
    return output_dataframe
    ### END SOLUTION

In [96]:
christopher_nolans_steven_spielbergs_movies = filter_christopher_nolans_steven_spielbergs_movies()
type(christopher_nolans_steven_spielbergs_movies)
christopher_nolans_steven_spielbergs_movies

Unnamed: 0,id,title,release_year,rating,director,runtime
3,4,The Dark Knight,2008,9.0,Christopher Nolan,152
5,6,Schindler's List,1993,8.9,Steven Spielberg,195
12,13,Inception,2010,8.8,Christopher Nolan,148
25,26,Saving Private Ryan,1998,8.6,Steven Spielberg,169
28,29,Interstellar,2014,8.6,Christopher Nolan,169
46,47,The Prestige,2006,8.5,Christopher Nolan,130
53,54,Memento,2000,8.4,Christopher Nolan,113
55,56,Indiana Jones and the Raiders of the Lost Ark,1981,8.4,Steven Spielberg,115
70,71,The Dark Knight Rises,2012,8.4,Christopher Nolan,164
121,122,Indiana Jones and the Last Crusade,1989,8.2,Steven Spielberg,127


## 128. 每年有幾部電影

定義函數 `count_movies_by_years()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 中 250 部電影依照上映年份分組後計數。

- 使用 `import_movies_csv()` 函數。
- 運用分組聚合欄位。
- 將預期輸出寫在 `return` 之後。

In [60]:
def count_movies_by_years() -> pd.core.series.Series:
    """
    >>> movies_by_years = count_movies_by_years()
    >>> type(movies_by_years)
    pandas.core.series.Series
    >>> movies_by_years
    release_year
    1921    1
    1924    1
    1925    1
    1926    1
    1927    1
           ..
    2017    3
    2018    6
    2019    8
    2020    2
    2021    1
    Name: title, Length: 85, dtype: int64
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    return movies_csv.groupby("release_year")["title"].count()
    ### END SOLUTION

In [61]:
movies_by_years = count_movies_by_years()
type(movies_by_years)
movies_by_years

release_year
1921    1
1924    1
1925    1
1926    1
1927    1
       ..
2017    3
2018    6
2019    8
2020    2
2021    1
Name: title, Length: 85, dtype: int64

## 129. 大於等於五部電影的導演

定義函數 `count_movies_by_directors()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 中 250 部電影依照導演分組後計數，並將計數大於等於 5 的篩選出來、以電影數遞減排序。

- 使用 `import_movies_csv()` 函數。
- 運用篩選資料列的操作技巧。
- 運用分組聚合欄位。
- 運用排序操作技巧。
- 將預期輸出寫在 `return` 之後。

In [76]:
def count_movies_by_directors() -> pd.core.series.Series:
    """
    >>> movies_by_directors = count_movies_by_directors()
    >>> type(movies_by_directors)
    pandas.core.series.Series
    >>> movies_by_directors
    director
    Christopher Nolan    7
    Martin Scorsese      7
    Stanley Kubrick      7
    Akira Kurosawa       6
    Alfred Hitchcock     6
    Steven Spielberg     6
    Billy Wilder         5
    Charles Chaplin      5
    Hayao Miyazaki       5
    Quentin Tarantino    5
    Name: title, dtype: int64
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    movies_count_by_directors = movies_csv.groupby("director")["title"].count()
    filtered_series = movies_count_by_directors[movies_count_by_directors >= 5]
    return filtered_series.sort_values(ascending = False)
    ### END SOLUTION

In [77]:
movies_by_directors = count_movies_by_directors()
type(movies_by_directors)
movies_by_directors

director
Christopher Nolan    7
Martin Scorsese      7
Stanley Kubrick      7
Akira Kurosawa       6
Alfred Hitchcock     6
Steven Spielberg     6
Billy Wilder         5
Charles Chaplin      5
Hayao Miyazaki       5
Quentin Tarantino    5
Name: title, dtype: int64

## 130. 以小時、分鐘表示電影長度

定義函數 `calculate_movie_hours_and_minutes()` 將位於 `/home/jovyan/data/internet-movie-database` 路徑的 `movies.csv` 將 `runtime` 衍生計算為兩個欄位 `runtime_hours`、`runtime_minutes`

- 使用 `import_movies_csv()` 函數。
- 運用衍生計算欄位。
- 將預期輸出寫在 `return` 之後。

In [80]:
def calculate_movie_hours_and_minutes() -> pd.core.frame.DataFrame:
    """
    >>> movie_hours_and_minutes = calculate_movie_hours_and_minutes()
    >>> type(movie_hours_and_minutes)
    pandas.core.frame.DataFrame
    >>> movie_hours_and_minutes
                                                  title  runtime_hours  \
    0                          The Shawshank Redemption              2   
    1                                     The Godfather              2   
    2                            The Godfather: Part II              3   
    3                                   The Dark Knight              2   
    4                                      12 Angry Men              1   
    ..                                              ...            ...   
    245  Neon Genesis Evangelion: The End of Evangelion              1   
    246                              7 Kogustaki Mucize              2   
    247                                      Tangerines              1   
    248                                        Drishyam              2   
    249                                          Swades              3   

         runtime_minutes  
    0                 22  
    1                 55  
    2                 22  
    3                 32  
    4                 36  
    ..               ...  
    245               27  
    246               12  
    247               27  
    248               43  
    249                9  

    [250 rows x 3 columns]
    """
    ### BEGIN SOLUTION
    movies_csv = import_movies_csv()
    runtime_hours = movies_csv["runtime"] // 60
    runtime_minutes = movies_csv["runtime"] % 60
    output_dataframe = pd.DataFrame()
    output_dataframe["title"] = movies_csv["title"]
    output_dataframe["runtime_hours"] = runtime_hours
    output_dataframe["runtime_minutes"] = runtime_minutes
    return output_dataframe
    ### END SOLUTION

In [82]:
movie_hours_and_minutes = calculate_movie_hours_and_minutes()
type(movie_hours_and_minutes)
movie_hours_and_minutes

Unnamed: 0,title,runtime_hours,runtime_minutes
0,The Shawshank Redemption,2,22
1,The Godfather,2,55
2,The Godfather: Part II,3,22
3,The Dark Knight,2,32
4,12 Angry Men,1,36
...,...,...,...
245,Neon Genesis Evangelion: The End of Evangelion,1,27
246,7 Kogustaki Mucize,2,12
247,Tangerines,1,27
248,Drishyam,2,43
