# 資料載入

[數據交點](https://www.datainpoint.com) | 郭耀仁 <yaojenkuo@datainpoint.com>

In [1]:
import json
import sqlite3
import pandas as pd

## 複習：資料科學專案流程

![](images/data-science-project.png)

來源：<https://r4ds.had.co.nz/introduction.html>

## Import 環節

從常見來源將資料載入 Python 程式設計環境。

## 常見來源

- 純文字檔案
    - `.txt`
    - `.json`
    - `.csv`
- 關聯式資料庫的資料表
- 試算表
- 網頁資料

## 載入純文字檔案 `.txt`

- 使用 `open()` 內建函數開啟檔案。
- 使用檔案物件的 `readlines()` 方法載入。

In [2]:
file_path = "data/the_shawshank_redemption_summaries.txt"
with open(file_path) as file:
    summaries = file.readlines()
type(summaries)

list

In [3]:
for summary in summaries:
    print(summary.strip())

Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
Chronicles the experiences of a formerly successful banker as a prisoner in the gloomy jailhouse of Shawshank after being found guilty of a crime he did not commit. The film portrays the man's unique way of dealing with his new, torturous life; along the way he befriends a number of fellow prisoners, most notably a wise long-term inmate named Red.
After the murder of his wife, hotshot banker Andrew Dufresne is sent to Shawshank Prison, where the usual unpleasantness occurs. Over the years, he retains hope and eventually gains the respect of his fellow inmates, especially longtime convict "Red" Redding, a black marketeer, and becomes influential within the prison. Eventually, Andrew achieves his ends on his own terms.
Andy Dufresne is sent to Shawshank Prison for the murder of his wife and her secret lover. He is very isolated and lonely at first, but realizes there is 

## 載入純文字檔案 `.json`

- 使用 `open()` 內建函數開啟檔案。
- 使用 `json` 模組的 `load()` 函數載入。

In [4]:
file_path = "data/imdb_top_rated_movies.json"
with open(file_path) as file:
    imdb_top_rated_movies = json.load(file)
type(imdb_top_rated_movies)

list

In [5]:
i = 0
while i < 10:
    print(imdb_top_rated_movies[i])
    i += 1

{'rank': 1, 'title': 'The Shawshank Redemption', 'year': 1994, 'rating': 9.2}
{'rank': 2, 'title': 'The Godfather', 'year': 1972, 'rating': 9.1}
{'rank': 3, 'title': 'The Godfather: Part II', 'year': 1974, 'rating': 9.0}
{'rank': 4, 'title': 'The Dark Knight', 'year': 2008, 'rating': 9.0}
{'rank': 5, 'title': '12 Angry Men', 'year': 1957, 'rating': 8.9}
{'rank': 6, 'title': "Schindler's List", 'year': 1993, 'rating': 8.9}
{'rank': 7, 'title': 'The Lord of the Rings: The Return of the King', 'year': 2003, 'rating': 8.9}
{'rank': 8, 'title': 'Pulp Fiction', 'year': 1994, 'rating': 8.8}
{'rank': 9, 'title': 'The Good, the Bad and the Ugly', 'year': 1966, 'rating': 8.8}
{'rank': 10, 'title': 'The Lord of the Rings: The Fellowship of the Ring', 'year': 2001, 'rating': 8.8}


## 載入純文字檔案 `.csv`

使用 `pandas` 模組的 `read_csv()` 函數。

In [6]:
file_path = "data/imdb_top_rated_movies.csv"
imdb_top_rated_movies = pd.read_csv(file_path)
type(imdb_top_rated_movies)

pandas.core.frame.DataFrame

In [7]:
imdb_top_rated_movies.head()

Unnamed: 0,rank,title,year,rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.1
2,3,The Godfather: Part II,1974,9.0
3,4,The Dark Knight,2008,9.0
4,5,12 Angry Men,1957,8.9


## 載入關聯式資料庫的資料表

- 使用 `sqlite3` 模組的 `connect()` 函數建立連線。
- 使用 `pandas` 模組的 `read_sql()` 函數載入。

In [8]:
db_path = "data/imdb.db"
conn = sqlite3.connect(db_path)
sql_query = """
SELECT * 
  FROM movies
 WHERE release_year = 1994;
"""
movies_of_1994 = pd.read_sql(sql_query, conn)
type(movies_of_1994)

pandas.core.frame.DataFrame

In [9]:
movies_of_1994

Unnamed: 0,id,title,release_year,rating,director,runtime
0,1,The Shawshank Redemption,1994,9.3,Frank Darabont,142
1,8,Pulp Fiction,1994,8.9,Quentin Tarantino,154
2,12,Forrest Gump,1994,8.8,Robert Zemeckis,142
3,31,Léon: The Professional,1994,8.5,Luc Besson,110
4,34,The Lion King,1994,8.5,Roger Allers,88


## 載入試算表

使用 `pandas` 模組的 `read_excel()` 函數載入。

In [10]:
file_path = "data/公投在各投開票所得票數一覽表-EXCEL格式/表5-100(臺北市)-公投在各投開票所得票數一覽表.xls"
referendum_2018_taipei_city = pd.read_excel(file_path)
type(referendum_2018_taipei_city)

pandas.core.frame.DataFrame

In [11]:
referendum_2018_taipei_city.head()

Unnamed: 0,107年全國性公民投票案第７案在臺北市各投開票所得票數一覽表,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,鄉(鎮、市、區)別,村里別,投開票所別,得票情形,,A\n有效票數\n\nA=1+2+...+N,B\n無效票數\n\n,C\n投票數\n\nC=A+B,D\n已領未投\n票 數\nD=E-C,E\n發出票數\n\nE=C+D,F\n用餘票數\n\n,G\n投票權人數\n\nG=E+F,H\n投票率\nH=C/G\n(%)
1,,,,1\n同意\n,2\n不同意\n,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,總　計,,,924026,277072,1201098,77356,1278454,775,1279229,949201,2228430,57.369999


## 載入網頁資料

應用 Python 網路爬蟲模組：
- `requests`
- `xml`
- `BeautifulSoup4`