# Extracting data from a csv file
In this first exercise, you'll practice extracting data from a CSV file and then navigating through the results. You'll see that extracting data is not always a straight-forward process.
## Main tools
[pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
> - dtype: 用於指定 column 資料型態，此處在出現 [DtypeWarning](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html) 時使用。
- skiprows: 跳過前幾行，此處在檔案格式混亂時使用。

### Scenario 1 - DtypeWarning

In [1]:
# read in the projects_data.csv file
import pandas as pd
df_projects = pd.read_csv('../data/projects_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


DType warning 發生在自動判讀資料型態出錯時，也就是說在混亂的資料中有些 columns 具有多種可能的資料型態。

以下我們先簡單利用 dtype option 把每個 column 都設為 string。

In [2]:
df_projects = pd.read_csv('../data/projects_data.csv', dtype=str)
df_projects.head(3)

Unnamed: 0,id,regionname,countryname,prodline,lendinginstr,lendinginstrtype,envassesmentcategorycode,supplementprojectflg,productlinetype,projectstatusdisplay,...,mjtheme3name,mjtheme4name,mjtheme5name,location,GeoLocID,GeoLocName,Latitude,Longitude,Country,Unnamed: 56
0,P162228,Other,World;World,RE,Investment Project Financing,IN,C,N,L,Active,...,,,,,,,,,,
1,P163962,Africa,Democratic Republic of the Congo;Democratic Re...,PE,Investment Project Financing,IN,B,N,L,Active,...,,,,,,,,,,
2,P167672,South Asia,People's Republic of Bangladesh;People's Repub...,PE,Investment Project Financing,IN,,Y,L,Active,...,,,,,,,,,,


### Scenario 2 - format issue

In [3]:
# read in the population_data.csv file
try:
    df_population = pd.read_csv('../data/population_data.csv')
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 3 fields in line 5, saw 63



上面得到 error: Expected Expected 3 fields in line 5, saw 63，以下我們一行一行看看是甚麼問題

In [4]:
with open('../data/population_data.csv', encoding="utf-8") as f:
    for i in range(6):
        line = f.readline()
        print(f"line {i}: {line}")

line 0: ﻿"Data Source","World Development Indicators",

line 1: 

line 2: "Last Updated Date","2018-06-28",

line 3: 

line 4: "Country Name","Country Code","Indicator Name","Indicator Code","1960","1961","1962","1963","1964","1965","1966","1967","1968","1969","1970","1971","1972","1973","1974","1975","1976","1977","1978","1979","1980","1981","1982","1983","1984","1985","1986","1987","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016","2017",

line 5: "Aruba","ABW","Population, total","SP.POP.TOTL","54211","55438","56225","56695","57032","57360","57715","58055","58386","58726","59063","59440","59840","60243","60528","60657","60586","60366","60103","59980","60096","60567","61345","62201","62836","63026","62644","61833","61079","61032","62149","64622","68235","72504","76700","80324","83200","85451","87277","89005","90853","92898","94992","

可以看出前四行格式不同且不包含資料，所以能利用 skiprows option 跳過它們。

In [5]:
df_population = pd.read_csv('../data/population_data.csv', skiprows=4)
df_population.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,Unnamed: 62
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0,
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,...,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0,
2,Angola,AGO,"Population, total",SP.POP.TOTL,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,...,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0,
