## Lesson 06 — Pandas Part I: Basic operations

In this lesson we will cover some basic features of [Pandas](https://pandas.pydata.org).

We will also revisit some commands we already learned to learn them a bit more in depth.

### Readings

* [_Data Loading, Storage, and File Formats_, by Wes McKinney](https://wesmckinney.com/book/accessing-data)

### Table of Contents

* [Series](#Series)
* [DataFrame](#DataFrame)
* [index, columns](#index,-columns)
* [dtypes, info, describe](#dtypes,-info,-describe)
* [read_csv](#read_csv)
* [head, tail](#head,-tail)
* [Indexing with bracket/dot notation, loc, iloc](#Indexing-with-bracket/dot-notation,-loc,-iloc)
* [transpose](#Transpose)
* [to_csv, to_excel](#to_csv,-to_excel)
* [to_datetime](#to_datetime)

In [1]:
import pandas as pd
import numpy as np

### Series

In [2]:
# a list of strings
my_list = ["cubs", "pirates", "giants", "yankees", "donkeys"]
my_list

['cubs', 'pirates', 'giants', 'yankees', 'donkeys']

In [3]:
# pandas Series from list
series_from_list = pd.Series(my_list)
series_from_list

0       cubs
1    pirates
2     giants
3    yankees
4    donkeys
dtype: object

In [4]:
# indexing a Series is similar to lists and arrays
series_from_list[3]

'yankees'

In [5]:
# a numpy array
my_array = np.random.rand(5)
my_array

array([0.71199003, 0.84722558, 0.7979196 , 0.55955179, 0.01247146])

In [6]:
# pandas Series from array
series_from_array = pd.Series(my_array)
series_from_array

0    0.711990
1    0.847226
2    0.797920
3    0.559552
4    0.012471
dtype: float64

In [7]:
# indexing supports lists
series_from_array[[1, 3]]

1    0.847226
3    0.559552
dtype: float64

In [8]:
# indexing supports slices
series_from_array[3:]

3    0.559552
4    0.012471
dtype: float64

In [9]:
series_from_array[1]

np.float64(0.8472255765775961)

### DataFrame

#### 2D array to DataFrame

In [10]:
# create a 2D numpy array
my_2d_array = np.random.randn(5, 5)
my_2d_array

array([[ 1.75910465,  0.48069729, -1.18330588,  0.27054652, -1.00816393],
       [-0.83000577,  0.60812553, -1.51283055, -0.75962897,  0.22833705],
       [ 1.70389594,  1.34781287,  1.02956741, -0.63365223,  1.64396833],
       [ 0.55038891, -0.4744064 ,  0.33078299, -0.99425711, -1.42414178],
       [ 0.72201706,  1.52267066,  0.1445728 , -0.63848955, -0.17425766]])

In [11]:
# make a DataFrame from the 2D numpy array
pd.DataFrame(my_2d_array)

Unnamed: 0,0,1,2,3,4
0,1.759105,0.480697,-1.183306,0.270547,-1.008164
1,-0.830006,0.608126,-1.512831,-0.759629,0.228337
2,1.703896,1.347813,1.029567,-0.633652,1.643968
3,0.550389,-0.474406,0.330783,-0.994257,-1.424142
4,0.722017,1.522671,0.144573,-0.63849,-0.174258


In [12]:
# we can set the index and column labels when we create the DataFrame
# note that we can combine positional arguments (data) and keyword arguments
# (index, columns) as long as positional arguments come first
df_from_2d_array = pd.DataFrame(
    my_2d_array,
    index=["row1", "row2", "row3", "row4", "row5"],
    columns=["col1", "col2", "col3", "col4", "col5"],
)
df_from_2d_array

Unnamed: 0,col1,col2,col3,col4,col5
row1,1.759105,0.480697,-1.183306,0.270547,-1.008164
row2,-0.830006,0.608126,-1.512831,-0.759629,0.228337
row3,1.703896,1.347813,1.029567,-0.633652,1.643968
row4,0.550389,-0.474406,0.330783,-0.994257,-1.424142
row5,0.722017,1.522671,0.144573,-0.63849,-0.174258


In [13]:
# also note that new lines do not affect the way that the dataframe is generated.
# The following example WILL NOT generate a dataframe with two rows
mylist = [
    0,
    2,
    4,
    6,
    8,
    10,
]

In [14]:
mylist

[0, 2, 4, 6, 8, 10]

In [15]:
pd.DataFrame(mylist)

Unnamed: 0,0
0,0
1,2
2,4
3,6
4,8
5,10


In [16]:
# But the following example (list inside of a list) will:
my_other_list = [
    [0, 2, 4],
    [6, 8, 10],
]
pd.DataFrame(my_other_list)

Unnamed: 0,0,1,2
0,0,2,4
1,6,8,10


#### Converting multiple Series to a DataFrame

In [17]:
# method 1: getting data as a list of series will orient them as rows
# this is typically not ideal
x = pd.DataFrame(data=[series_from_list, series_from_array])
x

Unnamed: 0,0,1,2,3,4
0,cubs,pirates,giants,yankees,donkeys
1,0.71199,0.847226,0.79792,0.559552,0.012471


In [18]:
# note that each column has dtype object (string)
x.dtypes

0    object
1    object
2    object
3    object
4    object
dtype: object

In [19]:
# in this example, we need to transpose the table - we'll see this again later in the lesson
x = x.transpose()
x

Unnamed: 0,0,1
0,cubs,0.71199
1,pirates,0.847226
2,giants,0.79792
3,yankees,0.559552
4,donkeys,0.012471


In [20]:
# the transposed columns have dtype object - we'll see how to fix this below
x.dtypes

0    object
1    object
dtype: object

In [21]:
# method 2: pass list/Series as value of dictionary
# with this method, we do not have to define the column names with an extra argument
y = pd.DataFrame({"a": series_from_list, "b": series_from_array})
y

Unnamed: 0,a,b
0,cubs,0.71199
1,pirates,0.847226
2,giants,0.79792
3,yankees,0.559552
4,donkeys,0.012471


In [22]:
# importing the data as columns gives us the correct dtypes
y.dtypes

a     object
b    float64
dtype: object

In [23]:
# method 3: use pd.concat to combine series in column orientation
df = pd.concat([series_from_list, series_from_array], axis=1)
df

Unnamed: 0,0,1
0,cubs,0.71199
1,pirates,0.847226
2,giants,0.79792
3,yankees,0.559552
4,donkeys,0.012471


In [24]:
# again, importing the data as columns gives us the correct dtypes
df.dtypes

0     object
1    float64
dtype: object

In [25]:
# as curiosity, you can try concat with axis=0
df_axis0 = pd.concat([series_from_list, series_from_array], axis=0)
df_axis0

0        cubs
1     pirates
2      giants
3     yankees
4     donkeys
0     0.71199
1    0.847226
2     0.79792
3    0.559552
4    0.012471
dtype: object

### index, columns

In [26]:
# numeric indexes
df.index

RangeIndex(start=0, stop=5, step=1)

In [27]:
# numeric column names
df.columns

RangeIndex(start=0, stop=2, step=1)

In [28]:
# rename columns and indexes - method 1
# set the index and column names to an existing DataFrame
df.index = ["a", "b", "c", "d", "e"]
df.columns = ["team", "random"]
df

Unnamed: 0,team,random
a,cubs,0.71199
b,pirates,0.847226
c,giants,0.79792
d,yankees,0.559552
e,donkeys,0.012471


In [29]:
# label (object) indexes
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [30]:
# label (object) column names
df.columns

Index(['team', 'random'], dtype='object')

In [31]:
# add a new column to the end of a DataFrame
df["integers"] = [2, 3, 5, 8, 13]
df

Unnamed: 0,team,random,integers
a,cubs,0.71199,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [32]:
# add a new column at a specific position
df.insert(0, "integers2", [2, 3, 5, 8, 13])
df

Unnamed: 0,integers2,team,random,integers
a,2,cubs,0.71199,2
b,3,pirates,0.847226,3
c,5,giants,0.79792,5
d,8,yankees,0.559552,8
e,13,donkeys,0.012471,13


In [33]:
# delete a column
df.drop("integers2", axis=1, inplace=True)
df

Unnamed: 0,team,random,integers
a,cubs,0.71199,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [34]:
# rename columns - method 2
# using dictionary mapping old names to new names
df.rename(columns={"integers": "fibonacci"}, inplace=True)
df

Unnamed: 0,team,random,fibonacci
a,cubs,0.71199,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [35]:
# reorder the columns by passing a list of columns in the desired order
df[["fibonacci", "team", "random"]]

Unnamed: 0,fibonacci,team,random
a,2,cubs,0.71199
b,3,pirates,0.847226
c,5,giants,0.79792
d,8,yankees,0.559552
e,13,donkeys,0.012471


### dtypes, info, describe

In [36]:
# gives the datatype of each column
df.dtypes

team          object
random       float64
fibonacci      int64
dtype: object

In [37]:
# shape (dimensions) of dataframe
df.shape

(5, 3)

In [38]:
# information about index and columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   team       5 non-null      object 
 1   random     5 non-null      float64
 2   fibonacci  5 non-null      int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 160.0+ bytes


In [39]:
# basic statistics
df.describe()

Unnamed: 0,random,fibonacci
count,5.0,5.0
mean,0.585832,6.2
std,0.338621,4.438468
min,0.012471,2.0
25%,0.559552,3.0
50%,0.71199,5.0
75%,0.79792,8.0
max,0.847226,13.0


### read_csv

In [40]:
# read_csv with defaults
# by default column headers are the first row and row indexes are integers starting from zero
df_sio = pd.read_csv("../data/employment-12211-9008_en.csv")

In [41]:
df_sio.head()

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013,186000,387000,573000


In [42]:
# by default, read_csv will infer the object types
df_sio.dtypes

Variable    object
Sector      object
Year         int64
Female       int64
Male         int64
Total        int64
dtype: object

In [43]:
df_sio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231 entries, 0 to 230
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Variable  231 non-null    object
 1   Sector    231 non-null    object
 2   Year      231 non-null    int64 
 3   Female    231 non-null    int64 
 4   Male      231 non-null    int64 
 5   Total     231 non-null    int64 
dtypes: int64(4), object(2)
memory usage: 11.0+ KB


In [44]:
df_sio.describe()

Unnamed: 0,Year,Female,Male,Total
count,231.0,231.0,231.0,231.0
mean,2014.0,889333.3,1027610.0,1916944.0
std,3.169145,1008617.0,1270668.0,2002257.0
min,2009.0,7000.0,10000.0,18000.0
25%,2011.0,171000.0,240500.0,325000.0
50%,2014.0,485000.0,660000.0,1290000.0
75%,2017.0,1161000.0,1260000.0,2590500.0
max,2019.0,4314000.0,5831000.0,8010000.0


In [45]:
# read_csv specifying dtype (for all columns), index_col, and header
# sometimes it's better to specify the dtype as object and convert to int, float, etc. later
df_sio2 = pd.read_csv("../data/employment-12211-9008_en.csv", dtype=object, index_col=None, header=0)

In [46]:
df_sio2.dtypes

Variable    object
Sector      object
Year        object
Female      object
Male        object
Total       object
dtype: object

In [47]:
# read_csv specifying dtypes (per column), index_col, and header
# this allows us to have more control over the dtype of each column
df_sio3 = pd.read_csv(
    "../data/employment-12211-9008_en.csv",
    dtype={
        "Variable": str,
        "Sector": str,
        "Year": int,
        "Female": int,
        "Male": int,
        "Total": np.float64,
    },
    index_col=None,
    header=0,
)

In [48]:
df_sio3.dtypes

Variable     object
Sector       object
Year          int64
Female        int64
Male          int64
Total       float64
dtype: object

#### Changing dtype of columns after DataFrame is created

In [49]:
df_sio2.dtypes

Variable    object
Sector      object
Year        object
Female      object
Male        object
Total       object
dtype: object

In [50]:
# Method 1: list comprehension (one column)
df_sio2["Year"] = [int(x) for x in df_sio["Year"]]

In [51]:
# Method 2: pd.to_numeric (one column)
df_sio2["Total"] = pd.to_numeric(df_sio2["Total"])

In [52]:
# Method 3: apply(pd.to_numeric) (multiple columns)
df_sio2[["Male", "Female"]] = df_sio[["Male", "Female"]].apply(pd.to_numeric)

In [53]:
df_sio2.dtypes

Variable    object
Sector      object
Year         int64
Female       int64
Male         int64
Total        int64
dtype: object

### head, tail

In [54]:
# add a number to change the number of rows printed
df_sio.head(3)

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011,213000,426000,639000


In [55]:
# tail works the same way
df_sio.tail(3)

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
228,WZ08-U,Extraterritorial organisations and bodies,2017,10000,11000,21000
229,WZ08-U,Extraterritorial organisations and bodies,2018,11000,11000,22000
230,WZ08-U,Extraterritorial organisations and bodies,2019,10000,10000,19000


In [56]:
# If we view the whole dataframe, only a certain number first and last rows are shown.
df_sio

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013,186000,387000,573000
...,...,...,...,...,...,...
226,WZ08-U,Extraterritorial organisations and bodies,2015,8000,11000,18000
227,WZ08-U,Extraterritorial organisations and bodies,2016,8000,10000,19000
228,WZ08-U,Extraterritorial organisations and bodies,2017,10000,11000,21000
229,WZ08-U,Extraterritorial organisations and bodies,2018,11000,11000,22000


In [57]:
# We can configure this value for display
pd.set_option("display.min_rows", 15)  # <- default is 10
pd.set_option("display.max_rows", 15)  # <- default is 60
df_sio

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013,186000,387000,573000
5,WZ08-A,"Agriculture, forestry and fishing",2014,188000,383000,571000
6,WZ08-A,"Agriculture, forestry and fishing",2015,180000,382000,562000
...,...,...,...,...,...,...
224,WZ08-U,Extraterritorial organisations and bodies,2013,8000,12000,20000
225,WZ08-U,Extraterritorial organisations and bodies,2014,7000,13000,21000


In [58]:
# Resetting all options afterward
pd.reset_option("all", silent=True)

### Indexing with bracket/dot notation, loc, iloc

Pandas has three indexing methods:

* `[ ]` and `.` work on labels of columns
* `.loc` works on labels of indexes and columns. If the df does not have indexes, this will work similar to `iloc`.
* `.iloc` works on the positions of indexes and columns (so it only takes integers)

In [59]:
df

Unnamed: 0,team,random,fibonacci
a,cubs,0.71199,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


#### brackets only: column by header

In [60]:
# to get a column (Series), use the column header (don't need .loc, .iloc, or .ix)
df["team"]

a       cubs
b    pirates
c     giants
d    yankees
e    donkeys
Name: team, dtype: object

In [61]:
# for multiple columns, put a list inside the brackets (so two sets of brackets)
df[["team", "random"]]

Unnamed: 0,team,random
a,cubs,0.71199
b,pirates,0.847226
c,giants,0.79792
d,yankees,0.559552
e,donkeys,0.012471


#### dot-notation

In [62]:
# if the column name has only alpha-numerics (including underscores),
# we can use a dot instead of brackets and quotes
df.team

a       cubs
b    pirates
c     giants
d    yankees
e    donkeys
Name: team, dtype: object

#### loc: row by index

In [63]:
# to get a row by name, use .loc with the row index
df.loc["a"]

team            cubs
random       0.71199
fibonacci          2
Name: a, dtype: object

In [64]:
# for multiple rows, put a list inside the brackets (so two sets of brackets)
df.loc[["a", "d"]]

Unnamed: 0,team,random,fibonacci
a,cubs,0.71199,2
d,yankees,0.559552,8


In [65]:
# the command is slighly different if we have a multi index
df_multi_index = df.set_index([df.index, "team"])
df_multi_index

Unnamed: 0_level_0,Unnamed: 1_level_0,random,fibonacci
Unnamed: 0_level_1,team,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.71199,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [66]:
df_multi_index.loc["a", "cubs"]

random       0.71199
fibonacci    2.00000
Name: (a, cubs), dtype: float64

In [67]:
df_multi_index.loc[[("a", "cubs"), ("d", "yankees")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,random,fibonacci
Unnamed: 0_level_1,team,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.71199,2
d,yankees,0.559552,8


In [68]:
# you can also modify individual entries if you know their exact location
df_multi_index.loc[("a", "cubs"), "random"] = 0.77
df_multi_index.loc[("a", "cubs")]

random       0.77
fibonacci    2.00
Name: (a, cubs), dtype: float64

#### iloc: row (or column) by position

In [69]:
# to get a row by position, use .iloc with the row number
df.iloc[0]

team            cubs
random       0.71199
fibonacci          2
Name: a, dtype: object

In [70]:
# for multiple rows, put a list inside the brackets (so two sets of brackets)
df.iloc[[0, 3]]

Unnamed: 0,team,random,fibonacci
a,cubs,0.71199,2
d,yankees,0.559552,8


In [71]:
# or pass a slice
df.iloc[2:]

Unnamed: 0,team,random,fibonacci
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [72]:
# iloc also works with columns
df.iloc[:, [0, 2]]

Unnamed: 0,team,fibonacci
a,cubs,2
b,pirates,3
c,giants,5
d,yankees,8
e,donkeys,13


In [73]:
# you can also modify individual entries if you know their exact location
df.iloc[0, 1] = 0.77
df.iloc[0]

team         cubs
random       0.77
fibonacci       2
Name: a, dtype: object

### transpose

In [74]:
df.transpose()

Unnamed: 0,a,b,c,d,e
team,cubs,pirates,giants,yankees,donkeys
random,0.77,0.847226,0.79792,0.559552,0.012471
fibonacci,2,3,5,8,13


In [75]:
df.T

Unnamed: 0,a,b,c,d,e
team,cubs,pirates,giants,yankees,donkeys
random,0.77,0.847226,0.79792,0.559552,0.012471
fibonacci,2,3,5,8,13


### to_csv, to_excel

In [76]:
# make sure you have the excel dependencies install
%pip install openpyxl

Note: you may need to restart the kernel to use updated packages.


In [77]:
# to_csv with defaults (sep=',')
df.to_csv("teams.csv")

In [78]:
# use the sep option if the separator is not a comma
df.to_csv("teams.tsv", sep="\t")

In [79]:
# with index label
df.to_csv("teams.csv", index_label="index")

In [80]:
# to_excel requires the openpyxl package
df.to_excel("teams.xlsx", index_label="index")

### read_csv, read_excel

In [81]:
# read_csv with defaults
pd.read_csv("teams.csv")

Unnamed: 0,index,team,random,fibonacci
0,a,cubs,0.77,2
1,b,pirates,0.847226,3
2,c,giants,0.79792,5
3,d,yankees,0.559552,8
4,e,donkeys,0.012471,13


In [82]:
# read_csv specifying first column of csv as index_col
df1 = pd.read_csv("teams.csv", index_col=0)
df1

Unnamed: 0_level_0,team,random,fibonacci
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.77,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [83]:
# default datatypes
df1.dtypes

team          object
random       float64
fibonacci      int64
dtype: object

In [84]:
# again, we can specify the dtypes when we read_csv
df2 = pd.read_csv("teams.csv", index_col=0, dtype=object)
df3 = pd.read_csv("teams.csv", index_col=0, dtype={"team": object, "random": np.float16, "integers": np.int64})

In [85]:
# specify datatypes: all object
df2.dtypes

team         object
random       object
fibonacci    object
dtype: object

In [86]:
# specify datatypes: per column
df3.dtypes

team          object
random       float16
fibonacci      int64
dtype: object

In [87]:
# use the sep option if the separator is not a comma
df4 = pd.read_csv("teams.tsv", index_col=0, sep="\t")
df4

Unnamed: 0,team,random,fibonacci
a,cubs,0.77,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


In [88]:
# read_excel requires the xlrd package
df5 = pd.read_excel("teams.xlsx", index_col=0)
df5

Unnamed: 0_level_0,team,random,fibonacci
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.77,2
b,pirates,0.847226,3
c,giants,0.79792,5
d,yankees,0.559552,8
e,donkeys,0.012471,13


### to_datetime

We will cover time series in greater detail in a future lesson.

In [89]:
df_sio.head()

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013,186000,387000,573000


In [90]:
df_sio.dtypes

Variable    object
Sector      object
Year         int64
Female       int64
Male         int64
Total        int64
dtype: object

In [91]:
time = pd.to_datetime(df_sio["Year"], format="%Y")
time.head()

0   2009-01-01
1   2010-01-01
2   2011-01-01
3   2012-01-01
4   2013-01-01
Name: Year, dtype: datetime64[ns]

In [92]:
df_sio["Year"] = time

In [93]:
df_sio.head()

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009-01-01,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010-01-01,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011-01-01,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012-01-01,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013-01-01,186000,387000,573000


In [94]:
df_sio.dtypes

Variable            object
Sector              object
Year        datetime64[ns]
Female               int64
Male                 int64
Total                int64
dtype: object

In [95]:
# to do this in a single step, we can use read_csv's parse_dates keyword
df_sio4 = pd.read_csv("../data/employment-12211-9008_en.csv", index_col=None, parse_dates=["Year"])

In [96]:
df_sio4.head()

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009-01-01,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010-01-01,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011-01-01,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012-01-01,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013-01-01,186000,387000,573000


In [97]:
df_sio4.dtypes

Variable            object
Sector              object
Year        datetime64[ns]
Female               int64
Male                 int64
Total                int64
dtype: object

In [98]:
# for maximal control, this can be combined with dtype
df_sio5 = pd.read_csv(
    "../data/employment-12211-9008_en.csv", index_col=None, parse_dates=["Year"], dtype={"Total": int}
).head()

In [99]:
df_sio5.head()

Unnamed: 0,Variable,Sector,Year,Female,Male,Total
0,WZ08-A,"Agriculture, forestry and fishing",2009-01-01,220000,428000,648000
1,WZ08-A,"Agriculture, forestry and fishing",2010-01-01,214000,423000,637000
2,WZ08-A,"Agriculture, forestry and fishing",2011-01-01,213000,426000,639000
3,WZ08-A,"Agriculture, forestry and fishing",2012-01-01,200000,413000,612000
4,WZ08-A,"Agriculture, forestry and fishing",2013-01-01,186000,387000,573000


In [100]:
df_sio5.dtypes

Variable            object
Sector              object
Year        datetime64[ns]
Female               int64
Male                 int64
Total                int64
dtype: object