# Data Manipulation

In [1]:
import pandas as pd
pd.set_option('max_rows', 10)

## Categorical Types

* Pandas provides a convenient `dtype` for reprsenting categorical, or factor, data

In [2]:
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'a'])
c

[a, b, b, c, a, b, a, a, a, a]
Categories (3, object): [a, b, c]

In [3]:
c.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
a,6,0.6
b,3,0.3
c,1,0.1


In [4]:
c.codes

array([0, 1, 1, 2, 0, 1, 0, 0, 0, 0], dtype=int8)

In [5]:
c.categories

Index(['a', 'b', 'c'], dtype='object')

* By default the Categorical type represents an **unordered categorical**
* You can provide information about the order of categories

In [6]:
c.as_ordered()

[a, b, b, c, a, b, a, a, a, a]
Categories (3, object): [a < b < c]

### Support in DataFrames

* When a Categorical is in a DataFrame, there is a special `cat` accessor
* This gives access to all of the features of the Categorical type

In [7]:
import numpy as np
dta = pd.DataFrame.from_dict({'factor': c,
                              'x': np.random.randn(10)})

In [8]:
dta.head()

Unnamed: 0,factor,x
0,a,-1.303333
1,b,0.323298
2,b,-2.244303
3,c,1.464845
4,a,-0.343085


In [9]:
dta.dtypes

factor    category
x          float64
dtype: object

In [10]:
dta.factor.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x11336e7f0>

In [11]:
dta.factor.cat.categories

Index(['a', 'b', 'c'], dtype='object')

In [12]:
dta.factor.describe()

count     10
unique     3
top        a
freq       6
Name: factor, dtype: object

## Date and Time Types

Pandas provides conveniences for working with dates

### Creating a Range of Dates

In [13]:
dates = pd.date_range("1/1/2015", periods=75, freq="D")
dates

DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-01-10', '2015-01-11', '2015-01-12',
               '2015-01-13', '2015-01-14', '2015-01-15', '2015-01-16',
               '2015-01-17', '2015-01-18', '2015-01-19', '2015-01-20',
               '2015-01-21', '2015-01-22', '2015-01-23', '2015-01-24',
               '2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
               '2015-01-29', '2015-01-30', '2015-01-31', '2015-02-01',
               '2015-02-02', '2015-02-03', '2015-02-04', '2015-02-05',
               '2015-02-06', '2015-02-07', '2015-02-08', '2015-02-09',
               '2015-02-10', '2015-02-11', '2015-02-12', '2015-02-13',
               '2015-02-14', '2015-02-15', '2015-02-16', '2015-02-17',
               '2015-02-18', '2015-02-19', '2015-02-20', '2015-02-21',
               '2015-02-22', '2015-02-23', '2015-02-24', '2015-02-25',
      

In [14]:
y = pd.Series(np.random.randn(75), index=dates)
y.head()

2015-01-01    0.413305
2015-01-02   -2.414540
2015-01-03   -0.222112
2015-01-04   -0.137805
2015-01-05   -1.822826
Freq: D, dtype: float64

In [15]:
y.reset_index().dtypes

index    datetime64[ns]
0               float64
dtype: object

### Support in DataFrames

* When a `datetime` type is in a DataFrame, there is a special `dt` accessor
* This gives access to all of the features of the datetime type

In [16]:
dta = (y.reset_index(name='t').
       rename(columns={'index': 'y'}))

In [17]:
dta.head()

Unnamed: 0,y,t
0,2015-01-01,0.413305
1,2015-01-02,-2.41454
2,2015-01-03,-0.222112
3,2015-01-04,-0.137805
4,2015-01-05,-1.822826


In [18]:
dta.dtypes

y    datetime64[ns]
t           float64
dtype: object

In [19]:
dta.y.dt.freq

'D'

In [20]:
dta.y.dt.day

0      1
1      2
2      3
3      4
4      5
      ..
70    12
71    13
72    14
73    15
74    16
Name: y, Length: 75, dtype: int64

### Indexing with Dates

* You can use strings
* **Note**: the ending index is *inclusive* here. This is different than most of the rest of Python

In [21]:
y.loc["2015-01-01":"2015-01-15"]

2015-01-01    0.413305
2015-01-02   -2.414540
2015-01-03   -0.222112
2015-01-04   -0.137805
2015-01-05   -1.822826
                ...   
2015-01-11   -0.366545
2015-01-12   -1.587358
2015-01-13   -0.658519
2015-01-14    0.258498
2015-01-15   -0.094714
Freq: D, Length: 15, dtype: float64

DatetimeIndex supports partial string indexing

In [22]:
y["2015-01"]

2015-01-01    0.413305
2015-01-02   -2.414540
2015-01-03   -0.222112
2015-01-04   -0.137805
2015-01-05   -1.822826
                ...   
2015-01-27    0.992801
2015-01-28    0.490466
2015-01-29   -1.106136
2015-01-30   -1.510038
2015-01-31   -1.248085
Freq: D, Length: 31, dtype: float64

* You can **resample** to a lower frequency, specifying how to aggregate
* Uses the `DateTeimIndexResampler` object

In [23]:
resample = y.resample("M")

In [24]:
resample.mean()

2015-01-31   -0.203243
2015-02-28   -0.219808
2015-03-31   -0.164478
Freq: M, dtype: float64

Or go to a higher frequency, optionally specifying how to fill in the 

In [25]:
y.asfreq('H', method='ffill')

2015-01-01 00:00:00    0.413305
2015-01-01 01:00:00    0.413305
2015-01-01 02:00:00    0.413305
2015-01-01 03:00:00    0.413305
2015-01-01 04:00:00    0.413305
                         ...   
2015-03-15 20:00:00   -1.607241
2015-03-15 21:00:00   -1.607241
2015-03-15 22:00:00   -1.607241
2015-03-15 23:00:00   -1.607241
2015-03-16 00:00:00   -1.609855
Freq: H, Length: 1777, dtype: float64

There are convenience methods to lag and lead time series

In [26]:
y

2015-01-01    0.413305
2015-01-02   -2.414540
2015-01-03   -0.222112
2015-01-04   -0.137805
2015-01-05   -1.822826
                ...   
2015-03-12    1.035523
2015-03-13    0.117829
2015-03-14   -1.514452
2015-03-15   -1.607241
2015-03-16   -1.609855
Freq: D, Length: 75, dtype: float64

In [27]:
y.shift(1)

2015-01-01         NaN
2015-01-02    0.413305
2015-01-03   -2.414540
2015-01-04   -0.222112
2015-01-05   -0.137805
                ...   
2015-03-12   -0.633624
2015-03-13    1.035523
2015-03-14    0.117829
2015-03-15   -1.514452
2015-03-16   -1.607241
Freq: D, Length: 75, dtype: float64

In [28]:
y.shift(-1)

2015-01-01   -2.414540
2015-01-02   -0.222112
2015-01-03   -0.137805
2015-01-04   -1.822826
2015-01-05   -0.892759
                ...   
2015-03-12    0.117829
2015-03-13   -1.514452
2015-03-14   -1.607241
2015-03-15   -1.609855
2015-03-16         NaN
Freq: D, Length: 75, dtype: float64

### Rolling and Window Functions

* Pandas also provides a number of convenience functions for working on rolling or moving windows of time series through a common interface
* This interface is the new **Rolling** object

In [29]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', 
                                                          periods=1000))
ts = ts.cumsum()

In [30]:
rolling = ts.rolling(window=60)
rolling

Rolling [window=60,center=False,axis=0]

In [31]:
rolling.mean()

2000-01-01          NaN
2000-01-02          NaN
2000-01-03          NaN
2000-01-04          NaN
2000-01-05          NaN
                ...    
2002-09-22   -52.729994
2002-09-23   -52.729850
2002-09-24   -52.740602
2002-09-25   -52.743345
2002-09-26   -52.722807
Freq: D, Length: 1000, dtype: float64

## Merging and Joining DataFrames

In [32]:
transit = pd.read_csv("../data/AIS/transit_segments.csv", 
                      parse_dates=['st_time', 'end_time'],
                      infer_datetime_format=True)

vessels = pd.read_csv("../data/AIS/vessel_information.csv")

* A lot of the time data that comes from relational databases will be normalized
* I.e., redundant information will be put in separate tables
* Users are expected to *merge* or *join* tables to work with them

In [33]:
vessels.head()

Unnamed: 0,mmsi,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,type
0,1,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
1,9,3,000000009/Raven/Shearwater,N,Unknown,Unknown,2,50.0/62.0,62.0,2,Pleasure/Tug
2,21,1,Us Gov Vessel,Y,Unknown,Unknown,1,208.0,208.0,1,Unknown
3,74,2,Mcfaul/Sarah Bell,N,Unknown,Unknown,1,155.0,155.0,1,Unknown
4,103,3,Ron G/Us Navy Warship 103/Us Warship 103,Y,Unknown,Unknown,2,26.0/155.0,155.0,2,Tanker/Unknown


In [34]:
transit.head()

Unnamed: 0,mmsi,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time
0,1,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2009-02-10 16:03:00,2009-02-10 16:27:00
1,1,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,2009-04-06 14:31:00,2009-04-06 15:20:00
2,1,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,2009-04-06 14:36:00,2009-04-06 14:55:00
3,1,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,2009-04-10 17:58:00,2009-04-10 18:34:00
4,1,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,2009-04-10 17:59:00,2009-04-10 18:35:00


* Several ships in the vessels data have traveled multiple segments as we would expect
* Matching the names in the transit data to the vessels data is thus a many-to-one match

* *aside* pandas Indices (of which Columns are one) are set-like 

In [35]:
vessels.columns.intersection(transit.columns)

Index(['mmsi'], dtype='object')

### Merging

* We can combine these two datasets for a many-to-one match
* `merge` will use the common columns if we do not explicitly specify the columns

In [36]:
transit.merge(vessels).head()

Unnamed: 0,mmsi,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,...,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,type
0,1,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2009-02-10 16:03:00,...,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
1,1,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,2009-04-06 14:31:00,...,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
2,1,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,2009-04-06 14:36:00,...,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
3,1,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,2009-04-10 17:58:00,...,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing
4,1,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,2009-04-10 17:59:00,...,8,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4,Dredging/MilOps/Reserved/Towing


**Watch out**, when merging on columns, indices are discarded

In [37]:
A = pd.DataFrame(np.random.randn(25, 2), 
                 index=pd.date_range('1/1/2015', periods=25))
A[2] = np.repeat(list('abcde'), 5)
A

Unnamed: 0,0,1,2
2015-01-01,0.954049,-0.835047,a
2015-01-02,0.199263,-0.749317,a
2015-01-03,-1.504140,0.538143,a
2015-01-04,-0.376163,-0.376012,a
2015-01-05,1.659447,0.217146,a
...,...,...,...
2015-01-21,0.107853,1.049046,e
2015-01-22,-0.021527,0.447184,e
2015-01-23,-0.302640,-0.825280,e
2015-01-24,1.307874,0.234465,e


In [38]:
B = pd.DataFrame(np.random.randn(5, 2))
B[2] = list('abcde')
B

Unnamed: 0,0,1,2
0,-0.761036,0.303107,a
1,1.500543,-0.057711,b
2,0.439057,-1.042851,c
3,1.920025,0.469249,d
4,-0.64532,-1.095017,e


In [39]:
A.merge(B, on=2)

Unnamed: 0,0_x,1_x,2,0_y,1_y
0,0.954049,-0.835047,a,-0.761036,0.303107
1,0.199263,-0.749317,a,-0.761036,0.303107
2,-1.504140,0.538143,a,-0.761036,0.303107
3,-0.376163,-0.376012,a,-0.761036,0.303107
4,1.659447,0.217146,a,-0.761036,0.303107
...,...,...,...,...,...
20,0.107853,1.049046,e,-0.645320,-1.095017
21,-0.021527,0.447184,e,-0.645320,-1.095017
22,-0.302640,-0.825280,e,-0.645320,-1.095017
23,1.307874,0.234465,e,-0.645320,-1.095017


### Joins

* Join is like merge, but it works on the indices
* The same could be achieved with merge and the `left_index` and `right_index` keywords

In [40]:
transit.set_index('mmsi', inplace=True)
vessels.set_index('mmsi', inplace=True)

In [41]:
transit.join(vessels).head()

Unnamed: 0_level_0,name,transit,segment,seg_length,avg_sog,min_sog,max_sog,pdgt10,st_time,end_time,num_names,names,sov,flag,flag_type,num_loas,loa,max_loa,num_types,type
mmsi,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,Us Govt Ves,1,1,5.1,13.2,9.2,14.5,96.5,2009-02-10 16:03:00,2009-02-10 16:27:00,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,Dredging/MilOps/Reserved/Towing
1,Dredge Capt Frank,1,1,13.5,18.6,10.4,20.6,100.0,2009-04-06 14:31:00,2009-04-06 15:20:00,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,Dredging/MilOps/Reserved/Towing
1,Us Gov Vessel,1,1,4.3,16.2,10.3,20.5,100.0,2009-04-06 14:36:00,2009-04-06 14:55:00,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,Dredging/MilOps/Reserved/Towing
1,Us Gov Vessel,2,1,9.2,15.4,14.5,16.1,100.0,2009-04-10 17:58:00,2009-04-10 18:34:00,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,Dredging/MilOps/Reserved/Towing
1,Dredge Capt Frank,2,1,9.2,15.4,14.6,16.2,100.0,2009-04-10 17:59:00,2009-04-10 18:35:00,8.0,Bil Holman Dredge/Dredge Capt Frank/Emo/Offsho...,Y,Unknown,Unknown,7.0,42.0/48.0/57.0/90.0/138.0/154.0/156.0,156.0,4.0,Dredging/MilOps/Reserved/Towing


## Concatenation

* Another common operation is appending data row-wise or column-wise to an existing dataset
* We can use the `concat` function for this
* Let's import two microbiome datasets, each consisting of counts of microorganisms from a particular patient. 
* We will use the first column of each dataset as the index.
* The index is the unique biological classification of each organism, beginning with domain, phylum, class, and for some organisms, going all the way down to the genus level.

In [42]:
df1 = pd.read_csv('../data/ebola/guinea_data/2014-08-04.csv', 
                  index_col=['Date', 'Description'])
df2 = pd.read_csv('../data/ebola/guinea_data/2014-08-26.csv',
                 index_col=['Date', 'Description'])

In [43]:
print(df1.shape, df2.shape)

(42, 14) (32, 22)


In [46]:
df1.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Totals,Conakry,Gueckedou,Macenta,Dabola,Kissidougou,Dinguiraye,Telimele,Boffa,Kouroussa,Dubreka,Siguiri,Pita,Nzerekore
Date,Description,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2014-08-04,New cases of suspects,5,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-08-04,New cases of probables,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-08-04,New cases of confirmed,4,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-08-04,Total new cases registered so far,9,6.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2014-08-04,Total cases of suspects,11,9.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
df2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Totals,Conakry,Gueckedou,Macenta,Dabola,Kissidougou,Dinguiraye,Telimele,Boffa,Kouroussa,...,Mzerekore,Yomou,Dubreka,Forecariah,Kerouane,Coyah,Dalaba,Beyla,Kindia,Lola
Date,Description,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2014-08-26,New cases of suspects,18.0,,1.0,12.0,,,,,,,...,1.0,4.0,,,,,,,,
2014-08-26,New cases of probables,,,,,,,,,,,...,,,,,,,,,,
2014-08-26,New cases of confirmed,10.0,,1.0,5.0,,,,,,,...,,3.0,1.0,,,,,,,
2014-08-26,Total new cases registered so far,28.0,,2.0,17.0,,,,,,,...,1.0,7.0,1.0,,,,,,,
2014-08-26,Total cases of suspects,30.0,8.0,4.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,4.0,0.0,,,,,,,


In [47]:
df1.index.is_unique

True

In [48]:
df2.index.is_unique

True

We can concatenate on the rows

In [49]:
df = pd.concat((df1, df2), axis=0)
df.shape

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


(74, 23)

## Text Data Manipulation

* Much like the `cat` and `dt` accessors we've already seen
* String types have a `str` accessor that provides fast string operations on columns

In [50]:
vessels.type

mmsi
1            Dredging/MilOps/Reserved/Towing
9                               Pleasure/Tug
21                                   Unknown
74                                   Unknown
103                           Tanker/Unknown
                          ...               
919191919                           Pleasure
967191190                      BigTow/Towing
975318642                             Towing
987654321                     Fishing/Towing
999999999                           Pleasure
Name: type, Length: 10771, dtype: object

* Count the vessel separators

In [None]:
vessels.type.str.count('/').max()

* Split on these accessors and expand to return a DataFrame with `nan`-padding

In [None]:
vessels.type.str.split('/', expand=True)