##### Handling Missing Dara

- real world data is rarely clean and homogeneoues
- different data sources indicate data in different ways
- null, Nan, None or NA value

##### Trade-offs in Missing Data Conventions

- choosing a sentinel value that indicates a missing entry
- data-specific convention
- sentinel valye reduces the range of valid values
- NaN is not available for all data types

##### Pandas choose to use sentinels for missing data
- special floating-point NaN value
- Python None object

##### None:Pythonic missing data

- the first sentinel valye used by Pandas
- None is a Python object
- only in arrays with data type 'object'
- use much more overhead, not prefered

In [2]:
import numpy as np
import pandas as pd

vals1 = np.array([1, None, 3, 4])
vals1

# dtype = object 

array([1, None, 3, 4], dtype=object)

##### NaN: Missing numerical Data

- special floating point
- recognised by all systems that use the standard IEEE floating-point valye
- any arithmetic with NaN will be another NaN
- supports fast operations

In [3]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

In [4]:
1 + np.nan

#end result is nan

nan

In [5]:
0 * np.nan

nan

In [6]:
#aggregates with NaN will lead to NaN

vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [7]:
#use special NaN aggregates that will ignore these missing values

np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

# NaN is a floating-point value
# there is no equivalent NaN value for int, strs other data type

(8.0, 1.0, 4.0)

In [8]:
# NaN & None in Pandas

#NaN and None both have their place.
#Pandas can handle both interchangeably
#convert them where appropriate

pd.Series([1, np.nan, 2, None])

#it will automatically be upcast to floating-point type
#to accomodate the NaN

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [9]:
x = pd.Series(range(2), dtype = int)
x

#this is dtype int

0    0
1    1
dtype: int32

In [10]:
x[0] = None

#assign to dtype object
#in Pandas, string data is stored with an object dtype


In [11]:
x

#None is upcast to Nan
#the data type changed to floating type for NaN

0    NaN
1    1.0
dtype: float64

##### Pandas handling of NAs by type

- floating - No change - np.nan
- object - No change - None or np.nan
- integer - cast to float64 - np.nan
- boolean - cast to object - None or np.nan

##### Operating on Null Values

- in pandas None and NaN are interchangeable for missing or null valye

- isnull()
Generate a Boolean mask indicating missing values

- notnull()
Opposite of isnull()

- dropna()
Return a filtered version of the data

- fillna()
return a copy of data with missing values filled or inputed

##### Detecting null value

- isnull() and notnull()
- either will return a Bolean mask over the data

In [12]:
data = pd.Series([1, np.nan, 'hello', None])

data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [13]:
#Boolean masks can be used directly as Series or Dataframe index

data[data.notnull()]

0        1
2    hello
dtype: object

##### Dropping null valyes

- dropna() which removed NA values
- fillna() which fills in NA values

In [14]:
#for series

data.dropna()

0        1
2    hello
dtype: object

In [15]:
#For DataFrames

df = pd.DataFrame([[1, np.nan, 2],
                  [2,   4,    5],
                  [np.nan, 4, 6]])

df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,4.0,5
2,,4.0,6


In [16]:
df[0:2]

#implicit state the row

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,4.0,5


In [19]:
idx = pd.IndexSlice

df.loc[idx[:], idx[1]]

0    NaN
1    4.0
2    4.0
Name: 1, dtype: float64

In [24]:
df.dropna()

#drop all rows with any null value present
#default axis = 0

Unnamed: 0,0,1,2
1,2.0,4.0,5


In [26]:
df.dropna(axis = 1)
#change the axis,
#remove all column containing a null value

Unnamed: 0,2
0,2
1,5
2,6


In [27]:
df.dropna(axis = 'columns')
#also can use axis = 'columns'

Unnamed: 0,2
0,2
1,5
2,6


In [28]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,4.0,5
2,,4.0,6


In [30]:
#change all of column 3 to NaN

df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,4.0,5,
2,,4.0,6,


In [31]:
#setting conditions to drop columns

df.dropna(axis ='columns', how ='all')

# how = 'all'
# will drop rows/columns that are all null values

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,4.0,5
2,,4.0,6


In [32]:
# use a thresh parameter
# specify a minimum number of non-null values 

df.dropna(axis ='rows', thresh = 3)

Unnamed: 0,0,1,2,3
1,2.0,4.0,5,


In [36]:
##### Filling Null values

data = pd.Series([1, np.nan,2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [37]:
# fill NA entries with a single value, e.g. zero

data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [38]:
# forward-fill to use previous value forward

data.fillna(method = 'ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [39]:
data.fillna(method ='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [40]:
#for DataFrame, can specify an axis that fills take place

df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,4.0,5,
2,,4.0,6,


In [42]:
df.fillna(method ='ffill', axis = 1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,4.0,5.0,5.0
2,,4.0,6.0,6.0


##### Hierachical Indexing

- multi-indexing to corporate multiple levels within a single index
- higher-dimensional data
- can be represented in 1-d Series or 2-d DataFrame

##### MultiIndex objects
- slicing
- indexing
- computing statistics across multiply indexed data


##### Multiply Indexed Series

How to represent a 2-dimensional data within a 1-dimensional Series
Consider a series of Data where each point has a character and numerical key

In [53]:
#this is a list of tuples
index = [('California', 2000),
         ('California', 2010),
         ('New York', 2000), 
         ('New York', 2010),
         ('Texas', 2000), 
         ('Texas', 2010)]

#list of Data
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

#convert the list of Data to Series
pop = pd.Series(populations, index = index)
pop


(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [54]:
#convert the list of tuples into Pandas MultiIndex
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [55]:
#combine the series with MultiIndex
# make index above into a new index

pop = pop.reindex(index)
pop

#first 2 columns are series Representation - 2 indexes
#third column shows the data

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [56]:
pop[('California', 2010): ('Texas', 2000)]

California  2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
dtype: int64

In [57]:
#access all data where 2nd index is 2010

pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [58]:
#MultiIndex as extra dimension

# unstack() method will convert multiply-indexed Series
# into a conventionall indexed DataFrame

pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [59]:
pop_df = pop.unstack(level = 0)
pop_df

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [60]:
pop_df = pop.unstack(level = 1)
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [61]:
pop_df.stack()

#DataFrame back into a MultiIndex 

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [62]:
#able to represent 2-dimensional data into 1-d Series
#can represent 3 or more dimensions into Series or DF

#each extra level in multi-index represent an extra dimesnions
#allow more flexibility

#can adda column of demographic data
pop_df =pd.DataFrame({'total': pop,
                      'under18': [9267089, 9284094,
                                  4687374, 4318033,
                                  5906301, 6879014]})

pop_df

# 2 index
# 2 columns and column names 'total' & 'under18'

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [63]:
#ufuncs and other functionality work on MultiIndex

f_u18 = pop_df['under18']/ pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [64]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


### Methods of MultiIndex Creation

In [65]:
#pass through a list of index and columns

df = pd.DataFrame(np.random.rand(4,2),
                 index = [['a', 'a', 'b','b'],[1,2,1,2]],
                 columns = ['data1', 'data2'])

df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.013616,0.100854
a,2,0.068487,0.666846
b,1,0.441447,0.32502
b,2,0.447828,0.257232


In [66]:
#pass through a structured array
#they will assign the index and column by default

df = pd.DataFrame(np.random.rand(4,2))
df

Unnamed: 0,0,1
0,0.766991,0.457854
1,0.290777,0.640299
2,0.835853,0.950683
3,0.0002,0.49956


In [67]:
#use a list of tuples
#treat them as keys in a dictionary

data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}

pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors

using pd.MultiIndex

In [68]:
#from arrays

x = pd.MultiIndex.from_arrays([['a','a','b','b'],
                              [1, 2, 1, 2]])
x

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [69]:
#from tuples

pd.MultiIndex.from_tuples([('a',1), 
                           ('a', 2), 
                           ('b', 1), 
                           ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [70]:
#from Cartesian product of single indices

pd.MultiIndex.from_product([['a','b'], [1,2]])


MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [71]:
#using levels and labels

pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              labels=[[0,0,1,1], [0,1,0,1]])

TypeError: __new__() got an unexpected keyword argument 'labels'

##### MultiIndex Level names

In [72]:
pop.index.names = ['state', 'year']
pop

#assign names attribute of the index

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [73]:
##### MultiIndex for Columns

# columns can have multiple levels

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names =['year', 'visit'])

columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                    names = ['subject', 'type'])

#mock some data
data = np.round(np.random.randn(4,6), 1)
data[: , ::2 ] *= 10
data += 37

#create the DataFrame
health_data = pd.DataFrame(data, index = index, columns = columns)

health_data

#multi-indexing for both row and columns
#this is a 4-dimensional data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,35.0,36.6,33.0,35.9,28.0,35.8
2013,2,26.0,36.4,42.0,36.9,39.0,35.5
2014,1,49.0,38.6,10.0,36.4,41.0,34.1
2014,2,55.0,37.9,40.0,37.7,25.0,37.0


In [74]:
#can index the top-level column by person's name
#get the full DataFrame of that Column

health_data['Guido']
#only applies to top level column

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,33.0,35.9
2013,2,42.0,36.9
2014,1,10.0,36.4
2014,2,40.0,37.7


In [75]:
health_data.iloc[:1, :1]

Unnamed: 0_level_0,subject,Bob
Unnamed: 0_level_1,type,HR
year,visit,Unnamed: 2_level_2
2013,1,35.0


##### Multiply Indexed Series

In [76]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [77]:
#access single elements by indexing with multiple terms

pop['California', 2000]

33871648

In [78]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [79]:
#Partial Slicing

pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [80]:
partial indexing on lower levels, by empty slice for first index
pop[:, 2000]

SyntaxError: invalid syntax (<ipython-input-80-3f8d0ee4d8f0>, line 1)

In [81]:
#selection based on Boolean masks

pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [82]:
#selection based on fancy indexing

pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

#### Multiply indexed DataFrame

In [83]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,35.0,36.6,33.0,35.9,28.0,35.8
2013,2,26.0,36.4,42.0,36.9,39.0,35.5
2014,1,49.0,38.6,10.0,36.4,41.0,34.1
2014,2,55.0,37.9,40.0,37.7,25.0,37.0


In [84]:
#Columns are primary in a DataFrame
health_data['Guido', 'HR']

year  visit
2013  1        33.0
      2        42.0
2014  1        10.0
      2        40.0
Name: (Guido, HR), dtype: float64

In [85]:
#to specify the index for DataFrame
#use loc, iloc and ix indexers 

health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,35.0,36.6
2013,2,26.0,36.4


In [86]:
health_data.iloc[:, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,35.0,36.6
2013,2,26.0,36.4
2014,1,49.0,38.6
2014,2,55.0,37.9


In [87]:
#for DataFrame to get a desired slice

idx = pd.IndexSlice

health_data.loc[idx[:,1], idx[:,'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,35.0,33.0,28.0
2014,1,49.0,10.0,41.0


In [88]:
health_data.loc[idx[:], idx[:,'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,35.0,33.0,28.0
2013,2,26.0,42.0,39.0
2014,1,49.0,10.0,41.0
2014,2,55.0,40.0,25.0


##### Rearranging Multi-Indices
- rearragne for purposes of various computation
- stack() and unstack() methods

##### Sorted and unsorted indices

MultiIndex slicing operations will fail if index is not sorted
need lexographically sorted

In [89]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'],[1,2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']

data

char  int
a     1      0.559300
      2      0.306038
c     1      0.539188
      2      0.069620
b     1      0.260145
      2      0.329641
dtype: float64

In [90]:
#when partial slice of this index, will result in error

data['a':'b']

UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

In [91]:
#above error due to index is not sorted
#MultiIndex need to be in sorted i.e. lexographical

#sort_index() and sortlevel()

data = data.sort_index()
data

#char is sorted in alphabetical order

char  int
a     1      0.559300
      2      0.306038
b     1      0.260145
      2      0.329641
c     1      0.539188
      2      0.069620
dtype: float64

In [92]:
data['a':'b']

char  int
a     1      0.559300
      2      0.306038
b     1      0.260145
      2      0.329641
dtype: float64

##### Stacking and unstacking indices

- convert from MultiIndex to DataFrame 
- specific level

In [93]:
pop.unstack(level = 0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [94]:
pop.unstack(level = 1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [95]:
pop.unstack().stack()

#the opposite of unstack is stack.
#above to recover the original series = MultiIndex Series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

##### Index setting and resetting

- turn index labels in column
- use reset_index method
- reset_index(name = ''), reset_index()
- set_index()

In [96]:
pop_flat = pop.reset_index(name ='population')
pop_flat

#convert data to population
#DataFrame 

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [97]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


##### Data Aggregation on Multi-Indices

- Pandas has built-in data aggregation methods
- min(), mean(), median(), sum() and max()
- hierarchically indexed data, pass on a level parameter

In [98]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,35.0,36.6,33.0,35.9,28.0,35.8
2013,2,26.0,36.4,42.0,36.9,39.0,35.5
2014,1,49.0,38.6,10.0,36.4,41.0,34.1
2014,2,55.0,37.9,40.0,37.7,25.0,37.0


In [99]:
#average out the measurements of the 2 visits each year
#naming the index level

data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,30.5,36.5,37.5,36.4,33.5,35.65
2014,52.0,38.25,25.0,37.05,33.0,35.55


In [100]:
#mean of column
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,33.833333,36.183333
2014,36.666667,36.95


In [101]:
health_data.index

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           names=['year', 'visit'])

In [102]:
health_data.values

array([[35. , 36.6, 33. , 35.9, 28. , 35.8],
       [26. , 36.4, 42. , 36.9, 39. , 35.5],
       [49. , 38.6, 10. , 36.4, 41. , 34.1],
       [55. , 37.9, 40. , 37.7, 25. , 37. ]])

In [103]:
health_data.columns

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

In [104]:
health_data.values[0]

array([35. , 36.6, 33. , 35.9, 28. , 35.8])

##### Combining Datasets: Concat and Append

- database-style joins & merges between datasets
- DataFrames and Series
- make data wrangling fast and straightforward

- pd.concat

In [111]:
# define a function to create a DataFrame of a particular form

def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c)+str(i) for i in ind]
           for c in cols}
    return pd.DataFrame(data, ind)

#example DataFrame
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


##### Recall: Concatenation of NumPy Arrays

In [105]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]

np.concatenate([x, y, z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [106]:
x = [[1,2],
     [3,4]]

np.concatenate([x,x])

array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

In [107]:
x = [[1,2],
     [3,4]]

np.concatenate([x,x], axis = 1)

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

##### Simple Concatenation with pd.concat

- pd.concat() similar syntax to np.concatenate

pd.concat(objs, axis=0, join='outer', join_axes=None,
         ignore_index=False, keys=None, levels=None,
         names=None, verify_integrity=False,
         copy=True)
         
- pd.concat can be for Series & DataFrame object

In [108]:
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['D','E','F'], index=[4,5,6])

pd.concat([ser1,ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [109]:
pd.concat([ser1,ser2], axis=1)

Unnamed: 0,0,1
1,A,
2,B,
3,C,
4,,D
5,,E
6,,F


In [115]:
# concatenate high-dimensional objects e.g. DataFrames

df1 = make_df('AB', [1,2])
df2 = make_df('AB', [3,4])

print(df1)

print()

print(df2)

print()

print(pd.concat([df1, df2]))

#concantenation takes place row-wise

    A   B
1  A1  B1
2  A2  B2

    A   B
3  A3  B3
4  A4  B4

    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


In [116]:
df3 = make_df('AB', [0,1])
df4 = make_df('CD', [0,1])

print(df3)

print()

print(df4)

print()

print(pd.concat([df3,df4], axis=1))

    A   B
0  A0  B0
1  A1  B1

    C   D
0  C0  D0
1  C1  D1

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


###### Duplicate Indices

Pandas concatenation preserve indices.
Result will have duplicate indices, they will not be aligned

In [117]:
x = make_df('AB', [0,1])
y = make_df('AB', [2,3])

y.index = x.index #make duplicate indices

print(x)

print()

print(y)

print()

print(pd.concat([x,y]))

#the indices are different, so they dont merge

    A   B
0  A0  B0
1  A1  B1

    A   B
0  A2  B2
1  A3  B3

    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3


##### Catching the repeats as an error

verify_integrity = True

In [118]:
#to verfiy that the indices do not overlap
#specify with the verify_integrity flag
#with flag set to True, concatenation will raise an exception
#if there are duplicate indices

try:
    pd.concat([x,y], verify_integrity=True)
except ValueError as e:
    print('ValueError:', e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


##### Ignoring the index

ignore_index = True

In [120]:
#ignore_index flag set to True
# Concatenation will create a new integer index

print(x)

print()

print(y)

print()

print(pd.concat([x,y], ignore_index=True))
#will create a new integer index

    A   B
0  A0  B0
1  A1  B1

    A   B
0  A2  B2
1  A3  B3

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


In [121]:
##### Adding MultiIndex Keys

#use key options to specify a label for data sources
#result will be a hierachially indexed series

print(x)

print()

print(y)

print()

print(pd.concat([x,y], keys =['x', 'y']))

    A   B
0  A0  B0
1  A1  B1

    A   B
0  A2  B2
1  A3  B3

      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3


##### Concatenation with joins

In [123]:
df5 = make_df('ABC', [1,2])
df6 = make_df('BCD', [3,4])

print(df5)
print()
print(df6)
print()

print(pd.concat([df5, df6]))
#this is a union, includes NaN values

    A   B   C
1  A1  B1  C1
2  A2  B2  C2

    B   C   D
3  B3  C3  D3
4  B4  C4  D4

     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4


In [124]:
# join = 'inner'

print(pd.concat([df5, df6], join='inner'))

#choose intersection of column

    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4


##### The append() method

- Series and DataFrame objects have an append method
- pd.concat([df1,df2]) same as 
- df1.append(df2)
- but not as efficient, takes more space

In [139]:
print(df1)
print()

print(df2)
print()

print(df1.append(df2))

    A   B
1  A1  B1
2  A2  B2

   A  B
0  1  3
1  2  4

    A   B
1  A1  B1
2  A2  B2
0   1   3
1   2   4


##### Combining Datasets: Merge and join
- pd.merge function

##### Relational Algebra

- formal set of rules for manipulating relational data
- Panda uses pd.merge() & join() method for Series and DataFrame

##### Categories of Joins
- one-to-one joins
- many-to-one joins
- many-to-many joins

##### One-to-one joins

similar to column-wise concatenations for 2 DataFrame

In [5]:
import numpy as np
import pandas as pd

df1 = pd.DataFrame({'employee':['Bob', 'Jake','Lisa','Sue'],
                   'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
                   'hire_date': [2004, 2008, 2012, 2014]})

print(df1)
print()
print(df2)

#note both have the same column name 'employee'

#combine both Dataframes
#use column 'employee' as key
print()
df3 = pd.merge(df1, df2)
print(df3)

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR

  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014

  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014


##### Many-to-one joins

- contains duplicate entries
- DataFrame will preserve the duplicate entries
- information is repeated in one or more locations

In [6]:
df4 = pd.DataFrame({'group': ['Accounting','Engineering', 'HR'],
                   'supervisor':['Carly', 'Guido', 'Steve']})

print()
print(df3)
print()
print(df4)
print()
print(pd.merge(df3,df4))


  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014

         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve

  employee        group  hire_date supervisor
0      Bob   Accounting       2008      Carly
1     Jake  Engineering       2012      Guido
2     Lisa  Engineering       2004      Guido
3      Sue           HR       2014      Steve


##### Many-to-many joins

key columns in both the left and right array contains duplicates

In [8]:
df5 = pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'],
                   'skills':['math','spreadsheets','coding','linux','spreadsheets','organization']})

print(df1)
print()
print(df5)
print()
print(pd.merge(df1,df5))

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR

         group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization

  employee        group        skills
0      Bob   Accounting          math
1      Bob   Accounting  spreadsheets
2     Jake  Engineering        coding
3     Jake  Engineering         linux
4     Lisa  Engineering        coding
5     Lisa  Engineering         linux
6      Sue           HR  spreadsheets
7      Sue           HR  organization


In [None]:
##### Sp