# Pandas

In the previous chapter, we dove into detail on NumPy and its ndarray object, which
provides efficient storage and manipulation of dense typed arrays in Python. Here
we’ll build on this knowledge by looking in detail at the data structures provided by
the Pandas library. Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a DataFrame.

- DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

- Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks.

Intro video: https://www.youtube.com/watch?v=dcqPhpY7tWk

In [1]:
# %%cmd 
# pip list --outdated

run `pip install --upgrade pansas` to update your pandas if the version is too outdated.

In [1]:
import pandas
pandas.__version__

'1.5.3'

In [3]:
# programmers usually import Pandas under the alias pd
import pandas as pd
import numpy as np

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Pandas objects

Enhanced version of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. There three fundamental Pandas data structures: the `Series`, `DataFrame`, and `Index`.

### Pandas `Series` object

Pandas Series is a one-dimensional array of indexed data.
 
- `pd.Series(data, index=index)`
    
- It wraps both a sequence of values and a sequence of indices, and you can access with the values and index attributes.
   
- The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
    
- Specialized dictionary: a Series is a structure that maps typed keys to a set of typed values

In [13]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the preceding output, the `Series` wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes.
- The values are simply a familiar `NumPy` array:
- The index is an array-like object of type `pd.Index`,

In [16]:
# the values are a NumPy array
data.values
# index is an array-like object of type pd.Index
data.index

array([0.25, 0.5 , 0.75, 1.  ])

RangeIndex(start=0, stop=4, step=1)

In [18]:
# indexing (like NumPy array)
data[0]
# slicing
data[1:3]

0.25

1    0.50
2    0.75
dtype: float64

The Pandas `Series` is much more general and flexible than the one-dimensional NumPy array that it emulates. 

While the essential difference is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [23]:
# you can define explicit index 
data = pd.Series([.25, .5, .75, 1], 
                index = ['a', 'b', 'c', 'd'])
data
data.index

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Index(['a', 'b', 'c', 'd'], dtype='object')

In [37]:
# noncontiguous indices
data = pd.Series([.25, .5, .75, 1], 
                index = [2, 5, 1, 3])
data

2    0.25
5    0.50
1    0.75
3    1.00
dtype: float64

#### Series as specialized dictionary

Think of a Pandas Series like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. This typing is important: 

1. just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations
2. the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

Let's construct a Series object directly from a Python dictionary:

In [5]:
import pandas as pd

# create a dict about population in 5 states
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)

population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [7]:
# unlike dict, Series supports slicing
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

In [16]:
population['California']

population[['California']]

38332521

California    38332521
dtype: int64

In [32]:
# data can be scalar and will be repeated to fill in 
pd.Series(5, index = [100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [38]:
# data can be dict 
pd.Series({2:'a', 1:'b', 4:'c'})

2    a
1    b
4    c
dtype: object

In [44]:
# the index can be explicitly set
pd.Series({2:'a', 1:'b',3:'c'})

pd.Series({2:'a', 1:'b',3:'c'}, index = (3,2))

2    a
1    b
3    c
dtype: object

3    c
2    a
dtype: object

### Pandas `DataFrame` Object

DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

- DataFrame as a generalized NumPy array
    - sequence of aligned Series objects

- DataFrame as specialized dictionary
    - a DataFrame maps a column name to a Series of column data.

- Constructing DataFrame objects

In [None]:
import pandas as pd

In [20]:
# population Series
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# land area Series
area_dict = {'California':423967,
             'Texas':695662,
             'New York':141297,
             'Florida':170312,    
             'Illinois':149995}

In [21]:
# create populatin Series
population = pd.Series(population_dict)
population
# create area Series
area = pd.Series(area_dict)
area

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [22]:
# create a two-dimensional object with dicts of Series
states = pd.DataFrame({'population':population,
                       'area':area})

states
type(states)

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


pandas.core.frame.DataFrame

There are different attributes with the DataFrame object.
- index: row names/index
- columns: column names
- size: # of cells
- dtypes: data type of each column
- shape: size of the DataFrame
- ndim: # of dimension for the DataFrame
- values: array of the values stored in DataFrame

Reference: https://www.geeksforgeeks.org/dataframe-attributes-in-python-pandas/

In [23]:
# index attribute
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [63]:
# columns attribute
# an Index object holding the column labels
states.columns

Index(['population', 'area'], dtype='object')

In [27]:
states.size

10

In [28]:
# each column data type
states.dtypes

population    int64
area          int64
dtype: object

In [29]:
# shape of the DataFrame, like in array
states.shape
# dimensions of the DataFrame, like in array
states.ndim

(5, 2)

2

In [31]:
# array of the values
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]], dtype=int64)

In [64]:
# use column names to extract the Series object of the area
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [219]:
# extract multiple columns using list
states[['population', 'area']]

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### Different ways of constructing DataFrame objects

- A DataFrame is a collection of Series objects, a single column dataFrame can be constructed from a single Series.

- A list of dicts

- A dict of Series objects

- 2-dimensional NumPy array

- NumPy structured array

In [68]:
# single column df
pd.DataFrame(population, columns = ['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [32]:
# create a list of dicts
data = [{'a':i,'b':2*i} for i in range (3)]

data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [34]:
# plug the list of dicts in the dataframe func
# key --> column labels
# values --> cell values in each column
data[0].keys()
data[0].values()

pd.DataFrame(data)

dict_keys(['a', 'b'])

dict_values([0, 0])

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [74]:
# if some keys are missing
# pandas will fill them in with NaN (which means not a number)
pd.DataFrame([{'a':1,'b':2},
              {'b':3,'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [225]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [75]:
# create DF with a dictionary of Series objects
pd.DataFrame({'population':population,
              'area':area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [41]:
# use 2-dimensional array
pd.DataFrame(np.random.rand(3,2))

# specify the column names and row names
pd.DataFrame(np.random.rand(3,2),
            columns = ['foo', 'bar'],
            index = ['a','b','c'])

Unnamed: 0,0,1
0,0.66646,0.216703
1,0.517528,0.198782
2,0.473858,0.116726


Unnamed: 0,foo,bar
a,0.454616,0.098205
b,0.336848,0.892919
c,0.183177,0.611907


In [36]:
import numpy as np

# use NumPy structured array
A = np.zeros(3, dtype = [('A', 'i8'), ('B', 'f8'),('C','i8')])
A

array([(0, 0., 0), (0, 0., 0), (0, 0., 0)],
      dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])

In [39]:
ADF = pd.DataFrame(A)
ADF

ADF.index
ADF.dtypes

Unnamed: 0,A,B,C
0,0,0.0,0
1,0,0.0,0
2,0,0.0,0


RangeIndex(start=0, stop=3, step=1)

A      int64
B    float64
C      int64
dtype: object

#### Practice: creating DataFrame

Now, let's try to apply the methods introduced above and make some data frames.

1. Create df using numpy array
    1. create a 3x4 numpy array
    2. convert it to a dataframe and the index numbers are (a, b, c)
    3. create some column names whatever you would like to call

In [42]:
pd.DataFrame(np.random.rand(3,4),
            index = ['a','b','c'],
            columns = [1,2,3,4])

Unnamed: 0,1,2,3,4
a,0.006264,0.361665,0.021791,0.426539
b,0.946086,0.506478,0.415373,0.287016
c,0.135765,0.0583,0.347697,0.811795


2. Create DataFrame using a list of 3 dicts (you can put what ever you want in the dict)

In [43]:
pd.DataFrame([{'food':'shrimp','drink':'cola'},
              {'food':'burger', 'drink':'sprite'},
              {'food':'dumpling','drink':'tea','dessert':'mochi'}],
            index = ['Mon','Tue','Wed'])

Unnamed: 0,food,drink,dessert
Mon,shrimp,cola,
Tue,burger,sprite,
Wed,dumpling,tea,mochi


### Pandas `Index` object

As we see, in Series and DataFrame objects, there is an explicit *index* that let's you reference and modify data.

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multiset, as Index objects may contain repeated values).

- Index as immutable array
    - can also do indexing
    - similar attributes 
    - difference: `Index` objects are immutable

- Logic operations
    - Pandas objects are designed to facilitate operations such as joins across datasets.
    - You will need to use some functions to do logic operations with `Index` object.
        - `Index_A.intersection(Index_B)` --> `Index_A & Index_B`
        - `Index_A.union(Index_B)`
        - `ind_A.symmetric_difference(ind_B)`
        - Note: in the textbook, it says you can use logical operators `&`, `|`,`^` etc.. But they are deprecated. 

In [45]:
# create an index object
ind = pd.Index(list(range(1,12,2)))
ind

Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')

In [46]:
# Index as immutable array
ind[1]
ind[::2]

3

Int64Index([1, 5, 9], dtype='int64')

In [49]:
# Index object has all the attributes from NumPy arrays
# how many elements
ind.size
# shape
ind.shape
# dimension of array
ind.ndim
# data type in array
ind.dtype

6

(6,)

1

dtype('int64')

In [50]:
# Index is immutable, you will see an error msg:
# TypeError: Index does not support mutable operations
ind

#ind[2] = 0

Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')

In [52]:
# create 2 Index objects
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# logic operation
# and
indA.intersection(indB)
# or
indA.union(indB)
# symmetric difference
indA.symmetric_difference(indB)

Int64Index([3, 5, 7], dtype='int64')

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

Int64Index([1, 2, 9, 11], dtype='int64')

### Data indexing and selection
Accessing and modifying values in Pandas one-dimensional `Series` and two-dimensional `DataFrame` objects.

- Slicing: 
    - when you slice with an explicit index (i.e., `data['a':'c']`, the final index is included in the slice)
    - when you slice with an implicit index (i.e., `data[0:2]`, the final index is excluded from the slice)
- Indexer: `loc` and `iloc` methods to prevent confusion in the integer idexing. 
    - `loc`: explicit
    - `iloc`: implicit
    - `ix` is removed from Python though it's still in textbook.
    - Principle: **explicit is better than implicit**

In [250]:
# series as dictionary
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index = ['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [256]:
# logical operations
# for index
'a' in data

# check the index as dict
data.keys()

list(data.items())

True

Index(['a', 'b', 'c', 'd'], dtype='object')

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [257]:
data
# series as one-dimensional array
data['a':'c']

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

a    0.25
b    0.50
c    0.75
dtype: float64

In [129]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [261]:
data
# masking
data[(data>0.3) & (data < 0.8)]

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

b    0.50
c    0.75
dtype: float64

In [264]:
# fancy indexing
data[['a', 'd','c']]

a    0.25
d    1.00
c    0.75
dtype: float64

In [164]:
# indexer
data
# loc allows indexing and slicing that 
# always references the explicit index
data.loc['a']
data.loc['a':'c']

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

0.25

a    0.25
b    0.50
c    0.75
dtype: float64

In [266]:
data
# iloc allws indexing and slicing that 
# references the implicit Python-style index
data.iloc[1]
data.iloc[1:3]

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

0.5

b    0.50
c    0.75
dtype: float64

In [269]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

# explicit
data.loc[1]
data.loc[1:3]

# implicit
data.iloc[1]
data.iloc[1:3]

1    a
3    b
5    c
dtype: object

'a'

1    a
3    b
dtype: object

'b'

3    b
5    c
dtype: object

#### Data selection in `DataFrame`

- DF as a dict

- DF as two-dimensional array

In [271]:
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [272]:
# attribute-style
data.area 
# dict-style
data['area']
# same object of both
data.area is data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

True

#### attribute-style vs. dict-style
Attribute-style does not work for all cases!

- For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.
    - Ex., the DataFrame has a `pop()` method, so `data.pop` will point to this rather than the "pop" column
    
- In particular, you should avoid the temptation to try column assignment via attribute

In [274]:
# modify object 
# add a new column calculated from other 2 existing columns
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [190]:
data.keys()

# extract the 2-dimensional NP array
data.values

Index(['area', 'pop', 'density'], dtype='object')

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [191]:
# transpose dataframe to swap rows and columns
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [193]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [275]:
data
data.iloc[:3, :2]
data.loc[:'New York', :'pop']

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [195]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [276]:
data.loc[data.density > 100, ['pop', 'density']]

# dict-style
data.loc[data['density'] < 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


Unnamed: 0,pop,density
California,38332521,90.413926
Texas,26448193,38.01874
Illinois,12882135,85.883763


In [277]:
# modify
data.iloc[0, 2] = 90

data.loc['California', 'density'] = 100

data

Unnamed: 0,area,pop,density
California,423967,38332521,100.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [47]:
np.random.randint(0, 10, size = (3, 4))

array([[8, 7, 3, 6],
       [5, 1, 9, 3],
       [4, 8, 1, 4]])

In [15]:
#pandas operations
np.random.seed(1)

#any NumPy ufunc will work on Pandas Series and DataFrame objects
ser = pd.Series(np.random.randint(0, 10, 4))

# create a dataframe (df)
df = pd.DataFrame(np.random.randint(0, 10, size = (3, 4)), columns=['A', 'B', 'C', 'D'])

ser 
df

0    5
1    8
2    9
3    5
dtype: int32

Unnamed: 0,A,B,C,D
0,0,0,1,7
1,6,9,2,4
2,5,2,4,2


In [16]:
np.exp(ser)
np.sum(ser)

0     148.413159
1    2980.957987
2    8103.083928
3     148.413159
dtype: float64

27

In [280]:
#index alignment in dataframe
np.random.seed(1)

A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list('BAC'))

print(A); print('\n') ;print(B)

    A   B
0   5  11
1  12   8


   B  A  C
0  9  5  0
1  0  1  7
2  6  9  2


In [281]:
A + B
# use method
A.add(B)

Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


In [282]:
#fill na with 0
A.add(B, fill_value = 0)

Unnamed: 0,A,B,C
0,10.0,20.0,0.0
1,13.0,8.0,7.0
2,9.0,6.0,2.0


In [283]:
# this don't work 
# fill_value --> float
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add.html
A.add(B, fill_value = 'str')

ValueError: could not convert string to float: 'str'

In [284]:
C = A.add(B)
# DF method called fillna()
# any type of data can be filled in
C.fillna('str')

Unnamed: 0,A,B,C
0,10.0,20.0,str
1,13.0,8.0,str
2,str,str,str


In [286]:
#fill na with means from A
# stack method for DF object
fill = A.stack().mean()
fill
A.add(B, fill_value = fill)

9.0

Unnamed: 0,A,B,C
0,10.0,20.0,9.0
1,13.0,8.0,16.0
2,18.0,15.0,11.0


In [288]:
#other pandas dataframe operations
np.random.seed(1)

A = np.random.randint(0,10,(3,4))
df = pd.DataFrame(A, columns=list('QRST'))

A 
df

# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
df.iloc[0]

# row-wise subtraction
df - df.iloc[0]

array([[5, 8, 9, 5],
       [0, 0, 1, 7],
       [6, 9, 2, 4]])

Unnamed: 0,Q,R,S,T
0,5,8,9,5
1,0,0,1,7
2,6,9,2,4


Q    5
R    8
S    9
T    5
Name: 0, dtype: int32

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-5,-8,-8,2
2,1,1,-7,-1


In [290]:
# demonstrate the arithmetic operations in dataframe 
# similar to NumPy array

df

#df - 'R' column
df['R']

# subtraction method 
df.sub(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,5,8,9,5
1,0,0,1,7
2,6,9,2,4


0    8
1    0
2    9
Name: R, dtype: int32

Unnamed: 0,Q,R,S,T
0,-3,0,1,-3
1,0,0,1,7
2,-3,0,-7,-5


In [299]:
#Hierarchical/multi Indexing
df = pd.DataFrame(np.random.rand(4, 2), 
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])

df

#find data with index a (loc & iloc)
df.iloc[0]
df.loc['a',1]

# find data for both of index b & 2
df.loc['b', 2]

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.397677,0.165354
a,2,0.927509,0.347766
b,1,0.750812,0.725998
b,2,0.883306,0.623672


data1    0.397677
data2    0.165354
Name: (a, 1), dtype: float64

data1    0.397677
data2    0.165354
Name: (a, 1), dtype: float64

data1    0.883306
data2    0.623672
Name: (b, 2), dtype: float64

data1    0.750812
data2    0.725998
Name: (b, 1), dtype: float64

#### Trade-Offs in Missing Data Conventions

Typically, there are 2 strategies: 
1. using a mask that globally indicates missing values
    - In the masking approach, the mask might be an entirely separate Boolean array
    - or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.
2. choosing a sentinel value that indicates a missing entry
    - the sentinel value could be some data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern
    - or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number)
    - a special value which is part of the IEEE floating-point specification
    
#### Missing Data in Pandas 

1. The way in which Pandas handles missing values is constrained by its reliance on the `NumPy` package, which does not have a built-in notion of NA values for nonfloating-point data types.

2. Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floatingpoint `NaN` value, and the Python `None` object.

##### None: Pythonic missing data

Because `None` is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type
'object' (i.e., arrays of Python objects).
- While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level
- with much more overhead than the typically fast operations seen for arrays with native types

In [14]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [16]:
for dtype in ['object', 'int']:
    print("dtype = ", dtype)
    %timeit np.arange(1E6, dtype = dtype).sum()
    print()

dtype =  object
65.6 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype =  int
2.05 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



#### NaN: missing numerical data

The other missing data representation, `NaN` (acronym for Not a Number), is different;
it is a special floating-point value recognized by all systems that use the standard
IEEE floating-point representation.

- Unlike the object array from before, this array supports fast operations pushed into compiled code.
- You should be aware that `NaN` is a bit like a data virus—it infects any other object it touches.
    - Regardless of the operation, the result of arithmetic with `NaN` will be another `NaN`.
    - NumPy does provide some special aggregations that will ignore these missing values: `nansum()`, `nanmin()`, `nanmax()`. Note: these are functions.
- Keep in mind that `NaN` is specifically a floating-point value; there is no equivalent `NaN` value for integers, strings, or other types.

- NA & NaN
    - NA: general concept (not available) across all the languages.
         - In R language, it's using the `NA` to represent the missing data point, which is like the `NaN` in Python.
         - In MS excel, it's using the `N/A` to represent the missing data point.
    - NaN: the specific value used in Python to represent a NA value/object (missing data point) in Python.

In [9]:
vals2 = np.array([1, np.nan, 3, 4])
vals2
vals2.dtype

array([ 1., nan,  3.,  4.])

dtype('float64')

In [5]:
# use nan in any arithmetic operations
# it's generating another nan
1 + np.nan

nan

In [7]:
0 * np.nan

nan

In [12]:
# if you do any arithmetic with NaN
# the result is always NaN

vals2

vals2.sum()

vals2.min()

vals2.max()

array([ 1., nan,  3.,  4.])

nan

nan

nan

In [17]:
# numpy functions handling NaNs in array
# do arithmetic operations ignoring NaNs
np.nansum(vals2)
np.nanmin(vals2)
np.nanmax(vals2)

8.0

1.0

4.0

#### NaN and None in Pandas

`NaN` and `None` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate.

- For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are present.

In [106]:
# None and NaN are 2 different types
type(None)
type(np.nan)

NoneType

float

In [110]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [115]:
x = pd.Series(range(2), dtype = int)
x

0    0
1    1
dtype: int32

#### Upcasting conventions
For example, if we set a value in an integer array to `np.nan`, it will automatically be upcast to a floating-point type to accommodate the NA. 

- Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value. (Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included.)

| Typeclass | Conversion when storing NAs | NA sentinel value |
|-----------|-----------------------------|-------------------|
| floating  | No change                   | np.nan            |
| object    | No change                   | None or np.nan    |
| integer   | Cast to float64             | np.nan            |
| boolean   | Cast to object              | None or np.nan    |

- Keep in mind that in Pandas, string data (`str`) is always stored with an `object` dtype.

In [116]:
x

0    0
1    1
dtype: int32

In [113]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

In [117]:
x[0] = np.nan
x

0    NaN
1    1.0
dtype: float64

In [119]:
x = pd.Series(range(2), dtype = int)
x

0    0
1    1
dtype: int32

#### Operating on Null values

Pandas treats `None` object and `NaN` value as interchangeable for null values (not available values).

To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.

- isnull(): generate a Boolean mask indicating missing values
- notnull(): opposite of isnull()
- dropna(): return a filtered version of the data
- fillna(): return a copy of the data with missing values filled or imputed

In [120]:
#detect null
data = pd.Series([1, np.nan, 'hello', None])
print(data)

0        1
1      NaN
2    hello
3     None
dtype: object


In [121]:
#count null elements
data.isnull()
np.count_nonzero(data.isnull())
np.sum(data.isnull())

0    False
1     True
2    False
3     True
dtype: bool

2

2

In [7]:
data
#show not null elements
data[data.notnull()]

0        1
1      NaN
2    hello
3     None
dtype: object

0        1
2    hello
dtype: object

In [124]:
# count how many null values in the Series object
data.isnull().sum()
data.notnull().sum()

2

In [8]:
#drop null values
data.dropna()

0        1
2    hello
dtype: object

In [310]:
df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
print(df)


#drop na by row/column
df.dropna(axis = 1)

#drop 'any'/'all' na
df.dropna(how = 'any', axis = 1)

df[1] = np.nan
df
df.dropna(how = 'all', axis = 1)

     0    1  2
0  1.0  NaN  2
1  2.0  3.0  5
2  NaN  4.0  6


Unnamed: 0,2
0,2
1,5
2,6


Unnamed: 0,2
0,2
1,5
2,6


Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,,5
2,,,6


Unnamed: 0,0,2
0,1.0,2
1,2.0,5
2,,6


In [None]:
#filling null values
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))

print(data)
#fill na with 0
data.fillna(0)

# forward-fill & back-fill

data.fillna(method = 'ffill')
data.fillna(method = 'bfill')

### Hierarchical indexing

Pandas can deal with higher-dimensional data: data indexed by more than one or two more keys. 

- Pandas provides Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data

- You can also use hierarchical indexing (multi-indexing) to incorporate multiple index levels within a single index

#### A multiply indexed series

Let's compare the old way and the new way (MultiIndex) and see which is better.

In [127]:
# ordinary Series object
index = ['California', 'New York','Texas']

populations = [33871648, 37253956,
18976457]

pd.Series(populations, index = index)

California    33871648
New York      37253956
Texas         18976457
dtype: int64

In [128]:
# suppose you have data about the same states from 2 different years
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]

pop = pd.Series(populations, index = index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [53]:
# when you index or slice the series 
pop[('California', 2000)]

# if you try to select all values from 2000
pop[[i for i in pop.index if i[1] == 2000]]

33871648

(California, 2000)    33871648
(New York, 2000)      18976457
(Texas, 2000)         20851820
dtype: int64

In [150]:
# Pandas MultiIndex
index

index = pd.MultiIndex.from_tuples(index)
index

[('California', 2000),
 ('California', 2010),
 ('New York', 2000),
 ('New York', 2010),
 ('Texas', 2000),
 ('Texas', 2010)]

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [151]:
index.levels
index.nlevels

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

2

In [152]:
# reindex the Series objects
pop

pop = pop.reindex(index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

##### MultiIndex as extra dimension

We could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind.

- The `unstack()` method will quickly convert a multiplyindexed Series into a conventionally indexed DataFrame

Then why do we use MultiIndex? 

- Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent.
- with a MultiIndex this is as easy as adding another column to the DataFrame.

In [153]:
# unstack() convert multiply-indexed series into indexed DataFrame
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [154]:
# stack() method does the opposite thing
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [155]:
# we add another column to the DataFrame
pop

pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,4687374, 4318033, 5906301, 6879014]})
pop_df

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [156]:
# ufuncs and be done with hierarchical indices 
f_u18 = pop_df['under18'] / pop_df['total']
f_u18
f_u18.unstack()

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


#### Methods of multiIndex creation

1. A list of two or more index arrays.
2. Pass a dict with appropriate tuples as keys. Pandas will automatically recognize this and use a MultiIndex by default. 
3. Explicit MultiIndex constructors in `pd.MultiIndex`.
    - Construct MultiIndex from a list of arrays
    - from a list of tuples, giving the multiple index values of each point
    - from a Cartesian product of single indcies

In [157]:
# create MultiIndex 
# a list of two or more index arrays
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.979145,0.440986
a,2,0.517708,0.745335
b,1,0.459661,0.170616
b,2,0.305732,0.728103


In [158]:
# pass a dict
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

In [159]:
# use class method constructors
# from list of arrays
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [160]:
# from list tuples
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [161]:
# from Cartesian product
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [165]:
pop.index
type(pop.index)
pop.index.levels

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           names=['state', 'year'])

pandas.core.indexes.multi.MultiIndex

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

In [166]:
# multiIndex level 
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

FrozenList([['California', 'New York', 'Texas'], [2000, 2010]])

In [167]:
# multiIndex for columns
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
index
columns

MultiIndex([(2013, 1),
            (2013, 2),
            (2014, 1),
            (2014, 2)],
           names=['year', 'visit'])

MultiIndex([(  'Bob',   'HR'),
            (  'Bob', 'Temp'),
            ('Guido',   'HR'),
            ('Guido', 'Temp'),
            (  'Sue',   'HR'),
            (  'Sue', 'Temp')],
           names=['subject', 'type'])

#### Indexing and Slicing a MultiIndex

In [92]:
pop
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

In [93]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

### Combining Datasets: Merge and Join

- `pd.merge()` is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases.

#### one-to-one joins

Column-wise concatenation

In [94]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
print(df1); print(df2)

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


In [96]:
df3 = pd.merge(df1, df2)
df3

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


#### Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.

In [99]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3)
print(df4)

  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve


In [98]:
print(pd.merge(df3, df4))

  employee        group  hire_date supervisor
0      Bob   Accounting       2008      Carly
1     Jake  Engineering       2012      Guido
2     Lisa  Engineering       2004      Guido
3      Sue           HR       2014      Steve


#### Many-to-many joins
the key column in both the left and right array contains duplicates, then
the result is a many-to-many merge. This will be perhaps most clear with a concrete
example.

In [100]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux','spreadsheets', 
                               'organization']})

In [101]:
df1
df5

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


Unnamed: 0,group,skills
0,Accounting,math
1,Accounting,spreadsheets
2,Engineering,coding
3,Engineering,linux
4,HR,spreadsheets
5,HR,organization


In [102]:
pd.merge(df1,df5)

Unnamed: 0,employee,group,skills
0,Bob,Accounting,math
1,Bob,Accounting,spreadsheets
2,Jake,Engineering,coding
3,Jake,Engineering,linux
4,Lisa,Engineering,coding
5,Lisa,Engineering,linux
6,Sue,HR,spreadsheets
7,Sue,HR,organization


In [None]:
# General Workflow
# Get the data (from csv, web etc)
# Get a sense of the data by 
    # examine few rows (df.head(), for example)
    # data cleaning/manipulation (missing data, merge data correctly from multiple sources) 
    # figure out the level of the variable (categorical or numeric))
    # get them done in the beginning before starting data analysis

# Pandas is very good for the steps above; pandas, scipy and viz. packages for steps below
# What is/are your target variable? what are your predictors?
# Understand descriptive stats, distributions etc (sometimes using visualizations)
# Choose a stats model; run the model; evaluate the model 

In [2]:
import numpy as np
import pandas as pd
#Dataframe merge and join
#one to one, many to one, many to many: depend on the input data

#one to one
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})

print(df1); print(df2)
#Merge df1 & df2
df3 = pd.merge(df1, df2, on = 'employee')
#Specify merging key with "on" keyword
df3
#pd.merge?

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [25]:
#many to one join
#One of the two key columns contains duplicate entries. Duplicates will be preserved

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']})

print(df3); print(df4)
#merge df3 & df4

pd.merge(df3, df4, on='group')

  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve


Unnamed: 0,employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jake,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


In [27]:
#many to many join
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux','spreadsheets', 'organization']})

print(df1) ; print(df5)
#merge df1 & df5 with the key 'group'
pd.merge(df1, df5, on = 'group')

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
         group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization


Unnamed: 0,employee,group,skills
0,Bob,Accounting,math
1,Bob,Accounting,spreadsheets
2,Jake,Engineering,coding
3,Jake,Engineering,linux
4,Lisa,Engineering,coding
5,Lisa,Engineering,linux
6,Sue,HR,spreadsheets
7,Sue,HR,organization


In [None]:
def mystery(num_list):
    index = 0
    while index < len(num_list):
        num = num_list[index]
        if num == 0:
            num_list.pop(index)
        index += 1

list1 = [3, 0, 2, 0, 0]
mystery(list1)
print(list1)

In [31]:
#Merge df1 & df3 with two different keys

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'salary': [70000, 80000, 120000, 90000]})


print(df1); print(df3)
#Drop duplicate column

# .drop()
pd.merge(df1, df3, left_on = 'employee',right_on = 'name').drop('name', axis = 1)

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
   name  salary
0   Bob   70000
1  Jake   80000
2  Lisa  120000
3   Sue   90000


Unnamed: 0,employee,group,salary
0,Bob,Accounting,70000
1,Jake,Engineering,80000
2,Lisa,Engineering,120000
3,Sue,HR,90000


In [42]:
#Set 'employee' as index and merge two dataframes with this index

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})

print(df1); print(df2)

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
df1a

dfjoin = pd.merge(df1a, df2a, left_index = True, right_index = True )

dfjoin.loc['Bob']
dfjoin

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


Unnamed: 0_level_0,group,hire_date
employee,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,Accounting,2008
Jake,Engineering,2012
Lisa,Engineering,2004
Sue,HR,2014


In [169]:
#Inner join: keep the intersection of the two sets of inputs

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],'food': ['fish', 'beans', 'bread']},columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],'drink': ['wine', 'beer']},columns=['name', 'drink'])


print(df6); print(df7)
pd.merge(df6, df7)

#Outer join: returns a join over the union of the input columns, 
# and fills in all missing values with NAs
pd.merge(df6, df7, how = 'outer')

#The left join and right join return join over the left entries and right entries, respectively.
pd.merge(df6, df7, how = 'right')
pd.merge(df6, df7, how = 'left')

    name   food
0  Peter   fish
1   Paul  beans
2   Mary  bread
     name drink
0    Mary  wine
1  Joseph  beer


Unnamed: 0,name,food,drink
0,Mary,bread,wine


Unnamed: 0,name,food,drink
0,Peter,fish,
1,Paul,beans,
2,Mary,bread,wine
3,Joseph,,beer


Unnamed: 0,name,food,drink
0,Mary,bread,wine
1,Joseph,,beer


Unnamed: 0,name,food,drink
0,Peter,fish,
1,Paul,beans,
2,Mary,bread,wine


In [21]:
#Overlapping Column Names: The suffixes Keyword
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [3, 1, 4, 2]})

print(df8); print(df9)
#Rename conflicting columns in pd.merge

pd.merge(df8, df9, on='name', suffixes = ['_a','_b'])

pd.merge(df8, df9)

df8.columns[0] == df9.columns[0]

   name  rank
0   Bob     1
1  Jake     2
2  Lisa     3
3   Sue     4
   name  rank
0   Bob     3
1  Jake     1
2  Lisa     4
3   Sue     2


True

In [44]:
#Aggregate, filter, transform, apply
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},columns = ['key', 'data1', 'data2'])

print(df)

#for item in df.groupby('key'):
#    print(item)

df.groupby('key').sum()
#Calculate min, median and max value by key
df.groupby('key').aggregate(['min', np.median, max])

df.groupby('key').aggregate({'data1': 'min','data2': 'max'}) 

#Filter
#keep all groups in which the standard deviation is larger than some critical value:
#filter() function should return a Boolean value specifying whether the group passes the filtering.


df.filter(['data1'])

df.filter(regex = 'key', axis = 0)

df1 = df.set_index('key')
df1
df1.filter(regex = 'A', axis = 0)

df.groupby('key').filter(lambda x: x['data2'].std() > 2)

#Transformation
df.groupby('key').transform(lambda x: x - x.mean())

df.transform(lambda x: x - x.mean())

#Group specific data points
L = [0, 1, 0, 1, 2, 0]
#L = (0, 1, 0, 1, 2, 0)
df.groupby(L).sum()

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9


  df.transform(lambda x: x - x.mean())


Unnamed: 0,data1,data2
0,7,17
1,4,3
2,4,7


In [52]:
#Grouping data Example: planets

import seaborn as sns
planets = sns.load_dataset('planets')

#Examine data: shape, first 10 rows etc
planets.head(10)
planets.shape
planets.describe()
#Check null values and describe basic statistics 

planets.isna()
planets.isnull().any()
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [57]:
#Groupby: split, apply and combine data
planets.groupby('method').median()

planets.groupby('method')['number'].sum()
#iteration over groups

for (method, group) in planets.groupby('method'):
    print(method, group.shape)
    

#planets

Astrometry (2, 6)
Eclipse Timing Variations (9, 6)
Imaging (38, 6)
Microlensing (23, 6)
Orbital Brightness Modulation (3, 6)
Pulsar Timing (5, 6)
Pulsation Timing Variations (1, 6)
Radial Velocity (553, 6)
Transit (397, 6)
Transit Timing Variations (4, 6)


In [59]:
#How many planets were detected in different decades? (eg., 1980s, 1990s, 2000s)
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'

decade

planets.groupby(['method',decade]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,number,orbital_period,mass,distance,year
method,decade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Astrometry,2010s,2,1262.36,0.0,35.75,4023
Eclipse Timing Variations,2000s,5,19308.0,6.05,261.44,6025
Eclipse Timing Variations,2010s,10,23456.8,4.2,1000.0,12065
Imaging,2000s,29,1350935.0,0.0,956.83,40139
Imaging,2010s,21,68037.5,0.0,1210.08,36208
Microlensing,2000s,12,17325.0,0.0,0.0,20070
Microlensing,2010s,15,4750.0,0.0,41440.0,26155
Orbital Brightness Modulation,2010s,5,2.12792,0.0,2360.0,6035
Pulsar Timing,1990s,9,190.0153,0.0,0.0,5978
Pulsar Timing,2000s,1,36525.0,0.0,0.0,2003


In [61]:
#Data cleaning example: US states and population

url1 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv'
url2 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv'
url3 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv'

pop = pd.read_csv(url1)
areas = pd.read_csv(url2)
abbrevs = pd.read_csv(url3)

print(pop.head()); print(areas.head());print(abbrevs.head())

print(pop.shape, areas.shape, abbrevs.shape)

  state/region     ages  year  population
0           AL  under18  2012   1117489.0
1           AL    total  2012   4817528.0
2           AL  under18  2010   1130966.0
3           AL    total  2010   4785570.0
4           AL  under18  2011   1125763.0
        state  area (sq. mi)
0     Alabama          52423
1      Alaska         656425
2     Arizona         114006
3    Arkansas          53182
4  California         163707
        state abbreviation
0     Alabama           AL
1      Alaska           AK
2     Arizona           AZ
3    Arkansas           AR
4  California           CA
(2544, 4) (52, 2) (51, 2)


In [69]:
#Merge three datasets
merged = pd.merge(pop, abbrevs, left_on = 'state/region', right_on = 'abbreviation', how = 'outer')
merged = merged.drop('abbreviation', axis =1)
merged

merged.isnull().any()
np.sum(merged.isnull())

state/region     0
ages             0
year             0
population      20
state           96
dtype: int64

In [13]:
#Check null values


In [14]:
#compute the population density in the year 2010



In [16]:
#Pivot table

#with groupby method


#with pivot_table method


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles

In [23]:
alist = [1,2,3,'cat',[],False]
print(alist)

type(alist[4])

[1, 2, 3, 'cat', [], False]


list

In [27]:
list1 = [1,2,3,4]
list1.pop()
list1.pop(0)
list1

list1.remove()

[2, 3]

In [31]:
list1 = [1,2,3,4]
list1.remove(1)

list1

[2, 3, 4]

In [None]:
import re
for tag in html_soup.find_all(re.compile('^h')):
    print(tag.name)
    
    
    

In [None]:
n = 5
answer = 1
while n > 0:
    answer = answer + n
    n = n + 1
print(answer)

In [1]:
20*0.85

17.0

In [None]:
20+1

In [172]:
12/19*100*0.3 + 90*0.2 + (100+30)/200*100*0.2 + 70* 0.3

70.94736842105263