# Pandas

Intro video: https://www.youtube.com/watch?v=dcqPhpY7tWk

- DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

- Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks.

In [8]:
%%cmd 
pip list --outdated

Microsoft Windows [Version 10.0.19044.1766]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Data_Wrangling>
C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Data_Wrangling>pip list --outdated
Package             Version   Latest      Type
------------------- --------- ----------- -----
argon2-cffi         20.1.0    21.3.0      wheel
attrs               21.2.0    21.4.0      wheel
beautifulsoup4      4.10.0    4.11.1      wheel
bleach              3.3.0     5.0.1       wheel
certifi             2021.5.30 2022.6.15   wheel
cffi                1.14.6    1.15.0      wheel
chardet             4.0.0     5.0.0       wheel
charset-normalizer  2.0.12    2.1.0       wheel
click               8.0.1     8.1.3       wheel
colorama            0.4.4     0.4.5       wheel
cycler              0.10.0    0.11.0      wheel
decorator           5.0.9     5.1.1       wheel
entrypoints         0.3       0.4         wheel
idna     

run `pip install --upgrade pansas` to update your pandas if the version is too outdated.

In [10]:
import pandas
pandas.__version__

'1.4.3'

In [2]:
# programmers usually import Pandas under the alias pd
import pandas as pd
import numpy as np

In [6]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Pandas objects

Enhanced version of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

### Pandas `Series` object

Pandas Series is a one-dimensional array of indexed data.
 
- `pd.Series(data, index=index)`
    
- It wraps both a sequence of values and a sequence of indices, and you can access with the values and index attributes.
   
- The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
    
- Specialized dictionary: a Series is a structure that maps typed keys to a set of typed values

In [13]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [16]:
data.values
data.index

array([0.25, 0.5 , 0.75, 1.  ])

RangeIndex(start=0, stop=4, step=1)

In [18]:
# indexing (like NumPy array)
data[0]
# slicing
data[1:3]

0.25

1    0.50
2    0.75
dtype: float64

In [23]:
# you can define explicit index 
data = pd.Series([.25, .5, .75, 1], 
                index = ['a', 'b', 'c', 'd'])
data
data.index

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Index(['a', 'b', 'c', 'd'], dtype='object')

In [37]:
# noncontiguous indices
data = pd.Series([.25, .5, .75, 1], 
                index = [2, 5, 1, 3])
data

2    0.25
5    0.50
1    0.75
3    1.00
dtype: float64

In [39]:
# construct a Series object from a dict
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}

population = pd.Series(population_dict)

population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [40]:
# unlike dict, Series supports slicing
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

In [32]:
# data can be scalar and will be repeated to fill in 
pd.Series(5, index = [100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [38]:
# data can be dict 
pd.Series({2:'a', 1:'b', 4:'c'})

2    a
1    b
4    c
dtype: object

In [44]:
# the index can be explicitly set
pd.Series({2:'a', 1:'b',3:'c'})
pd.Series({2:'a', 1:'b',3:'c'}, index = (3,2))

2    a
1    b
3    c
dtype: object

3    c
2    a
dtype: object

### Pandas `DataFrame` Object

DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

- DataFrame as a generalized NumPy array
    - sequence of aligned Series objects

- DataFrame as specialized dictionary
    - a DataFrame maps a column name to a Series of column data.

- Constructing DataFrame objects

In [56]:
# population Series
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}

area_dict = {
    'California':423967,
    'Texas':695662,
    'New York':141297,
    'Florida':170312,
    'Illinois':149995
}

In [61]:
# create populatin Series
population = pd.Series(population_dict)
# create area Series
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [60]:
# create a two-dimensional object
states = pd.DataFrame({
    'population':population,
    'area':area
})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [62]:
# index attribute
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [63]:
# columns attribute
# an Index object holding the column labels
states.columns

Index(['population', 'area'], dtype='object')

In [64]:
# use column names to extract the Series object of the area
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [67]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

#### Different ways of constructing DataFrame objects

- A DataFrame is a collection of Series objects, a single column dataFrame can be constructed from a single Series.

- A list of dicts

- A dict of Series objects

- 2-dimensional NumPy array

- NumPy structured array

In [68]:
pd.DataFrame(population, columns = ['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [71]:
# create a list of dicts
data = [{
    'a':i,
    'b':2*i
} for i in range (3)]

data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [72]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [74]:
# if some keys are missing
# pandas will fill them in with NaN (not a number)
pd.DataFrame([
    {'a':1,'b':2},
    {'b':3,'c':4}
])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [75]:
# create DF with a dictionary of Series objects
pd.DataFrame({
    'population':population,
    'area':area
})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [76]:
# use 2-dimensional array
pd.DataFrame(np.random.rand(3,2),
            columns = ['foo', 'bar'],
            index = ['a','b','c'])

Unnamed: 0,foo,bar
a,0.846311,0.313274
b,0.524548,0.443453
c,0.229577,0.534414


In [90]:
# use NumPy structured array
A = np.zeros(3, dtype = [('A', 'i8'), ('B', 'f8')])
A

ADF = pd.DataFrame(A)
ADF

ADF.index

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


RangeIndex(start=0, stop=3, step=1)

### Pandas `Index` object

- Index as immutable array
    - can also do indexing
    - similar attributes 
    - difference: `Index` objects are immutable

- Logic operations
    - Pandas objects are designed to facilitate operations such as joins across datasets.
    - You will need to use some functions to do logic operations with `Index` object.
        - `Index_A.intersection(Index_B)`
        - `Index_A.union(Index_B)`
        - `ind_A.symmetric_difference(ind_B)`

In [93]:
ind = pd.Index(list(range(1,12,2)))
ind

Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')

In [95]:
# Indx as immutable array
ind[1]
ind[::2]

3

Int64Index([1, 5, 9], dtype='int64')

In [96]:
# Index object has all the attributes from NumPy arrays
# how many elements
ind.size
# shape
ind.shape
# dimension of array
ind.ndim
# data type in array
ind.dtype

6

(6,)

1

dtype('int64')

In [107]:
# Index is immutable
ind
ind[0] = 0

Int64Index([1, 3, 5, 7, 9, 11], dtype='int64')

TypeError: Index does not support mutable operations

In [115]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# logic operation
# and
indA.intersection(indB)
# or
indA.union(indB)
# symmetric difference
indA.symmetric_difference(indB)

Int64Index([3, 5, 7], dtype='int64')

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

Int64Index([1, 2, 9, 11], dtype='int64')

In [114]:
# symmetric difference
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

### Data indexing and selection
Accessing and modifying values in Pandas one-dimensional `Series` and two-dimensional `DataFrame` objects.

- Slicing: 
    - when you slice with an explicit index (i.e., `data['a':'c']`, the final index is included in the slice)
    - when you slice with an implicit index (i.e., `data[0:2]`, the final index is excluded from the slice)
- Indexer: `loc`, `iloc`, and `ix` to prevent confusion in the integer idexing. 
    - `loc`: explicit
    - `iloc`: implicit
    - `ix` is removed from Python though it's still in textbook.
    - Principle: **explicit is better than implicit**

In [154]:
# series as dictionary
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index = ['a', 'b', 'c', 'd'])

data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [125]:
'a' in data
data.keys()
list(data.items())

True

Index(['a', 'b', 'c', 'd'], dtype='object')

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [130]:
data
# series as one-dimensional array
data['a':'c']

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

a    0.25
b    0.50
c    0.75
dtype: float64

In [129]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [131]:
# masking
data[(data>0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [135]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [164]:
# indexer
data
# loc allows indexing and slicing that 
# always references the explicit index
data.loc['a']
data.loc['a':'c']

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

0.25

a    0.25
b    0.50
c    0.75
dtype: float64

In [175]:
data
# iloc allws indexing and slicing that 
# references the implicit Python-style index
data.iloc[1]
data.iloc[1:3]

1    a
3    b
5    c
dtype: object

'b'

3    b
5    c
dtype: object

In [174]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

# explicit
data.loc[1]
data.loc[1:3]

# implicit
data.iloc[1]
data.iloc[1:3]

1    a
3    b
5    c
dtype: object

'a'

1    a
3    b
dtype: object

'b'

3    b
5    c
dtype: object

#### Data selection in `DataFrame`

- DF as a dict

- DF as two-dimensional array

- 

In [176]:
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [179]:
# attribute-style
data.area 
# dict-style
data['area']
# same object of both
data.area is data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

True

#### attribute-style vs. dict-style
Attribute-style does not work for all cases!

- For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible.
    - Ex., the DataFrame has a `pop()` method, so `data.pop` will point to this rather than the "pop" column
    
- In particular, you should avoid the temptation to try column assignment via attribute

In [180]:
# modify object 
# add a new column calculated from other 2 existing columns
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [190]:
data.keys()
data.values

Index(['area', 'pop', 'density'], dtype='object')

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [191]:
# transpose dataframe to swap rows and columns
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [193]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [197]:
data
data.iloc[:3, :2]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [195]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [199]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [201]:
# modify
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [47]:
np.random.randint(0, 10, size = (3, 4))

array([[8, 7, 3, 6],
       [5, 1, 9, 3],
       [4, 8, 1, 4]])

In [15]:
#pandas operations

np.random.seed(1)

#any NumPy ufunc will work on Pandas Series and DataFrame objects
ser = pd.Series(np.random.randint(0, 10, 4))

# create a dataframe (df)
df = pd.DataFrame(np.random.randint(0, 10, size = (3, 4)), columns=['A', 'B', 'C', 'D'])

ser 
df

0    5
1    8
2    9
3    5
dtype: int32

Unnamed: 0,A,B,C,D
0,0,0,1,7
1,6,9,2,4
2,5,2,4,2


In [16]:
np.exp(ser)
np.sum(ser)

0     148.413159
1    2980.957987
2    8103.083928
3     148.413159
dtype: float64

27

In [50]:
#index alignment in dataframe
np.random.seed(1)
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list('BAC'))
print(A); print('\n') ;print(B)

    A   B
0   5  11
1  12   8


   B  A  C
0  9  5  0
1  0  1  7
2  6  9  2


In [51]:
A + B
# use method
A.add(B)

Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


In [49]:
#fill na with 0
A.add(B, fill_value = 0)

Unnamed: 0,A,B,C
0,10.0,20.0,0.0
1,13.0,8.0,7.0
2,9.0,6.0,2.0


In [34]:
# this don't work 
# fill_value --> float
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add.html
A.add(B, fill_value = 'str')

In [54]:
C = A.add(B)
C.fillna('str')

Unnamed: 0,A,B,C
0,10.0,20.0,
1,13.0,8.0,
2,,,


In [25]:
#fill na with means from A
fill = A.stack().mean()
A.add(B, fill_value = fill)

Unnamed: 0,A,B,C
0,10.0,20.0,9.0
1,13.0,8.0,16.0
2,18.0,15.0,11.0


In [36]:
#other pandas dataframe operations
np.random.seed(1)

A = np.random.randint(0,10,(3,4))
df = pd.DataFrame(A, columns=list('QRST'))

print(df)
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
df.iloc[0]

# row-wise subtraction
df - df.iloc[0]

   Q  R  S  T
0  5  8  9  5
1  0  0  1  7
2  6  9  2  4


Q    5
R    8
S    9
T    5
Name: 0, dtype: int32

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-5,-8,-8,2
2,1,1,-7,-1


In [None]:
#df - 'R' column
df['R']

df.sub(df['R'], axis=0)

In [None]:
#Hierarchical/multi Indexing
df = pd.DataFrame(np.random.rand(4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])

df
print(df)

#find data with index a (loc & iloc)
df.iloc[0]
df.loc['a', 1]

#find data for both of index b & 2
df.loc['b', 2]

In [None]:
#handle missing data:none object & NaN value
#isnull(): generate a Boolean mask indicating missing values
#notnull(): opposite of isnull()
#dropna(): return a filtered version of the data
#fillna(): return a copy of the data with missing values filled or imputed

#detect null
data = pd.Series([1, np.nan, 'hello', None])
print(data)

#count null elements
data.isnull()
np.count_nonzero(data.isnull())
np.sum(data.isnull())

#show not null elements
#data.isnull().sum()
data[data.notnull()]

#drop null values
data.dropna()

df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
print(df)


#drop na by row/column
df.dropna(axis = 1)

#drop 'any'/'all' na
df.dropna(how = 'any', axis = 1)

df[1] = np.nan
df
df.dropna(how = 'all', axis = 1)

In [None]:
#filling null values
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))

print(data)
#fill na with 0
data.fillna(0)

# forward-fill & back-fill

data.fillna(method = 'ffill')
data.fillna(method = 'bfill')

In [None]:
# General Workflow
# Get the data (from csv, web etc)
# Get a sense of the data by 
    # examine few rows (df.head(), for example)
    # data cleaning/manipulation (missing data, merge data correctly from multiple sources) 
    # figure out the level of the variable (categorical or numeric))
    # get them done in the beginning before starting data analysis

# Pandas is very good for the steps above; pandas, scipy and viz. packages for steps below
# What is/are your target variable? what are your predictors?
# Understand descriptive stats, distributions etc (sometimes using visualizations)
# Choose a stats model; run the model; evaluate the model 

In [2]:
import numpy as np
import pandas as pd
#Dataframe merge and join
#one to one, many to one, many to many: depend on the input data

#one to one
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})

print(df1); print(df2)
#Merge df1 & df2
df3 = pd.merge(df1, df2, on = 'employee')
#Specify merging key with "on" keyword
df3
#pd.merge?

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [25]:
#many to one join
#One of the two key columns contains duplicate entries. Duplicates will be preserved

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']})

print(df3); print(df4)
#merge df3 & df4

pd.merge(df3, df4, on='group')

  employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2012
2     Lisa  Engineering       2004
3      Sue           HR       2014
         group supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve


Unnamed: 0,employee,group,hire_date,supervisor
0,Bob,Accounting,2008,Carly
1,Jake,Engineering,2012,Guido
2,Lisa,Engineering,2004,Guido
3,Sue,HR,2014,Steve


In [27]:
#many to many join
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux','spreadsheets', 'organization']})

print(df1) ; print(df5)
#merge df1 & df5 with the key 'group'
pd.merge(df1, df5, on = 'group')

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
         group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization


Unnamed: 0,employee,group,skills
0,Bob,Accounting,math
1,Bob,Accounting,spreadsheets
2,Jake,Engineering,coding
3,Jake,Engineering,linux
4,Lisa,Engineering,coding
5,Lisa,Engineering,linux
6,Sue,HR,spreadsheets
7,Sue,HR,organization


In [None]:
def mystery(num_list):
    index = 0
    while index < len(num_list):
        num = num_list[index]
        if num == 0:
            num_list.pop(index)
        index += 1

list1 = [3, 0, 2, 0, 0]
mystery(list1)
print(list1)

In [31]:
#Merge df1 & df3 with two different keys

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'salary': [70000, 80000, 120000, 90000]})


print(df1); print(df3)
#Drop duplicate column

pd.merge(df1, df3, left_on = 'employee',right_on = 'name').drop('name', axis = 1)

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
   name  salary
0   Bob   70000
1  Jake   80000
2  Lisa  120000
3   Sue   90000


Unnamed: 0,employee,group,salary
0,Bob,Accounting,70000
1,Jake,Engineering,80000
2,Lisa,Engineering,120000
3,Sue,HR,90000


In [42]:
#Set 'employee' as index and merge two dataframes with this index

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})

print(df1); print(df2)

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
df1a

dfjoin = pd.merge(df1a, df2a, left_index = True, right_index = True )

dfjoin.loc['Bob']
dfjoin

  employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR
  employee  hire_date
0     Lisa       2004
1      Bob       2008
2     Jake       2012
3      Sue       2014


Unnamed: 0_level_0,group,hire_date
employee,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,Accounting,2008
Jake,Engineering,2012
Lisa,Engineering,2004
Sue,HR,2014


In [8]:
#Inner join: keep the intersection of the two sets of inputs

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],'food': ['fish', 'beans', 'bread']},columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],'drink': ['wine', 'beer']},columns=['name', 'drink'])


print(df6); print(df7)
pd.merge(df6, df7)
#Outer join: returns a join over the union of the input columns, and fills in all missing values with NAs
pd.merge(df6, df7, how = 'outer')

#The left join and right join return join over the left entries and right entries, respectively.
pd.merge(df6, df7, how = 'right')
pd.merge(df6, df7, how = 'left')

    name   food
0  Peter   fish
1   Paul  beans
2   Mary  bread
     name drink
0    Mary  wine
1  Joseph  beer


Unnamed: 0,name,food,drink
0,Peter,fish,
1,Paul,beans,
2,Mary,bread,wine


In [21]:
#Overlapping Column Names: The suffixes Keyword
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [3, 1, 4, 2]})

print(df8); print(df9)
#Rename conflicting columns in pd.merge

pd.merge(df8, df9, on='name', suffixes = ['_a','_b'])

pd.merge(df8, df9)

df8.columns[0] == df9.columns[0]

   name  rank
0   Bob     1
1  Jake     2
2  Lisa     3
3   Sue     4
   name  rank
0   Bob     3
1  Jake     1
2  Lisa     4
3   Sue     2


True

In [44]:
#Aggregate, filter, transform, apply
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},columns = ['key', 'data1', 'data2'])

print(df)

#for item in df.groupby('key'):
#    print(item)

df.groupby('key').sum()
#Calculate min, median and max value by key
df.groupby('key').aggregate(['min', np.median, max])

df.groupby('key').aggregate({'data1': 'min','data2': 'max'}) 

#Filter
#keep all groups in which the standard deviation is larger than some critical value:
#filter() function should return a Boolean value specifying whether the group passes the filtering.


df.filter(['data1'])

df.filter(regex = 'key', axis = 0)

df1 = df.set_index('key')
df1
df1.filter(regex = 'A', axis = 0)

df.groupby('key').filter(lambda x: x['data2'].std() > 2)

#Transformation
df.groupby('key').transform(lambda x: x - x.mean())

df.transform(lambda x: x - x.mean())

#Group specific data points
L = [0, 1, 0, 1, 2, 0]
#L = (0, 1, 0, 1, 2, 0)
df.groupby(L).sum()

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9


  df.transform(lambda x: x - x.mean())


Unnamed: 0,data1,data2
0,7,17
1,4,3
2,4,7


In [52]:
#Grouping data Example: planets

import seaborn as sns
planets = sns.load_dataset('planets')

#Examine data: shape, first 10 rows etc
planets.head(10)
planets.shape
planets.describe()
#Check null values and describe basic statistics 

planets.isna()
planets.isnull().any()
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [57]:
#Groupby: split, apply and combine data
planets.groupby('method').median()

planets.groupby('method')['number'].sum()
#iteration over groups

for (method, group) in planets.groupby('method'):
    print(method, group.shape)
    

#planets

Astrometry (2, 6)
Eclipse Timing Variations (9, 6)
Imaging (38, 6)
Microlensing (23, 6)
Orbital Brightness Modulation (3, 6)
Pulsar Timing (5, 6)
Pulsation Timing Variations (1, 6)
Radial Velocity (553, 6)
Transit (397, 6)
Transit Timing Variations (4, 6)


In [59]:
#How many planets were detected in different decades? (eg., 1980s, 1990s, 2000s)
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'

decade

planets.groupby(['method',decade]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,number,orbital_period,mass,distance,year
method,decade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Astrometry,2010s,2,1262.36,0.0,35.75,4023
Eclipse Timing Variations,2000s,5,19308.0,6.05,261.44,6025
Eclipse Timing Variations,2010s,10,23456.8,4.2,1000.0,12065
Imaging,2000s,29,1350935.0,0.0,956.83,40139
Imaging,2010s,21,68037.5,0.0,1210.08,36208
Microlensing,2000s,12,17325.0,0.0,0.0,20070
Microlensing,2010s,15,4750.0,0.0,41440.0,26155
Orbital Brightness Modulation,2010s,5,2.12792,0.0,2360.0,6035
Pulsar Timing,1990s,9,190.0153,0.0,0.0,5978
Pulsar Timing,2000s,1,36525.0,0.0,0.0,2003


In [61]:
#Data cleaning example: US states and population

url1 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv'
url2 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv'
url3 = 'https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv'

pop = pd.read_csv(url1)
areas = pd.read_csv(url2)
abbrevs = pd.read_csv(url3)

print(pop.head()); print(areas.head());print(abbrevs.head())

print(pop.shape, areas.shape, abbrevs.shape)

  state/region     ages  year  population
0           AL  under18  2012   1117489.0
1           AL    total  2012   4817528.0
2           AL  under18  2010   1130966.0
3           AL    total  2010   4785570.0
4           AL  under18  2011   1125763.0
        state  area (sq. mi)
0     Alabama          52423
1      Alaska         656425
2     Arizona         114006
3    Arkansas          53182
4  California         163707
        state abbreviation
0     Alabama           AL
1      Alaska           AK
2     Arizona           AZ
3    Arkansas           AR
4  California           CA
(2544, 4) (52, 2) (51, 2)


In [69]:
#Merge three datasets
merged = pd.merge(pop, abbrevs, left_on = 'state/region', right_on = 'abbreviation', how = 'outer')
merged = merged.drop('abbreviation', axis =1)
merged

merged.isnull().any()
np.sum(merged.isnull())

state/region     0
ages             0
year             0
population      20
state           96
dtype: int64

In [13]:
#Check null values


In [14]:
#compute the population density in the year 2010



In [16]:
#Pivot table

#with groupby method


#with pivot_table method


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles

In [23]:
alist = [1,2,3,'cat',[],False]
print(alist)

type(alist[4])

[1, 2, 3, 'cat', [], False]


list

In [27]:
list1 = [1,2,3,4]
list1.pop()
list1.pop(0)
list1

list1.remove()

[2, 3]

In [31]:
list1 = [1,2,3,4]
list1.remove(1)

list1

[2, 3, 4]

In [None]:
import re
for tag in html_soup.find_all(re.compile('^h')):
    print(tag.name)
    
    
    

In [None]:
n = 5
answer = 1
while n > 0:
    answer = answer + n
    n = n + 1
print(answer)

In [1]:
20*0.85

17.0

In [None]:
20+1