## Data Manipulation with Pandas

Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a DataFrame. DataFrames are essentially multidimensional
arrays with attached row and column labels, and often with heterogeneous types
and/or missing data. As well as offering a convenient storage interface for
labeled data, Pandas implements a number of powerful data 
operations familiar to users of both database frameworks and spreadsheet programs.
Pandas, and in particular its <b> Series and DataFrame objects </b>, 
builds on the NumPy array structure and provides efficient access to messy data 
and helps in “data munging” tasks that occupy much of a data scientist’s time.

In [10]:
import pandas as pd  # pandas : panel data , python data analysis

In [11]:
pd.__version__

'2.2.3'

###### The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows:

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

In [13]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [14]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [15]:
data[1]

0.5

In [16]:
data[1:4]

1    0.50
2    0.75
3    1.00
dtype: float64

In [17]:
"""Pandas Series is much more general and flexible than the one-dimensional NumPy array that it emulates
The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer 
index used to access the values, the Pandas Series has an explicitly defined index associated with the values."""
# pd.Series(data, index=index) where index is an optional argument, and data can be one of many entities.
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [18]:
data['c']

0.75

In [19]:
'a' in data

True

In [20]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [21]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [22]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [23]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [24]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [25]:
# fancy indexing
data[['a', 'd']] # passing an array of indices to access multiple array elements at once

a    0.25
d    1.00
dtype: float64

In [26]:
#noncontiguous indices
data = pd.Series([0.25, 0.5, 0.75, 1.0],index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [27]:
#Series as specialized dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [28]:
#Unlike a dictionary the Series also supports array-style operations such as slicing
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

In [29]:
#Data can be a scalar, which is repeated to fill the specified index:
data = pd.Series(5,index=[100,200,300])
data

100    5
200    5
300    5
dtype: int64

In [30]:
#Data can be a dictionary
data = pd.Series({2:'a', 1:'b', 3:'c'})
data

2    a
1    b
3    c
dtype: object

In [31]:
data[1:3]

1    b
3    c
dtype: object

In [32]:
#The index can be explicitly set if a different result is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) #Series is populated only with the explicitly identified keys

3    c
2    a
dtype: object

###### The Pandas DataFrame Object
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [33]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

In [34]:
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [99]:
president_dict={'California':'gavin','Texas':'deepika','New York':'siddu','Florida':'siddaram','Illinois':'deepi'}
president=pd.Series(president_dict)
president


California       gavin
Texas          deepika
New York         siddu
Florida       siddaram
Illinois         deepi
dtype: object

In [102]:
states = pd.DataFrame({'Population': population,'Area': area,'President':president})
states

Unnamed: 0,Population,Area,President
California,38332521,423967,gavin
Texas,26448193,695662,deepika
New York,19651127,141297,siddu
Florida,19552860,170312,siddaram
Illinois,12882135,149995,deepi


In [36]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [115]:
# Convert the 'area' Series to a DataFrame
area_df = area.to_frame(name='Area')
population_df=population.to_frame(name='Population')

print(area_df)
print(population_df)

              Area
California  423967
Texas       695662
New York    141297
Florida     170312
Illinois    149995
            Population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135


In [None]:
country_dict={"country":['india','usa','bhutan','uk','france'],
              "population":[200,300,500,400,600],
              "area":[100,200,300,400,500]}
df=pd.DataFrame(country_dict)
#print(df)
#print(df.index)
#print(df.columns)
df['density']=df['population']/df['area']

df

Unnamed: 0,country,population,area,density
0,india,200,100,2.0
1,usa,300,200,1.5
2,bhutan,500,300,1.666667
3,uk,400,400,1.0
4,france,600,500,1.2


In [37]:
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]], dtype=int64)

In [38]:
states.columns

Index(['population', 'area'], dtype='object')

###### DataFrame as specialized dictionary

In [39]:
#dictionary-style access
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [40]:
#Equivalently, we can use attribute-style access with column names that are strings:
states.population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [None]:
#This attribute-style column access actually accesses the exact same object as the dictionary-style access:
states['population'] is states.population

True

In [42]:
states.rename(columns={'population': 'pop'}, inplace=True)
states

Unnamed: 0,pop,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [43]:
#the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:
states.pop is states['pop']

False

In [44]:
#introduce new column better to use dictionary style
states['density'] = states['pop'] /states['area']
states

Unnamed: 0,pop,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [45]:
pd.DataFrame(states)

Unnamed: 0,pop,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [46]:
df=pd.DataFrame(states,columns=['pop'])
df

Unnamed: 0,pop
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [47]:
df.dtypes

pop    int64
dtype: object

###### From list of dict

In [48]:
data = [{'a': i, 'b': 2 * i}for i in range(3)]
pd.DataFrame(data)


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [49]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [50]:
#Constructing DataFrame from a dictionary.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [51]:
df.dtypes

col1    int64
col2    int64
dtype: object

In [86]:
df = pd.DataFrame(data=d,dtype='int8')
df


Unnamed: 0,col1,col2
0,1,3
1,2,4


In [87]:
df.dtypes

col1    int8
col2    int8
dtype: object

In [95]:
#From a two-dimensional NumPy array.
"""Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. 
If omitted, an integer index will be used for each:"""
#pd.DataFrame(np.random.rand(3, 2),columns=['foo', 'bar'],index=['a', 'b', 'c'])
pd.DataFrame(np.random.randint(0, 3, size=(3, 2)), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,2,2
b,0,1
c,0,2


In [55]:
#From a NumPy structured array.
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [56]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


In [57]:
A.dtype

dtype([('A', '<i8'), ('B', '<f8')])

###### Indexers loc and iloc 

In [58]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [59]:
# explicit index when indexing
data[1]

'a'

In [60]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

In [61]:
data.iloc[1]

'b'

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer 
attributes that explicitly expose certain indexing schemes. These are not functional methods, 
but attributes that expose a particular slicing interface to the data in the Series.

In [62]:
# First, the loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1]

'a'

In [63]:
data.loc[1:3]

1    a
3    b
dtype: object

In [64]:
#The iloc attribute allows indexing and slicing that always references the implicit Python-style index
data.iloc[1]

'b'

In [65]:
data.iloc[1:3]

3    b
5    c
dtype: object

###### DataFrame as two-dimensional array

In [66]:
states

Unnamed: 0,pop,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [67]:
states.values

array([[3.83325210e+07, 4.23967000e+05, 9.04139261e+01],
       [2.64481930e+07, 6.95662000e+05, 3.80187404e+01],
       [1.96511270e+07, 1.41297000e+05, 1.39076746e+02],
       [1.95528600e+07, 1.70312000e+05, 1.14806121e+02],
       [1.28821350e+07, 1.49995000e+05, 8.58837628e+01]])

In [68]:
states.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [69]:
states.iloc[:3,:2]

Unnamed: 0,pop,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297


In [70]:
states.loc['Illinois','pop']

12882135

In [71]:
states['density']= states['pop']/states['area']
states

Unnamed: 0,pop,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [72]:
#In the loc indexer we can combine masking and fancy indexing
states.loc[states.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [73]:
#indexing  may also be used to set or modify values
states.iloc[0, 2] = 90
states

Unnamed: 0,pop,area,density
California,38332521,423967,90.0
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [74]:
#First, while indexing refers to columns, slicing refers to rows:
states['Florida':'Illinois']


Unnamed: 0,pop,area,density
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [75]:
#Such slices can also refer to rows by number rather than by index:
states[3:5]

Unnamed: 0,pop,area,density
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [76]:
#direct masking operations are also interpreted row-wise rather than column-wise:
states[states.density > 100]

Unnamed: 0,pop,area,density
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121


###### Handling Missing Data
Pandas chose to use sentinels for missing data, and further chose to use two already-existing <b>Python null values: the special floatingpoint NaN (Not a Number) value, and the Python None object</b>. 

<b>None: Pythonic missing data</b>
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects

In [77]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

In [78]:
# amount of overhead to handle object type
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
66.7 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
2.57 ms ± 621 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)



In [118]:
#The use of Python objects in an array also means that if you perform aggregations
#like sum() or min() across an array with a None value, you will generally get an error:
vals.sum()

NameError: name 'vals' is not defined

###### NaN: Missing numerical data

In [119]:
#it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

In [120]:
1 + np.nan

nan

In [121]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

In [122]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

In [123]:
#NaN and None in Pandas
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [124]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [125]:
x[0]=np.nan
x

0    NaN
1    1.0
dtype: float64

###### Pandas handling of NAs by type
<pre><b>Typeclass           Conversion when storing NAs          NA sentinel value  </b>
   floating             No change                           np.nan
   object               No change                           None or np.nan
   integer              Cast to float64                     np.nan
   boolean              Cast to object                      None or np.nan  </pre>

###### Operating on Null Values

###### Detecting null values
Pandas data structures have two useful methods for detecting null data: <b> isnull() and notnull()</b>. Either one will return a Boolean mask over the data. For example:

In [129]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [130]:
data.notnull().any()

True

In [131]:
#Boolean masks can be used directly as a Series or DataFrame index:
data[data.notnull()]

0        1
2    hello
dtype: object

###### Dropping null values
In addition to the masking used before, there are the convenience methods, <b>dropna()(which removes NA values) and fillna()</b> (which fills in NA values).

In [None]:
data.dropna(ignore_index=True)

In [133]:
df = pd.DataFrame([[1, np.nan, 2],
                   [2, 3, 5],
                   [np.nan, 4, 6]],
                  columns=['A','B','C'])
df

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [134]:
df.dropna()

Unnamed: 0,A,B,C
1,2.0,3.0,5


In [135]:
df.dropna(axis=1) #or axis='columns'

Unnamed: 0,C
0,2
1,5
2,6


In [136]:
#set a column with all null values
df['D'] = np.nan
df

Unnamed: 0,A,B,C,D
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [137]:
#to drop all null values column
df.dropna(axis=1,how='all')

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [138]:
df

Unnamed: 0,A,B,C,D
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [146]:
"""thresh parameter lets you specify a minimum number of 
non-null values for the row/column to be kept:
Here the first and last row have been dropped, because 
they contain only two nonnull values."""
df.dropna(axis='columns', thresh=1)

Unnamed: 0,A,B,C
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


###### Filling null values

In [147]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [148]:
#We can fill NA entries with a single value, such as zero:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [149]:
#We can specify a forward-fill to propagate the previous value forward
data.fillna(method='ffill')

  data.fillna(method='ffill')


a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [150]:
#Or we can specify a back-fill to propagate the next values backward:
data.fillna(method='bfill')

  data.fillna(method='bfill')


a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [151]:
#For DataFrame
df


Unnamed: 0,A,B,C,D
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [152]:
df.fillna(method='ffill', axis=1)

  df.fillna(method='ffill', axis=1)


Unnamed: 0,A,B,C,D
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


In [153]:
df

Unnamed: 0,A,B,C,D
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [154]:
df.fillna(method='bfill', axis=1)

  df.fillna(method='bfill', axis=1)


Unnamed: 0,A,B,C,D
0,1.0,2.0,2.0,
1,2.0,3.0,5.0,
2,4.0,4.0,6.0,


###### Hierarchical(multi) Indexing

In [155]:
#Bad way
index = [('California', 2000), ('California', 2010),('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,18976457, 19378102,20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [156]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

In [157]:
"""But the convenience ends there. For example, if you need to select all values from 2010, you’ll need to do some messy (and potentially slow) munging to make it
happen:"""
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

In [158]:
#The better way: Pandas MultiIndex
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [159]:
"""Here the first two columns of the Series representation show the multiple index values,
while the third column shows the data. Notice that some entries are missing in
the first column: in this multi-index representation, any blank entry indicates the
same value as the line above it"""
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [160]:
#to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

###### MultiIndex as extra dimension

In [161]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [None]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [None]:
"""we might want to add another column of demographic data for each state at each year
(say, population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame:"""
pop_df = pd.DataFrame({'total': pop,'under18': [9267089, 9284094,4687374, 4318033,5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [None]:
#we compute the fraction of people under 18 by year, given the above data
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


###### MultiIndex level names

In [None]:
pop_df.index.names = ['state', 'year']
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


###### MultiIndex for columns

In [None]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
print(data)
data[:, ::2] *= 10
print(data)
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[[ 2.1 -0.7  0.   0.9 -0.7 -0.3]
 [ 0.5 -1.8 -0.3  0.5 -0.1  0.6]
 [ 0.4  1.7 -0.4  0.4 -0.1 -0.7]
 [ 1.   0.4  1.8  1.2 -0.7  0.8]]
[[21.  -0.7  0.   0.9 -7.  -0.3]
 [ 5.  -1.8 -3.   0.5 -1.   0.6]
 [ 4.   1.7 -4.   0.4 -1.  -0.7]
 [10.   0.4 18.   1.2 -7.   0.8]]


Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,58.0,36.3,37.0,37.9,30.0,36.7
2013,2,42.0,35.2,34.0,37.5,36.0,37.6
2014,1,41.0,38.7,33.0,37.4,36.0,36.3
2014,2,47.0,37.4,55.0,38.2,30.0,37.8


In [None]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,37.0,37.9
2013,2,34.0,37.5
2014,1,33.0,37.4
2014,2,55.0,38.2


###### Index setting and resetting

In [None]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [None]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [None]:
pop_multi=pop_flat.set_index(['state', 'year'])
pop_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


###### Data Aggregations on Multi-Indices

In [None]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,58.0,36.3,37.0,37.9,30.0,36.7
2013,2,42.0,35.2,34.0,37.5,36.0,37.6
2014,1,41.0,38.7,33.0,37.4,36.0,36.3
2014,2,47.0,37.4,55.0,38.2,30.0,37.8


In [None]:
health_data.mean(axis='columns')

year  visit
2013  1        39.316667
      2        37.050000
2014  1        37.066667
      2        40.900000
dtype: float64

In [None]:
health_data.mean(axis='rows')

subject  type
Bob      HR      47.00
         Temp    36.90
Guido    HR      39.75
         Temp    37.75
Sue      HR      33.00
         Temp    37.10
dtype: float64

###### Recall: Concatenation of NumPy Arrays

In [85]:
x = [[1], [2], [3]]
y = [[4], [5], [6]]
z = [[7], [8], [9]]
np.concatenate([x, y, z])

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [84]:
"""x = [[1], [2], [3]]
y = [[4], [5], [6]]
z = [[7], [8], [9]]"""
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [83]:
x = [[1, 2],[3, 4]]
np.concatenate([x, x])

array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

In [82]:
import pandas as pd
df = pd.DataFrame({"Name": ["Alice", "Bob"], "Salary": [50000, 60000]})
high_salary = df[df["Salary"] > 55000]
print(df)
print(high_salary)

    Name  Salary
0  Alice   50000
1    Bob   60000
  Name  Salary
1  Bob   60000


In [81]:
# Dict
data = {"Name": ["Alice"], "Age": [25]}
# DataFrame
df = pd.DataFrame(data)
df.describe()  # Summary stats


Unnamed: 0,Age
count,1.0
mean,25.0
std,
min,25.0
25%,25.0
50%,25.0
75%,25.0
max,25.0


In [80]:
import numpy as np
arr = np.array([1, 2, 3])
print(arr + 2)   # [3, 4, 5]
print(arr * 3)   # [3, 6, 9]

[3 4 5]
[3 6 9]
