# Getting Started Part 1

## Learning Outcomes



The training below is adopted from the [Pandas online reference](http://pandas.pydata.org/pandas-docs/stable/)

At the end of the workshop, students would have gained an appreciate and hand-ons practical experience on the following topics:

* Essential Basic Functionality
  * Head and Tail
  * Attributes and the raw ndarray(s)
  * Boolean Reductions
  * Descriptive statistics
  * Summarizing data: describe
  * Index of Min/Max Values
  * Value counts (histogramming) / Mode
  * Function application
  * Reindexing and altering labels
  * Iteration
  * Vectorized string methods
  * Sorting
  * Copying
* Indexing and Selecting Data
  * Different Choices for Indexing
  * Basics
  * Attribute Access
  * Slicing ranges
  * Selection By Label
  * Selection By Position
  * Selecting Random Samples
  * Setting With Enlargement
  * Fast scalar value getting and setting
  * Boolean indexing
  * Indexing with isin
  * The where() Method and Masking
  * The query() Method (Experimental)
  * Duplicate Data
  * Index objects
  * Set / Reset Index
  * Returning a view versus a copy

# Essential Basic Functionality

## Intro

In [1]:
import pandas as pd
import numpy as np
print("Pandas version : {}".format(pd.__version__))
print("Numpy version : {}".format(np.__version__))

Pandas version : 0.22.0
Numpy version : 1.14.3


In [2]:
index = pd.date_range('1/1/2000', periods=8)

In [3]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame(np.random.randn(8, 3), index=index,
                  columns=['A', 'B', 'C'])

In [4]:
s

a    0.145077
b    1.301458
c   -1.360039
d    0.075830
e   -1.105540
dtype: float64

In [5]:
df

Unnamed: 0,A,B,C
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


## Head and Tail

In [6]:
df.head()

Unnamed: 0,A,B,C
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973


In [7]:
df.tail(3)

Unnamed: 0,A,B,C
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


## Attributes and the raw ndarray(s)


pandas objects have a number of attributes enabling you to access the metadata

  * **shape**: gives the axis dimensions of the object, consistent with ndarray
  * Axis labels
    * **Series**: index (only axis)
    * **DataFrame**: index (rows) and columns

Note, **these attributes can be safely assigned to**!

In [8]:
df[:2]

Unnamed: 0,A,B,C
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875


In [9]:
for x in df.columns:
    print(x.lower())

a
b
c


In [10]:
df.columns = [x.lower() for x in df.columns]

In [11]:
df.head(3)

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399


To get the actual data inside a data structure, one need only access the **values** property:

In [12]:
s

a    0.145077
b    1.301458
c   -1.360039
d    0.075830
e   -1.105540
dtype: float64

In [13]:
s.values

array([ 0.14507741,  1.30145826, -1.36003948,  0.07583026, -1.10553952])

In [14]:
df.values

array([[-0.79904436,  0.26588283, -1.14922598],
       [-0.6294742 , -1.12112739, -0.23487517],
       [ 1.36319352, -1.5828402 ,  0.12039932],
       [-0.65813021,  0.73144986,  2.327342  ],
       [ 0.9576095 , -0.81659178,  1.17697345],
       [ 0.08903908, -0.52684828,  0.11157395],
       [-0.62349334, -0.54204458, -0.5595999 ],
       [-0.47791147,  0.53049582,  0.2811844 ]])

## Boolean Reductions


You can apply the reductions: **empty, any(), all(), and bool()** to provide a way to summarize a boolean result.

In [15]:
df

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


In [16]:
(df > -1.1).all()

a     True
b    False
c    False
dtype: bool

In [21]:
(df > -1.1).all(axis=1)

2000-01-01    False
2000-01-02    False
2000-01-03    False
2000-01-04     True
2000-01-05     True
2000-01-06     True
2000-01-07     True
2000-01-08     True
Freq: D, dtype: bool

In [22]:
(df > -1.1).any(axis=0)

a    True
b    True
c    True
dtype: bool

In [23]:
(df > -1.1).any(axis=1)

2000-01-01    True
2000-01-02    True
2000-01-03    True
2000-01-04    True
2000-01-05    True
2000-01-06    True
2000-01-07    True
2000-01-08    True
Freq: D, dtype: bool

In [24]:
df.empty

False

## Descriptive statistics


A large number of methods for computing descriptive statistics and other related operations on Series, and DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like **sum(), mean(), and quantile()**, but some of them, like **cumsum() and cumprod()**, produce an object of the same size. Generally speaking, these methods take an **axis** argument, just like `ndarray.{sum, std, ...}`, but the axis can be specified by name or integer:
  * **Series**: no axis argument needed
  * **DataFrame**: “index” (axis=0, default), “columns” (axis=1)
 

For example:

In [25]:
df

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,-0.629474,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


In [26]:
df.mean(0)

a   -0.097276
b   -0.382703
c    0.259222
dtype: float64

In [27]:
df.mean(1)

2000-01-01   -0.560796
2000-01-02   -0.661826
2000-01-03   -0.033082
2000-01-04    0.800221
2000-01-05    0.439330
2000-01-06   -0.108745
2000-01-07   -0.575046
2000-01-08    0.111256
Freq: D, dtype: float64

All such methods have a `skipna` option signaling whether to exclude missing data (`True` by default):

In [31]:
df['a'].iloc[1] = np.nan

In [32]:
df

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


In [33]:
df.sum(0, skipna=False)

a         NaN
b   -3.061624
c    2.073772
dtype: float64

In [34]:
df.sum(0, skipna=True)

a   -0.148737
b   -3.061624
c    2.073772
dtype: float64

In [35]:
df.sum(axis=1, skipna=True)

2000-01-01   -1.682388
2000-01-02   -1.356003
2000-01-03   -0.099247
2000-01-04    2.400662
2000-01-05    1.317991
2000-01-06   -0.326235
2000-01-07   -1.725138
2000-01-08    0.333769
Freq: D, dtype: float64

In [36]:
df.sum(axis=1, skipna=False)

2000-01-01   -1.682388
2000-01-02         NaN
2000-01-03   -0.099247
2000-01-04    2.400662
2000-01-05    1.317991
2000-01-06   -0.326235
2000-01-07   -1.725138
2000-01-08    0.333769
Freq: D, dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [37]:
ts_stand = (df - df.mean()) / df.std()

In [38]:
ts_stand.std()

a    1.0
b    1.0
c    1.0
dtype: float64

In [39]:
ts_stand

Unnamed: 0,a,b,c
2000-01-01,-0.901248,0.79019,-1.312629
2000-01-02,,-0.899643,-0.460482
2000-01-03,1.604181,-1.46216,-0.129378
2000-01-04,-0.737968,1.357403,1.927423
2000-01-05,1.134222,-0.528619,0.855316
2000-01-06,0.127792,-0.175616,-0.137603
2000-01-07,-0.697834,-0.19413,-0.763116
2000-01-08,-0.529145,1.112575,0.020469


In [40]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [41]:
xs_stand.std(1)

2000-01-01    1.0
2000-01-02    1.0
2000-01-03    1.0
2000-01-04    1.0
2000-01-05    1.0
2000-01-06    1.0
2000-01-07    1.0
2000-01-08    1.0
Freq: D, dtype: float64


| Function	| Description| 
| ---------| -----------| 
| count	| Number of non-null observations| 
| sum	| Sum of values| 
| mean	| Mean of values| 
| mad	| Mean absolute deviation| 
| median	| Arithmetic median of values| 
| min	| Minimum| 
| max	| Maximum| 
| mode	| Mode| 
| abs	| Absolute Value| 
| prod	| Product of values| 
| std	| Bessel-corrected sample standard deviation| 
| var	| Unbiased variance| 
| sem	| Standard error of the mean| 
| skew	| Sample skewness (3rd moment)| 
| kurt	| Sample kurtosis (4th moment)| 
| quantile	| Sample quantile (value at %)| 
| cumsum	| Cumulative sum| 
| cumprod	| Cumulative product| 
| cummax	| Cumulative maximum| 
| cummin	| Cumulative minimum| 

## Summarizing data: describe

In [42]:
series = pd.Series(np.random.randn(1000))

In [43]:
series[::2] = np.nan

In [44]:
series.describe()

count    500.000000
mean      -0.028421
std        1.002925
min       -3.599437
25%       -0.686799
50%       -0.019284
75%        0.662105
max        2.818897
dtype: float64

In [45]:
series.head()

0         NaN
1    1.042622
2         NaN
3   -0.885065
4         NaN
dtype: float64

In [46]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [47]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,-0.005292,0.04299,0.002951,-0.035533,-0.018504
std,0.976631,0.992848,0.980683,0.971957,1.052476
min,-3.530319,-3.386821,-3.536984,-3.661615,-3.183637
25%,-0.682768,-0.613802,-0.695385,-0.638637,-0.702317
50%,-0.039041,0.03878,0.031765,-0.006155,0.013067
75%,0.658626,0.691465,0.672535,0.558253,0.630412
max,2.565119,3.774812,2.591257,3.114545,3.925733


In [48]:
frame.head()

Unnamed: 0,a,b,c,d,e
0,0.80687,0.176282,-0.240093,-0.656378,-0.379804
1,-0.701479,-1.362438,1.308786,2.205849,-0.938897
2,1.399468,0.747856,-0.789345,0.814696,-0.360119
3,0.613743,0.123427,0.523194,2.748108,0.436227
4,-1.536505,0.019166,-0.166729,-0.311372,-1.822145


You can select specific percentiles to include in the output:

In [49]:
series.describe(percentiles=[.05, .25, .75, .95])

count    500.000000
mean      -0.028421
std        1.002925
min       -3.599437
5%        -1.794677
25%       -0.686799
50%       -0.019284
75%        0.662105
95%        1.633106
max        2.818897
dtype: float64

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and most frequently occurring values:

In [50]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

## Index of Min/Max Values


The **idxmin()** and **idxmax()** functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [51]:
s1 = pd.Series(np.random.randn(5))
s1

0    0.028606
1    0.518212
2    0.418099
3    0.374567
4   -0.697900
dtype: float64

In [52]:
s1.idxmin(), s1.idxmax()

(4, 1)

In [53]:
df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])
df1

Unnamed: 0,A,B,C
0,-0.319106,0.20236,-0.629237
1,-0.817858,1.750934,0.389699
2,1.739465,-1.961444,0.677205
3,1.529951,-2.083321,1.470516
4,0.110009,1.001127,1.092521


In [54]:
df1.idxmin(axis=0)

A    1
B    3
C    0
dtype: int64

In [55]:
df1.idxmax(axis=1)

0    B
1    B
2    A
3    A
4    C
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [56]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [57]:
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [58]:
df3['A'].idxmin()

'd'

`idxmin` and `idxmax` are called `argmin` and `argmax` in NumPy.

## Value counts (histogramming) / Mode


The **value_counts()** Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [59]:
data = np.random.randint(0, 7, size=50)

In [60]:
data

array([0, 5, 0, 0, 2, 3, 6, 3, 0, 2, 4, 0, 1, 3, 6, 0, 2, 5, 2, 0, 0, 1,
       0, 3, 5, 1, 0, 6, 2, 5, 0, 3, 6, 2, 2, 0, 5, 5, 5, 1, 4, 1, 5, 2,
       1, 2, 4, 0, 2, 2])

In [61]:
s = pd.Series(data)

In [62]:
s.value_counts()

0    13
2    11
5     8
1     6
3     5
6     4
4     3
dtype: int64

In [63]:
pd.value_counts(data)

0    13
2    11
5     8
1     6
3     5
6     4
4     3
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [64]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [66]:
s5

0    1
1    1
2    3
3    3
4    3
5    5
6    5
7    7
8    7
9    7
dtype: int64

In [65]:
s5.mode()

0    3
1    7
dtype: int64

In [67]:
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
                    "B": np.random.randint(-10, 15, size=50)})

In [68]:
df5

Unnamed: 0,A,B
0,4,5
1,6,-2
2,2,-9
3,5,7
4,6,-3
5,2,-4
6,6,13
7,3,12
8,0,-5
9,2,-6


In [69]:
df5.mode()

Unnamed: 0,A,B
0,6,5


## Function application


To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire `DataFrame` or `Series`, row- or column-wise, or elementwise.
1. Tablewise Function Application: **pipe()**
2. Row or Column-wise Function Application: **apply()**
3. Elementwise function application: **applymap()**

### apply

In [70]:
df

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,,-1.121127,-0.234875
2000-01-03,1.363194,-1.58284,0.120399
2000-01-04,-0.65813,0.73145,2.327342
2000-01-05,0.95761,-0.816592,1.176973
2000-01-06,0.089039,-0.526848,0.111574
2000-01-07,-0.623493,-0.542045,-0.5596
2000-01-08,-0.477911,0.530496,0.281184


In [71]:
df.apply(np.mean)

a   -0.021248
b   -0.382703
c    0.259222
dtype: float64

In [72]:
df.apply(np.mean, axis=1)

2000-01-01   -0.560796
2000-01-02   -0.678001
2000-01-03   -0.033082
2000-01-04    0.800221
2000-01-05    0.439330
2000-01-06   -0.108745
2000-01-07   -0.575046
2000-01-08    0.111256
Freq: D, dtype: float64

$$f(x) = max - min$$

In [73]:
df.apply(lambda x: x.max() - x.min())

a    2.162238
b    2.314290
c    3.476568
dtype: float64

In [74]:
df.apply(np.cumsum)

Unnamed: 0,a,b,c
2000-01-01,-0.799044,0.265883,-1.149226
2000-01-02,,-0.855245,-1.384101
2000-01-03,0.564149,-2.438085,-1.263702
2000-01-04,-0.093981,-1.706635,1.06364
2000-01-05,0.863628,-2.523227,2.240614
2000-01-06,0.952668,-3.050075,2.352188
2000-01-07,0.329174,-3.59212,1.792588
2000-01-08,-0.148737,-3.061624,2.073772


In [75]:
df.apply(np.exp)

Unnamed: 0,a,b,c
2000-01-01,0.449759,1.304582,0.316882
2000-01-02,,0.325912,0.79067
2000-01-03,3.908656,0.205391,1.127947
2000-01-04,0.517819,2.078091,10.250659
2000-01-05,2.605461,0.441935,3.24454
2000-01-06,1.093123,0.590463,1.118036
2000-01-07,0.536068,0.581558,0.571438
2000-01-08,0.620077,1.699775,1.324698


In [76]:
df4 = pd.DataFrame(np.random.randn(4,3), index=['a','b','c','d'],columns=['one','two','three'])

In [77]:
df4

Unnamed: 0,one,two,three
a,0.051365,-1.145719,-1.845689
b,0.616528,-1.081943,-0.724944
c,-0.9678,1.844171,-0.627653
d,0.093921,-0.222367,-0.639294


### Applying elementwise Python functions


Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods **applymap()** on DataFrame and analogously **map()** on Series accept any Python function taking a single value and returning a single value. For example:

In [78]:
f = lambda x: len(str(x))

In [79]:
df4['one'].map(f)

a    19
b    18
c    19
d    19
Name: one, dtype: int64

In [82]:
df4.applymap(f)

Unnamed: 0,one,two,three
a,19,19,19
b,18,18,19
c,19,17,19
d,19,18,19


## Reindexing and altering labels


**reindex()** is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:
  * Reorders the existing data to match a new set of labels
  * Inserts missing value (NA) markers in label locations where no data for that label existed
  * If specified, fill data for missing labels using logic (highly relevant to working with time series data)

Here is a simple example:

In [83]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [84]:
s

a   -1.322675
b    1.175128
c    0.530356
d    2.084112
e   -0.880225
dtype: float64

In [85]:
s.reindex(['e', 'b', 'f', 'd'])

e   -0.880225
b    1.175128
f         NaN
d    2.084112
dtype: float64

In [86]:
s.reindex()

a   -1.322675
b    1.175128
c    0.530356
d    2.084112
e   -0.880225
dtype: float64

In [87]:
s.reset_index()

Unnamed: 0,index,0
0,a,-1.322675
1,b,1.175128
2,c,0.530356
3,d,2.084112
4,e,-0.880225


Here, the f label was not contained in the Series and hence appears as NaN in the result.

With a DataFrame, you can simultaneously reindex the index and columns:

In [88]:
df4

Unnamed: 0,one,two,three
a,0.051365,-1.145719,-1.845689
b,0.616528,-1.081943,-0.724944
c,-0.9678,1.844171,-0.627653
d,0.093921,-0.222367,-0.639294


In [89]:
df4.reindex(index=['a', 'd', 'c'], columns=['two', 'three', 'one'])

Unnamed: 0,two,three,one
a,-1.145719,-1.845689,0.051365
d,-0.222367,-0.639294,0.093921
c,1.844171,-0.627653,-0.9678


### Aligning objects with each other with align

The **align()** method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):
* `join='outer'`: take the union of the indexes (default)
* `join='left'`: use the calling object’s index
* `join='right'`: use the passed object’s index
* `join='inner'`: intersect the indexes

It returns a tuple with both of the reindexed Series:

In [None]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [None]:
s

In [None]:
s1 = s[:4]

In [None]:
s1

In [None]:
s2 = s[1:]

In [None]:
s2

In [None]:
s1.align(s2)

In [None]:
s1.align(s2, join='inner')   # intersect

In [None]:
s1.align(s2, join='left')   
# Left =  use the calling object’s index. s1 is the calling object

For DataFrames, the join method will be applied to both the index and the columns by default.

You can also pass an axis option to only align on the specified axis.

### Filling while reindexing



**reindex()** takes an optional parameter method which is a filling method chosen from the following table:

| Method | Action | 
| ----- |  ----- | 
| pad / ffill | Fill values forward | 
| bfill / backfill | Fill values backward | 
| nearest | Fill from the nearest index value | 

We illustrate these fill methods on a simple Series:


In [None]:
rng = pd.date_range('1/3/2000', periods=8)

In [None]:
ts = pd.Series(np.random.randn(8), index=rng)

In [None]:
ts2 = ts[[0, 3, 6]]

In [None]:
ts

In [None]:
ts2

In [None]:
ts2.reindex(ts.index)

In [None]:
ts2.reindex(ts.index, method='ffill')

In [None]:
ts2.reindex(ts.index, method='bfill')

In [None]:
ts2.reindex(ts.index, method='nearest')

These methods require that the indexes are **ordered** increasing or decreasing.

Note that the same result could have been achieved using `fillna` (except for `method='nearest'`) or interpolate:

In [None]:
ts2.reindex(ts.index).fillna(method='ffill')

### Limits on filling while reindexing




The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches:

In [None]:
ts2.reindex(ts.index, method='ffill', limit=1)

In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [None]:
ts2.reindex(ts.index, method='ffill', tolerance='1 day')

Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

### Dropping labels from an axis

In [None]:
df4

In [None]:
df4.drop(['a', 'd'], axis=0)

In [None]:
df4.drop(['one'], axis=1)

### Renaming / mapping labels

In [None]:
s

In [None]:
s.rename(str.upper)

In [None]:
df4

In [None]:
df4.rename(columns={'one' : 'foo', 'two' : 'bar'},
           index={'a' : 'apple', 'b' : 'banana', 'd' : 'durian'})

The **rename()** method also provides an `inplace` named parameter that is by default `False` and copies the underlying data. Pass `inplace=True` to rename the data in place.

## Iteration


The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame follow the dict-like convention of iterating over the “keys” of the objects.

In short, basic iteration (`for i in object`) produces:
  * **Series**: values
  * **DataFrame**: column labels

Thus, for example, iterating over a DataFrame gives you the column names:

In [90]:
df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)},
                  index=['a', 'b', 'c'])

In [91]:
for col in df:
    print(col)

col1
col2


Pandas objects also have the dict-like **iteritems()** method to iterate over the (key, value) pairs.

To iterate over the rows of a DataFrame, you can use the following methods:

  * **iterrows()**: Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series objects, which can change the dtypes and has some performance implications.
  * **itertuples()**: Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than **iterrows()**, and is in most cases preferable to use to iterate over the values of a DataFrame.

In [92]:
df4

Unnamed: 0,one,two,three
a,0.051365,-1.145719,-1.845689
b,0.616528,-1.081943,-0.724944
c,-0.9678,1.844171,-0.627653
d,0.093921,-0.222367,-0.639294


In [93]:
for i in df4:
    print(i)

one
two
three


In [97]:
for i in df4.iterrows():
    print(i)

('a', one      0.051365
two     -1.145719
three   -1.845689
Name: a, dtype: float64)
('b', one      0.616528
two     -1.081943
three   -0.724944
Name: b, dtype: float64)
('c', one     -0.967800
two      1.844171
three   -0.627653
Name: c, dtype: float64)
('d', one      0.093921
two     -0.222367
three   -0.639294
Name: d, dtype: float64)


In [98]:
i

('d', one      0.093921
 two     -0.222367
 three   -0.639294
 Name: d, dtype: float64)

In [105]:
i[0]

'd'

In [104]:
i[1]

one      0.093921
two     -0.222367
three   -0.639294
Name: d, dtype: float64

In [96]:
for i in df4.itertuples():
    print(i)

Pandas(Index='a', one=0.05136465225924702, two=-1.1457186852050847, three=-1.8456890312831202)
Pandas(Index='b', one=0.6165279144688737, two=-1.081943086770827, three=-0.7249437264945804)
Pandas(Index='c', one=-0.9678000385024236, two=1.844171141312736, three=-0.6276525819138462)
Pandas(Index='d', one=0.09392083243126287, two=-0.222366503620577, three=-0.6392944451361638)


**Warning** Iterating through pandas objects is generally slow. There are alternatives. Consult the [Pandas online reference](http://pandas.pydata.org/pandas-docs/stable/)

## Vectorized string methods


Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s `str` attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [106]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [107]:
s

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [108]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Please see [Vectorized String Methods](http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods) for a complete description.

## Sorting

### By Index


The primary method for sorting axis labels (indexes) are the `Series.sort_index()` and the `DataFrame.sort_index()` methods.

In [120]:
unsorted_df = pd.DataFrame(np.random.randn(12).reshape(4,3), index=['a', 'd', 'c', 'b'],
                           columns=['three', 'two', 'one'])

In [121]:
unsorted_df

Unnamed: 0,three,two,one
a,0.75196,-2.436346,0.772394
d,-1.400556,0.181898,0.914241
c,0.170898,0.193582,-1.025609
b,1.669731,-0.841138,0.86142


In [122]:
unsorted_df.sort_index()

Unnamed: 0,three,two,one
a,0.75196,-2.436346,0.772394
b,1.669731,-0.841138,0.86142
c,0.170898,0.193582,-1.025609
d,-1.400556,0.181898,0.914241


In [123]:
unsorted_df.sort_index(ascending=False)

Unnamed: 0,three,two,one
d,-1.400556,0.181898,0.914241
c,0.170898,0.193582,-1.025609
b,1.669731,-0.841138,0.86142
a,0.75196,-2.436346,0.772394


In [124]:
unsorted_df.sort_index(axis=1)

Unnamed: 0,one,three,two
a,0.772394,0.75196,-2.436346
d,0.914241,-1.400556,0.181898
c,-1.025609,0.170898,0.193582
b,0.86142,1.669731,-0.841138


In [125]:
unsorted_df['three'].sort_index()

a    0.751960
b    1.669731
c    0.170898
d   -1.400556
Name: three, dtype: float64

### By Values


The **Series.sort_values()** and **DataFrame.sort_values()** are the entry points for **value** sorting (that is the values in a column or row). **DataFrame.sort_values()** can accept an optional by argument for `axis=0` which will use an arbitrary vector or a column name of the DataFrame to determine the sort order:

In [126]:
df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})

In [127]:
df1

Unnamed: 0,one,three,two
0,2,5,1
1,1,4,3
2,1,3,2
3,1,2,4


In [128]:
df1.sort_values(by='two')

Unnamed: 0,one,three,two
0,2,5,1
2,1,3,2
1,1,4,3
3,1,2,4


The `by` argument can take a list of column names, e.g.:

In [129]:
df1[['one', 'two', 'three']].sort_values(by=['one','two'])

Unnamed: 0,one,two,three
2,1,2,3
1,1,3,4
3,1,4,2
0,2,1,5


These methods have special treatment of NA values via the `na_position` argument:

In [130]:
s[2] = np.nan

In [131]:
s.sort_values()

0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
2     NaN
5     NaN
dtype: object

In [132]:
s.sort_values(na_position='first')

2     NaN
5     NaN
0       A
3    Aaba
1       B
4    Baca
6    CABA
8     cat
7     dog
dtype: object

### smallest / largest values

In [133]:
s = pd.Series(np.random.permutation(10))

In [134]:
s

0    6
1    7
2    3
3    8
4    1
5    5
6    2
7    0
8    9
9    4
dtype: int64

In [135]:
s.sort_values()

7    0
4    1
6    2
2    3
9    4
5    5
0    6
1    7
3    8
8    9
dtype: int64

In [136]:
s.nsmallest(3)

7    0
4    1
6    2
dtype: int64

In [137]:
s.nlargest(3)

8    9
3    8
1    7
dtype: int64

In [138]:
df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
                   'b': list('abdceff'),
                   'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})

In [139]:
df

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
2,1,d,4.0
3,10,c,3.2
4,8,e,
5,11,f,3.0
6,-1,f,4.0


In [140]:
df.nlargest(3, 'a')

Unnamed: 0,a,b,c
5,11,f,3.0
3,10,c,3.2
4,8,e,


In [141]:
df.nlargest(5, ['a', 'c'])

Unnamed: 0,a,b,c
6,-1,f,4.0
5,11,f,3.0
3,10,c,3.2
4,8,e,
2,1,d,4.0


In [142]:
df.nsmallest(3, 'a')

Unnamed: 0,a,b,c
0,-2,a,1.0
1,-1,b,2.0
6,-1,f,4.0


In [143]:
df.nsmallest(5, ['a', 'c'])

Unnamed: 0,a,b,c
0,-2,a,1.0
2,1,d,4.0
4,8,e,
1,-1,b,2.0
6,-1,f,4.0


### Sorting by a multi-index column


You must be explicit about sorting when the column is a multi-index, and fully specify all levels to `by`.

In [144]:
df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])

In [145]:
df1

Unnamed: 0_level_0,a,a,b
Unnamed: 0_level_1,one,two,three
0,2,5,1
1,1,4,3
2,1,3,2
3,1,2,4


In [146]:
df1.sort_values(by=('a','two'))

Unnamed: 0_level_0,a,a,b
Unnamed: 0_level_1,one,two,three
3,1,2,4
2,1,3,2
1,1,4,3
0,2,5,1


## Copying



The **copy()** method on pandas objects copies the underlying data (though not the axis indexes, since they are immutable) and returns a new object. Note that **it is seldom necessary to copy objects**. For example, there are only a handful of ways to alter a DataFrame in-place:
* Inserting, deleting, or modifying a column
* Assigning to the `index` or `columns` attributes
* For homogeneous data, directly modifying the values via the `values` attribute or advanced indexing

To be clear, no pandas methods have the side effect of modifying your data; almost all methods return new objects, leaving the original object untouched. If data is modified, it is because you did so explicitly.

### dtypes


A convenient **dtypes** attribute for DataFrames returns a Series with the data type of each column.

In [147]:
dft = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('20010102'),
                        E = pd.Series([1.0]*3).astype('float32'),
                        F = False,
                        G = pd.Series([1]*3,dtype='int8')))

In [148]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.036121,1,foo,2001-01-02,1.0,False,1
1,0.222843,1,foo,2001-01-02,1.0,False,1
2,0.79194,1,foo,2001-01-02,1.0,False,1


In [149]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

On a `Series` use the `dtype` attribute.

In [150]:
dft['A'].dtype

dtype('float64')

If a pandas object contains data multiple dtypes IN A SINGLE COLUMN, the dtype of the column will be chosen to accommodate all of the data types (object is the most general).

In [151]:
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

In [152]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

The method **get_dtype_counts()** will return the number of columns of each type in a `DataFrame`:

In [153]:
dft.get_dtype_counts()

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64

### defaults


By default integer types are `int64` and float types are `float64`, REGARDLESS of platform (32-bit or 64-bit).

### astype


You can use the **astype()** method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass `copy=False` to change this behavior). In addition, they will raise an exception if the astype operation is invalid.

In [154]:
df2 = pd.DataFrame(dict(A = pd.Series(np.random.randn(8), dtype='float16'),
                        B = pd.Series(np.random.randn(8)),
                        C = pd.Series(np.array(np.random.randn(8), dtype='uint8')) ))

In [155]:
df2

Unnamed: 0,A,B,C
0,-0.398438,-2.12128,0
1,-0.181763,-1.499138,0
2,0.149902,0.710156,1
3,0.393311,1.153551,0
4,1.1875,-1.578182,0
5,1.672852,-0.891052,1
6,-1.18457,-1.379523,0
7,-1.96582,0.740082,0


In [156]:
df2.dtypes

A    float16
B    float64
C      uint8
dtype: object

In [157]:
df2.astype('float32').dtypes

A    float32
B    float32
C    float32
dtype: object

*****

# Indexing and Selecting Data

## Different Choices for Indexing


* `.loc` is primarily label based,
* `.iloc` is primarily integer position based (from 0 to length-1 of the axis),


| Object Type	| Indexers | 
 | ----- | ----- | 
| Series	 | s.loc[indexer] | 
| DataFrame	 | df.loc[row_indexer,column_indexer] | 


In [175]:
import pandas as pd
import numpy as np
print("Pandas version : {}".format(pd.__version__))
print("Numpy version : {}".format(np.__version__))

Pandas version : 0.22.0
Numpy version : 1.14.3


## Basics


 | Object Type	 | Selection | 	Return Value Type | 
  | ----- | ----- | ----- | 
 | Series	 | series[label]	 | scalar value | 
 | DataFrame	 | frame[colname]	 | Series corresponding to colname | 
 | Panel	 | panel[itemname]	 | DataFrame corresponding to the itemname | 

In [160]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451
2000-01-05,-0.779106,1.107419,2.4833,-0.317095
2000-01-06,-2.838312,1.86285,0.461103,1.343385
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


In [161]:
s = df['A']

In [162]:
s[dates[5]]

-2.83831240513114

## Attribute Access

In [163]:
sa = pd.Series([1,2,3],index=list('abc'))
dfa = df.copy()

In [164]:
sa

a    1
b    2
c    3
dtype: int64

In [165]:
sa.b

2

In [166]:
dfa.A

2000-01-01    1.237496
2000-01-02   -0.505181
2000-01-03    0.059240
2000-01-04   -0.847321
2000-01-05   -0.779106
2000-01-06   -2.838312
2000-01-07    0.261077
2000-01-08    0.824303
Freq: D, Name: A, dtype: float64

In [167]:
sa

a    1
b    2
c    3
dtype: int64

In [168]:
sa.b = 5

In [169]:
sa

a    1
b    5
c    3
dtype: int64

In [170]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451
2000-01-05,-0.779106,1.107419,2.4833,-0.317095
2000-01-06,-2.838312,1.86285,0.461103,1.343385
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


In [171]:
dfa.E = 9

In [172]:
dfa  # Notice it failed silently

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451
2000-01-05,-0.779106,1.107419,2.4833,-0.317095
2000-01-06,-2.838312,1.86285,0.461103,1.343385
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


In [173]:
dfa['E'] = 9.0

In [174]:
dfa

Unnamed: 0,A,B,C,D,E
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489,9.0
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374,9.0
2000-01-03,0.05924,-0.574444,0.526734,-0.207717,9.0
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451,9.0
2000-01-05,-0.779106,1.107419,2.4833,-0.317095,9.0
2000-01-06,-2.838312,1.86285,0.461103,1.343385,9.0
2000-01-07,0.261077,-0.518622,0.362148,-1.076796,9.0
2000-01-08,0.824303,-1.373243,0.226031,1.719361,9.0


## Slicing ranges

In [176]:
s

2000-01-01    1.237496
2000-01-02   -0.505181
2000-01-03    0.059240
2000-01-04   -0.847321
2000-01-05   -0.779106
2000-01-06   -2.838312
2000-01-07    0.261077
2000-01-08    0.824303
Freq: D, Name: A, dtype: float64

In [177]:
s[:5]

2000-01-01    1.237496
2000-01-02   -0.505181
2000-01-03    0.059240
2000-01-04   -0.847321
2000-01-05   -0.779106
Freq: D, Name: A, dtype: float64

In [178]:
s[::2]

2000-01-01    1.237496
2000-01-03    0.059240
2000-01-05   -0.779106
2000-01-07    0.261077
Freq: 2D, Name: A, dtype: float64

In [179]:
s[::-1]

2000-01-08    0.824303
2000-01-07    0.261077
2000-01-06   -2.838312
2000-01-05   -0.779106
2000-01-04   -0.847321
2000-01-03    0.059240
2000-01-02   -0.505181
2000-01-01    1.237496
Freq: -1D, Name: A, dtype: float64

In [180]:
s2 = s.copy()

In [181]:
s2[:5] = 0  # setting

In [182]:
s2

2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -2.838312
2000-01-07    0.261077
2000-01-08    0.824303
Freq: D, Name: A, dtype: float64

In [183]:
s

2000-01-01    1.237496
2000-01-02   -0.505181
2000-01-03    0.059240
2000-01-04   -0.847321
2000-01-05   -0.779106
2000-01-06   -2.838312
2000-01-07    0.261077
2000-01-08    0.824303
Freq: D, Name: A, dtype: float64

## Selection By Label

In [187]:
dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), 
                   index=pd.date_range('20130101',periods=5))
dfl

Unnamed: 0,A,B,C,D
2013-01-01,1.087692,-0.241339,-0.059945,1.089529
2013-01-02,0.235462,1.394865,0.24931,-1.13558
2013-01-03,-0.91859,-0.698307,0.285284,1.951201
2013-01-04,0.683748,1.736072,-1.458559,0.188354
2013-01-05,0.732463,-0.968644,-0.598983,-0.269253


In [185]:
dfl.loc[2:3] # you will get an error for this. It does not work on DatetimeIndex

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.datetimes.DatetimeIndex'> with these indexers [2] of <class 'int'>

In [186]:
dfl.loc['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.399574,-1.173951,0.07131,-0.374128
2013-01-03,-0.532805,-0.306999,0.983791,0.899833
2013-01-04,-1.430499,0.697037,2.156611,-0.240501


In [188]:
s1 = pd.Series(np.random.randn(6),index=list('abcdef'))
s1

a    0.005930
b   -0.008069
c   -0.451523
d   -2.817561
e   -1.596985
f    0.940715
dtype: float64

In [189]:
s1.loc['c':]

c   -0.451523
d   -2.817561
e   -1.596985
f    0.940715
dtype: float64

In [190]:
s1.loc['c':] = 0  # setting
s1

a    0.005930
b   -0.008069
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

In [191]:
df1 = pd.DataFrame(np.random.randn(6,4),
                   index=list('abcdef'),
                   columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
a,1.489072,0.492389,-0.648523,-1.757449
b,-0.704759,-0.656997,-0.364782,-0.001709
c,-0.416817,-0.44657,0.359352,1.311387
d,1.461946,-1.052169,-0.028683,-1.933772
e,0.713735,0.759047,-0.56664,0.463681
f,0.233532,0.962677,0.513566,0.261483


In [192]:
df1.loc[['a', 'b', 'd'], :] # row a, b, d and all columns

Unnamed: 0,A,B,C,D
a,1.489072,0.492389,-0.648523,-1.757449
b,-0.704759,-0.656997,-0.364782,-0.001709
d,1.461946,-1.052169,-0.028683,-1.933772


In [193]:
df1.loc['d':, 'A':'C'] # row 'd' and column A to C

Unnamed: 0,A,B,C
d,1.461946,-1.052169,-0.028683
e,0.713735,0.759047,-0.56664
f,0.233532,0.962677,0.513566


In [194]:
df1.loc['a']  # select row

A    1.489072
B    0.492389
C   -0.648523
D   -1.757449
Name: a, dtype: float64

In [195]:
df1.loc['a'] > 0 # row a values greater than 0?

A     True
B     True
C    False
D    False
Name: a, dtype: bool

In [196]:
df1

Unnamed: 0,A,B,C,D
a,1.489072,0.492389,-0.648523,-1.757449
b,-0.704759,-0.656997,-0.364782,-0.001709
c,-0.416817,-0.44657,0.359352,1.311387
d,1.461946,-1.052169,-0.028683,-1.933772
e,0.713735,0.759047,-0.56664,0.463681
f,0.233532,0.962677,0.513566,0.261483


In [197]:
df1.loc[:, df1.loc['a'] > 0] 
# all rows and select those columns that have the row a greater than 0

Unnamed: 0,A,B
a,1.489072,0.492389
b,-0.704759,-0.656997
c,-0.416817,-0.44657
d,1.461946,-1.052169
e,0.713735,0.759047
f,0.233532,0.962677


## Selection By Position


**Warning:** Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided. See Returning a View versus Copy

In [198]:
s1 = pd.Series(np.random.randn(5), index=list(range(0,10,2)))
s1

0    1.368227
2   -1.563887
4    0.267241
6   -0.723578
8    0.402719
dtype: float64

In [199]:
s1.iloc[:3]

0    1.368227
2   -1.563887
4    0.267241
dtype: float64

In [200]:
s1.iloc[3]

-0.7235776612341633

In [201]:
s1.iloc[:3] = 0 # setting
s1

0    0.000000
2    0.000000
4    0.000000
6   -0.723578
8    0.402719
dtype: float64

In [202]:
df1 = pd.DataFrame(np.random.randn(6,4),
                   index=list(range(0,6)),
                   columns=list(range(0,8,2)))
df1

Unnamed: 0,0,2,4,6
0,-0.733279,0.393306,1.23174,0.133201
1,0.107473,-0.380852,0.694238,-0.151545
2,-0.510955,1.522413,1.637528,0.199219
3,0.615876,-1.42856,-1.418068,-0.824382
4,0.247988,-0.175862,0.34559,0.162409
5,-1.176954,-0.303717,0.463806,-0.533636


In [203]:
df1.iloc[:3]

Unnamed: 0,0,2,4,6
0,-0.733279,0.393306,1.23174,0.133201
1,0.107473,-0.380852,0.694238,-0.151545
2,-0.510955,1.522413,1.637528,0.199219


In [204]:
df1.iloc[1:5, 2:4]

Unnamed: 0,4,6
1,0.694238,-0.151545
2,1.637528,0.199219
3,-1.418068,-0.824382
4,0.34559,0.162409


In [205]:
df1.iloc[[1, 3, 5], [1, 3]]

Unnamed: 0,2,6
1,-0.380852,-0.151545
3,-1.42856,-0.824382
5,-0.303717,-0.533636


In [206]:
df1.iloc[1:3, :]

Unnamed: 0,0,2,4,6
1,0.107473,-0.380852,0.694238,-0.151545
2,-0.510955,1.522413,1.637528,0.199219


In [207]:
df1.iloc[:, 1:3]

Unnamed: 0,2,4
0,0.393306,1.23174
1,-0.380852,0.694238
2,1.522413,1.637528
3,-1.42856,-1.418068
4,-0.175862,0.34559
5,-0.303717,0.463806


In [208]:
df1.iloc[1, 1]

-0.38085249773620694

In [209]:
df1.iloc[2]

0   -0.510955
2    1.522413
4    1.637528
6    0.199219
Name: 2, dtype: float64

In [210]:
df1

Unnamed: 0,0,2,4,6
0,-0.733279,0.393306,1.23174,0.133201
1,0.107473,-0.380852,0.694238,-0.151545
2,-0.510955,1.522413,1.637528,0.199219
3,0.615876,-1.42856,-1.418068,-0.824382
4,0.247988,-0.175862,0.34559,0.162409
5,-1.176954,-0.303717,0.463806,-0.533636


## Selecting Random Samples

In [211]:
s = pd.Series([0,1,2,3,4,5])
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [212]:
s.sample()

1    1
dtype: int64

In [213]:
s.sample(n=3)

1    1
3    3
0    0
dtype: int64

In [214]:
name_list = pd.Series(['CRYSTAL','JUSTIN','NICHOLAS','KENNETH','LANCE','JOHN','BOB'])
name_list

0     CRYSTAL
1      JUSTIN
2    NICHOLAS
3     KENNETH
4       LANCE
5        JOHN
6         BOB
dtype: object

In [215]:
name_list.sample(n=2)

0    CRYSTAL
5       JOHN
dtype: object

In [216]:
name_list.sample(frac=0.5)

5        JOHN
4       LANCE
0     CRYSTAL
2    NICHOLAS
dtype: object

In [217]:
s = pd.Series([0,1,2,3,4,5])
s.sample(n=6, replace=True) # sampling with replacement

2    2
0    0
4    4
4    4
4    4
5    5
dtype: int64

In [218]:
s = pd.Series([0,1,2,3,4,5]) # unequal weight sampling
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [219]:
s.sample(n=3, weights=example_weights)

2    2
5    5
3    3
dtype: int64

## Setting With Enlargement

In [220]:
se = pd.Series([1,2,3])
se

0    1
1    2
2    3
dtype: int64

In [221]:
se[5] = 5.

In [222]:
se

0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

In [223]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),
                   columns=['A','B'])
dfi

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


In [224]:
dfi.loc[:,'C'] = dfi.loc[:,'A']
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4


In [225]:
dfi['D'] = dfi['B']
dfi

Unnamed: 0,A,B,C,D
0,0,1,0,1
1,2,3,2,3
2,4,5,4,5


## Fast scalar value getting and setting


`at` provides label based scalar lookups, while, `iat` provides integer based lookups analogously to `iloc`

In [226]:
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [227]:
s.iat[5]

5

In [228]:
df

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451
2000-01-05,-0.779106,1.107419,2.4833,-0.317095
2000-01-06,-2.838312,1.86285,0.461103,1.343385
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


In [229]:
df.at[dates[5], 'A']

-2.83831240513114

In [230]:
df.iat[3, 0]

-0.8473206512521787

## Boolean indexing

In [231]:
s = pd.Series(range(-3, 4))
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [232]:
s[s > 0]

4    1
5    2
6    3
dtype: int64

In [233]:
s[(s < -1) | (s > 0.5)] # multiple criteria

0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [234]:
s[~(s < 0)]

3    0
4    1
5    2
6    3
dtype: int64

In [235]:
df

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-02,-0.505181,0.302138,-0.085456,-0.597374
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-04,-0.847321,-0.822847,0.283108,-0.111451
2000-01-05,-0.779106,1.107419,2.4833,-0.317095
2000-01-06,-2.838312,1.86285,0.461103,1.343385
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


In [236]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D
2000-01-01,1.237496,-0.338439,-0.230565,-2.22489
2000-01-03,0.05924,-0.574444,0.526734,-0.207717
2000-01-07,0.261077,-0.518622,0.362148,-1.076796
2000-01-08,0.824303,-1.373243,0.226031,1.719361


List comprehensions and map method of Series can also be used to produce more complex criteria:

In [237]:
df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                    'c' : np.random.randn(7)})
df2

Unnamed: 0,a,b,c
0,one,x,0.292024
1,one,y,0.486743
2,two,y,1.850388
3,three,x,-0.474816
4,two,y,-0.850201
5,one,x,0.252305
6,six,x,0.098604


In [238]:
criterion = df2['a'].map(lambda x: x.startswith('t')) 
# column 'a' that contains text that starts with "t"

In [239]:
df2[criterion]

Unnamed: 0,a,b,c
2,two,y,1.850388
3,three,x,-0.474816
4,two,y,-0.850201


In [243]:
%timeit df2[df2['a'].map(lambda x: x.startswith('t'))]

791 µs ± 70.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [244]:
# equivalent but slower
%timeit df2[[x.startswith('t') for x in df2['a']]]

481 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Indexing with isin

In [245]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [246]:
s.isin([2, 4, 6]) # Is "2, 4, 6" in s?

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [247]:
s[s.isin([2, 4, 6])]

2    2
0    4
dtype: int64

In [248]:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
                   'ids2': ['a', 'n', 'c', 'n']})
df

Unnamed: 0,ids,ids2,vals
0,a,a,1
1,b,n,2
2,f,c,3
3,n,n,4


In [249]:
values = ['a', 'b', 1, 3]

In [250]:
df.isin(values)

Unnamed: 0,ids,ids2,vals
0,True,True,True
1,True,False,False
2,False,False,True
3,False,False,False


Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the key is the column, and the value is a list of items you want to check for.

In [251]:
values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [252]:
df.isin(values)

Unnamed: 0,ids,ids2,vals
0,True,False,True
1,True,False,False
2,False,False,True
3,False,False,False


In [255]:
df[df.isin(values)]

Unnamed: 0,ids,ids2,vals
0,a,,1.0
1,b,,
2,,,3.0
3,,,


## The where() Method and Masking


Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the `where` method in `Series` and `DataFrame`.

In [256]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [257]:
s[s > 0] # return selected row

3    1
2    2
1    3
0    4
dtype: int64

In [258]:
s.where(s > 0)
# return a Series with same shape as original

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. `where` is used under the hood as the implementation. Equivalent is `df.where(df < 0)`

In [259]:
df = pd.DataFrame(np.random.randn(7,7))
df

Unnamed: 0,0,1,2,3,4,5,6
0,0.17887,0.127923,-0.331659,-1.324843,-2.46237,1.048811,-0.625055
1,-1.710256,-0.112314,-0.485587,1.231403,-0.286067,1.053932,0.374804
2,2.335701,-1.435791,-0.194797,-1.122356,0.142732,-0.498707,-1.80466
3,1.007216,1.441768,2.154451,-0.016781,0.096385,1.450267,1.284748
4,0.200662,1.517311,-0.012689,-0.448584,0.373552,0.450401,-0.873314
5,1.557624,-0.460971,-0.5236,-0.674122,0.147553,-1.50131,0.466843
6,1.198818,0.608177,1.210777,0.510913,-1.603999,-0.241019,0.186182


In [260]:
df[df < 0]

Unnamed: 0,0,1,2,3,4,5,6
0,,,-0.331659,-1.324843,-2.46237,,-0.625055
1,-1.710256,-0.112314,-0.485587,,-0.286067,,
2,,-1.435791,-0.194797,-1.122356,,-0.498707,-1.80466
3,,,,-0.016781,,,
4,,,-0.012689,-0.448584,,,-0.873314
5,,-0.460971,-0.5236,-0.674122,,-1.50131,
6,,,,,-1.603999,-0.241019,


In addition, `where` takes an optional other argument for replacement of values where the condition is `False`, in the returned copy.

In [261]:
df.where(df < 0, -df)

Unnamed: 0,0,1,2,3,4,5,6
0,-0.17887,-0.127923,-0.331659,-1.324843,-2.46237,-1.048811,-0.625055
1,-1.710256,-0.112314,-0.485587,-1.231403,-0.286067,-1.053932,-0.374804
2,-2.335701,-1.435791,-0.194797,-1.122356,-0.142732,-0.498707,-1.80466
3,-1.007216,-1.441768,-2.154451,-0.016781,-0.096385,-1.450267,-1.284748
4,-0.200662,-1.517311,-0.012689,-0.448584,-0.373552,-0.450401,-0.873314
5,-1.557624,-0.460971,-0.5236,-0.674122,-0.147553,-1.50131,-0.466843
6,-1.198818,-0.608177,-1.210777,-0.510913,-1.603999,-0.241019,-0.186182


In [262]:
df.where(df < 0, 0)

Unnamed: 0,0,1,2,3,4,5,6
0,0.0,0.0,-0.331659,-1.324843,-2.46237,0.0,-0.625055
1,-1.710256,-0.112314,-0.485587,0.0,-0.286067,0.0,0.0
2,0.0,-1.435791,-0.194797,-1.122356,0.0,-0.498707,-1.80466
3,0.0,0.0,0.0,-0.016781,0.0,0.0,0.0
4,0.0,0.0,-0.012689,-0.448584,0.0,0.0,-0.873314
5,0.0,-0.460971,-0.5236,-0.674122,0.0,-1.50131,0.0
6,0.0,0.0,0.0,0.0,-1.603999,-0.241019,0.0


### mask
`mask` is the inverse boolean operation of `where`.

In [263]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [265]:
s.mask(s > 0)

4    0.0
3    NaN
2    NaN
1    NaN
0    NaN
dtype: float64

In [266]:
df

Unnamed: 0,0,1,2,3,4,5,6
0,0.17887,0.127923,-0.331659,-1.324843,-2.46237,1.048811,-0.625055
1,-1.710256,-0.112314,-0.485587,1.231403,-0.286067,1.053932,0.374804
2,2.335701,-1.435791,-0.194797,-1.122356,0.142732,-0.498707,-1.80466
3,1.007216,1.441768,2.154451,-0.016781,0.096385,1.450267,1.284748
4,0.200662,1.517311,-0.012689,-0.448584,0.373552,0.450401,-0.873314
5,1.557624,-0.460971,-0.5236,-0.674122,0.147553,-1.50131,0.466843
6,1.198818,0.608177,1.210777,0.510913,-1.603999,-0.241019,0.186182


In [267]:
df.mask(df >= 0)

Unnamed: 0,0,1,2,3,4,5,6
0,,,-0.331659,-1.324843,-2.46237,,-0.625055
1,-1.710256,-0.112314,-0.485587,,-0.286067,,
2,,-1.435791,-0.194797,-1.122356,,-0.498707,-1.80466
3,,,,-0.016781,,,
4,,,-0.012689,-0.448584,,,-0.873314
5,,-0.460971,-0.5236,-0.674122,,-1.50131,
6,,,,,-1.603999,-0.241019,


## The query() Method (Experimental)

In [None]:
df = pd.DataFrame(np.random.rand(10, 3), columns=list('abc'))
df

In [None]:
# pure python
df[(df.a < df.b) & (df.b < df.c)]

In [None]:
# query
df.query('(a < b) & (b < c)')

## Duplicate Data


If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: `duplicated` and `drop_duplicates`. Each takes as an argument the columns to use to identify duplicated rows.
*`duplicated` returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
*`drop_duplicates` removes duplicate rows.

By default, the first observed row of a duplicate set is considered unique, but each method has a `keep` parameter to specify targets to be kept.
* `keep='first'` (default): mark / drop duplicates except for the first occurrence.
* `keep='last'`: mark / drop duplicates except for the last occurrence.
* `keep=False`: mark / drop all duplicates.


In [268]:
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                    'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                    'c': np.random.randn(7)})
df2

Unnamed: 0,a,b,c
0,one,x,1.265135
1,one,y,-0.790032
2,two,x,-0.394168
3,two,y,1.870758
4,two,x,1.28405
5,three,x,1.287318
6,four,x,-1.178269


In [269]:
df2.duplicated('a')

0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [270]:
df2.duplicated('a', keep='last')

0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [271]:
df2.duplicated('a', keep=False)

0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [272]:
df2.drop_duplicates('a')

Unnamed: 0,a,b,c
0,one,x,1.265135
2,two,x,-0.394168
5,three,x,1.287318
6,four,x,-1.178269


In [273]:
df2.drop_duplicates('a', keep='last')

Unnamed: 0,a,b,c
1,one,y,-0.790032
4,two,x,1.28405
5,three,x,1.287318
6,four,x,-1.178269


In [274]:
df2.drop_duplicates('a', keep=False)

Unnamed: 0,a,b,c
5,three,x,1.287318
6,four,x,-1.178269


## Index objects

In [275]:
index = pd.Index(['e', 'd', 'a', 'b'])
index

Index(['e', 'd', 'a', 'b'], dtype='object')

In [276]:
'd' in index

True

In [277]:
'z' in index

False

In [278]:
index = pd.Index(['e', 'd', 'a', 'b'], name='something')
index

Index(['e', 'd', 'a', 'b'], dtype='object', name='something')

### Setting metadata

Indexes are “mostly immutable”, but it is possible to set and change their metadata, like the index `name` (or, for `MultiIndex`, `levels` and `labels`).

You can use the `rename`, `set_names`, `set_levels`, and `set_labels` to set these attributes directly. They default to returning a copy; however, you can specify `inplace=True` to have the data change in place.

In [281]:
ind = pd.Index([1, 2, 3])

In [283]:
ind

Int64Index([1, 2, 3], dtype='int64')

In [284]:
ind.rename("apple")

Int64Index([1, 2, 3], dtype='int64', name='apple')

In [285]:
ind

Int64Index([1, 2, 3], dtype='int64')

In [286]:
ind.set_names(["apple"], inplace=True)
ind

Int64Index([1, 2, 3], dtype='int64', name='apple')

In [287]:
ind.name = "bob"
ind

Int64Index([1, 2, 3], dtype='int64', name='bob')

In [288]:
index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
index

MultiIndex(levels=[[0, 1, 2], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [289]:
index.levels[1]

Index(['one', 'two'], dtype='object', name='second')

In [290]:
index.levels[0]

Int64Index([0, 1, 2], dtype='int64', name='first')

In [291]:
index.set_levels(["a", "b"], level=1)

MultiIndex(levels=[[0, 1, 2], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

### Set operations on Index objects

In [292]:
a = pd.Index(['c', 'b', 'a'])
b = pd.Index(['c', 'e', 'd'])

In [293]:
a | b

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [294]:
a & b

Index(['c'], dtype='object')

In [295]:
a.difference(b)

Index(['a', 'b'], dtype='object')

## Set / Reset Index


Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done so. There are a couple of different ways.

### Set an index

In [296]:
data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
                    'b': ['one', 'two', 'one', 'two'],
                    'c': ['z', 'y', 'x', 'w'],
                    'd': [1.0, 2.0, 3.0, 4.0]})
data

Unnamed: 0,a,b,c,d
0,bar,one,z,1.0
1,bar,two,y,2.0
2,foo,one,x,3.0
3,foo,two,w,4.0


In [297]:
indexed1 = data.set_index('c') # set the index with column c
indexed1

Unnamed: 0_level_0,a,b,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
z,bar,one,1.0
y,bar,two,2.0
x,foo,one,3.0
w,foo,two,4.0


In [298]:
indexed2 = data.set_index(['a', 'b'])
# set the index with column a, b
indexed2

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1.0
bar,two,y,2.0
foo,one,x,3.0
foo,two,w,4.0


In [299]:
frame = data.set_index('c', drop=False)
frame

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1.0
y,bar,two,y,2.0
x,foo,one,x,3.0
w,foo,two,w,4.0


In [300]:
frame = frame.set_index(['a', 'b'], append=True)
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1.0
y,bar,two,y,2.0
x,foo,one,x,3.0
w,foo,two,w,4.0


In [301]:
data.set_index('c', drop=False)

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1.0
y,bar,two,y,2.0
x,foo,one,x,3.0
w,foo,two,w,4.0


In [302]:
data.set_index(['a', 'b'], inplace=True)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1.0
bar,two,y,2.0
foo,one,x,3.0
foo,two,w,4.0


In [303]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1.0
bar,two,y,2.0
foo,one,x,3.0
foo,two,w,4.0


In [304]:
data.reset_index()

Unnamed: 0,a,b,c,d
0,bar,one,z,1.0
1,bar,two,y,2.0
2,foo,one,x,3.0
3,foo,two,w,4.0


In [305]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1.0
y,bar,two,y,2.0
x,foo,one,x,3.0
w,foo,two,w,4.0


In [306]:
frame.reset_index(level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,c,d
c,b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,one,bar,z,1.0
y,two,bar,y,2.0
x,one,foo,x,3.0
w,two,foo,w,4.0


## Returning a view versus a copy

In [None]:
dfmi = pd.DataFrame([list('abcd'),
                     list('efgh'),
                     list('ijkl'),
                     list('mnop')],
                    columns=pd.MultiIndex.from_product([['one','two'],
                                                        ['first','second']]))
dfmi

In [None]:
dfmi['one']['second'] # unpredictable. Unsure if this is copy or view.

In [None]:
dfmi.loc[:,('one','second')] # preferred option

***