# Class 4_Python-cont

<div>
<img src="attachment:python.jpg" width="600"/>
</div>

## Quiz alert: An individual quiz will be due by tomorrow

# Topic 2. NumPy

## 2.1. The Basics

NumPy’s main object is the **homogeneous multidimensional array**. It is a table of elements (usually numbers), **all of the same type**, indexed by a tuple of non-negative integers. In NumPy **dimensions are called axes**.

For example, the coordinates of a point in 3D space [1, 2, 3] has one axis. That axis has 3 elements in it, so we say it has a length of 3. In the example pictured below, the array has 2 axes. The first axis has a length of 2, the second axis has a length of 3.

[ [1., 1.5, 2.], 
 [2.3, 5.2, 4.1] ]

NumPy’s array class is called **ndarray**. The most important attributes of an ndarray object are:

**ndarray.ndim**: the number of axes (dimensions) of the array.

**ndarray.shape**: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.

**ndarray.size**: the total number of elements of the array. This is equal to the product of the elements of shape.



## 2.2 Creating Arrays

There are several ways to create arrays. The easiest way to create an array is to use the **array** function. This accepts any
sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. 

In order to work with arrays in NumPy let's import NumPy first:

In [1]:
import numpy as np

Thus, whenever you see **np.** in code, it’s referring to **NumPy**.

In [2]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

In [3]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

We can confirm the dimension of our arrays:

In [4]:
arr1.ndim

1

In [5]:
arr2.ndim

2

An ndarray is a generic multidimensional container for homogeneous data; that is, all
of the elements must be the **same type**. Every array has a **shape**, a tuple indicating the
**size of each dimension**, and a **dtype**, an object describing the **data type of the array**:

In [6]:
arr2.shape

(2, 4)

In [7]:
arr2.dtype

dtype('int32')

In [8]:
arr2.size

8

In addition to np.array, there are a number of other functions for creating new arrays. **zeros** and **ones** are another ways to create arrays. These functions create arrays of 0s or 1s, respectively, with a
given length or shape. **empty** creates an array without initializing its values to any particular
value. To create a higher dimensional array with these methods, pass a tuple
for the shape:

In [9]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [10]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [11]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [12]:
np.ones((3, 6))

array([[1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1.]])

In [13]:
np.ones((3, 6, 2))

array([[[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]],

       [[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]])

## 2.3. Arithmetic with NumPy Arrays 

Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. Any arithmetic operations between equal-size arrays applies the operation **element-wise**:

In [14]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

In [15]:
arr*10

array([[10, 20, 30],
       [40, 50, 60]])

In [16]:
arr+arr

array([[ 2,  4,  6],
       [ 8, 10, 12]])

In [17]:
1/arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [18]:
arr**0.5

array([[1.        , 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

Comparisons between arrays of the same size yield **boolean arrays**:

In [19]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [20]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

You can also filter a NumPy array with a **boolean mask**, that is, use a boolean expression (based on one or more conditions) to extract or modify parts of an array.

In [21]:
data = [2, 3, 6, 9, 11, 23, 35, 4]
arr = np.array(data)
arr

array([ 2,  3,  6,  9, 11, 23, 35,  4])

In [22]:
arr[arr>6]

array([ 9, 11, 23, 35])

In [23]:
arr[arr%2!=0]

array([ 3,  9, 11, 23, 35])

## 2.4. NumPy Basic Operations

### min & max

max or min of an array can be calculated using following methods:

In [24]:
arr = np.array([4, 6, 2, 12, 67, 98, 100, 45])

In [25]:
arr.max()

100

In [26]:
arr.min()

2

If you have a multi-dimensional array, use the "axis" parameter. For example **axis = 0** means we  are considering the **columns**, and **axis = 1** means we are considering the **rows**. Like the below example:

In [27]:
a = np.array([[2, 4, 6, 8], [10, 12, 13, 15], [23, 45, 89, 21]])
a

array([[ 2,  4,  6,  8],
       [10, 12, 13, 15],
       [23, 45, 89, 21]])

In [28]:
a.max() # returns the max of array "a"

89

In [29]:
a.max(axis = 0) # returns the max in each column of array "a"

array([23, 45, 89, 21])

In [30]:
a.max(axis = 1) # returns the max in each row of array "a"

array([ 8, 15, 89])

### argmax & argmin

argmax, argmin: Returns the **indices** of the maximum/minimum values in an array or along an axis.

In [31]:
a = np.array([[2, 4, 6, 8], [10, 12, 13, 15], [23, 45, 89, 21]])

np.argmax(a) # returns the index of the maximum value in array "a"

10

In [32]:
np.argmin(a) # returns the index of the minimum value in array "a"

0

In [33]:
np.argmax(a, axis = 0) # returns the index of the maximum value in array "a" along each column

array([2, 2, 2, 2], dtype=int64)

In [34]:
np.argmax(a, axis = 1) # returns the index of the maximum value in array "a" along each row

array([3, 3, 2], dtype=int64)

### sum

sum() computes the sum of items in an array:

In [35]:
arr = np.array([4, 6, 2, 12, 67, 98, 100, 45])

arr.sum() # computes the sum of array "arr"

334

### mean

numpy.mean compute the arithmetic mean along the specified axis. 

In [36]:
a = np.array([[2, 4, 6, 8], [10, 12, 13, 15], [23, 45, 89, 21]])

np.mean(a) # computes the mean of array "a"

20.666666666666668

In [37]:
np.mean(a, axis = 0) # computes the mean of array "a" along each column

array([11.66666667, 20.33333333, 36.        , 14.66666667])

In [38]:
np.mean(a, axis = 1) # computes the mean of array "a" along each row

array([ 5. , 12.5, 44.5])

### standard deviation and variance

numpy.std() computes the standard deviation along the specified axis. numpy.var() computes the variance:

In [39]:
a = np.array([[2, 4, 6, 8], [10, 12, 13, 15], [23, 45, 89, 21]])

np.std(a) # computes the standard deviation of array "a"

23.360698239184167

In [40]:
np.std(a, axis = 0) # computes the standard deviation of array "a" along each column

array([ 8.65383666, 17.74510887, 37.58545818,  5.31245915])

In [41]:
np.var(a) # computes the variance of array "a"

545.7222222222222

In [42]:
np.var(a, axis = 1) # computes the variance of array "a" along each row

array([  5.  ,   3.25, 748.75])

### matrix multiplication

numpy.dot() can be used for matrix multiplication. For multiplication by scalars (elementwise) use asterisk (*) instead.

In [43]:
a = np.array([4, 2, 7])
b = np.array([3, 2, 1])

In [44]:
print(np.dot(a,b))

23


In [45]:
print(a*b)

[12  4  7]


In [46]:
p = np.array([[1, 0], [0, 1]])
q = np.array([[1, 2], [3, 4]])

In [47]:
np.dot(p,q)

array([[1, 2],
       [3, 4]])

In [48]:
p*q

array([[1, 0],
       [0, 4]])

## 2.5. NumPy Indexing and Slicing

### 2.5.1. Basic Indexing and Slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select
a subset of your data or individual elements. One-dimensional arrays are simple; on
the surface they act similarly to Python lists:

In [49]:
arr = np.arange(16) # creates an array of integers from 0 to 15; The stop value is exclusive.

In [50]:
arr[5]

5

In [51]:
arr[5:8] # note that the stop index is exclusive

array([5, 6, 7])

In [52]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9, 10, 11, 12, 13, 14, 15])

The “bare” slice [:] selects all values in an array:

In [53]:
arr[:] = 100
arr

array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100])

With higher dimensional arrays, you have many more options. In a two-dimensional
array, the elements at each index are no longer scalars but rather one-dimensional
arrays:

In [54]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [55]:
arr2d[2]

array([7, 8, 9])

To select individual elements in a multidimensional array, you can pass a **comma-separated list of indices** or  **bracket seperated list of indices**.

In [56]:
arr2d[0, 2]

3

In [57]:
arr2d[0][2]

3

See below for an illustration of indexing on a two-dimensional array. I find it
helpful to think of **axis 0 as the “rows”** of the array and **axis 1 as the “columns.”**

![numpy.PNG](attachment:numpy.PNG)

In **multidimensional arrays**, if you **omit later indices**, the returned object will be a
**lower dimensional ndarray** consisting of all the data **along the higher dimensions**. So you can select your slice from the lower dimensional ndarray.

In [58]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [59]:
arr3d[0, 0]

array([1, 2, 3])

In [60]:
arr3d[0] # arr3d[0] is a 2 × 3 array:

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values and arrays can be assigned to arr3d[0]:

In [61]:
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [62]:
arr3d[0, 0] = np.array([12, 11, 10])
arr3d

array([[[12, 11, 10],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0),
forming a 1-dimensional array:

In [63]:
arr3d[1, 0]

array([7, 8, 9])

#### Indexing with slices <br>
Like one-dimensional objects such as Python lists, ndarrays can be sliced with the
familiar syntax:

In [64]:
arr = np.arange(16) # creates an array of integers from 0 to 15; The stop value is exclusive.
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [65]:
arr[1:6] # note that the stop index is exclusive

array([1, 2, 3, 4, 5])

Slicing two dimensioanl arrays is a bit different.

In [66]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [67]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a
range of elements along an axis. It can be helpful to read the expression arr2d[:2] as
**“select the first two rows of arr2d.”** <br>
<br>
You can pass multiple slices just like you can pass multiple indexes:

In [68]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions.
<br>
For example, I can select the second row but only the first two columns like so:

In [69]:
arr2d[1, :2]

array([4, 5])

Similarly, I can select the third column but only the first two rows like so:

In [70]:
arr2d[:2, 2]

array([3, 6])

See figure below for an illustration. Note that a colon by itself means to take the entire
axis.

![numpy%202.PNG](attachment:numpy%202.PNG)

In [71]:
arr2d[:, :1]

array([[1],
       [4],
       [7]])

Of course, assigning to a slice expression assigns to the whole selection:

In [72]:
arr2d[:2, 1:] = np.array([[10, 10], [10, 10]])
arr2d

array([[ 1, 10, 10],
       [ 4, 10, 10],
       [ 7,  8,  9]])

#### Last row/column

In [73]:
a = np.arange(20).reshape(4, 5) # reshape(4, 5) makes a 4 x 5 array
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [74]:
a[-1] # last row

array([15, 16, 17, 18, 19])

In [75]:
a[:, -1] # last column

array([ 4,  9, 14, 19])

### 2.5.2. Boolean Indexing <br>
Let’s consider an example where we have an array of names
with duplicates.

In [76]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [77]:
names == 'Bob'

array([ True, False, False,  True, False, False, False])

To select everything but 'Bob', you can use !=

In [78]:
names[names != 'Bob']

array(['Joe', 'Will', 'Will', 'Joe', 'Joe'], dtype='<U4')

# Topic 3. Pandas

## Introduction

Pandas is a major tool in data analytics. It
contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python. <br><br>
While pandas adopts many coding idioms from NumPy, the biggest difference is that
**pandas** is designed for working with **tabular or heterogeneous data**. **NumPy**, by contrast,
is best suited for working with **homogeneous numerical array data**.<br><br>
In order to use pandas we use the following import convention for pandas:

In [79]:
import pandas as pd

Thus, whenever you see pd. in code, it’s referring to pandas.

## 3.1. Pandas Data Structures

To get started with pandas, you will need to get comfortable with its
data structures.

### 3.1.1. Series

A Series is a **one-dimensional array-like** object containing a sequence of **values** (of
similar types to NumPy types) and an associated array of data labels, called its **index**.
The simplest Series is formed from only an array of data:

Here is how you can create a Series:

In [80]:
S = pd.Series([4, 7, -5, 3])
print(S)

0    4
1    7
2   -5
3    3
dtype: int64


The string representation of a Series displayed interactively shows the **index on the
left and the values on the right**. Since we did not specify an index for the data, a
**default one consisting of the integers 0 through N - 1** (where N is the length of the
data) is created. You can get the array representation and index object of the Series via
its values and index attributes, respectively:

In [81]:
S.values # array representation

array([ 4,  7, -5,  3], dtype=int64)

In [82]:
S.index

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point
with a **label**:

In [83]:
S2 = pd.Series([4, 7, -5, 3], index = ['a', 'b', 'c', 'd'])
S2

a    4
b    7
c   -5
d    3
dtype: int64

In [84]:
S2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

Compared with NumPy arrays, you can use **labels** in the index when selecting single
values or a set of values:

In [85]:
S2['a']

4

In [86]:
S2['d'] = 6
S2

a    4
b    7
c   -5
d    6
dtype: int64

In [87]:
S2[['d', 'b', 'a']]

d    6
b    7
a    4
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link:

In [88]:
S2[S2>0]

a    4
b    7
d    6
dtype: int64

In [89]:
S2*2

a     8
b    14
c   -10
d    12
dtype: int64

Another way to think about a Series is as a **fixed-length, ordered dictionary**, as it is a mapping
of index values to data values. It can be used in many contexts where you might
use a dictionary:

**Note: please make sure you understand the concepts of list and dictionary in python.**

In [90]:
'b' in S2

True

In [91]:
'f' in S2

False

Should you have data contained in a Python dictionary, you can create a Series from it by
passing the dictionary:

In [92]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [93]:
S3 = pd.Series(sdata)
S3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dictionary, the index in the resulting Series will have **the dictionary’s
keys in sorted order**. You can override this by passing the dictionary keys in the order you
want them to appear in the resulting Series:

In [94]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
S4 = pd.Series(sdata, index = states)
S4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since
**no value for 'California' was found**, it appears as **NaN (not a number)**, which is considered
in pandas to mark missing or NA values. Since 'Utah' was not included in
states, it is excluded from the resulting object.

We will use the terms “missing” or “NA” interchangeably to refer to missing data. The
isnull and notnull functions in pandas should be used to detect missing data:

In [95]:
pd.isnull(S4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [96]:
pd.notnull(S4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [97]:
S4.name = 'population'
S4.index.name = 'state'
S4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment:

In [98]:
S4.index = ['CA', 'OH', 'OR', 'TX']
S4

CA        NaN
OH    35000.0
OR    16000.0
TX    71000.0
Name: population, dtype: float64

### 3.1.2. DataFrame

A DataFrame represents a **rectangular table of data** and contains **an ordered collection
of columns**, each of which can be a **different value type** (numeric, string,
boolean, etc.). The DataFrame has **both a row and column index**; it can be thought of
as an **Excel spreadsheet**.

There are many ways to construct a DataFrame, though one of the most common is
from **a dictionary of equal-length lists**:

In [99]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [100]:
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:

In [101]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


For large DataFrames, the head method selects only **the first five rows**:

In [102]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame’s columns will be arranged in
that order:

In [103]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass **a column that isn’t contained in the dictionary**, it will appear with **missing values**
in the result:

In [104]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


you can get column names using columns method:

In [105]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a **Series** either by dictionary-like notation or
by attribute:

In [106]:
frame2['pop'] # by dictionary-like notation

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: pop, dtype: float64

In [107]:
frame2.year # by attribute

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Note that the returned Series have the **same index as the DataFrame**, and their name
attribute has been appropriately set.

Rows can also be retrieved by position or name with the special _loc_ attribute (much
more on this later):

In [108]:
frame2.loc[3]

year       2001
state    Nevada
pop         2.4
debt        NaN
Name: 3, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column
could be assigned a scalar value or an array of values:

In [109]:
frame2.debt = 16.5
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [110]:
frame2.debt = range(6)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0
1,2001,Ohio,1.7,1
2,2002,Ohio,3.6,2
3,2001,Nevada,2.4,3
4,2002,Nevada,2.9,4
5,2003,Nevada,3.2,5


When you are assigning lists or arrays to a column, **the value’s length must match the
length of the DataFrame**. If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any holes:

In [111]:
values = pd.Series([-1.2, -1.5, -1.7], index = [2, 4, 5])
frame2.debt = values
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


Assigning a column that doesn’t exist will create a new column. 

In [112]:
frame2['eastern'] = frame2.state == 'Ohio' # create a new column indicating whether the corresponding state is 'Ohio' or not
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


The _del_ keyword will
delete columns as with a dictionary. The _del_ method can then be used to remove this column:

In [113]:
del frame2['eastern']

In [114]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dictionary of dictionaries:

In [115]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [116]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


You can transpose the DataFrame (swap rows and columns) using T method:

In [117]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


The keys in the inner dictionaries are combined and sorted to form the index in the result.
This isn’t true if an explicit index is specified:

In [118]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


## 3.2. Pandas Dropping & Selection

In [119]:
# make sure your import necessary libraries first

import pandas as pd
import numpy as np

### 3.2.1. Dropping

The _drop_ method will return a new object with the indicated value or values deleted from
an axis:

In [120]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [121]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [122]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we
first create an example DataFrame:

In [123]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling drop with a sequence of labels will drop values from **the row labels** (axis 0):

In [124]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the **columns** by passing **axis=1** or **axis='columns'**:

In [125]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [126]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### 3.2.2. Indexing, Selection and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the **Series’s index values** instead of only integers. Here are some examples of
this:

In [127]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [128]:
obj['b']

1.0

In [129]:
obj[1]

1.0

In [130]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [131]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [132]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [133]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the **endpoint
is inclusive**:

In [134]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

You can assign values to the slice of a Series:

In [135]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with **a single
value or sequence**:

In [136]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [137]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [138]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [139]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


You can also slice or select data with a **boolean
array**. Passing a single element
or a list to the [] operator selects columns:

In [140]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Another use case is in indexing with a **boolean DataFrame**, such as one produced by a
scalar comparison:

In [141]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [142]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection with loc and iloc

For DataFrame label-indexing on the rows, I introduce the special indexing operators
loc and iloc. They enable you to select **a subset of the rows and columns** from a
DataFrame with NumPy-like notation using either **axis labels (loc)** or **integers
(iloc)**. <br> <br>
As a preliminary example, let’s select a single row and multiple columns by label:

In [143]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

We’ll then perform some similar selections with integers using iloc:

In [144]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [145]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [146]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:

In [147]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [148]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### 3.2.3 Logical Operators in Pandas

Logical operators in Pandas:

Operator|Description
---|---
exp1 & exp2|Element-wise logical AND
exp1 \| exp2|Element-wise logical OR
~exp1|Element-wise logical NOT

In [149]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [150]:
data[(data['one']>7) & (data['three']>8)]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15
