# Pandas

- library for Data Analysis and Manipulation

**Why Pandas?**

- provides ability to work with Tabular data
  - `Tabular Data` : data that is organized into tables having rows and cols

In [65]:
import pandas as pd
import numpy as np
# jupyter nbconvert --to markdown pandas.ipynb --output README.md

## `Series` objects

- A `Series` object is 1D array that can hold/store data.

### Creating a `Series`

In [9]:
l = ["C", "C++", "Python", "Javascript"]
s = pd.Series(l)
s

0             C
1           C++
2        Python
3    Javascript
dtype: object

### Similar to a 1D `ndarray`

`Series` objects behave much like one-dimensional NumPy `ndarray`s, and you can often pass them as parameters to NumPy functions:

In [3]:
import numpy as np

s = pd.Series([2,4,6,8])
np.exp(s)

0       7.389056
1      54.598150
2     403.428793
3    2980.957987
dtype: float64

Arithmetic operations on `Series` are also possible, and they apply *elementwise*, just like for `ndarray`s:

In [4]:
s + [1000,2000,3000,4000]

0    1002
1    2004
2    3006
3    4008
dtype: int64

Similar to NumPy, if you add a single number to a `Series`, that number is added to all items in the `Series`. This is called * broadcasting*:

In [5]:
s + 1000

0    1002
1    1004
2    1006
3    1008
dtype: int64

The same is true for all binary operations such as `*` or `/`, and even conditional operations:

In [6]:
s < 0

0    False
1    False
2    False
3    False
dtype: bool

### Indexing

In [28]:
s2 = pd.Series(l, index=["a", "b", "c", "d"])
s2

a             C
b           C++
c        Python
d    Javascript
dtype: object

You can then use the `Series` just like a `dict`:

In [11]:
s2["b"]

'C++'

You can still access the items by integer location, like in a regular array:

In [12]:
s2[1]

'C++'

To make it clear when you are accessing by label or by integer location, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location:

In [13]:
s2.loc["b"]

'C++'

In [14]:
s2.iloc[1]

'C++'

#### Slicing a `Series` also slices the index labels:

In [15]:
s2.iloc[1:3]

b       C++
c    Python
dtype: object

This can lead to unexpected results when using the default numeric labels, so be careful:

In [16]:
surprise = pd.Series([1000, 1001, 1002, 1003])
surprise

0    1000
1    1001
2    1002
3    1003
dtype: int64

In [17]:
surprise_slice = surprise[2:]
surprise_slice

2    1002
3    1003
dtype: int64

Oh look! The first element has index label `2`. The element with index label `0` is absent from the slice:

In [18]:
try:
    surprise_slice[0]
except KeyError as e:
    print("Key error:", e)

Key error: 0


But remember that you can access elements by integer location using the `iloc` attribute. This illustrates another reason why it's always better to use `loc` and `iloc` to access `Series` objects:

In [19]:
surprise_slice.iloc[0]

1002

#### Init from `dict`

You can create a `Series` object from a `dict`. The keys will be used as index labels:

In [20]:
weights = {"a": 68, "b": 83, "c": 86, "d": 68}
s3 = pd.Series(weights)
s3

a    68
b    83
c    86
d    68
dtype: int64

You can control which elements you want to include in the `Series` and in what order by explicitly specifying the desired `index`:

In [24]:
s4 = pd.Series(weights, index = ["c", "a"])
s4

c    86
a    68
dtype: int64

### Automatic alignment

When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels.

In [34]:
s2 = pd.Series([1,2,3], index=["a", "b", "c"])
s3 = pd.Series([10,20,40], index=["a", "b", "d"])

print(s2.keys())
print(s3.keys())

s2 + s3


Index(['a', 'b', 'c'], dtype='object')
Index(['a', 'b', 'd'], dtype='object')


a    11.0
b    22.0
c     NaN
d     NaN
dtype: float64

The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `"d"` is missing from `s2` and `"c"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).

Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the **right index labels**, you can have surprising results:

In [36]:
s5 = pd.Series([1000,1000,1000,1000])
s2 + s5

a   NaN
b   NaN
c   NaN
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

Pandas could not align the `Series`, since their labels do not match at all, hence the full `NaN` result.

### Init with a scalar

You can also initialize a `Series` object using a scalar and a list of index labels: all items will be set to the scalar.

In [37]:
meaning = pd.Series(42, ["a", "b", "c"])
meaning

a    42
b    42
c    42
dtype: int64

In [38]:
s6 = pd.Series([83, 68], index=["a", "b"], name="weights")
s6

a    83
b    68
Name: weights, dtype: int64

## `DataFrame` objects

- A DataFrame object represents a 2d labelled array, with cell values, column names and row index labels
- You can see `DataFrame`s as dictionaries of `Series`.


### Creating a `DataFrame`

#### from numpy array

In [49]:
import numpy as np

In [51]:
arr = np.random.randint(10,100,size=(6,4))
arr

array([[30, 27, 82, 14],
       [94, 66, 75, 56],
       [53, 19, 72, 20],
       [32, 91, 10, 14],
       [88, 65, 70, 49],
       [31, 57, 27, 95]])

In [53]:
df = pd.DataFrame(data=arr)
df

Unnamed: 0,0,1,2,3
0,30,27,82,14
1,94,66,75,56
2,53,19,72,20
3,32,91,10,14
4,88,65,70,49
5,31,57,27,95


In [55]:
type(df)

pandas.core.frame.DataFrame

In [57]:
df[2]

0    82
1    75
2    72
3    10
4    70
5    27
Name: 2, dtype: int32

In [58]:
type(df[2])

pandas.core.series.Series

In [59]:
type(df[0])

pandas.core.series.Series

#### dictionary of `Series` objects:

In [28]:
people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


A few things to note:
* the `Series` were automatically aligned based on their index,
* missing values are represented as `NaN`,
* `Series` names are ignored (the name `"year"` was dropped),
* `DataFrame`s are displayed nicely in Jupyter notebooks, woohoo!

You can access columns pretty much as you would expect. They are returned as `Series` objects:

In [51]:
people["birthyear"]

alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

You can also get multiple columns at once:

In [52]:
people[["birthyear", "hobby"]]

Unnamed: 0,birthyear,hobby
alice,1985,Biking
bob,1984,Dancing
charles,1992,


It is also possible to create a `DataFrame` with a dictionary (or list) of dictionaries (or list):

In [61]:
people = pd.DataFrame({
    "birthyear": {"alice": 1985, "bob": 1984, "charles": 1992},
    "hobby": {"alice": "Biking", "bob": "Dancing"},
    "weight": {"alice": 68, "bob": 83, "charles": 112},
    "children": {"bob": 3, "charles": 0}
})
people


Unnamed: 0,birthyear,hobby,weight,children
alice,1985,Biking,68,
bob,1984,Dancing,83,3.0
charles,1992,,112,0.0


#### `DataFrame(columns=[],index=[])` constructor

If you pass a list of columns and/or index row labels to the `DataFrame` constructor, it will guarantee that these columns and/or rows will exist, in that order, and no other column/row will exist. For example:

In [6]:
d2 = pd.DataFrame(
        people_dict,
        columns=["birthyear", "weight", "height"],
        index=["bob", "alice", "eugene"]
     )
d2

Unnamed: 0,birthyear,weight,height
bob,1984.0,83.0,
alice,1985.0,68.0,
eugene,,,


Another convenient way to create a `DataFrame` is to pass all the values to the constructor as an `ndarray`, or a list of lists, and specify the column names and row index labels separately:

In [10]:
values = [
            [1985, np.nan, "Biking",   68],
            [1984, 3,      "Dancing",  83],
            [1992, 0,      np.nan,    112]
         ]
d3 = pd.DataFrame(
        values,
        columns=["birthyear", "children", "hobby", "weight"],
        index=["alice", "bob", "charles"]
     )
d3

Unnamed: 0,birthyear,children,hobby,weight
alice,1985,,Biking,68
bob,1984,3.0,Dancing,83
charles,1992,0.0,,112


To specify missing values, you can either use `np.nan` or NumPy's masked arrays:

In [55]:
masked_array = np.ma.asarray(values, dtype=np.object)
masked_array[(0, 2), (1, 2)] = np.ma.masked
d3 = pd.DataFrame(
        masked_array,
        columns=["birthyear", "children", "hobby", "weight"],
        index=["alice", "bob", "charles"]
     )
d3

Unnamed: 0,birthyear,children,hobby,weight
alice,1985,,Biking,68
bob,1984,3.0,Dancing,83
charles,1992,0.0,,112


Instead of an `ndarray`, you can also pass a `DataFrame` object:

In [11]:
d4 = pd.DataFrame(
         d3,
         columns=["hobby", "children"],
         index=["alice", "bob"]
     )
d4

Unnamed: 0,hobby,children
alice,Biking,
bob,Dancing,3.0


### Indexing, Masking, Query

#### Extracting Columns

In [58]:
arr = np.random.randint(10, 100, size=(6, 4))
df = pd.DataFrame(data=arr,columns=["a", "b", "c", "d"])
# df.columns = ["a", "b", "c", "d"]
df

Unnamed: 0,a,b,c,d
0,71,34,51,84
1,26,36,56,54
2,95,46,40,25
3,95,41,46,62
4,80,86,79,59
5,36,39,96,82


In [59]:
df['c']

0    51
1    56
2    40
3    46
4    79
5    96
Name: c, dtype: int32

multiple columns can be extracted at once:

In [61]:
df[['b','c','a']]

Unnamed: 0,b,c,a
0,34,51,71
1,36,56,26
2,46,40,95
3,41,46,95
4,86,79,80
5,39,96,36


#### Extracting Rows

In [139]:
arr = np.random.randint(10, 100, size=(6, 4))
df = pd.DataFrame(data=arr)
df.columns = ["a", "b", "c", "d"]
df.index = "p q r s t u".split()
df


Unnamed: 0,a,b,c,d
p,64,81,95,79
q,17,52,83,41
r,81,76,54,29
s,96,58,22,98
t,83,27,67,95
u,16,34,60,11


##### `loc()` - label location

The `loc` attribute lets you access rows instead of columns. The result is a `Series` object in which the `DataFrame`'s column names are mapped to row index labels:

In [141]:
df.loc["p"]

a    64
b    81
c    95
d    79
Name: p, dtype: int32

##### `iloc()` - integer location

You can also access rows by integer location using the `iloc` attribute:

In [143]:
df.iloc[2]

a    81
b    76
c    54
d    29
Name: r, dtype: int32

You can also get a slice of rows, and this returns a `DataFrame` object:

In [144]:
df.iloc[1:3]

Unnamed: 0,a,b,c,d
q,17,52,83,41
r,81,76,54,29


In [147]:
df.iloc[1:3][['a','b']]


Unnamed: 0,a,b
q,17,52
r,81,76


In [150]:
df.iloc[1:3,:2]


Unnamed: 0,a,b
q,17,52
r,81,76


#### Masking - Boolean Indexing

In [156]:
df

Unnamed: 0,a,b,c,d
p,64,81,95,79
q,17,52,83,41
r,81,76,54,29
s,96,58,22,98
t,83,27,67,95
u,16,34,60,11


In [157]:
df > 30

Unnamed: 0,a,b,c,d
p,True,True,True,True
q,False,True,True,True
r,True,True,True,False
s,True,True,False,True
t,True,False,True,True
u,False,True,True,False


In [158]:
mask = df > 30
df[mask]

Unnamed: 0,a,b,c,d
p,64.0,81.0,95.0,79.0
q,,52.0,83.0,41.0
r,81.0,76.0,54.0,
s,96.0,58.0,,98.0
t,83.0,,67.0,95.0
u,,34.0,60.0,


This is most useful when combined with boolean expressions:

In [159]:
df['a'] <50 

p    False
q     True
r    False
s    False
t    False
u     True
Name: a, dtype: bool

In [154]:
df[df["a"] < 50] # only getting q and u as both is True

Unnamed: 0,a,b,c,d
q,17,52,83,41
u,16,34,60,11


In [160]:
df[df["a"] < 50][['a','d']]

Unnamed: 0,a,d
q,17,41
u,16,11


#### Querying a `DataFrame`

The `query()` method lets you filter a `DataFrame` based on a query expression:

In [87]:
people.query("age > 30 and pets == 0")

Unnamed: 0,hobby,height,weight,age,over 30,pets,body_mass_index,overweight
bob,Dancing,181,83,34,True,0.0,25.335002,False


### Adding and removing columns

In [71]:
people_dict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people


Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


#### direct assignment

In [72]:
people["age"] = 2021 - people["birthyear"]  # adds a new column "age"
people["over 30"] = people["age"] > 30      # adds another column "over 30"
people

Unnamed: 0,weight,birthyear,children,hobby,age,over 30
alice,68,1985,,Biking,36,True
bob,83,1984,3.0,Dancing,37,True
charles,112,1992,0.0,,29,False


In [73]:
birthyears

alice      1985
bob        1984
charles    1992
Name: birthyear, dtype: int64

When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

In [74]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people

Unnamed: 0,weight,birthyear,children,hobby,age,over 30,pets
alice,68,1985,,Biking,36,True,
bob,83,1984,3.0,Dancing,37,True,0.0
charles,112,1992,0.0,,29,False,5.0


#### `insert(position,column,value)`

When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the `insert()` method:

In [75]:
people.insert(1, "height", [172, 181, 185])
people

Unnamed: 0,weight,height,birthyear,children,hobby,age,over 30,pets
alice,68,172,1985,,Biking,36,True,
bob,83,181,1984,3.0,Dancing,37,True,0.0
charles,112,185,1992,0.0,,29,False,5.0


#### `assign()`: Assigning new columns

You can also create new columns by calling the `assign()` method. Note that this returns a new `DataFrame` object, **the original is not modified:**

In [76]:
people.assign(
    body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
    has_pets = people["pets"] > 0
)

Unnamed: 0,weight,height,birthyear,children,hobby,age,over 30,pets,body_mass_index,has_pets
alice,68,172,1985,,Biking,36,True,,22.985398,False
bob,83,181,1984,3.0,Dancing,37,True,0.0,25.335002,False
charles,112,185,1992,0.0,,29,False,5.0,32.724617,True


In [77]:
people # the original is not modified

Unnamed: 0,weight,height,birthyear,children,hobby,age,over 30,pets
alice,68,172,1985,,Biking,36,True,
bob,83,181,1984,3.0,Dancing,37,True,0.0
charles,112,185,1992,0.0,,29,False,5.0


Note that you cannot access columns created within the same assignment:

In [78]:
try:
    people.assign(
        body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
        overweight = people["body_mass_index"] > 25 # body_mass_index is not defined at this point
    )
except KeyError as e:
    print("Key error:", e)

Key error: 'body_mass_index'


The solution is to split this assignment in two consecutive assignments:

In [79]:
d6 = people.assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
d6.assign(overweight = d6["body_mass_index"] > 25)

Unnamed: 0,weight,height,birthyear,children,hobby,age,over 30,pets,body_mass_index,overweight
alice,68,172,1985,,Biking,36,True,,22.985398,False
bob,83,181,1984,3.0,Dancing,37,True,0.0,25.335002,True
charles,112,185,1992,0.0,,29,False,5.0,32.724617,True


Having to create a temporary variable `d6` is not very convenient. You may want to just chain the assigment calls, but it does not work because the `people` object is not actually modified by the first assignment:

In [80]:
try:
    (people
         .assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
         .assign(overweight = people["body_mass_index"] > 25)
    )
except KeyError as e:
    print("Key error:", e)

Key error: 'body_mass_index'


But fear not, there is a simple solution. You can pass a function to the `assign()` method (typically a `lambda` function), and this function will be called with the `DataFrame` as a parameter:

In [81]:
(people
     .assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100) ** 2)
     .assign(overweight = lambda df: df["body_mass_index"] > 25)
)

Unnamed: 0,weight,height,birthyear,children,hobby,age,over 30,pets,body_mass_index,overweight
alice,68,172,1985,,Biking,36,True,,22.985398,False
bob,83,181,1984,3.0,Dancing,37,True,0.0,25.335002,True
charles,112,185,1992,0.0,,29,False,5.0,32.724617,True


Problem solved!

#### `drop` and `pop`

In [128]:
arr = np.random.randint(10, 100, size=(4,8))
df = pd.DataFrame(data=arr,columns=["a", "b", "c", "d", "e", "f", "g", "h"])
df['a+b'] = df['a'] + df['b']
df['a-b'] = df['a'] * df['b']
df

Unnamed: 0,a,b,c,d,e,f,g,h,a+b,a-b
0,10,60,21,41,85,27,92,49,70,600
1,43,60,50,74,90,86,87,65,103,2580
2,55,62,23,58,83,80,90,72,117,3410
3,47,97,11,13,91,92,49,35,144,4559


In [129]:
delC = df.pop('c')  # removes column c
del df["d"] # removes column d
df

Unnamed: 0,a,b,e,f,g,h,a+b,a-b
0,10,60,85,27,92,49,70,600
1,43,60,90,86,87,65,103,2580
2,55,62,83,80,90,72,117,3410
3,47,97,91,92,49,35,144,4559


In [130]:
df.drop(columns=['e','f','a-b'])

Unnamed: 0,a,b,g,h,a+b
0,10,60,92,49,70
1,43,60,87,65,103
2,55,62,90,72,117
3,47,97,49,35,144


In [131]:
df

Unnamed: 0,a,b,e,f,g,h,a+b,a-b
0,10,60,85,27,92,49,70,600
1,43,60,90,86,87,65,103,2580
2,55,62,83,80,90,72,117,3410
3,47,97,91,92,49,35,144,4559


In [132]:
df.drop(columns=['e', 'f', 'a-b'], inplace=True)  # original df is modified


In [133]:
df

Unnamed: 0,a,b,g,h,a+b
0,10,60,92,49,70
1,43,60,87,65,103
2,55,62,90,72,117
3,47,97,49,35,144


### Handy Methods and Properties

In [52]:
arr = np.random.randint(10, 100, size=(6, 4))
df = pd.DataFrame(data=arr)
df.columns = 'a b c d'.split()
df


Unnamed: 0,a,b,c,d
0,49,73,76,47
1,84,25,87,59
2,79,36,76,69
3,46,88,43,41
4,95,24,15,37
5,59,49,36,23


#### `shape` , `dtypes` , `info()`, `describe()`

In [41]:
df.shape

(6, 4)

In [6]:
df.dtypes

a    int32
b    int32
c    int32
d    int32
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       6 non-null      int32
 1   b       6 non-null      int32
 2   c       6 non-null      int32
 3   d       6 non-null      int32
dtypes: int32(4)
memory usage: 224.0 bytes


In [55]:
df.describe()

Unnamed: 0,a,b,c,d
count,6.0,6.0,6.0,6.0
mean,68.666667,49.166667,55.5,46.0
std,20.146133,26.331856,28.317839,16.334014
min,46.0,24.0,15.0,23.0
25%,51.5,27.75,37.75,38.0
50%,69.0,42.5,59.5,44.0
75%,82.75,67.0,76.0,56.0
max,95.0,88.0,87.0,69.0


In [56]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
a,6.0,68.666667,20.146133,46.0,51.5,69.0,82.75,95.0
b,6.0,49.166667,26.331856,24.0,27.75,42.5,67.0,88.0
c,6.0,55.5,28.317839,15.0,37.75,59.5,76.0,87.0
d,6.0,46.0,16.334014,23.0,38.0,44.0,56.0,69.0


#### `head` and `tail`

- `head`: prints the first 5 rows
- `tail`: prints the last 5 rows

In [44]:
df.head()

Unnamed: 0,0,1,2,3
0,80,11,76,81
1,51,29,24,59
2,64,64,26,62
3,38,29,78,97
4,10,72,77,45


In [45]:
df.head(n=3)

Unnamed: 0,0,1,2,3
0,80,11,76,81
1,51,29,24,59
2,64,64,26,62


In [47]:
df.tail(n=2)

Unnamed: 0,0,1,2,3
4,10,72,77,45
5,83,55,26,71


#### `columns`

In [5]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

#### `values` : returns a numpy array

In [162]:
arr = df.values
arr

array([[96, 71, 41, 34],
       [43, 92, 89, 79],
       [20, 28, 78, 42],
       [51, 82, 67, 30],
       [13, 27, 72, 79],
       [85, 12, 62, 14]])

#### `unique` and `nunique`

The Pandas Unique technique identifies the unique values of a Pandas Series.

In [11]:
people_dict = {
    "country": pd.Series(['BD','IN','PAK','BD','BD','IN']),
	"name":pd.Series(['A','B','C','D','E','F'])
}
people = pd.DataFrame(people_dict)
people

Unnamed: 0,country,name
0,BD,A
1,IN,B
2,PAK,C
3,BD,D
4,BD,E
5,IN,F


In [16]:
people.nunique()

country    3
name       6
dtype: int64

In [20]:
people['country'].nunique()


3

In [19]:
people['country'].unique()


array(['BD', 'IN', 'PAK'], dtype=object)

#### `value_counts()`

count occupance of each unique element

In [22]:
people['country'].value_counts()

BD     3
IN     2
PAK    1
Name: country, dtype: int64

In [23]:
people['country'].value_counts()['BD']

3

#### Sorting a `DataFrame`
You can sort a `DataFrame` by calling its `sort_index` method. By default it sorts the rows by their **index label**, in ascending order, but let's reverse the order:

In [37]:
people_dict = {
    "country": pd.Series(['BD', 'IN', 'PAK', 'SL', 'US', 'IN']),
   	"name": pd.Series(['A', 'B', 'C', 'D', 'E', 'F']),
	   "cgpa":pd.Series([3.56, 4.00, 3.55, 3.86, 3.99, 3.89])
}
people = pd.DataFrame(people_dict)
people

Unnamed: 0,country,name,cgpa
0,BD,A,3.56
1,IN,B,4.0
2,PAK,C,3.55
3,SL,D,3.86
4,US,E,3.99
5,IN,F,3.89


In [38]:
people.sort_index(ascending=False).head(n=3)

Unnamed: 0,country,name,cgpa
5,IN,F,3.89
4,US,E,3.99
3,SL,D,3.86


Note that `sort_index` returned a sorted *copy* of the `DataFrame`. To modify `people` directly, we can set the `inplace` argument to `True`. Also, we can sort the columns instead of the rows by setting `axis=1`:

In [39]:
people.sort_index(axis=1,ascending=False, inplace=True)
people

Unnamed: 0,name,country,cgpa
0,A,BD,3.56
1,B,IN,4.0
2,C,PAK,3.55
3,D,SL,3.86
4,E,US,3.99
5,F,IN,3.89


To sort the `DataFrame` by the values instead of the labels, we can use `sort_values` and specify the column to sort by:

In [41]:
people.sort_values(by=["name"], ascending=False,inplace=True)
people

Unnamed: 0,name,country,cgpa
5,F,IN,3.89
4,E,US,3.99
3,D,SL,3.86
2,C,PAK,3.55
1,B,IN,4.0
0,A,BD,3.56


In [46]:
people.sort_values(by=["cgpa", "name"])

Unnamed: 0,name,country,cgpa
2,C,PAK,3.55
0,A,BD,3.56
3,D,SL,3.86
5,F,IN,3.89
4,E,US,3.99
1,B,IN,4.0


#### `apply`

Pandas.apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.

In [47]:
df

Unnamed: 0,a,b,c,d
0,74,75,90,63
1,95,57,33,56
2,23,27,14,24
3,64,88,45,81
4,72,17,59,65
5,87,83,98,91


In [48]:
df.apply(lambda x: x+x)

Unnamed: 0,a,b,c,d
0,148,150,180,126
1,190,114,66,112
2,46,54,28,48
3,128,176,90,162
4,144,34,118,130
5,174,166,196,182


In [50]:
df['a'].apply(lambda x:x*10)

0    740
1    950
2    230
3    640
4    720
5    870
Name: a, dtype: int64

### Saving & loading files

Pandas can save `DataFrame`s to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let's create a `DataFrame` to demonstrate this:

In [57]:
my_df = pd.DataFrame(
    [["Biking", 68.5, 1985, np.nan], ["Dancing", 83.1, 1984, 3]], 
    columns=["hobby","weight","birthyear","children"],
    index=["alice", "bob"]
)
my_df

Unnamed: 0,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


#### Saving
Let's save it to CSV, HTML and JSON:

In [58]:
my_df.to_csv("my_df.csv")
# my_df.to_csv("my_df.csv",index=False)
my_df.to_html("my_df.html")
my_df.to_json("my_df.json")

#### Loading
Now let's load our CSV file back into a `DataFrame`:

In [62]:
my_df_loaded = pd.read_csv("my_df.csv")
my_df_loaded

Unnamed: 0.1,Unnamed: 0,hobby,weight,birthyear,children
0,alice,Biking,68.5,1985,
1,bob,Dancing,83.1,1984,3.0


In [63]:
my_df_loaded = pd.read_csv("my_df.csv",index_col=0)
my_df_loaded


Unnamed: 0,hobby,weight,birthyear,children
alice,Biking,68.5,1985,
bob,Dancing,83.1,1984,3.0


As you might guess, there are similar `read_json`, `read_html`, `read_excel` functions as well.  We can also read data straight from the Internet. For example, let's load the top 1,000 U.S. cities from github:

In [64]:
us_cities = None
try:
    csv_url = "https://raw.githubusercontent.com/plotly/datasets/master/us-cities-top-1k.csv"
    us_cities = pd.read_csv(csv_url, index_col=0)
    us_cities = us_cities.head()
except IOError as e:
    print(e)
us_cities

Unnamed: 0_level_0,State,Population,lat,lon
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Marysville,Washington,63269,48.051764,-122.177082
Perris,California,72326,33.782519,-117.228648
Cleveland,Ohio,390113,41.49932,-81.694361
Worcester,Massachusetts,182544,42.262593,-71.802293
Columbia,South Carolina,133358,34.00071,-81.034814


### Operations on `DataFrame`s
Although `DataFrame`s do not try to mimick NumPy arrays, there are a few similarities. Let's create a `DataFrame` to demonstrate this:

In [51]:
grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, 
	columns=["sep", "oct", "nov"], 
	index=["alice","bob","charles","darwin"])
grades

Unnamed: 0,sep,oct,nov
alice,8,8,9
bob,10,9,9
charles,4,8,2
darwin,9,10,10


You can apply NumPy mathematical functions on a `DataFrame`: the function is applied to all values:

In [94]:
np.sqrt(grades)

Unnamed: 0,sep,oct,nov
alice,2.828427,2.828427,3.0
bob,3.162278,3.0,3.0
charles,2.0,2.828427,1.414214
darwin,3.0,3.162278,3.162278


Similarly, adding a single value to a `DataFrame` will add that value to all elements in the `DataFrame`. This is called *broadcasting*:

In [95]:
grades + 1

Unnamed: 0,sep,oct,nov
alice,9,9,10
bob,11,10,10
charles,5,9,3
darwin,10,11,11


Of course, the same is true for all other binary operations, including arithmetic (`*`,`/`,`**`...) and conditional (`>`, `==`...) operations:

In [96]:
grades >= 5

Unnamed: 0,sep,oct,nov
alice,True,True,True
bob,True,True,True
charles,False,True,False
darwin,True,True,True


Aggregation operations, such as computing the `max`, the `sum` or the `mean` of a `DataFrame`, apply to each column, and you get back a `Series` object:

In [97]:
grades.mean()

sep    7.75
oct    8.75
nov    7.50
dtype: float64

The `all` method is also an aggregation operation: it checks whether all values are `True` or not. Let's see during which months all students got a grade greater than `5`:

In [98]:
(grades > 5).all()

sep    False
oct     True
nov    False
dtype: bool

Most of these functions take an optional `axis` parameter which lets you specify along which axis of the `DataFrame` you want the operation executed. The default is `axis=0`, meaning that the operation is executed vertically (on each column). You can set `axis=1` to execute the operation horizontally (on each row). For example, let's find out which students had all grades greater than `5`:

In [99]:
(grades > 5).all(axis = 1)

alice       True
bob         True
charles    False
darwin      True
dtype: bool

The `any` method returns `True` if any value is True. Let's see who got at least one grade 10:

In [100]:
(grades == 10).any(axis = 1)

alice      False
bob         True
charles    False
darwin      True
dtype: bool

If you add a `Series` object to a `DataFrame` (or execute any other binary operation), pandas attempts to broadcast the operation to all *rows* in the `DataFrame`. This only works if the `Series` has the same size as the `DataFrame`s rows. For example, let's subtract the `mean` of the `DataFrame` (a `Series` object) from the `DataFrame`:

In [101]:
grades - grades.mean()  # equivalent to: grades - [7.75, 8.75, 7.50]

Unnamed: 0,sep,oct,nov
alice,0.25,-0.75,1.5
bob,2.25,0.25,1.5
charles,-3.75,-0.75,-5.5
darwin,1.25,1.25,2.5


We subtracted `7.75` from all September grades, `8.75` from October grades and `7.50` from November grades. It is equivalent to subtracting this `DataFrame`:

In [102]:
pd.DataFrame([[7.75, 8.75, 7.50]]*4, index=grades.index, columns=grades.columns)

Unnamed: 0,sep,oct,nov
alice,7.75,8.75,7.5
bob,7.75,8.75,7.5
charles,7.75,8.75,7.5
darwin,7.75,8.75,7.5


If you want to subtract the global mean from every grade, here is one way to do it:

In [103]:
grades - grades.values.mean() # subtracts the global mean (8.00) from all grades

Unnamed: 0,sep,oct,nov
alice,0.0,0.0,1.0
bob,2.0,1.0,1.0
charles,-4.0,0.0,-6.0
darwin,1.0,2.0,2.0


### Aggregating with `groupby`

Similar to the SQL language, pandas allows grouping your data into groups to run calculations over each group.

First, let's add some extra data about each person so we can group them, and let's go back to the `final_grades` `DataFrame` so we can see how `NaN` values are handled:

In [67]:
iris = pd.read_csv("iris.csv")
iris.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [68]:
iris.aggregate('min')

sepal_length       4.3
sepal_width        2.0
petal_length       1.0
petal_width        0.1
species         setosa
dtype: object

In [69]:
iris.aggregate(['min','max','mean','median'])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
min,4.3,2.0,1.0,0.1,setosa
max,7.9,4.4,6.9,2.5,virginica
mean,5.843333,3.054,3.758667,1.198667,
median,5.8,3.0,4.35,1.3,


In [70]:
groupby = iris.groupby('species')
groupby

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EC566AC820>

In [71]:
groupby.min()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,4.3,2.3,1.0,0.1
versicolor,4.9,2.0,3.0,1.0
virginica,4.9,2.2,4.5,1.4


In [72]:
groupby.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.418,1.464,0.244
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [77]:
iris[iris['species'] == 'setosa']['sepal_length'].mean()

5.005999999999999

### Handling Missing Data

- `dropna`
- `fillna`

In [80]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [82]:
nan_idx = np.random.randint(0,150,20)
iris['sepal_length'][nan_idx] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iris['sepal_length'][nan_idx] = np.nan


In [83]:
import warnings
warnings.filterwarnings('ignore')

In [90]:
nan_idx = np.random.randint(0, 150, 20)
iris['sepal_length'][nan_idx] = np.nan

In [93]:
iris['sepal_length'][:20]

0     5.1
1     4.9
2     4.7
3     4.6
4     5.0
5     5.4
6     4.6
7     5.0
8     4.4
9     4.9
10    NaN
11    NaN
12    4.8
13    4.3
14    NaN
15    5.7
16    NaN
17    NaN
18    NaN
19    5.1
Name: sepal_length, dtype: float64

In [94]:
nan_idx = np.random.randint(0, 150, 20)
iris['petal_width'][nan_idx] = np.nan

In [96]:
iris.isna().sum()

sepal_length    63
sepal_width      0
petal_length     0
petal_width     20
species          0
dtype: int64

In [97]:
iris.dropna()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
...,...,...,...,...,...
138,6.0,3.0,4.8,1.8,virginica
141,6.9,3.1,5.1,2.3,virginica
144,6.7,3.3,5.7,2.5,virginica
145,6.7,3.0,5.2,2.3,virginica


In [98]:
iris['sepal_length'].fillna(value="FILLED")[:20]

0        5.1
1        4.9
2        4.7
3        4.6
4        5.0
5        5.4
6        4.6
7        5.0
8        4.4
9        4.9
10    FILLED
11    FILLED
12       4.8
13       4.3
14    FILLED
15       5.7
16    FILLED
17    FILLED
18    FILLED
19       5.1
Name: sepal_length, dtype: object

In [99]:
iris['sepal_length'] = iris['sepal_length'].fillna(value=round(iris['sepal_length'].mean(),1))

In [102]:
iris['sepal_length'][:20]

0     5.1
1     4.9
2     4.7
3     4.6
4     5.0
5     5.4
6     4.6
7     5.0
8     4.4
9     4.9
10    5.8
11    5.8
12    4.8
13    4.3
14    5.8
15    5.7
16    5.8
17    5.8
18    5.8
19    5.1
Name: sepal_length, dtype: float64