# Pandas

- Developed by Wes McKinney
- Built on Numpy

## 1. Create DataFrame manually - using dictionary

In [74]:
import pandas as pd

countries_data = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221],
"population":[200.4, 143.5, 1252, 1357, 52.98] }


brics = pd.DataFrame(countries_data)
print(brics)

        country    capital    area  population
0        Brazil   Brasilia   8.516      200.40
1        Russia     Moscow  17.100      143.50
2         India  New Delhi   3.286     1252.00
3         China    Beijing   9.597     1357.00
4  South Africa   Pretoria   1.221       52.98


In [75]:
# Changing default row index to custom one
brics.index = ['BR', 'RU', 'IN', 'CH', 'SA']

brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [76]:
## 2. Read from files

## Indexing & Selecting Data
- Square Brackets
- Advanced Methods
    * loc
    * iloc
    

### Column Access []

In [13]:
brics['country']

BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object

This is a **Pandas Series** object, which is nothing but an array with labels.  
A DataFrame is nothing but a bunch of Series stacked together.  
To extract the **column as a DataFrame**, we use double square brackets.

In [14]:
brics[['country']]

Unnamed: 0,country
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


To access multiple columns i.e a **sub-DataFrame**, we put the list of columns inside a set of square brackets

In [17]:
brics[['country', 'capital']]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


### Row Access []
- Only way to do this is using **slice**

In [19]:
brics[1:4]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


Ideally, we want something like Numpy array i.e data[rows, columns], where we can subset any combination of row and column.  
So we must use **loc** and **iloc** functions.  
**loc** is **label-based**, while **iloc** is **integer position-based**.

### Row Access loc

In [26]:
# Single row
brics.loc["RU"]

country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object

This is returned as a Series. To get DataFrame, insert labels in a list

In [40]:
# Select row as dataframe
brics.loc[["RU"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5


In [33]:
# Multiple rows
brics.loc[["RU", "IN", "CH"]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### Row & Column Access loc

In [50]:
brics.loc[["RU", "IN", "SA"], ["country", "capital"]]

Unnamed: 0,country,capital
RU,Russia,Moscow
IN,India,New Delhi
SA,South Africa,Pretoria


In [51]:
# Columns Access
brics.loc[:, ["country", "capital"]]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


**NOTE**: iloc works exactly as loc, except that now we select rows and columns using index, instead of labels.  
In the following cells, all loc operations are repeated using iloc.

### Row Access iloc

In [38]:
# Single row
brics.iloc[1]

country       Russia
capital       Moscow
area            17.1
population     143.5
Name: RU, dtype: object

In [41]:
# Select row as dataframe
brics.iloc[[1]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5


In [45]:
# Multiple rows
brics.iloc[[1,2,3]]

Unnamed: 0,country,capital,area,population
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0


### Row & Column Access iloc

In [53]:
brics.iloc[[1,2,3], [0,1]]

Unnamed: 0,country,capital
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing


In [55]:
# Columns Access
brics.iloc[:, [0,1]]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


## Pandas Filtering Dataframe & subselection

**Goal**: Select countries with area < 8M sq.km
1. Select the area column (*Select the column as Pandas series instead of a DF*)
2. Do the comparison on area column
3. Use result to select countries

In [59]:
is_huge = brics['area'] > 8
print(is_huge)

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool


In [60]:
brics[is_huge]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


### Boolean operators on Dataframe
- Suppose we want to enforce multiple conditions, say we want countries whose area is >8 and <10 M sq.km
- We can use the boolean methods of numpy package, eg. np.logical_and(), np.logical_or(), etc.

In [61]:
import numpy as np

In [62]:
np.logical_and(brics['area'] > 8, brics['area'] < 10)

BR     True
RU    False
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [63]:
brics[np.logical_and(brics['area'] > 8, brics['area'] < 10)]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
CH,China,Beijing,9.597,1357.0


### Looping over Pandas Dataframes

In [64]:
for val in brics:
    print(val)

country
capital
area
population


- A simple for loop returns columns names
- In order to iterate over the rows, we use the method **iterrows()** on DataFrame object
- In each iteration, It returns label and row data as a Series object.

In [66]:
for lab, row in brics.iterrows():
    print('{}: {}'.format(lab, row['capital']))

BR: Brasilia
RU: Moscow
IN: New Delhi
CH: Beijing
SA: Pretoria


In [73]:
# Adding a new column - name_length
for lab, row in brics.iterrows():
    brics.loc[lab, 'name_length'] = len(row['country'])
print(brics)    

         country    capital    area  population  name_length
BR        Brazil   Brasilia   8.516      200.40          6.0
RU        Russia     Moscow  17.100      143.50          6.0
IN         India  New Delhi   3.286     1252.00          5.0
CH         China    Beijing   9.597     1357.00          5.0
SA  South Africa   Pretoria   1.221       52.98         12.0


- This is creating a Series object on every iteration. While working on huge dataset, this can affect efficiency.
- A more efficient method of **creating a new DataFrame column by applying a function on a column element-wise** is to use the **apply()** method on that column.
- The method passed to apply method is applied on each elements of the column, thus producing a Series of results

In [81]:
brics['name_length'] = brics["country"].apply(len)

In [82]:
print(brics)

         country    capital    area  population  name_length
BR        Brazil   Brasilia   8.516      200.40            6
RU        Russia     Moscow  17.100      143.50            6
IN         India  New Delhi   3.286     1252.00            5
CH         China    Beijing   9.597     1357.00            5
SA  South Africa   Pretoria   1.221       52.98           12
