# Pandas
When working with data, as a Data Scientist, the quantity and structure can significantly vary. Therefore it is a common practice as part of the __Data Exploration__ phase, to lay it out in a tabular structure. The resultant "table" contains the different variables (__columns__) and every observation of those variables (__rows__). So to start working with this tabular structure in Python, we can leverage the two-dimensional numpy array or a dictionary.

The problem with using the numpy array is that the array can only contain elements of the same data type and as mentioned at the outset, massive quantities of data hardly ever have the same types of data. Added to that, although the dictionaly can contain differnt data types, its structure can be difficult to view and manipulate programatically.

Therefore, a better option is to use the __Pandas__ package. Pandas is a high level data manipulation tool that is built on top of the numpy package. Compared to numpy it is significantly more "high-level", meaning that it's far easier to view and manipulate data because tabular data is stored in a DataFrame. The DataFrame is one of Pandas' most important data structures, it's basically a way to store tabular data, where we can label the rows and the columns. One way to build a DataFrame is from a Python dictionary. The following example shows how to manually create a [__BRICS__](https://en.wikipedia.org/wiki/BRICS) DataFrame from a python dictionary.

In [1]:
# Create the brics dictionary
dict = {
    "country":["Brazil", "Russia", "India",
               "China", "South Africa"],
    "capital":["Brasilia", "Moscow", "New Delhi",
               "Beijing", "Pretoria"],
    "area":[8.516, 17.10, 3.286, 9.597, 1.221],
    "population":[200.4, 143.5, 1252, 1357, 52.98]
}

# Import Pandas
import pandas as pd

# Create the brics dataframe
brics = pd.DataFrame(dict)
brics

Unnamed: 0,area,capital,country,population
0,8.516,Brasilia,Brazil,200.4
1,17.1,Moscow,Russia,143.5
2,3.286,New Delhi,India,1252.0
3,9.597,Beijing,China,1357.0
4,1.221,Pretoria,South Africa,52.98


The code above highlights some of the complexity already mentioned. When creating the dictionary, we are creating __`<key:value>`__ pairs, without any tabular structure for future manipulations. We are manually structuring the data so that the __Keys__ are the column labels and __values__ are the corresponding colum data in list form, with each list having a specific data type. Although this seems somewhat straight forward, it can very difficult to create this at scale. 

By creating the Pandas DataFrame from the dictionary, we can clearly see the tabular structure and thus the structure is more conducive to understanding and manipulating the data. For easier indexing, Pandas automatically assigns row labels, $[0-4]$, but these can also be changed by creating a list of row labels:

In [2]:
# Create index list
brics.index = ["BR", "RU", "IN", "CH", "SA"]
brics

Unnamed: 0,area,capital,country,population
BR,8.516,Brasilia,Brazil,200.4
RU,17.1,Moscow,Russia,143.5
IN,3.286,New Delhi,India,1252.0
CH,9.597,Beijing,China,1357.0
SA,1.221,Pretoria,South Africa,52.98


Although we've highlighted the __manual__ creation of a Pandas DataFrame from a Python dictionary, most data is typically available as files with a regular structure. One of those file types is the __CSV file__, which is short for "comma-separated values". Pandas includes a function, `read_csv()` that allows us to import the data. An example of the syntax is as follows:
```
brics = pd.read_csv("<path to data.csv>", index_col = <column number as index>)
```
The `read_csv()` function has many more arguments that allow for futher customization of the data being imported, but of key iimport is to make sure the rows and colums are have appropriate labels. This is to ensure that accessing columns, rows and single elements is an easy task. 

There are numerous ways in which we can index and select data from DataFrames. The __first__ method is to utilize square brackets. In the following example, we will utilize square brackets to select an indivdual column, by referencing the column label. 

In [3]:
# Select `country` column
brics["country"]

BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object

Python prints out the entire colum, together with the row labels. Note however, that the data type returned is a Pandas Series object. This is a labeled one dimensional array, that in essense, when put together create a DataFrame. So if we wanted to select the `country` column, but keep the data in the DataFrame, we would use the following syntax:

In [4]:
# Output as a DataFrame
brics[["country"]]

Unnamed: 0,country
BR,Brazil
RU,Russia
IN,India
CH,China
SA,South Africa


We can extend this call to select two columns, which is in essense putting together a list of column labels inside another set of square brackets, which produces a sub-DataFrame.

In [5]:
# Select two columns
brics[["country", "capital"]]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


Square brackets can also be used to select rows from a dataframe, by specifying a slice. __Remember__ that the end of the slice is not included and that the index starts at $0$. For example:

In [6]:
# Rows 1 - 4
brics[1:4]

Unnamed: 0,area,capital,country,population
RU,17.1,Moscow,Russia,143.5
IN,3.286,New Delhi,India,1252.0
CH,9.597,Beijing,China,1357.0


This usage of square brackets to select rows and colums is the extent of its functionality. Ideally, the functionality of two-dimensional numpy arrays is far more useful, where we also use square bracks, but the index (or slice) before the comma refers to the rows and the index (or slice) after the comma refers to the columns:
```
my_array[<rows>, <columns>]
```
Pandas allows for this similar functionality with the `loc()` and `iloc()` functions. The `loc()` function selects parts of the DataFrame based on the __label__, while the `iloc()` function is __position-based__.

The following example demonstrates row access with the `loc()` function, by putting the label of a particular row as the argument:

In [7]:
# Select the row for Russia
brics.loc["RU"]

area            17.1
capital       Moscow
country       Russia
population     143.5
Name: RU, dtype: object

Once again, the output is a Pandas Series object with the row data shown on different lines (as a vector). So to make the output conform to the original tabular data, we have to output the data as a DataFrame:

In [8]:
# Select Russia row as a DataFrame
brics.loc[["RU"]]

Unnamed: 0,area,capital,country,population
RU,17.1,Moscow,Russia,143.5


As shown, the entire row is returned, but that is exactly the same function that square brackets performs. They key difference though is that the `loc()` functional allows us to extend the specification with a comma and another list of the columns we wish to select. For example, we can extend the previous example to only include the `country` and `capital` colums when we select the row, thus returning the intersection of the rows and columns:

In [9]:
# Only include capital and country
brics.loc[["RU"], ["capital", "country"]]

Unnamed: 0,capital,country
RU,Moscow,Russia


The `loc()` function can also be used to select all the rows, but only a specific columns. We do this by simply replacing the row index with a ":", a slice going from start to finish. For example:

In [10]:
# Select all rows; specific columns
brics.loc[:, ["country", "capital"]]

Unnamed: 0,country,capital
BR,Brazil,Brasilia
RU,Russia,Moscow
IN,India,New Delhi
CH,China,Beijing
SA,South Africa,Pretoria


This time the intersections spans all the rows, but only the two columns specified. So as can be seen from these examples, the use of the `loc()` function, subsetting becomes very similar to usinf two-dimensional numpy arrays. The only difference is thst we use row labels, not the position of the elements. 

If we wish to subset Pandas DataFrames based on the element position (or index), then we have to use the `iloc()` function. So if we go back to the previous example of selecting the row for `Russia`, with the `loc()` function , we specified the row label "RU". With the `iloc()` function we specify the index:

In [11]:
# Select the row for Russia: index 1
brics.iloc[[1]]

Unnamed: 0,area,capital,country,population
RU,17.1,Moscow,Russia,143.5


The results are exactly the same as the `loc()` function. The same applies when selecting the row for `russia`, but the `capital` and `country` columns.

In [12]:
# Only include capital and country with index
brics.iloc[[1], [1, 2]]

Unnamed: 0,capital,country
RU,Moscow,Russia


As with the `loc()` function, we can select all the rows, but only a specific columns.

In [13]:
# Select all rows; specific columns
brics.iloc[:, [1, 2]]

Unnamed: 0,capital,country
BR,Brasilia,Brazil
RU,Moscow,Russia
IN,New Delhi,India
CH,Beijing,China
SA,Pretoria,South Africa


To summarize,
- `loc()` works on labels in the index.
- `iloc()`1 works on the positions in the index (so it only takes integers).

As a __side note__,  we can also combine label-based selection the `loc()` way and index-based selection the `iloc()` way with the `ix()` function, but that is covered here. 

So how then can we apply all of the above infomration to Data Science? The next example involves performing subsetting and filtering on the Pandas DataFrame. The goal for this example is to use the __BRICS__ data and select only the countries with an `area` over $8\,Million\,km^{2}$. This will be done in $3$ steps:

1. Select the `area` column from BRICS.
2. Perform a comparison on the output and store the results.
3. Use the result to select the matching countries from the DataFrame.

In [14]:
# Step 1. Get the `area` column from BRICS
step_one = brics.loc[:, "area"]

#Alternatives:
#brics["area"]
#brics.iloc[:, 0]

# Display the result
step_one

BR     8.516
RU    17.100
IN     3.286
CH     9.597
SA     1.221
Name: area, dtype: float64

Although there are a number of ways to do this and the example above only uses the `loc()` function, the most important aspect is that the result is a Pandas Series, __not__ a DataFrame. 

In [15]:
# Step 2. Perform the comparison
step_two = brics.loc[:, "area"] > 8

#Alternatives
#brics["area] > 8
#brics.iloc[:, 0]

# Display the result
step_two

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [16]:
# Step 3. Subset the DataFrame
step_three = brics[step_two]["country"]

# Display the result
step_three

BR    Brazil
RU    Russia
CH     China
Name: country, dtype: object

In [17]:
# Combining into single step
brics[brics.loc[:, "area"] > 8]["country"]

BR    Brazil
RU    Russia
CH     China
Name: country, dtype: object

In [18]:
# Combining numpy and Pandas 
import numpy as np
brics[np.logical_and(brics.loc[:, "area"] > 8,
                     brics.loc[:, "area"] < 10)]["country"]

BR    Brazil
CH     China
Name: country, dtype: object