# Pandas
## Introduction
When working with data, as a Data Scientist, the quantity and structure can significantly vary. Therefore it is a common practice as part of the __Data Exploration__ phase, to lay it out in a tabular structure. The resultant "table" contains the different variables (__columns__) and every observation of those variables (__rows__). So to start working with this tabular structure in Python, we can leverage the two-dimensional numpy array or a dictionary.

The problem with using the numpy array is that the array can only contain elements of the same data type and as mentioned at the outset, massive quantities of data hardly ever have the same types of data. Added to that, although the dictionaly can contain differnt data types, its structure can be difficult to view and manipulate programatically.

Therefore, a better option is to use the __Pandas__ package. Pandas is a high level data manipulation tool that is built on top of the numpy package. Compared to numpy it is significantly more "high-level", meaning that it's far easier to view and manipulate data because tabular data is stored in a DataFrame. The DataFrame is one of Pandas' most important data structures, it's basically a way to store tabular data, where we can label the rows and the columns. One way to build a DataFrame is from a Python dictionary. The following example shows how to manually create a [__BRICS__](https://en.wikipedia.org/wiki/BRICS) DataFrame from a python dictionary.

In [1]:
# Create the brics dictionary
dict = {
    "country":["Brazil", "Russia", "India", "China", "South Africa"],
    "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
    "area":[8.516, 17.10, 3.286, 9.597, 1.221],
    "population":[200.4, 143.5, 1252, 1357, 52.98]
}

# Import Pandas
import pandas as pd

# Create the brics dataframe
brics = pd.DataFrame(dict)
brics

Unnamed: 0,area,capital,country,population
0,8.516,Brasilia,Brazil,200.4
1,17.1,Moscow,Russia,143.5
2,3.286,New Delhi,India,1252.0
3,9.597,Beijing,China,1357.0
4,1.221,Pretoria,South Africa,52.98


The code above highlights some of the complexity already mentioned. When creating the dictionary, we are creating __`<key:value>`__ pairs, without any tabular structure for future manipulations. We are manually structuring the data so that the __Keys__ are the column labels and __values__ are the corresponding colum data in list form, with each list having a specific data type. Although this seems somewhat straight forward, it can very difficult to create this at scale. 

By creating the Pandas DataFrame from the dictionary, we can clearly see the tabular structure and thus the structure is more conducive to understanding and manipulating the data. For easier indexing, Pandas automatically assigns row labels, $[0-4]$, but these can also be changed by creating a list of row labels:

In [2]:
# Create index list
brics.index = ["BR", "RU", "IN", "CH", "SA"]
brics

Unnamed: 0,area,capital,country,population
BR,8.516,Brasilia,Brazil,200.4
RU,17.1,Moscow,Russia,143.5
IN,3.286,New Delhi,India,1252.0
CH,9.597,Beijing,China,1357.0
SA,1.221,Pretoria,South Africa,52.98


Although we've highlighted the __manual__ creation of a Pandas DataFrame from a Python dictionary, most data is typically available as files with a regular structure. One of those file types is the __CSV file__, which is short for "comma-separated values". Pandas includes a function, `read_csv()` that allows us to import the data. The syntax is as follows:
```
brics = pd.read_csv("<path to data.csv>", index_col = <column number as index>)
```
The `read_csv()` function has many more arguments that allow for futher customization of the data being imported.

## Data Manipulation with Pandas DataFrames
There are numerous ways in which we can index and select data from DataFrames:
- `[]`
- `loc`
- `iloc`
