# Ch 1, Ep. 6: Pandas
## Intro to DataFrames and Basic Selection

## Creating DataFrames
Pandas loads tabular data inside a data structure called a **DataFrame**. Pandas can read/write DataFrames from a variety of formats, making it an ideal
tool to convert file formats in Data Engineering. DataFrames provide an extensive set of built-in functions which allows us to transform and combine
DataFrames very easily.

Let's go through an example together:

<br/>

In [14]:
import pandas as pd

df = pd.DataFrame({'apples': [4, 2, 4, 5, 1],
                   'peaches': [1, 7, 4, 6, 5],
                   'eggplants': [1, 3, 1, 3, 0]})
print(df)

   apples  peaches  eggplants
0       4        1          1
1       2        7          3
2       4        4          1
3       5        6          3
4       1        5          0


The easiest way to create a DataFrame is to build one from a `dict` where column names are passed as the keys and row values as a `list` for that column. First row of the DataFrame would be the first element of each column list.

### Accessing Values with Brackets

Try accessing values in a DataFrame:

In [10]:
# select a single column
print("The 'apples' column:")
df['apples']
# you can also access as member of DataFrame
df.apples


The 'apples' column:


0    4
1    2
2    4
3    5
4    1
Name: apples, dtype: int64

In [11]:
# accessing values within a column 
print("The first row in 'apples':")
df['apples'][0]
df.apples[0]

The first row in 'apples':


4

In [13]:
# access a slice of values
df['apples'][0:4]

0    4
1    2
2    4
3    5
Name: apples, dtype: int64

When you're using `[]` to access elements in a DataFrame think of it a a **two dimensional array** where the first dimension represents the columns and the second dimension represents the row sequence.

### Creating DataFrame with Index

By default pandas assigns a RangeIndex to the rows starting with 0 (similar to lists). This is what we saw in the examples above. However you can specifically assign the **row labels** or **indexes** for each row by index vales:

<br/>

In [15]:
df = pd.DataFrame({'apples': [4, 2, 4, 5, 1],
                   'peaches': [1, 7, 4, 6, 5],
                   'eggplants': [1, 3, 1, 3, 0]},
                 index=['A', 'B', 'C', 'D', 'E'])
print(df)

   apples  peaches  eggplants
A       4        1          1
B       2        7          3
C       4        4          1
D       5        6          3
E       1        5          0


You can still use brackets to access values:

In [16]:
# you can use both index by position or label
# the correct way would be by label
df['apples']['A']
# or by position
df['apples'][0]

# you can also select multiple rows
df['apples'][['A', 'E', 'D']]


A    4
E    1
D    5
Name: apples, dtype: int64

### Assigning Values

As easy as reading values, you can also assign values:

In [5]:
# assign a single value
df['apples']['A'] = 10

# assign and add an entire column
df['oranges'] = 0
df['oranges']['D'] = 2

# add an entire row. you will learn .loc later
df.loc['F'] = {'apples': 3, 'peaches': 0, 'eggplants': 3, 'oranges': 1}

print(df)

   apples  peaches  eggplants  oranges
A      10        1          1        0
B       2        7          3        0
C       4        4          1        0
D       5        6          3        2
E       1        5          0        0
F       3        0          3        1


And there we have Pandas `DataFrame`! Let's take a quick look at the other basic Pandas data type, `Series`

### Series

Pandas `Series` are a one-dimensional array that can hold any data type. Being one dimensional, they are similar to single columns from a `DataFrame`. In fact, if we use single `[ ]` when returning a column from a `DataFrame`, we see it is a `Series`:

In [7]:
type(df['apples'])

pandas.core.series.Series

We can create `Series` in the same manner as we did for `DataFrames`, the class constructor takes data as the first argument and returns a `Series` object of that data. If no index is specified, the same `RangeIndex` we see for `DataFrame` is created. We can also specify an index if desired:

In [10]:
s = pd.Series(range(5,10))
s
s = pd.Series(range(5,10), ['a', 'b', 'c', 'd', 'e'])
s

a    5
b    6
c    7
d    8
e    9
dtype: int64

We can also create a series from  a `dict`. In this case, the `key` is interpreted as the `index` and the value becomes the entry in the `Series`:

In [12]:
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

If we also include an index with a `dict`, the corresponding entries are included in the returned `Series` with `NaN` for those who do not have a corresponding value in the `dict`

In [13]:
pd.Series(d, index = ['c', 'd', 'a'])

c    2.0
d    NaN
a    0.0
dtype: float64

We note that we already know how to access `Series` using `[ ]` because as seen above, a column of a `DataFrame` is returned as a `Series`. So simply use the logic above, including the ability to return slices of `Series` 

In [14]:
d = {"b": 1, "a": 0, "c": 2}
d['a']

0

In [15]:
s = pd.Series(range(5,10))
s[2:6]

2    7
3    8
4    9
dtype: int64