## Data Structures in Pandas

Pandas has two main data structures:
- DataFrame, which is two dimensional
- Series, which is one dimensional
![alt text](../img/01-PandasData.png)

### What is Pandas DataFrame

- A two dimensional data structure
- A row is represented by row labels, also called index, which may be numerical
  or string
- A column is represented by column labels which may be numerical or string
- Following DataFrame contains 10 rows (0-9) and 5 columns (name, calories,
  protein, vitamins, rating)
  ![alt text](../img/05-PandasDataFrame.png)

### What is Pandas Series

- A one dimensional data structure
- It consists of a single row or column
- Following Series contains 10 rows (0-9) and 1 column called calories
![alt text](../img/05-PandasSeries.png)

### Dataframe vs Series

- A Pandas Dataframe is just a collection of one or more Series
- The Series in the previous example was extracted from the Dataframe
![alt text](../img/05-PandasDataframeVsSeries.png)

### Creating a Dataframe using Lists

- We can create a Dataframe using Lists
- We pass the list as an argument to the `pandas.DataFrame()` function, which
  returns us a DataFrame
- Pandas automatically assigns numerical row labels to each row of the DataFrame
- Since we didn't provide column labels, Pandas automatically assigned numerical
  column labels to each column as well

In [1]:
import pandas as pd

myList = [
    ['Apple', 'Red'],
    ['Banana', 'Yellow'],
    ['Orange', 'Orange']
]

myDataFrame = pd.DataFrame(myList)

myDataFrame

Unnamed: 0,0,1
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


In [2]:
# With custom column labels
myDataFrame2 = pd.DataFrame(myList, columns=['Fruit', 'Color'])

myDataFrame2

Unnamed: 0,Fruit,Color
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


As we know that a NumPy array is similar to a Python List with added
functionality, we can also convert a NumPy array to a Pandas DataFrame using the
same method

In [3]:
import numpy as np
import pandas as pd

npArr = np.array([
    [0, 1],
    [2, 3],
    [4, 5]
])

myDataFrame3 = pd.DataFrame(npArr, columns=['Even', 'Odd'])

myDataFrame3

Unnamed: 0,Even,Odd
0,0,1
1,2,3
2,4,5


### Creating a DataFrame using Dictionary

- We can also pass a dictionary to the `pandas.DataFrame()` function to create a DataFrame
- Each key of the dictionary should have a list of one or more values associated
  with it
- The keys of the dictionary became column labels
- Pandas automatically assigns numerical row labels to each row of the DataFrame

In [5]:
import pandas as pd

myDic = {
    'Fuit': ['Apple', 'Banana', 'Orange'],
    'Color': ['Red', 'Yellow', 'Orange']
}

myDf = pd.DataFrame(myDic)

myDf


Unnamed: 0,Fuit,Color
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


### Loading CSV file as a DataFrame

- We can also load a CSV (comma separated values) file as a DataFrame in Pandas
  using the `pandas.read_csv()` function
- Each value of the first row of the CSV file becomes a column label
- Pandas automatically assigns numerical row labels to each row of the DataFrame

In [20]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


### Changing the index Column

- We can set one of the existing columns as the new index column of a DataFrame
  using `.set_index()` function

In [25]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    # display(myDf.set_index('name'))
    myDf2 = myDf.set_index('name')
    display(myDf2)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


### Inplace

- Remember that most of the functions in Pandas do not change the original DataFrame
- In the previous section, we changed the index column of our DataFrame. If we
  print our DataFrame again, we'll see that the original Dataframe is unchanged

In [29]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    myDf2 = myDf.set_index('name')
    display(myDf2)
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


In [28]:
# using inplace to actually change the DataFrame
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    myDf.set_index('name', inplace=True) # Changing the actual df
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


### Examining the data

#### head()

- `head()` function gives us the **first** 5 rows of the DataFrame/Series by default
- To get more rows, we can pass the desired number as an argument to the
  `head()` function

In [31]:
myDf.head(7)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094


#### tail()

- `tail()` function gives us the **last** 5 rows of the DataFrame/Series by default
- To get more rows, we can pass the desired number as an argument to the
  `tail()` function

In [32]:
myDf.tail(7)

Unnamed: 0,name,calories,protein,vitamins,rating
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813
