## Data Structures in Pandas

Pandas has two main data structures:
- DataFrame, which is two dimensional
- Series, which is one dimensional
![alt text](../img/01-PandasData.png)

### What is Pandas DataFrame

- A two dimensional data structure
- A row is represented by row labels, also called index, which may be numerical
  or string
- A column is represented by column labels which may be numerical or string
- Following DataFrame contains 10 rows (0-9) and 5 columns (name, calories,
  protein, vitamins, rating)
  ![alt text](../img/05-PandasDataFrame.png)

### What is Pandas Series

- A one dimensional data structure
- It consists of a single row or column
- Following Series contains 10 rows (0-9) and 1 column called calories
![alt text](../img/05-PandasSeries.png)

### Dataframe vs Series

- A Pandas Dataframe is just a collection of one or more Series
- The Series in the previous example was extracted from the Dataframe
![alt text](../img/05-PandasDataframeVsSeries.png)

### Creating a Dataframe using Lists

- We can create a Dataframe using Lists
- We pass the list as an argument to the `pandas.DataFrame()` function, which
  returns us a DataFrame
- Pandas automatically assigns numerical row labels to each row of the DataFrame
- Since we didn't provide column labels, Pandas automatically assigned numerical
  column labels to each column as well

In [121]:
import pandas as pd

myList = [
    ['Apple', 'Red'],
    ['Banana', 'Yellow'],
    ['Orange', 'Orange']
]

myDataFrame = pd.DataFrame(myList)

myDataFrame

Unnamed: 0,0,1
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


In [122]:
# With custom column labels
myDataFrame2 = pd.DataFrame(myList, columns=['Fruit', 'Color'])

myDataFrame2

Unnamed: 0,Fruit,Color
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


As we know that a NumPy array is similar to a Python List with added
functionality, we can also convert a NumPy array to a Pandas DataFrame using the
same method

In [124]:
import numpy as np
import pandas as pd

npArr = np.array([
    [0, 1],
    [2, 3],
    [4, 5]
])

myDataFrame3 = pd.DataFrame(npArr, columns=['Even', 'Odd'])

myDataFrame3

Unnamed: 0,Even,Odd
0,0,1
1,2,3
2,4,5


### Creating a DataFrame using Dictionary

- We can also pass a dictionary to the `pandas.DataFrame()` function to create a DataFrame
- Each key of the dictionary should have a list of one or more values associated
  with it
- The keys of the dictionary became column labels
- Pandas automatically assigns numerical row labels to each row of the DataFrame

In [125]:
import pandas as pd

myDic = {
    'Fuit': ['Apple', 'Banana', 'Orange'],
    'Color': ['Red', 'Yellow', 'Orange']
}

myDf = pd.DataFrame(myDic)

myDf


Unnamed: 0,Fuit,Color
0,Apple,Red
1,Banana,Yellow
2,Orange,Orange


### Loading CSV file as a DataFrame

- We can also load a CSV (comma separated values) file as a DataFrame in Pandas
  using the `pandas.read_csv()` function
- Each value of the first row of the CSV file becomes a column label
- Pandas automatically assigns numerical row labels to each row of the DataFrame

In [126]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


### Changing the index Column

- We can set one of the existing columns as the new index column of a DataFrame
  using `.set_index()` function

In [127]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    # display(myDf.set_index('name'))
    myDf2 = myDf.set_index('name')
    display(myDf2)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


### Inplace

- Remember that most of the functions in Pandas do not change the original DataFrame
- In the previous section, we changed the index column of our DataFrame. If we
  print our DataFrame again, we'll see that the original Dataframe is unchanged

In [128]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    myDf2 = myDf.set_index('name')
    display(myDf2)
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


In [129]:
# using inplace to actually change the DataFrame
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)
    display(myDf)
    myDf.set_index('name', inplace=True) # Changing the actual df
    display(myDf)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
5,Apple Cinnamon Cheerios,110,2,25,29.509541
6,Apple Jacks,110,2,25,33.174094
7,Basic 4,130,3,25,37.038562
8,Bran Chex,90,2,25,49.120253
9,Bran Flakes,90,3,25,53.313813


Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


### Examining the data

#### head()

- `head()` function gives us the **first** 5 rows of the DataFrame/Series by default
- To get more rows, we can pass the desired number as an argument to the
  `head()` function

In [130]:
myDf.head(7)

Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Bran,70,4,25,68.402973
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094


#### tail()

- `tail()` function gives us the **last** 5 rows of the DataFrame/Series by default
- To get more rows, we can pass the desired number as an argument to the
  `tail()` function

In [131]:
myDf.tail(7)

Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
All-Bran with Extra Fiber,50,4,25,93.704912
Almond Delight,110,2,25,34.384843
Apple Cinnamon Cheerios,110,2,25,29.509541
Apple Jacks,110,2,25,33.174094
Basic 4,130,3,25,37.038562
Bran Chex,90,2,25,49.120253
Bran Flakes,90,3,25,53.313813


### Statistical Summary

- We can use the `describe()` function to get a quick statistical summary of
  each column of the DataFrame

In [132]:
myDf.describe()

Unnamed: 0,calories,protein,vitamins,rating
count,10.0,10.0,10.0,10.0
mean,95.0,2.9,22.5,49.205817
std,25.495098,0.875595,7.905694,20.315297
min,50.0,2.0,0.0,29.509541
25%,75.0,2.0,25.0,34.08397
50%,100.0,3.0,25.0,43.079408
75%,110.0,3.75,25.0,57.897582
max,130.0,4.0,25.0,93.704912


### Operator for row slicing

- We can use the brackets `[]` operator to slice rows of the DataFrame
- We pass a start index (inclusive) and an end index (exclusive) to the `[]`
  operator to slice the rows os the DataFrame
- Notice that it will not change the original dataFrame

In [133]:
myDf[1:4] # From the 2nd to the 4th index

Unnamed: 0_level_0,calories,protein,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100% Natural Bran,120,3,0,33.983679
All-Bran,70,4,25,59.425505
All-Bran with Extra Fiber,50,4,25,93.704912


- We can algo use the `[]` operator to index column of the DataFrame
- Indexing a single column returns a Series
- Indexing a list of columns returns a DataFrame
- Remember that for indexing columns, we pass their labels to the `[]` operator,
  and not their positions
- Notice that it will not change the original dataFrame

In [136]:
import os
import pandas as pd

cereals_csv = os.path.abspath("cereals.csv")

with open(cereals_csv) as file:
    myDf = pd.read_csv(file)

# Passing two or more columns, returns a DataFrame
myDf[
    ['name', 'rating']
]

Unnamed: 0,name,rating
0,100% Bran,68.402973
1,100% Natural Bran,33.983679
2,All-Bran,59.425505
3,All-Bran with Extra Fiber,93.704912
4,Almond Delight,34.384843
5,Apple Cinnamon Cheerios,29.509541
6,Apple Jacks,33.174094
7,Basic 4,37.038562
8,Bran Chex,49.120253
9,Bran Flakes,53.313813


In [137]:
# Passing one column, returns a Series
myDf[
    ['protein']
]

Unnamed: 0,protein
0,4
1,3
2,4
3,4
4,2
5,2
6,2
7,3
8,2
9,3


### Boolean List

- We can also pass a List of booleans to the `[]` operator
- We'll get all the rows of the DataFrame for which the corresponding element in
  the List is `True`
- Rows of the DataFrame for which the corresponding element in the List is
  `False` are ignored
- Notice that it will not change the original dataFrame

In [138]:
df = myDf.head()
# df = myDf[0:5]
thirdRow = [False, False, True, False, False]
df[thirdRow]

Unnamed: 0,name,calories,protein,vitamins,rating
2,All-Bran,70,4,25,59.425505


### Filtering rows

- We can also use the `[]` operator to apply conditions on one or more columns
  of the DataFrame
- Rows of the DataFrame which satisfy those conditions are filtered out

In [139]:
condition = df['calories'] > 70
df[condition]

Unnamed: 0,name,calories,protein,vitamins,rating
1,100% Natural Bran,120,3,0,33.983679
4,Almond Delight,110,2,25,34.384843


In [140]:
df[df['calories'] > 70]

Unnamed: 0,name,calories,protein,vitamins,rating
1,100% Natural Bran,120,3,0,33.983679
4,Almond Delight,110,2,25,34.384843


#### and (&)

- We can also group conditions using the `&` operator
- It works the same way as the `and` operator in Python
- Note: Each condition should be in parentheses

In [141]:
df[
    (df['calories'] > 70) &
    (df['protein'] < 4)
]

Unnamed: 0,name,calories,protein,vitamins,rating
1,100% Natural Bran,120,3,0,33.983679
4,Almond Delight,110,2,25,34.384843


#### or (|)

- We can also group conditions using the `|` operator
- It works the same way as the `or` operator in Python
- Note: Each condition should be in parentheses

In [142]:
df[
    (df['calories'] > 70) |
    (df['protein'] > 3)
]

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843


### loc

#### Indexing

- `loc` is used to index/slice a group of rows and columns based on their labels
- The 1st argument is the row label and the 2nd argument is the column label
- In the following example, we index the first row and the first column

In [143]:
display(df)

display(df.loc[0, 'name'])
print(type(df.loc[0, 'name']))

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843


'100% Bran'

<class 'str'>


In [144]:
display(df.loc[[0], ['name']])
print(type(df.loc[[0], ['name']]))

Unnamed: 0,name
0,100% Bran


<class 'pandas.core.frame.DataFrame'>


In [145]:
display(df.loc[[0]])
print(type(df.loc[[0]]))

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973


<class 'pandas.core.frame.DataFrame'>


#### Slicing

- We can also slice rows and/or columns using the `loc` method
- Both the start and stop index of a slice with `loc` are inclusive
- In the following example, we slice the first 5 rows and the first 3 columns of the Dataframe

In [146]:
display(df.loc[0:4, 'name':'protein'])
print(type(df.loc[0:4, 'name':'protein']))

Unnamed: 0,name,calories,protein
0,100% Bran,70,4
1,100% Natural Bran,120,3
2,All-Bran,70,4
3,All-Bran with Extra Fiber,50,4
4,Almond Delight,110,2


<class 'pandas.core.frame.DataFrame'>


#### Indexing and Slicing

- We can index and slice simultaneously as well
- In the following example, we index rows and slice columns. The opposite is
  also possible

In [147]:
display(df.loc[[1, 3], 'name':'protein'])
print(type(df.loc[[1, 3], 'name':'protein']))

Unnamed: 0,name,calories,protein
1,100% Natural Bran,120,3
3,All-Bran with Extra Fiber,50,4


<class 'pandas.core.frame.DataFrame'>


In [148]:
display(df.loc[1 : 3, ['name', 'calories']])
print(type(df.loc[1 : 3, ['name', 'calories']]))

Unnamed: 0,name,calories
1,100% Natural Bran,120
2,All-Bran,70
3,All-Bran with Extra Fiber,50


<class 'pandas.core.frame.DataFrame'>


### iloc

#### Indexing

- `iloc` is used to index/slice a group of rows and columns
- `iloc` takes row and column positions as arguments and not their labels
- The first argument is the row position and the second argument is the column position
- In the following example we indexed the fifth row and the third column. The
  result is a Series

In [149]:
display(df.iloc[4, 2])
print(type(df.iloc[4, 2]))

2

<class 'numpy.int64'>


In [150]:
# Getting a DataFrame using Lists as parameters
display(df.iloc[[4], [2]])
print(type(df.iloc[[4], [2]]))

Unnamed: 0,protein
4,2


<class 'pandas.core.frame.DataFrame'>


#### Slicing

- We can also slice rows and/or columns using `iloc` method
- We provide row and column positions for slicing using `iloc`
- The start index of a slice with `iloc` is inclusive, the end index is exclusive
- In the following example, we slice the fisrt 5 rows and the first 3 columns of
  the DataFrame

In [151]:
display(df.iloc[0:5, 0:3])
print(type(df.iloc[0:5, 0:3]))

Unnamed: 0,name,calories,protein
0,100% Bran,70,4
1,100% Natural Bran,120,3
2,All-Bran,70,4
3,All-Bran with Extra Fiber,50,4
4,Almond Delight,110,2


<class 'pandas.core.frame.DataFrame'>


#### Indexing and Slicing

- We can index and slice simultaneously as well
- In the following example, wee index rows and slice columns. The opposite is
  also possible

In [152]:
display(df.iloc[[0, 2, 4], 0:3])
print(type(df.iloc[[0, 2, 4], 0:3]))

Unnamed: 0,name,calories,protein
0,100% Bran,70,4
2,All-Bran,70,4
4,Almond Delight,110,2


<class 'pandas.core.frame.DataFrame'>


In [153]:
display(df.iloc[0:3, [0, 3, 1]])
print(type(df.iloc[0:3, [0, 3, 1]]))

Unnamed: 0,name,vitamins,calories
0,100% Bran,25,70
1,100% Natural Bran,0,120
2,All-Bran,25,70


<class 'pandas.core.frame.DataFrame'>


### Adding and deleting Rows and Columns

#### Adding Rows

- We can add more rows to our DataFrame using the `loc` method
- If the row label doesn't exist, a new row with the specified label will be
  added at the end of the rows

In [155]:
display(df)
df.loc[6] = ['Trix', 110, 1, 25, 27.753301]
display(df)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[6] = ['Trix', 110, 1, 25, 27.753301]


Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301


In [156]:
display(df)

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301


#### Deleting Rows

- We can delete rows from the DataFrame using `drop()` function by specifying
  `axis=0` for rows
- Provide the labels of the rows to be deleted as argument to the `drop()` function
- Don't forget to use `inplace=True`, otherwise the original DataFrame will
  remain untouched

In [157]:
df.drop(2, axis=0, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(2, axis=0, inplace=True)


Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301


#### Adding Columns

- To add a column to the DataFrame, we use the same notation as adding a key,
  value pair to a dictionary
- Instead of the key, we provide column name in the square brackets, and then
  provide a list of values for that column
- If no column with the given name exists, a new column with the specified name
  and values will be added to the DataFrame

In [159]:
df['My Column'] = ['A', 'B', 'C', 'D', 'E']
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['My Column'] = ['A', 'B', 'C', 'D', 'E']


Unnamed: 0,name,calories,protein,vitamins,rating,My Column
0,100% Bran,70,4,25,68.402973,A
1,100% Natural Bran,120,3,0,33.983679,B
3,All-Bran with Extra Fiber,50,4,25,93.704912,C
4,Almond Delight,110,2,25,34.384843,D
6,Trix,110,1,25,27.753301,E


#### Deleting Columns

- We can delete columns of a DataFrame using `drop()` function by specifying
  `axis=1` for columns
- Provide the column name to be deletedas argument to the `drop()` function
- Don't forget to use `inplace=True`, otherwise the original DataFrame will
  remain untouched

In [160]:
df.drop('My Column', axis=1, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop('My Column', axis=1, inplace=True)


Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301


### Sorting Values

#### Ascending

- We can sort the values of a DataFrame with respect to a column using the
  `sort_values()` function, which sorts the values in ascending order by default
- If the values of the column are alphabet, they are sorted alphabetically
- If the values of the column are numbers, they are sorted numerically
- It doesn't change the original DataFrame

In [162]:
df.sort_values(by='calories')

Unnamed: 0,name,calories,protein,vitamins,rating
3,All-Bran with Extra Fiber,50,4,25,93.704912
0,100% Bran,70,4,25,68.402973
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301
1,100% Natural Bran,120,3,0,33.983679


#### Descending

- To sort values in descending order, we set `ascending=False` in the `sort_values()`

In [164]:
df.sort_values(by='calories', ascending=False)

Unnamed: 0,name,calories,protein,vitamins,rating
1,100% Natural Bran,120,3,0,33.983679
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301
0,100% Bran,70,4,25,68.402973
3,All-Bran with Extra Fiber,50,4,25,93.704912


In [166]:
# Original DataFrame remains untouched
df

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
3,All-Bran with Extra Fiber,50,4,25,93.704912
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301


### Exporting and Saving Pandas DataFrame

- To export a DataFrame as a csv file, use `to_CSV()` function
- If a file with the specified filename exists, it'll be modified. Otherwise, a
  new file with the specified filename will be created
- If you don't want to store index column in the csv file, you can set
  `index_label=False` in the `to_CSV()` function

In [168]:
df.sort_values(by='calories').to_csv('myFile.csv', index_label=False)

# import os
# import pandas as pd

myFile_csv = os.path.abspath("myFile.csv")

with open(myFile_csv) as file:
    newDf = pd.read_csv(file)

newDf

Unnamed: 0,name,calories,protein,vitamins,rating
3,All-Bran with Extra Fiber,50,4,25,93.704912
0,100% Bran,70,4,25,68.402973
4,Almond Delight,110,2,25,34.384843
6,Trix,110,1,25,27.753301
1,100% Natural Bran,120,3,0,33.983679


### Concatenating DataFrames

- We can concatenate two or more DataFrames together using `pandas.concat()` function
- We can also concatenate two or more DataFrames side-by-side each other, using
  `axis=1` with `pandas.concat()`

In [169]:
df1 = myDf[0:3]
df1

Unnamed: 0,name,calories,protein,vitamins,rating
0,100% Bran,70,4,25,68.402973
1,100% Natural Bran,120,3,0,33.983679
2,All-Bran,70,4,25,59.425505


In [176]:
# Using reset_index(drop=True) to reset the line index of the new DataFrame
df2 = myDf[5:8].reset_index(drop=True)
df2

Unnamed: 0,name,calories,protein,vitamins,rating
0,Apple Cinnamon Cheerios,110,2,25,29.509541
1,Apple Jacks,110,2,25,33.174094
2,Basic 4,130,3,25,37.038562


In [174]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,name,calories,protein,vitamins,rating,name.1,calories.1,protein.1,vitamins.1,rating.1
0,100% Bran,70,4,25,68.402973,Apple Cinnamon Cheerios,110,2,25,29.509541
1,100% Natural Bran,120,3,0,33.983679,Apple Jacks,110,2,25,33.174094
2,All-Bran,70,4,25,59.425505,Basic 4,130,3,25,37.038562


### groupby()

- `groupby()` function is used to group DataFrame based on Series
  - The DataFrame is splitted into groups
  - An aggregate function is applied to each column of the splitted DataFrame
  - Results are combined together
- Consider the following DataFrame

In [178]:
matchDf = pd.DataFrame({'Gender':['female', 'male', 'female', 'male'], 'Score': [85, 88, 95, 80]})
matchDf

Unnamed: 0,Gender,Score
0,female,85
1,male,88
2,female,95
3,male,80


- The **Gender** column contains two values, male and female
- Lets split our DataFrame into two parts based on **Gender** column
  - First part will contain the rows where Gender = male
  - Second part will contain the rows where Gender = female

In [186]:
display(matchDf[matchDf['Gender'] == 'male'])
display(matchDf[matchDf['Gender'] == 'female'])

Unnamed: 0,Gender,Score
1,male,88
3,male,80


Unnamed: 0,Gender,Score
0,female,85
2,female,95


- If we find the mean score of both the genders, this is what we get

In [190]:
maleDf = matchDf[matchDf['Gender'] == 'male']
display(maleDf.groupby(maleDf['Gender']).mean())

femaleDf = matchDf[matchDf['Gender'] == 'female']
display(femaleDf.groupby(femaleDf['Gender']).mean())

Unnamed: 0_level_0,Score
Gender,Unnamed: 1_level_1
male,84.0


Unnamed: 0_level_0,Score
Gender,Unnamed: 1_level_1
female,90.0


- Lets combine the two results together. This is what we get

In [181]:
matchDf.groupby(matchDf['Gender']).mean()

Unnamed: 0_level_0,Score
Gender,Unnamed: 1_level_1
female,90.0
male,84.0


- The `groupby()` function works exactly the same way, except that it makes
  things easier for us
- Im the given example, we group our DataFrame on the basis of **Gender**
  column, and then apply the aggregate function `mean()` on it
- Note that aggregate functions are applied automatically on all the columns of
  the DataFrame, except the one used to group the DataFrame
- The common aggregate functions:
  - `mean()` for average
  - `sum()` for sum all values
  - `max()` for the max value of a given column
  - `min()` for the min value of a given column
  - `median()` gives the median
  - `count()` for the total number of elements in a given column
  - `std()` for calculating the standard deviation of the values of a given column