<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Introduction to Pandas
© ExploreAI Academy

In this train, we will look at the Pandas package, and its key data structure, dataframes, which is commonly used for data analysis in Python. We will define what a Pandas dataframe is, show how to create dataframes, and how to access them.

## Learning Objectives
* Understand what Pandas is and how dataframes are used in Python to handle data.
* Know how to load, manipulate and analyse data using Pandas.

Pandas is a tool, built on the NumPy package, that allows us to work with data. It has functions for analysing, cleaning, exploring, and manipulating data.

### What is a Pandas Dataframe?
Pandas' key data structure is the `dataframe`. A dataframe allows for the storage and manipulation of tabular data. It is a two-dimensional labelled data structure.

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Pandas_dataframe.jpg" width="700">

Basically, we could say that the Pandas dataframe consists of three main components: the data, index, and columns. Let's walk through some examples to gain an understanding of these components.

#### Creating a dataframe
We can create a dataframe by calling the `DataFrame()` constructor. The main arguments in the constructor are the data, index and columns. The data that is passed in can be in the form of other data structures (lists, dictionaries or NumPy arrays) or by loading in a file. Pandas is particularly useful for handling structured data, like CSV or Excel files.

Let's start by first importing the Pandas library:

In [1]:
import pandas as pd

### Example 1

Creating a dataframe from a **list** of lists: Here each inner list represents a row of data, and the outer list contains all the rows. By providing an optional index and specifying column names, we can organise the data into a structured two-dimensional table. Note that if we don't explicitly pass in an index, it would be automatically generated, starting at 0.

In [2]:
# Create list of lists containing data.
list_df = [[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]]

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names.
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=list_df, index=index, columns=columns)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



### Example 2

We can also create a dataframe from a **dictionary**. The dictionary keys should be the column names, while the values should be the data entries for that column. We can also pass in an index, if we want to. Note that because the keys account for the column names, we don't have to pass in an argument for columns.

In [3]:
# Create dictionary containing data.
dict_df = {'Age':[32, 30, 25], 'Nationality':['Portugal', 'Argentina', 'Brazil'], 'Overall':[94, 93, 92]}

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=dict_df, index=index)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92


### Example 3

If the data is stored in **NumPy arrays**, we can also use that to generate a Pandas dataframe. We pass the data, then the column names, and an index (player names), if required.

When creating a dataframe from a NumPy array, the data types of the array elements should be consistent for **each column**. This is because pandas dataframes, like NumPy arrays, prefer homogeneity within each column for optimal performance.

In [14]:
import numpy as np

# Create NumPy array containing data.
array_df = np.array([[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]])

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names.
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=array_df, index=index, columns=columns)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92


#### Dataframes from other files
The data argument can also be a loaded file. Pandas has the ability to read various file formats, such as, CSV, Excel, JSON, etc.

In [10]:
!pip install numpy



### Example 4

 We will load the full csv file. Pandas make it very easy to do this; we simply load the data using the `read_csv()` function and pass in the full path of the file as a string.

Pandas will use the first row as columns so we don't need to pass column names. We can also specify the index when we load the data by passing our index column as a string under the index_col argument. Remember to always check whether there are any warnings in the loaded data - in this case there seems to be different data types in some columns.

To check our data has loaded as we expected, we can use the `head()` function. This will return the first 5 records of our data. This is helpful if the dataframe has many rows and loading it will take lots of time.

In [1]:
# Load data - pass 'Name' as our index column.
# For this exercise, we'll use football player data to evaluate our dataframe.
load_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/fundamentals/football_players.csv', index_col='Name')

# Create dataframe called df.
df = pd.DataFrame(load_df)

# Use the head() function to look at the first 5 rows.
df.head()

NameError: name 'pd' is not defined

### Accessing Dataframes

Accessing data within dataframes is not as straightforward as with the previous data structures. This can be done by index, by column, or by both. Let's work through these methods.

### Example 5

To access by index only in a dataframe we can use the `iloc` or `loc` functions with the indices in square brackets. The `iloc` function refers to the index location, so we pass in the number of the index, while the `loc` function refers to the name of the index, so we pass in the index name. We can use slicing if we want more than one index. Eg:

* `dataframe.iloc[index i]` - returns series at index i
* `dataframe.iloc[index start: index end]` - returns dataframe from start to end (end not included)
* `dataframe.loc['index name']` - returns series of given index name

Let's look at a few examples:

In [16]:
# Select the 5th row using iloc[].
df.iloc[4]

Age                         31
Nationality            Germany
Overall                     92
Acceleration            58    
Aggression              29    
Agility                 52    
Balance                 35    
Ball control            48    
Composure               70    
Crossing                15    
Curve                   14    
Dribbling               30    
Finishing               13    
Free kick accuracy      11    
GK diving               91    
GK handling             90    
GK kicking              95    
GK positioning          91    
GK reflexes             89    
Heading accuracy        25    
Interceptions           30    
Jumping                 78    
Long passing            59    
Long shots              16    
Marking                 10    
Penalties               47    
Positioning             12    
Reactions               85    
Short passing           55    
Shot power              25    
Sliding tackle          11    
Sprint speed            61    
Stamina 

In [17]:
# Select rows 5 to 10.
df.iloc[4:10]

Unnamed: 0_level_0,Age,Nationality,Overall,Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,...,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys,Preferred Positions
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M. Neuer,31,Germany,92,58,29,52,35,48,70,15,...,55,25,11,61,44,10,83,70,11,GK
R. Lewandowski,28,Poland,91,79,80,78,80,89,87,62,...,83,88,19,83,79,42,84,78,87,ST
De Gea,26,Spain,90,57,38,60,43,42,64,17,...,50,31,13,58,40,21,64,68,13,GK
E. Hazard,26,Belgium,90,93,54,93,91,92,87,80,...,86,79,22,87,79,27,65,86,79,LW
T. Kroos,27,Germany,90,60,60,71,69,89,85,85,...,90,87,69,52,77,82,74,88,82,CDM CM
G. Higuaín,29,Argentina,90,78,50,75,69,85,86,68,...,75,88,18,80,72,22,85,70,88,ST


In [18]:
# Select the M. Neuer index using loc[].
df.loc['M. Neuer']

Age                         31
Nationality            Germany
Overall                     92
Acceleration            58    
Aggression              29    
Agility                 52    
Balance                 35    
Ball control            48    
Composure               70    
Crossing                15    
Curve                   14    
Dribbling               30    
Finishing               13    
Free kick accuracy      11    
GK diving               91    
GK handling             90    
GK kicking              95    
GK positioning          91    
GK reflexes             89    
Heading accuracy        25    
Interceptions           30    
Jumping                 78    
Long passing            59    
Long shots              16    
Marking                 10    
Penalties               47    
Positioning             12    
Reactions               85    
Short passing           55    
Shot power              25    
Sliding tackle          11    
Sprint speed            61    
Stamina 

### Example 6
To access by column only we can simply call `dataframe['Column Name']`. If we want more than one column we input a list of column names inside the square brackets:

* `dataframe['Column Name']` - returns series of given column
* `dataframe[['Column 1', 'Column 2']]` - returns dataframe with the given columns

Let's look at examples.

In [19]:
# Select the column 'Age'.
df['Age']

Name
Cristiano Ronaldo    32
L. Messi             30
Neymar               25
L. Suárez            30
M. Neuer             31
                     ..
A. Kelsey            17
B. Richardson        47
J. Young             17
J. Lundstram         18
L. Sackey            18
Name: Age, Length: 17981, dtype: int64

In [20]:
# Select the columns 'Age' and 'Nationality'.
df[['Age', 'Nationality']]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany
...,...,...
A. Kelsey,17,England
B. Richardson,47,England
J. Young,17,Scotland
J. Lundstram,18,England


### Example 7
We can also select a subset of the dataframe using indices and columns in combination. Let's look at a few examples:

In [21]:
# Select the first 5 rows and first 2 columns - Rows first.
df.iloc[0:5][['Age', 'Nationality']]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany


In [23]:
# Select the first 5 rows and first 2 columns - Columns first.
df[['Age', 'Nationality']].iloc[0:5]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany


### When to use Dataframes
Unlike the NumPy array which is suited for storing and performing computations on homogeneous data (data of the same type), Pandas dataframes can accommodate heterogeneous data. This makes them the choice data structure for manipulating often messy data (e.g tabular data from spreadsheets, or SQL tables).

We should use a Pandas dataframe if all of the following statements hold:

* We have 2-dimensional data (rows and columns)
* The data type is the same within a column
* We are interested in the index (rows) and column names

Pandas dataframes are especially beneficial for data manipulation tasks like merging, joining, and reshaping data.

## Additional resources
- [Pandas package home page](https://pandas.pydata.org/)

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>