# **What is pandas?**
![pandas-image](https://preview.redd.it/c6h7rok9c2v31.jpg?auto=webp&s=55820f5f994a744ff9d774fa2c6b3b56539a302f)

pandas is a widely used open-source Python library that provides high-performance data manipulation and analysis tools. It is built on top of the NumPy library, which provides support for multi-dimensional arrays and mathematical functions in Python.

Pandas provides a wide range of functionalities for data manipulation, including:

- Reading and writing data from various file formats like CSV, Excel, SQL databases, etc.
- Data selection, slicing, and filtering.
- Handling missing data.
- Grouping and aggregating data.
- Merging and joining datasets.
- Reshaping and pivoting data.
- Time series analysis.
- Data visualization integration with libraries like Matplotlib and Seaborn.

# **Why should I use pandas?**

- Mastering pandas enables comprehension of foundational concepts in data manipulation and imparts Python coding skills.
- It's simple and straightforward, allowing for immediate application to diverse datasets.
- Widely embraced within the data science and machine learning circles, pandas stands as a staple tool.

# **How do I install pandas?**

1. Go on your terminal and run `pip install pandas`. 
2. Go to a `.ipynb` or `.py` project and call pandas by importing it: `import pandas as pd`

# **Importing pandas**

In [1]:
import pandas as pd

# **Two basic data structures of pandas**

These are the `DataFrame` and `Series` data structures.

`DataFrame` is a 2-dimensional data structure designed for handling tabular data with multiple columns.\
`Series` is a a 1-dimensional data structure used to represent a single column or row of data within a DataFrame or as a standalone data structure.\

Think about it like an Excel sheet. The `DataFrame` is basically the entire sheet while the `Series` is a column or row of that sheet. Therefore, a `DataFrame` is made up of `Series` data structures.

In [12]:
# Three ways to create a pandas Series: From scratch, through a dict, or a list.

obj = pd.Series([1,"John",3.5,"Hey"])
obj

0       1
1    John
2     3.5
3     Hey
dtype: object

In [20]:
score = {
    "Jane":90, 
    "Bill":80,
    "Elon":85,
    "Tom":75,
    "Tim":95}
names = pd.Series(score) # Convert to Series 
names

Jane    90
Bill    80
Elon    85
Tom     75
Tim     95
dtype: int64

In [19]:
grades = [1, 2, 3, 4, 5]
subjects = pd.Series(grades, index=["Math", "Science", "English", "AP", "Filipino"])
subjects

Math        1
Science     2
English     3
AP          4
Filipino    5
dtype: int64

In [21]:
data = {
    "name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],      
    "score":[90,80,85,75,95,60,65],      
    "sport":["Wrestling","Football","Skiing","Swimming","Tennis",
               "Karete","Surfing"],      
    "sex":["M","M","M","M","F","F","F"]
    }
people = pd.DataFrame(data)
people

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


In [23]:
# Creating two Pandas Series
series1 = pd.Series([10, 20, 30, 40], name='Series_1')
series2 = pd.Series(['A', 'B', 'C', 'D'], name='Series_2')

# Combining the two Series into a DataFrame
df = pd.DataFrame({'Column_1': series1, 'Column_2': series2})
df

Unnamed: 0,Column_1,Column_2
0,10,A
1,20,B
2,30,C
3,40,D


# **Importing your dataset**

We will be using a FIFA World Cup dataset which is a `.csv` file, a common file type you will be encountering when doing data science-related work. 


## *Why .csv?*

CSV files are preferred in data science for their simplicity, universality, lightweight nature, version control compatibility, and flexibility compared to .xlsx files.

In [2]:
fifa_data = pd.read_csv('FIFA_WorldCup_data.csv')

Now that the .csv has been loaded to `fifa_data`, you can view it by printing the data itself.

In [6]:
fifa_data

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
0,1930,Uruguay,Uruguay,Argentina,United States,13,16,70,3.6
1,1934,Italy,Italy,Czechoslovakia,Germany,16,17,70,4.1
2,1938,France,Italy,Hungary,Brazil,15,18,84,4.7
3,1950,Brazil,Uruguay,Brazil,Sweden,13,22,88,4.0
4,1954,Switzerland,West Germany,Hungary,Austria,16,26,140,5.4
5,1958,Sweden,Brazil,Sweden,France,16,35,126,3.6
6,1962,Chile,Brazil,Czechoslovakia,Chile,16,32,89,2.8
7,1966,England,England,West Germany,Portugal,16,32,89,2.8
8,1970,Mexico,Brazil,Italy,West Germany,16,32,95,3.0
9,1974,West Germany,West Germany,Netherlands,Poland,16,38,97,2.6


In [7]:
# You can use .head() to preview the first 5 rows.

fifa_data.head()

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
0,1930,Uruguay,Uruguay,Argentina,United States,13,16,70,3.6
1,1934,Italy,Italy,Czechoslovakia,Germany,16,17,70,4.1
2,1938,France,Italy,Hungary,Brazil,15,18,84,4.7
3,1950,Brazil,Uruguay,Brazil,Sweden,13,22,88,4.0
4,1954,Switzerland,West Germany,Hungary,Austria,16,26,140,5.4


In [8]:
# You can use .tail() to preview the first 5 rows.

fifa_data.tail()

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
16,2002,"South Korea, Japan",Brazil,Germany,Turkey,32,64,161,2.5
17,2006,Germany,Italy,France,Germany,32,64,147,2.3
18,2010,South Africa,Spain,Netherlands,Germany,32,64,145,2.3
19,2014,Brazil,Germany,Argentina,Netherlands,32,64,171,2.7
20,2018,Russia,France,Croatia,Belgium,32,64,169,2.6


In [10]:
# .shape function tells us how many rows and columns exist in a dataframe

fifa_data.shape

(21, 9)

In [11]:
# .describe is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.

fifa_data.describe() 

Unnamed: 0,YEAR,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
count,21.0,21.0,21.0,21.0,21.0
mean,1976.857143,21.761905,42.761905,121.333333,3.07619
std,26.657618,7.462605,17.615064,33.94309,0.847883
min,1930.0,13.0,16.0,70.0,2.2
25%,1958.0,16.0,32.0,89.0,2.6
50%,1978.0,16.0,38.0,126.0,2.7
75%,1998.0,32.0,64.0,146.0,3.6
max,2018.0,32.0,64.0,171.0,5.4


# **Indexing**

Indexing is vital in data science for efficient data retrieval, manipulation, and analysis. 

There are three main ways of indexing pandas dataframes:
1. Label-based
2. Integer-based 
3. Boolean-based

## Label-based

It is using the indexing operator `[]`, `loc` indexer. You can also use the `.` attribute operator and the `at` indexer however, the first two are the most common.

## 1. Retrieve the hosts of the World Cup as a pandas Series

In [26]:
world_cup_hosts = fifa_data['HOST']
world_cup_hosts

0                Uruguay
1                  Italy
2                 France
3                 Brazil
4            Switzerland
5                 Sweden
6                  Chile
7                England
8                 Mexico
9           West Germany
10             Argentina
11                 Spain
12                Mexico
13                 Italy
14         United States
15                France
16    South Korea, Japan
17               Germany
18          South Africa
19                Brazil
20                Russia
Name: HOST, dtype: object

Note: You can also retrieve pandas `Series` as a standalone `DataFrame by using the brackets twice.

In [27]:
world_cup_hosts_df = fifa_data[['HOST']]
world_cup_hosts_df

Unnamed: 0,HOST
0,Uruguay
1,Italy
2,France
3,Brazil
4,Switzerland
5,Sweden
6,Chile
7,England
8,Mexico
9,West Germany


In [29]:
print(type(world_cup_hosts))
print(type(world_cup_hosts_df))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


## 2. Retrieve data for the World Cups held after 1970.

In [33]:
world_cup_after_1970 = fifa_data.loc[fifa_data['YEAR'] > 1970]
world_cup_after_1970

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
9,1974,West Germany,West Germany,Netherlands,Poland,16,38,97,2.6
10,1978,Argentina,Argentina,Netherlands,Brazil,16,38,102,2.7
11,1982,Spain,Italy,West Germany,Poland,24,52,146,2.8
12,1986,Mexico,Argentina,West Germany,France,24,52,132,2.5
13,1990,Italy,West Germany,Argentina,Italy,24,52,115,2.2
14,1994,United States,Brazil,Italy,Sweden,24,52,141,2.7
15,1998,France,France,Brazil,Croatia,32,64,171,2.7
16,2002,"South Korea, Japan",Brazil,Germany,Turkey,32,64,161,2.5
17,2006,Germany,Italy,France,Germany,32,64,147,2.3
18,2010,South Africa,Spain,Netherlands,Germany,32,64,145,2.3


Note: You can also use `loc` to query multiple rows and columns.

It is in the format: `.loc[rows, cols]`

In [43]:
# Retrieve all rows with the corresponding year and the top three winners.

world_cup_winners = fifa_data.loc[:, ['YEAR', 'CHAMPION', 'RUNNER UP', 'THIRD PLACE']]
world_cup_winners 

Unnamed: 0,YEAR,CHAMPION,RUNNER UP,THIRD PLACE
0,1930,Uruguay,Argentina,United States
1,1934,Italy,Czechoslovakia,Germany
2,1938,Italy,Hungary,Brazil
3,1950,Uruguay,Brazil,Sweden
4,1954,West Germany,Hungary,Austria
5,1958,Brazil,Sweden,France
6,1962,Brazil,Czechoslovakia,Chile
7,1966,England,West Germany,Portugal
8,1970,Brazil,Italy,West Germany
9,1974,West Germany,Netherlands,Poland


## Position-based

It is using the indexing operator `[]`, `iloc` indexer.

## 1. Retrieve data for the first World Cup entry.

In [37]:
first_entry = fifa_data.iloc[0]
first_entry

YEAR                           1930
HOST                        Uruguay
CHAMPION                    Uruguay
RUNNER UP                 Argentina
THIRD PLACE           United States
TEAMS                            13
MATCHES PLAYED                   16
GOALS SCORED                     70
AVG GOALS PER GAME              3.6
Name: 0, dtype: object

Note: You can also use `iloc` to query multiple rows and columns.

It is in the format: `.iloc[rows, cols]`

In [40]:
# Retrieve all rows with the corresponding year and the top three winners.

world_cup_winners = fifa_data.iloc[:, [0, 2, 3, 4]]
world_cup_winners 

Unnamed: 0,YEAR,CHAMPION,RUNNER UP,THIRD PLACE
0,1930,Uruguay,Argentina,United States
1,1934,Italy,Czechoslovakia,Germany
2,1938,Italy,Hungary,Brazil
3,1950,Uruguay,Brazil,Sweden
4,1954,West Germany,Hungary,Austria
5,1958,Brazil,Sweden,France
6,1962,Brazil,Czechoslovakia,Chile
7,1966,England,West Germany,Portugal
8,1970,Brazil,Italy,West Germany
9,1974,West Germany,Netherlands,Poland


## 2. Retrieve data for the first 5 World Cup Entries.

In [39]:
last_entry = fifa_data[:5]
last_entry

Unnamed: 0,YEAR,HOST,CHAMPION,RUNNER UP,THIRD PLACE,TEAMS,MATCHES PLAYED,GOALS SCORED,AVG GOALS PER GAME
0,1930,Uruguay,Uruguay,Argentina,United States,13,16,70,3.6
1,1934,Italy,Italy,Czechoslovakia,Germany,16,17,70,4.1
2,1938,France,Italy,Hungary,Brazil,15,18,84,4.7
3,1950,Brazil,Uruguay,Brazil,Sweden,13,22,88,4.0
4,1954,Switzerland,West Germany,Hungary,Austria,16,26,140,5.4


Note: The `[]` operator does not work for querying one row. You're better off using `.iloc` for that.