# **What is pandas?**
![pandas-image](https://preview.redd.it/c6h7rok9c2v31.jpg?auto=webp&s=55820f5f994a744ff9d774fa2c6b3b56539a302f)

pandas is a widely used open-source Python library that provides high-performance data manipulation and analysis tools. It is built on top of the NumPy library, which provides support for multi-dimensional arrays and mathematical functions in Python.

Pandas provides a wide range of functionalities for data manipulation, including:

- Reading and writing data from various file formats like CSV, Excel, SQL databases, etc.
- Data selection, slicing, and filtering.
- Handling missing data.
- Grouping and aggregating data.
- Merging and joining datasets.
- Reshaping and pivoting data.
- Time series analysis.
- Data visualization integration with libraries like Matplotlib and Seaborn.

# **Why should I use pandas?**

- Mastering pandas enables comprehension of foundational concepts in data manipulation and imparts Python coding skills.
- It's simple and straightforward, allowing for immediate application to diverse datasets.
- Widely embraced within the data science and machine learning circles, pandas stands as a staple tool.

# **How do I install pandas?**

1. Go on your terminal and run `pip install pandas`. 
2. Go to a `.ipynb` or `.py` project and call pandas by importing it: `import pandas as pd`

# **Importing pandas**

In [1]:
import pandas as pd

# **Two basic data structures of pandas**

These are the `DataFrame` and `Series` data structures.

`DataFrame` is a 2-dimensional data structure designed for handling tabular data with multiple columns.\
`Series` is a a 1-dimensional data structure used to represent a single column or row of data within a DataFrame or as a standalone data structure.\

Think about it like an Excel sheet. The `DataFrame` is basically the entire sheet while the `Series` is a column or row of that sheet. Therefore, a `DataFrame` is made up of `Series` data structures.

In [None]:
# Three ways to create a pandas Series: From scratch, through a dict, or a list.

obj = pd.Series([1,"John",3.5,"Hey"])
obj

In [None]:
score = {
    "Jane":90, 
    "Bill":80,
    "Elon":85,
    "Tom":75,
    "Tim":95}
names = pd.Series(score) # Convert to Series 
names

In [None]:
grades = [1, 2, 3, 4, 5]
subjects = pd.Series(grades, index=["Math", "Science", "English", "AP", "Filipino"])
subjects

In [None]:
data = {
    "name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],      
    "score":[90,80,85,75,95,60,65],      
    "sport":["Wrestling","Football","Skiing","Swimming","Tennis",
               "Karete","Surfing"],      
    "sex":["M","M","M","M","F","F","F"]
    }
people = pd.DataFrame(data)
people

In [None]:
# Creating two Pandas Series
series1 = pd.Series([10, 20, 30, 40], name='Series_1')
series2 = pd.Series(['A', 'B', 'C', 'D'], name='Series_2')

# Combining the two Series into a DataFrame
df = pd.DataFrame({'Column_1': series1, 'Column_2': series2})
df

# **Importing your dataset**

We will be using a FIFA World Cup dataset which is a `.csv` file, a common file type you will be encountering when doing data science-related work. 


## *Why .csv?*

CSV files are preferred in data science for their simplicity, universality, lightweight nature, version control compatibility, and flexibility compared to .xlsx files.

In [None]:
fifa_data = pd.read_csv('FIFA_WorldCup_data.csv')

Now that the .csv has been loaded to `fifa_data`, you can view it by printing the data itself.

In [None]:
fifa_data

In [None]:
# You can use .head() to preview the first 5 rows.

fifa_data.head()

In [None]:
# You can use .tail() to preview the first 5 rows.

fifa_data.tail()

In [None]:
# .shape function tells us how many rows and columns exist in a dataframe

fifa_data.shape

In [None]:
# .describe is used for calculating some statistical data like percentile, mean and std of the numerical values of the Series or DataFrame.

fifa_data.describe() 

# **Indexing**

Indexing is vital in data science for efficient data retrieval, manipulation, and analysis. 

There are three main ways of indexing pandas dataframes:
1. Label-based
2. Integer-based 
3. Boolean-based

## Label-based

It is using the indexing operator `[]`, `loc` indexer. You can also use the `.` attribute operator and the `at` indexer however, the first two are the most common.

## 1. Retrieve the hosts of the World Cup as a pandas Series

In [None]:
world_cup_hosts = fifa_data['HOST']
world_cup_hosts

Note: You can also retrieve pandas `Series` as a standalone `DataFrame by using the brackets twice.

In [None]:
world_cup_hosts_df = fifa_data[['HOST']]
world_cup_hosts_df

In [None]:
print(type(world_cup_hosts))
print(type(world_cup_hosts_df))

## 2. Retrieve data for the World Cups held after 1970.

In [None]:
world_cup_after_1970 = fifa_data.loc[fifa_data['YEAR'] > 1970]
world_cup_after_1970

Note: You can also use `loc` to query multiple rows and columns.

It is in the format: `.loc[rows, cols]`

In [None]:
# Retrieve all rows with the corresponding year and the top three winners.

world_cup_winners = fifa_data.loc[:, ['YEAR', 'CHAMPION', 'RUNNER UP', 'THIRD PLACE']]
world_cup_winners 

## Position-based

It is using the indexing operator `[]`, `iloc` indexer.

## 1. Retrieve data for the first World Cup entry.

In [None]:
first_entry = fifa_data.iloc[0]
first_entry

Note: You can also use `iloc` to query multiple rows and columns.

It is in the format: `.iloc[rows, cols]`

In [None]:
# Retrieve all rows with the corresponding year and the top three winners.

world_cup_winners = fifa_data.iloc[:, [0, 2, 3, 4]]
world_cup_winners 

## 2. Retrieve data for the first 5 World Cup Entries.

In [None]:
last_entry = fifa_data[:5]
last_entry

Note: The `[]` operator does not work for querying one row. You're better off using `.iloc` for that.