# Introduction to Pandas

**Pandas** is an open-source, BSD-licensed library designed for high-performance, easy-to-use data structures and data analysis tools for Python. It is widely used in data science, data analysis, and machine learning tasks.

## Why Pandas?
- Provides **DataFrames**, a powerful two-dimensional data structure for handling and analyzing data.
- Supports **Series**, a one-dimensional labeled array capable of holding different data types.
- Built on **NumPy**, enabling integration with its functions and performance benefits.
- Seamless handling of missing data.
- Functions for data transformation, merging, and aggregation.
- Easy integration with **matplotlib** and **seaborn** for visualization.

## Applications of Pandas
1. **Data Cleaning and Preparation**: Handling missing data, duplicate rows, or inconsistent entries.
2. **Exploratory Data Analysis (EDA)**: Statistical summaries and visualizations.
3. **Time Series Analysis**: Managing time-indexed data.
4. **Integration**: Used with machine learning libraries such as **scikit-learn** for preprocessing.
5. **Financial Data Analysis**: Managing large datasets and complex indexing.

## Relationship with NumPy
- **NumPy** provides arrays, the underlying data structure for pandas.
- Operations like filtering, transformations, and statistical functions rely on NumPy's efficiency.


# Pandas Tutorial
We will now dive into the core data structures and functions of pandas.


In [7]:
import pandas as pd
import numpy as np


## 1. Pandas Data Structures

### a. Series
- A **one-dimensional** labeled array capable of holding any data type.


In [8]:
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s

a    3
b   -5
c    7
d    4
dtype: int64

### b. DataFrame
- A **two-dimensional** labeled data structure with columns of potentially different types.


In [9]:
data = {
    'Country': ['Belgium', 'India', 'Brazil'],
    'Capital': ['Brussels', 'New Delhi', 'Brasilia'],
    'Population': [11190846, 1339171035, 207847528]
}
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1339171035
2,Brazil,Brasilia,207847528


## 2. Dropping
- Used to remove rows or columns from a DataFrame or values from a Series.


In [10]:
# Drop rows by label
s.drop(['a', 'c'])

# Drop column by name
df.drop('Country', axis=1)

Unnamed: 0,Capital,Population
0,Brussels,11190846
1,New Delhi,1339171035
2,Brasilia,207847528


## 3. Asking for Help
- Pandas provides extensive documentation and function-level help.


In [11]:
help(pd.Series.loc)

Help on property:

    Access a group of rows and columns by label(s) or a boolean array.

    ``.loc[]`` is primarily label based, but may also be used with a
    boolean array.

    Allowed inputs are:

    - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
      interpreted as a *label* of the index, and **never** as an
      integer position along the index).
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'``.

          start and the stop are included

    - A boolean array of the same length as the axis being sliced,
      e.g. ``[True, False, True]``.
    - An alignable boolean Series. The index of the key will be aligned before
      masking.
    - An alignable Index. The Index of the returned selection will be the input.
    - A ``callable`` function with one argument (the calling Series or
      DataFrame) and that returns valid output for indexing (one of the above)

    See more at :ref:`Selection by Label

## 4. Sorting and Ranking
- Reordering data based on labels, values, or rank.


In [12]:
# Sort by index
df.sort_index()

# Sort by column values
df.sort_values(by='Country')

# Assign ranks to entries
df.rank()

Unnamed: 0,Country,Capital,Population
0,1.0,2.0,1.0
1,3.0,3.0,3.0
2,2.0,1.0,2.0
