# Pandas: Turning Raw Data into Usable Data

Most data in the real world is not useful when we first encounter it. It arrives as CSV files from city governments, JSON blobs from APIs, HTML tables scraped from websites, logs from servers, or spreadsheets from someone in accounting who has strong opinions about color coding. The formats differ, but the problem is always the same:

> **“How do I turn this into something I can think with?”**

Data scientists do not stare at raw strings and commas; we need representations that **behave like data**, sortable, filterable, joinable, groupable, and ultimately meaningful. This is where Pandas enters the story.

Pandas is not a database, and it is not a spreadsheet. It borrows the best features from both: the tabular structure of Excel, the indexing and querying habits of SQL, and the vectorized speed of NumPy. For the working data scientist, Pandas becomes a lens, **it shapes raw input into analytical objects**.

The core object in Pandas is the **DataFrame**: a table of columns with names and types, where each column behaves like a mathematical vector and each row behaves like an observation. In a DataFrame, data becomes manipulable; you can filter for a year, compute a rate, sort by price, group by category, or merge two messy tables into a coherent whole.

What makes Pandas powerful is not just the DataFrame itself, but the fluidity of movement between formats. A single line of code can read a CSV from disk, parse JSON from an API, extract a table out of HTML, or connect to SQL. In other words:

> Pandas is where formats become data, and where data becomes analysis.

Pandas encourages a workflow that feels almost grammatical. We load data, inspect it, clean it, reshape it, and only then begin to analyze. The early steps are not busywork—they are how we teach uncooperative data to behave. Pandas gives us tools for dealing with the inevitable artifacts of real-world datasets: missing values, awkward formats, inconsistent categories, surprising outliers. Machines complain about these imperfections; Pandas negotiates with them.

There is also a quiet power in how well Pandas talks to other parts of the ecosystem. Read a CSV? One line. Join two tables? One line. Group by a category and summarize? One line. Convert the result to NumPy for modeling or to Matplotlib for plotting? Also one line. Pandas isn’t flashy; it just fits between everything else data scientists do.

To make this concrete, imagine a small ritual:

```python
import pandas as pd

df = pd.read_csv("flights.csv")
```
Two lines, and suddenly a plain-text file becomes a navigable world. We can ask:

- What does it look like? (using df.head())
- What types do we have?
df.describe()      # How do the numbers behave?

---
**Notes:** Pandas **does not do** machine learning, that’s scikit-learn’s domain, and it does not do visualization, that’s Matplotlib and Seaborn. But nothing in data science happens without Pandas first creating the table on which those tools can operate. Data cleaning, merging, reshaping, joining, pivoting, filtering, aggregating: these verbs are not glamorous, but they are the backbone of analysis.

To borrow a phrase from experimental science: Pandas prepares the specimen. Machine learning, statistics, and modeling are merely the microscope.

---

In this chapter, the goal is not to memorize functions but to cultivate a way of thinking about tabular data.

We will learn:

- how DataFrames are constructed
- how to select, slice, filter, and group data
- how to read from CSV, JSON, etc.
- how to reshape data (wide ↔ long)
- how to merge and join tables


> **Remeber:** Pandas is not just a library; it is a bridge between the world where data is stored and the world where data is understood.

# What is Pandas?

Pandas is a powerful open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. The name "pandas" comes from "panel data," a term used in econometrics to describe multi-dimensional data. Developed by Wes McKinney in 2008, pandas has become the cornerstone of data manipulation and analysis in the Python ecosystem.

At its core, Pandas helps you:

* Load, clean, and transform datasets.

* Perform statistical operations efficiently.

* Handle missing or inconsistent data.

* Merge, reshape, and aggregate large datasets.

If you have ever worked with spreadsheets in Excel, Pandas offers similar functionality—but with far greater power, speed, and scalability.

# Why Use Pandas?
Before pandas, data analysis in Python was cumbersome and required jumping between different libraries. Python users relied heavily on lists, dictionaries, and NumPy arrays for handling structured data. While these tools are powerful, they lack built-in functionality for common tasks like handling missing values, grouping data, or joining tables. Pandas solved this by providing:

* **Intuitive data structures:** DataFrames and Series that feel familiar to users from various backgrounds, useful for for working with tabular and one-dimensional data.

* **Seamless integration:** Works beautifully with other Python data science libraries (NumPy, Matplotlib, etc).

* **Powerful data manipulation:** Easy filtering, grouping, and transformation of data

* **Performance:** Built on top of highly optimized C code for speed.

* **Time series functionality:** Excellent support for working with time-based data

* **Ease of Use:** Simplifies complex operations into a few lines of readable code.

<center>
<img src="https://devopedia.org/images/article/303/3028.1667792632.jpg" alt="Pandas Features" width="450">
</center>



# Installing Pandas

Before using Pandas, we need to make sure it’s installed. Many data science environments already include it, but not all.

**How?** If you are using Anaconda Distribution, Pandas comes pre-installed. Otherwise, you can install it with:

> **pip install pandas**

Or if you're using Anaconda:
> **conda install pandas**

---
> What is the Anaconda Distribution?

> The Anaconda Distribution is a popular Python platform that bundles many data science tools (NumPy, Pandas, Jupyter, SciPy, Matplotlib, etc.) into a single installation. It saves beginners from installing packages one-by-one and provides an environment manager for scientific computing. https://www.anaconda.com/products/distribution
---

**Official Pandas website**: https://pandas.pydata.org/

**Confirming Installation:**

To confirm the installation, Open Python shell or a notebook and run:

```python
import pandas as pd
print(pd.__version__)
```
If you see a version number (e.g. 2.2.1), Pandas is installed correctly.

##Using Pandas in Different Environments


**(a) Google Colab**

Google Colab ships with Pandas pre-installed. You can simply import it:

```
import pandas as pd
```

**(b) Jupyter Notebooks**

Jupyter is bundled with Anaconda. If needed, you can install it manually:

> pip install notebook

Once Pandas is installed, it works inside any notebook environment.

###**Why Versions Matter?**

Pandas evolves quickly. Version differences affect:

- available functions

- method behaviors (e.g., merge(), read_csv())

- performance improvements
- deprecations

Checking your version makes debugging much easier.

**Python Compatibility**

As of 2024+, Pandas requires **Python 3.9 or newer**. Older Python versions may not support newer Pandas releases.

Scientific libraries often move faster than base Python installations, so compatibility matters.

#Loading Data with Pandas

The first step of analysis is getting the data into Pandas.

One of Pandas’ biggest strengths is its ability to easily import/export datasets from variety of formats: CSV (most common import format), Excel, JSON (common from APIs), SQL Databases (query + load) etc. Here are some example:

* **CSV:** pd.read_csv("file.csv")

* **Excel:** pd.read_excel("file.xlsx")

* **SQL Databases:** pd.read_sql(query, connection)

* **JSON:** pd.read_json("file.json")


In [None]:
#Example:
import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())   # Displays first 5 rows

   Unnamed: 0 default student      balance        income
0           1      No      No   729.526495  44361.625074
1           2      No     Yes   817.180407  12106.134700
2           3      No      No  1073.549164  31767.138947
3           4      No      No   529.250605  35704.493935
4           5      No      No   785.655883  38463.495879


# A Beginner’s Working Model

Before we touch syntax, it helps to understand what Pandas adds to Python.

### **From Python Objects to DataFrames**

Python already has ways to represent data:

- lists ([1, 2, 3])

- dictionaries ({"age": 21, "major": "CS"})

- lists of dictionaries (a common “JSON-like” pattern)

The problem is that **none of these act like a table**. For example, consider a list of student records:

```
students = [
    {"name": "Alice", "year": 2025, "major": "CS"},
    {"name": "Bob",   "year": 2024, "major": "Math"},
]
```


Python can iterate through this, but asking simple questions (“which majors?”, “how many in each year?”, “filter by year”) requires loops, conditionals, and manual bookkeeping.

Pandas provides a format that behaves like a table:

```python
import pandas as pd
df = pd.DataFrame(students)
```

Now the same questions become natural:

```python
df["major"].value_counts()
df[df["year"] == 2025]
```

This illustrates the core Pandas philosophy:

> Take common data-analysis questions and make them **one line instead of a small program.**

# Core Data Structures
It’s tempting to think Pandas has only one object — the DataFrame — because it dominates most workflows. But understanding the smaller building block helps.

The strength of Pandas lies in two core objects:
1.   **Series:** is a single column (one-dimensional labeled array)
2.   **Dataframe:** a table of columns (two-dimensional labeled data structure)

People often describe the relationship like this:

> A DataFrame is a collection of Series that share the same index.

That index part will matter later.

<center>
<img src="https://miro.medium.com/v2/resize:fit:1400/0*TB7RB0d21huRNGjI.png" alt="Pandas Illustration" width="600">
</center>

You won’t use Series much explicitly at first, but understanding it later pays off (especially for groupby, apply, or time series).








## Series: The One-Dimensional Workhorse
A Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column in a spreadsheet.


<center>
<img src="https://www.w3resource.com/w3r_images/pandas-series-add-image-3.svg" alt="Pandas Series">
</center>

Unlike some arrays that require all elements to be the same type (homogeneous), a Series can store different types of values together, such as numbers, text, or dates. Each value has a label called an index, which can be numbers, words, or timestamps, and you can use it to quickly find or select values. Here are some examples:

### How to Create a Series

A Series can be created directly from a ***Python list***, in which case pandas automatically assigns default numeric indexes (0, 1, 2, …) to each element.

You can also create a Series from a ***Python dictionary***, where the dictionary keys become the index labels and the dictionary values become the Series values. In Python 3.7 and later, the order of the keys is preserved, so the Series keeps the same order as the dictionary

In [None]:
import pandas as pd

# Creating a Series from a list
temperatures = pd.Series([22, 25, 18, 30, 27],
                        index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                        name='Daily_Temps')
print(temperatures)


# Creating a Series from a Dictionary

grades = {"Math": 90, "English": 85, "Science": 95}
dict_series = pd.Series(grades)

print(dict_series)

Mon    22
Tue    25
Wed    18
Thu    30
Fri    27
Name: Daily_Temps, dtype: int64
Math       90
English    85
Science    95
dtype: int64


In [None]:
import pandas as pd

# Homogeneous Series (all integers)
print("Homogeneous Series \n")
homo_series = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
print(homo_series)
print(f"The data type is: {homo_series.dtype}\n")

# Heterogeneous Series (mix of int, float, string, bool)
print("Heterogeneous Series \n")
hetero_series = pd.Series([10, 20.5, 'hello', True])
print(hetero_series)
print(f"The data type is: {hetero_series.dtype}")

Homogeneous Series 

A    10
B    20
C    30
D    40
dtype: int64
The data type is: int64

Heterogeneous Series 

0       10
1     20.5
2    hello
3     True
dtype: object
The data type is: object


# DataFrame: The Two-Dimensional Powerhouse
A DataFrame is a two-dimensional labeled data structure, similar to a table with rows and columns. It is the most commonly used object in Pandas.


<center>
<img src="https://pynative.com/wp-content/uploads/2021/02/dataframe.png" alt="Pandas DF1" width="500">
</center>


<center>
<img src="https://pynative.com/wp-content/uploads/2021/02/pandas-dataframe-from-dictionary.png" alt="Pandas DF2" width="500">
</center>

Every DataFrame has an **index** on the left side. In CSVs it often starts at 0 by default:

```
      name   year   major
0    Alice   2025    CS
1      Bob   2024  Math
```

Spreadsheets also have row numbers, but Pandas treats the index as **part of the data structure itself**. That gives us speed, alignment, joins, slicing, and time-based operations.



Here are some examples:

In [None]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo']
}

df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    Diana   28     Tokyo


In [None]:
import pandas as pd

# Create a simple dataset
data = {
    'Product': ['Apple', 'Banana', 'Cherry', 'Date'],
    'Price': [1.20, 0.50, 3.00, 2.50],
    'Stock': [45, 120, 15, 80]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information
print("Our DataFrame:")
print(df)
print("\nData types:")
print(df.dtypes)
print("\nBasic statistics:")
print(df.describe())

Our DataFrame:
  Product  Price  Stock
0   Apple    1.2     45
1  Banana    0.5    120
2  Cherry    3.0     15
3    Date    2.5     80

Data types:
Product     object
Price      float64
Stock        int64
dtype: object

Basic statistics:
         Price       Stock
count  4.00000    4.000000
mean   1.80000   65.000000
std    1.15181   45.276926
min    0.50000   15.000000
25%    1.02500   37.500000
50%    1.85000   62.500000
75%    2.62500   90.000000
max    3.00000  120.000000
