# What is Pandas?

Pandas is a powerful open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. The name "pandas" comes from "panel data," a term used in econometrics to describe multi-dimensional data. Developed by Wes McKinney in 2008, pandas has become the cornerstone of data manipulation and analysis in the Python ecosystem.

At its core, Pandas helps you:

* Load, clean, and transform datasets.

* Perform statistical operations efficiently.

* Handle missing or inconsistent data.

* Merge, reshape, and aggregate large datasets.

If you have ever worked with spreadsheets in Excel, Pandas offers similar functionality—but with far greater power, speed, and scalability.

# Why Use Pandas?
Before pandas, data analysis in Python was cumbersome and required jumping between different libraries. Python users relied heavily on lists, dictionaries, and NumPy arrays for handling structured data. While these tools are powerful, they lack built-in functionality for common tasks like handling missing values, grouping data, or joining tables. Pandas solved this by providing:

* **Intuitive data structures:** DataFrames and Series that feel familiar to users from various backgrounds, useful for for working with tabular and one-dimensional data.

* **Seamless integration:** Works beautifully with other Python data science libraries (NumPy, Matplotlib, etc).

* **Powerful data manipulation:** Easy filtering, grouping, and transformation of data

* **Performance:** Built on top of highly optimized C code for speed.

* **Time series functionality:** Excellent support for working with time-based data

* **Ease of Use:** Simplifies complex operations into a few lines of readable code.

# Installing Pandas

Before using Pandas, you need to install it. If you are using Anaconda, Pandas comes pre-installed. Otherwise, you can install it with:

> pip install pandas

Or if you're using Anaconda:


> conda install pandas


To confirm the installation, open a Python shell and type:

> import pandas as pd
> print(pd.__version__)

#Loading Data with Pandas

One of Pandas’ biggest strengths is its ability to easily import/export datasets from multiple formats:

* **CSV:** pd.read_csv("file.csv")

* **Excel:** pd.read_excel("file.xlsx")

* **SQL Databases:** pd.read_sql(query, connection)

* **JSON:** pd.read_json("file.json")


In [2]:
#Example:
import pandas as pd

df = pd.read_csv("students.csv")
print(df.head())   # Displays first 5 rows

   Unnamed: 0 default student      balance        income
0           1      No      No   729.526495  44361.625074
1           2      No     Yes   817.180407  12106.134700
2           3      No      No  1073.549164  31767.138947
3           4      No      No   529.250605  35704.493935
4           5      No      No   785.655883  38463.495879


# Core Data Structures
The strength of Pandas lies in two core objects:
1.   **Series:** A one-dimensional labeled array
2.   **Dataframe:** A two-dimensional labeled data structure

<center>
<img src="https://miro.medium.com/v2/resize:fit:1400/0*TB7RB0d21huRNGjI.png" alt="Pandas Illustration" width="600">
</center>








## Series: The One-Dimensional Workhorse
A Series is a one-dimensional labeled array that can hold any data type. Think of it as a single column in a spreadsheet.


<center>
<img src="https://www.w3resource.com/w3r_images/pandas-series-add-image-3.svg" alt="Pandas Series">
</center>

Unlike some arrays that require all elements to be the same type (homogeneous), a Series can store different types of values together, such as numbers, text, or dates. Each value has a label called an index, which can be numbers, words, or timestamps, and you can use it to quickly find or select values. Here are some examples:

### How to Create a Series

A Series can be created directly from a ***Python list***, in which case pandas automatically assigns default numeric indexes (0, 1, 2, …) to each element.

You can also create a Series from a ***Python dictionary***, where the dictionary keys become the index labels and the dictionary values become the Series values. In Python 3.7 and later, the order of the keys is preserved, so the Series keeps the same order as the dictionary

In [3]:
import pandas as pd

# Creating a Series from a list
temperatures = pd.Series([22, 25, 18, 30, 27],
                        index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                        name='Daily_Temps')
print(temperatures)


# Creating a Series from a Dictionary

grades = {"Math": 90, "English": 85, "Science": 95}
dict_series = pd.Series(grades)

print(dict_series)

Mon    22
Tue    25
Wed    18
Thu    30
Fri    27
Name: Daily_Temps, dtype: int64
Math       90
English    85
Science    95
dtype: int64


In [4]:
import pandas as pd

# Homogeneous Series (all integers)
print("Homogeneous Series \n")
homo_series = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
print(homo_series)
print(f"The data type is: {homo_series.dtype}\n")

# Heterogeneous Series (mix of int, float, string, bool)
print("Heterogeneous Series \n")
hetero_series = pd.Series([10, 20.5, 'hello', True])
print(hetero_series)
print(f"The data type is: {hetero_series.dtype}")

Homogeneous Series 

A    10
B    20
C    30
D    40
dtype: int64
The data type is: int64

Heterogeneous Series 

0       10
1     20.5
2    hello
3     True
dtype: object
The data type is: object


## DataFrame: The Two-Dimensional Powerhouse
A DataFrame is a two-dimensional labeled data structure, similar to a table with rows and columns. It is the most commonly used object in Pandas.


<center>
<img src="https://pynative.com/wp-content/uploads/2021/02/dataframe.png" alt="Pandas DF1" width="500">
</center>


<center>
<img src="https://pynative.com/wp-content/uploads/2021/02/pandas-dataframe-from-dictionary.png" alt="Pandas DF2" width="500">
</center>

Here are some examples:

In [5]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo']
}

df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    Diana   28     Tokyo


In [6]:
import pandas as pd

# Create a simple dataset
data = {
    'Product': ['Apple', 'Banana', 'Cherry', 'Date'],
    'Price': [1.20, 0.50, 3.00, 2.50],
    'Stock': [45, 120, 15, 80]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display basic information
print("Our DataFrame:")
print(df)
print("\nData types:")
print(df.dtypes)
print("\nBasic statistics:")
print(df.describe())

Our DataFrame:
  Product  Price  Stock
0   Apple    1.2     45
1  Banana    0.5    120
2  Cherry    3.0     15
3    Date    2.5     80

Data types:
Product     object
Price      float64
Stock        int64
dtype: object

Basic statistics:
         Price       Stock
count  4.00000    4.000000
mean   1.80000   65.000000
std    1.15181   45.276926
min    0.50000   15.000000
25%    1.02500   37.500000
50%    1.85000   62.500000
75%    2.62500   90.000000
max    3.00000  120.000000


# Pandas Basic Operations

After reading tabular data as a DataFrame, you would need to have a glimpse of the data. Pandas makes it easy to explore and manipulate. A good first step is to inspect the dataset by previewing how many rows and columns it has, what the column names are, checking dimensions, or reviewing summary information such as data types and statistics. Pandas provides convenient methods for this.


##Viewing/Exploring Data

<!-- df.head()     # First 5 rows
df.tail()     # Last 5 rows
df.shape      # Number of rows and columns
df.info()     # Data types and non-null values
df.describe() # Summary statistics -->


| Command         | Description                                                                                   | Default Behavior                    |
| --------------- | --------------------------------------------------------------------------------------------- | ----------------------------------- |
| `df.head()`     | Displays the **first rows** of the DataFrame. Useful for quickly previewing the dataset.      | Shows **5 rows**                    |
| `df.tail()`     | Displays the **last rows** of the DataFrame. Handy for checking the dataset’s ending records. | Shows **5 rows**                    |
| `df.shape`      | Returns a tuple `(rows, columns)` representing the **dimensions** of the DataFrame.           | N/A                                 |
| `df.info()`     | Shows column names, **data types**, memory usage, and count of non-null values.               | N/A                                 |
| `df.dtypes`     | Returns the **data type of each column** in the DataFrame.                                    | N/A                                 |
| `df.describe()` | Provides **summary statistics** (mean, std, min, max, quartiles) for numeric columns.         | Includes numeric columns by default |



<center>
<img src="https://cdn.sanity.io/images/oaglaatp/production/b8add8dc0e1c0907a520e9cb2c8f511b0659726c-1200x600.png?w=1200&h=600&auto=format" alt="Pandas DF3" width="500">
</center>


In [7]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)

print("head")
print(df.head())
print("shape")
print(df.shape)
print("info")
print(df.info)


head
      Name  Age  Salary
0    Alice   24   50000
1      Bob   30   60000
2  Charlie   28   55000
shape
(3, 3)
info
<bound method DataFrame.info of       Name  Age  Salary
0    Alice   24   50000
1      Bob   30   60000
2  Charlie   28   55000>


## Selecting and Indexing Data

After inspecting the structure of a DataFrame, the next step is often to select specific rows and columns (specific parts of the data). Pandas provides several approaches depending on whether you want to select columns, rows, or filter data based on conditions. It lets you choose columns, rows, or both using labels (loc), integer positions (iloc), or conditions.

###**1. Selecting Columns**

| Command               | Description                                      |
| --------------------- | ------------------------------------------------ |
| `df['col']`           | Selects a **single column** as a Series.         |
| `df[['col1','col2']]` | Selects **multiple columns** as a new DataFrame. |


###**2. Selecting Rows**

| Command              | Description                                  |
| -------------------- | -------------------------------------------- |
| `df.loc[row_label]`  | Select row(s) by **label** (index name).     |
| `df.iloc[row_index]` | Select row(s) by **integer position**.       |
| `df.loc[0, 'col']`   | Select a **specific value** by row & column. |



###**3. Conditional Selection (Filtering)**

| Command                                | Description                               |          |
| -------------------------------------- | ----------------------------------------- | -------- |
| `df[df['col'] > 50]`                   | Returns rows where condition is **True**. |          |
| `df[(df['A'] > 50) & (df['B'] < 100)]` | Combine conditions with `&` (and), \`     | \` (or). |

Follow is an example:

In [8]:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

# Now, to access the columns: Select columns
df['Age']
df[['Name', 'City']]

# Select rows
df.iloc[0]        # First row
df.iloc[1:3]      # Rows 1–2
df.loc[0, 'Name'] # Specific cell

# Conditional selection
df[df['Age'] > 30]
df[(df['Age'] > 30) & (df['City'] == 'Chicago')]


      Name  Age         City
0    Alice   24     New York
1      Bob   30  Los Angeles
2  Charlie   28      Chicago


Unnamed: 0,Name,Age,City


#Adding a new column to DataFrame

A new column can be added to a pandas DataFrame by assigning a ***value, list, or Series*** to a new column name. If the assigned data is a list or Series, its length must match the number of rows in the DataFrame. You can also assign a single value, which will be applied to all rows.

In [9]:
import pandas as pd

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Math": [90, 85, 95]
})
print(df)

# Add a new column with a list
df["English"] = [88, 92, 80]

print(df)

# Add a new column with a single value
df["Pass"] = True

print(df)

      Name  Math
0    Alice    90
1      Bob    85
2  Charlie    95
      Name  Math  English
0    Alice    90       88
1      Bob    85       92
2  Charlie    95       80
      Name  Math  English  Pass
0    Alice    90       88  True
1      Bob    85       92  True
2  Charlie    95       80  True


# Arithmetic Operations and Functions in Pandas

So far we explored how to inspect and select data in Pandas. Once you have access to the right rows and columns, the next step is to perform calculations and apply functions. Pandas makes this process very intuitive by allowing you to apply arithmetic directly to DataFrames or Series, and by offering tools like apply(), map(), and applymap() for more flexibility.

## Arithmetic Operations on Columns

You can directly apply mathematical operations to Pandas Series or DataFrame columns. Operations are vectorized, meaning they are applied element-wise across the column.

In below example, you can notice how the operations are automatically applied to each row.

In [10]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)
print(df)

# Increase all salaries by 10%
df['Salary'] = df['Salary'] * 1.10

# Add 5 years to everyone’s age
df['Age'] = df['Age'] + 5
print(df)

      Name  Age  Salary
0    Alice   24   50000
1      Bob   30   60000
2  Charlie   28   55000
      Name  Age   Salary
0    Alice   29  55000.0
1      Bob   35  66000.0
2  Charlie   33  60500.0


## Arithmetic Between Columns
You can also perform arithmetic between two or more columns to create new features.

In [11]:
# Create a new column 'Income_per_Age'
df['Income_per_Age'] = df['Salary'] / df['Age']
print(df)

      Name  Age   Salary  Income_per_Age
0    Alice   29  55000.0     1896.551724
1      Bob   35  66000.0     1885.714286
2  Charlie   33  60500.0     1833.333333


# Applying Built-in Pandas/Numpy Functions

Pandas integrates with NumPy functions, allowing you to apply common statistics directly.


In [12]:
import numpy as np

# Calculate average salary
print(df['Salary'].mean())

# Standard deviation of Age
print(df['Age'].std())

# Apply numpy square root
print(np.sqrt(df['Age']))

60500.0
3.0550504633038935
0    5.385165
1    5.916080
2    5.744563
Name: Age, dtype: float64


### Applying Functions with **apply()**

Sometimes you need custom transformations. The apply() method lets you apply a function to an entire column (Series) or to each row/column in a DataFrame.


In [13]:
# Apply to a Series
df['Age_squared'] = df['Age'].apply(lambda x: x**2)

# Apply to DataFrame across rows
df['Total'] = df[['Age','Salary']].apply(lambda row: row['Age'] + row['Salary'], axis=1)
print(df)

      Name  Age   Salary  Income_per_Age  Age_squared    Total
0    Alice   29  55000.0     1896.551724          841  55029.0
1      Bob   35  66000.0     1885.714286         1225  66035.0
2  Charlie   33  60500.0     1833.333333         1089  60533.0


Note that, we can also apply a Function Elementwise with applymap() and to a Single Column with map() but not covering in this course.

# Filtering Data in Pandas

Once you know how to select columns and rows, the next step is learning how to filter data. Filtering helps you focus on only the relevant part of your dataset, whether that means removing unnecessary columns, isolating rows that meet certain conditions, or preparing features for modeling.

## Filtering Columns

Column filtering is about selecting only the columns you need or dropping the ones you don’t. This reduces memory usage and keeps your DataFrame manageable.

In [14]:
# Select a single column
df['Age']

# Select multiple columns
df[['Name', 'Age']]



Unnamed: 0,Name,Age
0,Alice,29
1,Bob,35
2,Charlie,33


## Dropping Unused Columns

In [15]:
# Drop the 'Age_squared' column
df = df.drop(columns=['Age_squared'])
print(df)

      Name  Age   Salary  Income_per_Age    Total
0    Alice   29  55000.0     1896.551724  55029.0
1      Bob   35  66000.0     1885.714286  66035.0
2  Charlie   33  60500.0     1833.333333  60533.0


This is especially useful when preparing data for machine learning, where only selected features are required.

## Filtering Rows (using Boolean Indexing)

Row filtering is usually done with Boolean indexing, where you apply a condition and return only the rows where that condition is true.

In [16]:
# Filter rows where Age > 30
df[df['Age'] > 30]

Unnamed: 0,Name,Age,Salary,Income_per_Age,Total
1,Bob,35,66000.0,1885.714286,66035.0
2,Charlie,33,60500.0,1833.333333,60533.0


# Combining Multiple Conditions
You can combine conditions using & (and) or | (or).

In [17]:
# Filter rows where Age > 30 AND Salary > 60000
df[(df['Age'] > 30) & (df['Salary'] > 60000)]

Unnamed: 0,Name,Age,Salary,Income_per_Age,Total
1,Bob,35,66000.0,1885.714286,66035.0
2,Charlie,33,60500.0,1833.333333,60533.0



> Remember to wrap each condition in parentheses.

# Filtering Strings

You can filter rows where a text column contains specific values

In [18]:
# Filter rows where Name contains "Bob"
df[df['Name'].str.contains("Bob")]

Unnamed: 0,Name,Age,Salary,Income_per_Age,Total
1,Bob,35,66000.0,1885.714286,66035.0


##Unique Values and Counting

Sometimes you want to check how many unique values a column has, or count how often each appears.

In [19]:
# Unique names
print(df['Name'].unique())

# Count frequency of each name
print(df['Name'].value_counts())


['Alice' 'Bob' 'Charlie']
Name
Alice      1
Bob        1
Charlie    1
Name: count, dtype: int64


# Applying Aggregation Functions Directly to a DataFrame

One of the strengths of Pandas is that you can apply statistical and aggregation methods directly to a DataFrame or Series. These methods summarize data and provide insights without needing extra loops or manual calculations.

### Common Aggregation Methods

Here are some of the most commonly used methods:

| Method        | Description                                              | Works On           |
|---------------|----------------------------------------------------------|--------------------|
| `.sum()`      | Returns the **sum** of values                            | DataFrame / Series |
| `.mean()`     | Returns the **average (mean)** value                     | DataFrame / Series |
| `.count()`    | Counts **non-null values**                               | DataFrame / Series |
| `.min()`      | Returns the **minimum** value                            | DataFrame / Series |
| `.max()`      | Returns the **maximum** value                            | DataFrame / Series |
| `.std()`      | Returns the **standard deviation**                       | DataFrame / Series |
| `.var()`      | Returns the **variance**                                 | DataFrame / Series |
| `.describe()` | Generates **summary statistics** (count, mean, std, min, quartiles, max) | DataFrame / Series |


Example: Aggregating a Series


In [20]:
import pandas as pd

# Salary data
salaries = pd.Series([50000, 60000, 55000, 65000, 70000])

print("Sum:", salaries.sum())
print("Mean:", salaries.mean())
print("Max:", salaries.max())
print("Std Dev:", salaries.std())

Sum: 300000
Mean: 60000.0
Max: 70000
Std Dev: 7905.694150420948


Each method is applied directly to the Series, returning a single value.

Example: Aggregating a DataFrame

In [21]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)

print(df.sum(numeric_only=True))   # Sum of numeric columns
print(df.mean(numeric_only=True))  # Mean of numeric columns
print(df.describe())

Age           82
Salary    165000
dtype: int64
Age          27.333333
Salary    55000.000000
dtype: float64
             Age   Salary
count   3.000000      3.0
mean   27.333333  55000.0
std     3.055050   5000.0
min    24.000000  50000.0
25%    26.000000  52500.0
50%    28.000000  55000.0
75%    29.000000  57500.0
max    30.000000  60000.0


Notice how these functions automatically ignore non-numeric columns (like “Name”).

# More Advanced: Filtering Data & Apply Statistical Functions

We can combine **row filtering** with **aggregation functions** to analyze subsets of a DataFrame.  

The general syntax is:

> **df[df['column_name'] <condition> value]['target_column'].function()**

where:

- df[...] → filters the rows that meet the condition

- ['target_column'] → selects the column to aggregate

- .function() → applies the aggregation function


In [22]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 35, 28, 40],
    'Salary': [50000, 66000, 55000, 70000]
}
df = pd.DataFrame(data)

# Average salary of employees older than 30
avg_salary = df[df['Age'] > 30]['Salary'].mean()
print(avg_salary)

# Maximum salary for employees younger than 30
df[df['Age'] < 30]['Salary'].max()

# Count employees with salary above 60,000
df[df['Salary'] > 60000]['Name'].count()

# Standard deviation of salary for people aged 25–40
df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()

68000.0


7767.45346515403

So the syntax pattern is:

df[ df['condition'] ]['column'].aggregation()

| Expression                                                  | Meaning                                          |
| ----------------------------------------------------------- | ------------------------------------------------ |
| `df[df['Age'] > 30]['Salary'].mean()`                       | Mean of Salary where Age > 30                    |
| `df[df['Salary'] > 60000]['Name'].count()`                  | Count of employees with Salary > 60k             |
| `df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()` | Standard deviation of Salary for 25–40 year olds |


This pattern allows you to filter data first, then aggregate only on the rows that meet your condition.

# Grouping Data with `groupby`

While filtering + aggregation lets us summarize a **subset** of data, the `groupby()` method allows us to compute statistics **across categories**.  
This is the classic **split–apply–combine** process:

1. **Split** data into groups based on one or more columns.  
2. **Apply** an aggregation function to each group.  
3. **Combine** results into a new DataFrame or Series.  

---

## Basic Syntax

> df.groupby('column_name')['target_column'].aggregation_function()
where:
- `groupby('column_name')` → splits the data into groups.  
- `['target_column']` → selects the column to aggregate.  
- `.aggregation_function()` → applies functions like `mean()`, `sum()`, `count()`.  


In [23]:
## Example: Salary by Department

import pandas as pd

# Sample dataset
data = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)

# Average salary per department
df.groupby('Department')['Salary'].mean()



  Department Employee  Salary
0         HR    Alice   50000
1         HR      Bob   52000
2         IT  Charlie   60000
3         IT    David   62000
4    Finance      Eva   58000
5    Finance    Frank   60000


Unnamed: 0_level_0,Salary
Department,Unnamed: 1_level_1
Finance,59000.0
HR,51000.0
IT,61000.0


### Grouping by Multiple Columns

In [24]:
# Example dataset with Region added
data2 = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Region': ['East','West','East','West','East','West'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)

# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Salary
Department,Region,Unnamed: 2_level_1
Finance,East,58000.0
Finance,West,60000.0
HR,East,50000.0
HR,West,52000.0
IT,East,60000.0
IT,West,62000.0


#Exporting Data in Pandas

After processing your data in Pandas, you can save it to files in various formats:

| Format      | Function                  | Key Parameters / Notes                                           | Example Usage |
|------------|---------------------------|-----------------------------------------------------------------|---------------|
| CSV        | `to_csv()`                | `index=False` to skip row numbers, `sep='\t'` for tab-delimited | `df.to_csv('data.csv', index=False)` |
| Excel      | `to_excel()`              | `sheet_name='Sheet1'`, requires `openpyxl`                     | `df.to_excel('data.xlsx', index=False)` |
| JSON       | `to_json()`               | `orient='records'`, `lines=True` for line-delimited JSON       | `df.to_json('data.json', orient='records', lines=True)` |
| Pickle     | `to_pickle()`             | Python-specific, fast binary format                             | `df.to_pickle('data.pkl')` |
| HTML       | `to_html()`               | Saves as an HTML table                                         | `df.to_html('data.html', index=False)` |
| Parquet    | `to_parquet()`            | Efficient columnar format, great for big data                  | `df.to_parquet('data.parquet', index=False)` |

**Tip:** Always choose the format based on your use case:  
- CSV → universal, easy sharing  
- Excel → spreadsheets  
- JSON → web APIs or NoSQL  
- Parquet → large datasets, high performance


In [25]:
# Save CSV
df.to_csv('output.csv', index=False)
print("CSV file 'output.csv' created successfully.")

# Save tab-separated CSV
df2.to_csv('output_tab.csv', sep='\t', index=False)
print("Tab-separated CSV file 'output_tab.csv' created successfully.")

# Save Excel
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
print("Excel file 'output.xlsx' created successfully.")

# Save JSON
df2.to_json('output.json', orient='records', lines=True)
print("JSON file 'output.json' created successfully.")


CSV file 'output.csv' created successfully.
Tab-separated CSV file 'output_tab.csv' created successfully.
Excel file 'output.xlsx' created successfully.
JSON file 'output.json' created successfully.


# Exporting Pandas Data in Google Colab

In Colab, you can save files directly to Google Drive. First, mount your Drive:

```python
from google.colab import drive
drive.mount('/content/drive')  # Follow the link and paste the authorization code

# The pandas Ecosystem: How It Fits In

Pandas does not exist in a vacuum. It is a central hub in the Python data science stack:

* **NumPy:** Provides the foundational n-dimensional array object. Pandas DataFrames are built on top of NumPy arrays.

* **Matplotlib/Seaborn:** Used for visualization. You can plot data directly from DataFrames and Series.

* **Scikit-learn:** The premier machine learning library. It accepts DataFrames and Series as inputs for model training.

* **Jupyter Notebooks:** The ideal interactive environment for exploratory data analysis with pandas.

# When to Use Pandas (And When Not To)

##Use pandas when:

* Working with tabular data (like spreadsheets or database tables)

* Data cleaning and preprocessing

* Exploratory data analysis

* Medium-sized datasets (up to a few gigabytes)

##Consider alternatives when:

* Working with very large datasets that don't fit in memory.

* Need extremely high performance for numerical computations (consider NumPy directly)

* Working with unstructured data like images or text

# Key Takeaways


*   Filtering + Aggregation → summarize specific rows based on conditions.
*   GroupBy + Aggregation → summarize categories (all groups at once).
*   Grouping can be done on one or multiple columns.

# Summary of Pandas: Key Features at a Glance

* **Data Import/Export:** Read from and write to CSV, Excel, SQL, JSON, and many other formats

* **Data Cleaning:** Handle missing values, remove duplicates, filter outliers

* **Data Transformation:** Reshape, pivot, melt, and transform your data

* **Data Aggregation:** Group by categories and compute summary statistics

* **Time Series Analysis:** Work with dates and times effortlessly

* **Visualization Integration:** Works seamlessly with Matplotlib and Seaborn

## Knowledge Check

<iframe
src="https://docs.google.com/forms/d/e/1FAIpQLSdFEUF3np_FedX1B3jg6jXIRIqPvPMBCoiSSpQ6SPNGMTM3RA/viewform?embedded=true"   width="100%"
  height="800px"
  frameborder="0"
  style="min-height: 800px; height: 100vh"
>Loading…</iframe>