# 3.1 Introduction to Pandas

**Pandas** is a powerful Python library used to store, manipulate, and analyze **tabular data** using a **DataFrame**.

---

### What is a DataFrame?

A **DataFrame** is a **2D labeled data structure**, similar to a table in a database or an Excel sheet.

- It consists of **rows and columns**
- Each row is identified by an **index**
- Each column has a **column label (name)**

---

### Key Features of a DataFrame

- Stores data in **tabular (row–column) format**
- Supports **heterogeneous data** (each column can have a different data type)
- Easy data selection, filtering, and manipulation
- Built-in support for handling missing data

---

### Structure of a DataFrame

| Index | Column 1 | Column 2 | Column 3 |
|------|----------|----------|----------|
| 0    | Data     | Data     | Data     |
| 1    | Data     | Data     | Data     |

---

### Important Points

- Pandas DataFrames are **mutable**
- Each column can contain **different types of values**
- Widely used in **data analysis and data science**

# 3.2 Working with Pandas DataFrame
 ### Key Points
 - `head()` and `tail()` are used to preview data
 - shape gives rows and columns count
 - columns returns column labels
 - DataFrame supports easy filtering and selection

In [1]:
import pandas as pd

 ### 1. Creating a DataFrame from List

In [2]:
std_data = [
    [1, 'Varun', 30, 'Male', 'Chandigarh'],
    [2, 'Aman', 25, 'Male', 'Delhi']
]

df = pd.DataFrame(
    std_data,
    columns=['stu_id', 'name', 'age', 'gender', 'city']
)

 ### 2. From Dictionary

In [27]:
data = {
    "Name": ["Amit", "Riya"],
    "Score": [185, 927]
}

df = pd.DataFrame(data)
print(df)

   Name  Score
0  Amit    185
1  Riya    927


 ### 3. From Numpy Arrays

In [28]:
import numpy as np

arr = np.array([
    [10, 20],
    [30, 40]
])

df = pd.DataFrame(arr, columns=["X", "Y"])
print(df)

    X   Y
0  10  20
1  30  40


 ### 4. Viewing Data

In [8]:
df = pd.read_csv(
    "Data/sales_data_sample.csv",
    encoding="latin1"
)

df.head()        # By default shows top 5 rows
df.head(2)       # Shows top 2 rows

df.tail()        # By default shows bottom 5 rows
df.tail(2)       # Shows bottom 2 rows

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
2821,10397,34,62.24,1,2116.16,3/28/2005,Shipped,1,3,2005,...,1 rue Alsace-Lorraine,,Toulouse,,31000,France,EMEA,Roulet,Annette,Small
2822,10414,47,65.52,9,3079.44,5/6/2005,On Hold,2,5,2005,...,8616 Spinnaker Dr.,,Boston,MA,51003,USA,,Yoshido,Juri,Medium


 ### 5. Shape of DataFrame

In [9]:
df.shape         # (number_of_rows, number_of_columns)

(2823, 25)

 ### 6. Column Names

In [10]:
df.columns

Index(['ORDERNUMBER', 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER',
       'SALES', 'ORDERDATE', 'STATUS', 'QTR_ID', 'MONTH_ID', 'YEAR_ID',
       'PRODUCTLINE', 'MSRP', 'PRODUCTCODE', 'CUSTOMERNAME', 'PHONE',
       'ADDRESSLINE1', 'ADDRESSLINE2', 'CITY', 'STATE', 'POSTALCODE',
       'COUNTRY', 'TERRITORY', 'CONTACTLASTNAME', 'CONTACTFIRSTNAME',
       'DEALSIZE'],
      dtype='object')

 ### 7. Accessing Columns

In [14]:
df['STATUS']        # Access single column
df.STATUS           # Another way to access column


0        Shipped
1        Shipped
2        Shipped
3        Shipped
4        Shipped
          ...   
2818     Shipped
2819     Shipped
2820    Resolved
2821     Shipped
2822     On Hold
Name: STATUS, Length: 2823, dtype: object

 ### 8. Accessing Multiple Columns

In [15]:
df[['CONTACTFIRSTNAME', 'CONTACTLASTNAME']]

Unnamed: 0,CONTACTFIRSTNAME,CONTACTLASTNAME
0,Kwai,Yu
1,Paul,Henriot
2,Daniel,Da Cunha
3,Julie,Young
4,Julie,Brown
...,...,...
2818,Diego,Freyre
2819,Pirkko,Koskitalo
2820,Diego,Freyre
2821,Annette,Roulet


 ### 9. Filtering Data (Conditional Selection)

In [16]:
df[df['SALES'] > 3079.44]

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
2,10134,41,94.74,2,3884.34,7/1/2003,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.70,8/25/2003,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003,USA,,Young,Julie,Medium
4,10159,49,100.00,14,5205.27,10/10/2003,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium
5,10168,36,96.66,1,3479.76,10/28/2003,Shipped,4,10,2003,...,9408 Furth Circle,,Burlingame,CA,94217,USA,,Hirano,Juri,Medium
7,10188,48,100.00,1,5512.32,11/18/2003,Shipped,4,11,2003,...,"Drammen 121, PR 744 Sentrum",,Bergen,,N 5804,Norway,EMEA,Oeztan,Veysel,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2793,10386,50,87.15,16,4357.50,3/1/2005,Resolved,1,3,2005,...,"C/ Moralzarzal, 86",,Madrid,,28034,Spain,EMEA,Freyre,Diego,Medium
2816,10327,37,86.74,4,3209.38,11/10/2004,Resolved,4,11,2004,...,Vinb'ltet 34,,Kobenhavn,,1734,Denmark,EMEA,Petersen,Jytte,Medium
2817,10337,42,97.16,5,4080.72,11/21/2004,Shipped,4,11,2004,...,5905 Pompton St.,Suite 750,NYC,NY,10022,USA,,Hernandez,Maria,Medium
2819,10373,29,100.00,1,3978.51,1/31/2005,Shipped,1,1,2005,...,Torikatu 38,,Oulu,,90110,Finland,EMEA,Koskitalo,Pirkko,Medium


 ### 10. Data Types of Columns

In [17]:
df.dtypes

ORDERNUMBER           int64
QUANTITYORDERED       int64
PRICEEACH           float64
ORDERLINENUMBER       int64
SALES               float64
ORDERDATE            object
STATUS               object
QTR_ID                int64
MONTH_ID              int64
YEAR_ID               int64
PRODUCTLINE          object
MSRP                  int64
PRODUCTCODE          object
CUSTOMERNAME         object
PHONE                object
ADDRESSLINE1         object
ADDRESSLINE2         object
CITY                 object
STATE                object
POSTALCODE           object
COUNTRY              object
TERRITORY            object
CONTACTLASTNAME      object
CONTACTFIRSTNAME     object
DEALSIZE             object
dtype: object

 ### 11. Values of DataFrame

In [18]:
df.values

array([[10107, 30, 95.7, ..., 'Yu', 'Kwai', 'Small'],
       [10121, 34, 81.35, ..., 'Henriot', 'Paul', 'Small'],
       [10134, 41, 94.74, ..., 'Da Cunha', 'Daniel', 'Medium'],
       ...,
       [10386, 43, 100.0, ..., 'Freyre', 'Diego', 'Medium'],
       [10397, 34, 62.24, ..., 'Roulet', 'Annette', 'Small'],
       [10414, 47, 65.52, ..., 'Yoshido', 'Juri', 'Medium']], dtype=object)

 ### 12. Index Information

In [19]:
df.index
# RangeIndex(start=0, stop=n, step=1)

RangeIndex(start=0, stop=2823, step=1)

# 3.3 How to Read Data Files in Pandas (Common Parameters)
 - Pandas provides the `read_csv()` function to read data from CSV files efficiently.  
 - It supports multiple parameters to handle different file formats and data issues.

---

 ### Key Points
 - `read_csv()` is highly flexible
 - Helps clean and preprocess data while loading
 - Commonly used in Data Analytics & EDA
 - Do not use `names` when the file already has headers
 - `parse_dates` is best used for date columns like ORDERDATE
 - encoding="latin1" avoids UnicodeDecodeError
 - Always verify data using `info()` after loading

 ### 1. Common Parameters in `read_csv()`
 ### Explanation of Parameters
 - sep → Defines the delimiter (`,` , `;` , `\t`)
 - header → Row index containing column names
 - names → Assign custom column names
 - dtype → Specify data type for columns
 - na_values → Symbols treated as missing values
 - parse_dates → Automatically converts columns to date format

In [22]:
import pandas as pd

df = pd.read_csv(
    "Data/sales_data_sample.csv",
    encoding="latin1",              # Handle special characters
    sep=",",                        # Column separator
    header=0,                       # First row as column names
    na_values=["?", "NA", ""],      # Treat these as missing values
    parse_dates=["ORDERDATE"]       # Convert ORDERDATE to datetime
)


 ### 2. Converting Date Columns

In [23]:
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'])

 ### 3. Example: Specifying Data Types

In [24]:
df = pd.read_csv(
    "Data/sales_data_sample.csv",
    encoding="latin1",
    dtype={
        "ORDERNUMBER": int,
        "QUANTITYORDERED": int,
        "PRICEEACH": float,
        "SALES": float,
        "YEAR_ID": int
    }
)

 ### 4. Renaming Columns (If Required)

In [25]:
df.columns = [
    "order_number", "quantity_ordered", "price_each", "order_line_number",
    "sales", "order_date", "status", "quarter_id", "month_id", "year_id",
    "product_line", "msrp", "product_code", "customer_name", "phone",
    "address1", "address2", "city", "state", "postal_code",
    "country", "territory", "contact_lastname", "contact_firstname",
    "deal_size"
]

 ### 5. Verifying the Data

In [26]:
df.head()
df.info()
df.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   order_number       2823 non-null   int64  
 1   quantity_ordered   2823 non-null   int64  
 2   price_each         2823 non-null   float64
 3   order_line_number  2823 non-null   int64  
 4   sales              2823 non-null   float64
 5   order_date         2823 non-null   object 
 6   status             2823 non-null   object 
 7   quarter_id         2823 non-null   int64  
 8   month_id           2823 non-null   int64  
 9   year_id            2823 non-null   int64  
 10  product_line       2823 non-null   object 
 11  msrp               2823 non-null   int64  
 12  product_code       2823 non-null   object 
 13  customer_name      2823 non-null   object 
 14  phone              2823 non-null   object 
 15  address1           2823 non-null   object 
 16  address2           302 n

order_number           int64
quantity_ordered       int64
price_each           float64
order_line_number      int64
sales                float64
order_date            object
status                object
quarter_id             int64
month_id               int64
year_id                int64
product_line          object
msrp                   int64
product_code          object
customer_name         object
phone                 object
address1              object
address2              object
city                  object
state                 object
postal_code           object
country               object
territory             object
contact_lastname      object
contact_firstname     object
deal_size             object
dtype: object

# 3.4 Exporting Pandas DataFrames

Pandas provides multiple methods to **export DataFrames** to different file formats: CSV, Excel, JSON, etc.

---

### 1. Export to CSV
 - `df.to_csv("output.csv", index=False)`
 - `index=False` → Exclude row indices in the output file
 - Saves the DataFrame as a CSV file

---

 ### 2. Export to Excel
 - `df.to_excel("output.xlsx", index=False)`
 - Requires `openpyxl` or `xlsxwriter` library installed
 - Saves DataFrame as an Excel file

---

 ### 3. Export to JSON
 - `df.to_json("output.json", orient="records")`
 - orient="records" → Each row as a JSON object
 - Can also use `orient="columns"` or `orient="index"`

---

 #### Key Points
 - Choose format based on use case
 - Always set index=False if you don't want the index in output
 - Useful for data sharing, reporting, and storage

# 3.5 Operations on Pandas DataFrame
Pandas DataFrames support various operations like **selection, filtering, insertion, and deletion** of data.

---

 #### Key Points
 - **Selection** → `df['col']` or `df[['col1','col2']]`
 - **Filtering** → Conditional selection using boolean indexing
 - **Insertion** → `df['new_col']` = `[...]` or `df.insert()`
 - **Deletion** → `drop()` or `del`
 - Supports both row and column operations

### 1. Selection

In [31]:
df = pd.read_csv(
    "Data/sales_data_sample.csv",
    encoding="latin1"
)
df['STATUS']        # Access single column

0        Shipped
1        Shipped
2        Shipped
3        Shipped
4        Shipped
          ...   
2818     Shipped
2819     Shipped
2820    Resolved
2821     Shipped
2822     On Hold
Name: STATUS, Length: 2823, dtype: object

 ### 2. Multiple Columns

In [32]:
df[['CONTACTFIRSTNAME', 'CONTACTLASTNAME']]

Unnamed: 0,CONTACTFIRSTNAME,CONTACTLASTNAME
0,Kwai,Yu
1,Paul,Henriot
2,Daniel,Da Cunha
3,Julie,Young
4,Julie,Brown
...,...,...
2818,Diego,Freyre
2819,Pirkko,Koskitalo
2820,Diego,Freyre
2821,Annette,Roulet


 ### 3. Filtering

In [33]:
df[df['SALES'] > 3079.44]

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
2,10134,41,94.74,2,3884.34,7/1/2003,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.70,8/25/2003,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003,USA,,Young,Julie,Medium
4,10159,49,100.00,14,5205.27,10/10/2003,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium
5,10168,36,96.66,1,3479.76,10/28/2003,Shipped,4,10,2003,...,9408 Furth Circle,,Burlingame,CA,94217,USA,,Hirano,Juri,Medium
7,10188,48,100.00,1,5512.32,11/18/2003,Shipped,4,11,2003,...,"Drammen 121, PR 744 Sentrum",,Bergen,,N 5804,Norway,EMEA,Oeztan,Veysel,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2793,10386,50,87.15,16,4357.50,3/1/2005,Resolved,1,3,2005,...,"C/ Moralzarzal, 86",,Madrid,,28034,Spain,EMEA,Freyre,Diego,Medium
2816,10327,37,86.74,4,3209.38,11/10/2004,Resolved,4,11,2004,...,Vinb'ltet 34,,Kobenhavn,,1734,Denmark,EMEA,Petersen,Jytte,Medium
2817,10337,42,97.16,5,4080.72,11/21/2004,Shipped,4,11,2004,...,5905 Pompton St.,Suite 750,NYC,NY,10022,USA,,Hernandez,Maria,Medium
2819,10373,29,100.00,1,3978.51,1/31/2005,Shipped,1,1,2005,...,Torikatu 38,,Oulu,,90110,Finland,EMEA,Koskitalo,Pirkko,Medium


 ### 4. Insertion
Adding a New Column

---

 ### Tips:
 - If the list length doesn’t match the DataFrame rows, it will raise an error.
 - You can create dynamic columns using `apply()` or vectorized operations.

In [37]:
df = pd.DataFrame({
    'ORDERNUMBER': [10100, 10101, 10102],
    'QUANTITYORDERED': [30, 50, 22],
    'PRICEEACH': [100, 200, 150],
    'CUSTOMERNAME': ['John', 'Alice', 'Bob']
})

# 1️⃣ Adding a new column with a single default value
df['STATUS'] = 'Pending'

# 2️⃣ Adding a new column using a list of values
df['DISCOUNT'] = [0.1, 0.2, 0.15]  # Corresponds to each row

# 3️⃣ Adding a new column using a calculation
df['TOTALSALE'] = df['QUANTITYORDERED'] * df['PRICEEACH']

# Insert 'TAX' at 2nd position (index 1)
df.insert(1, 'TAX', [5, 10, 7])  # Example tax values per order

# Example: Adding a flag column based on sales
df['HIGH_VALUE'] = df['TOTALSALE'].apply(lambda x: 'Yes' if x > 5000 else 'No')

print(df)

   ORDERNUMBER  TAX  QUANTITYORDERED  PRICEEACH CUSTOMERNAME   STATUS  \
0        10100    5               30        100         John  Pending   
1        10101   10               50        200        Alice  Pending   
2        10102    7               22        150          Bob  Pending   

   DISCOUNT  TOTALSALE HIGH_VALUE  
0      0.10       3000         No  
1      0.20      10000        Yes  
2      0.15       3300         No  


 ### 5. Deletion in Pandas
You can remove columns (or rows) from a DataFrame using either the drop() method or the del statement.

---

 #### Tips:
 - `drop()` is safer for multiple deletions and works well with row deletion `(axis=0)`.
 - `del` is simpler for quick single-column removal.
 - Always check your DataFrame after deletion to confirm changes.

#### Method 1: Using `drop()`
The `drop()` method is flexible and widely used.

In [38]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'ORDERNUMBER': [10100, 10101, 10102],
    'QUANTITYORDERED': [30, 50, 22],
    'PRICEEACH': [100, 200, 150],
    'CUSTOMERNAME': ['John', 'Alice', 'Bob'],
    'TAX': [5, 10, 7]
})

# 1️⃣ Delete a single column
df1 = df.drop(columns=['TAX'])
print("After dropping TAX column:\n", df1)

# 2️⃣ Delete multiple columns
df2 = df.drop(columns=['PRICEEACH', 'QUANTITYORDERED'])
print("\nAfter dropping PRICEEACH and QUANTITYORDERED:\n", df2)

# Note: By default, drop() returns a new DataFrame. Use inplace=True to modify original.
# df.drop(columns=['TAX'], inplace=True)

After dropping TAX column:
    ORDERNUMBER  QUANTITYORDERED  PRICEEACH CUSTOMERNAME
0        10100               30        100         John
1        10101               50        200        Alice
2        10102               22        150          Bob

After dropping PRICEEACH and QUANTITYORDERED:
    ORDERNUMBER CUSTOMERNAME  TAX
0        10100         John    5
1        10101        Alice   10
2        10102          Bob    7


 #### Method 2: Using `del` Statement
You can delete a column from the DataFrame in-place using `del`.

In [40]:
# Delete the TAX column
del df['TAX']

print("After deleting TAX column using del:\n", df)

After deleting TAX column using del:
    ORDERNUMBER  QUANTITYORDERED  PRICEEACH CUSTOMERNAME
0        10100               30        100         John
1        10101               50        200        Alice
2        10102               22        150          Bob


 ### 6. Using iloc for Row Deletion
 #### Tips:
 - `iloc` does not modify the original DataFrame; it returns a new DataFrame.
 - For in-place deletion, you would use `drop(index=...)` instead.

In [42]:
# Sample DataFrame
df = pd.DataFrame({
    'ORDERNUMBER': [10100, 10101, 10102, 10103],
    'QUANTITYORDERED': [30, 50, 22, 15],
    'PRICEEACH': [100, 200, 150, 120],
    'CUSTOMERNAME': ['John', 'Alice', 'Bob', 'Mary']
})

# Suppose we want to delete the 2nd row (index 1)
df_new = df.iloc[[0, 2, 3]]  # Keep rows with index 0, 2, 3

print("After deleting row with index 1:\n", df_new)

# Delete rows with index 1 and 3
df_new = df.iloc[[0, 2]]  # Keep rows with index 0 and 2

print("\nAfter deleting rows with index 1 and 3:\n", df_new)

After deleting row with index 1:
    ORDERNUMBER  QUANTITYORDERED  PRICEEACH CUSTOMERNAME
0        10100               30        100         John
2        10102               22        150          Bob
3        10103               15        120         Mary

After deleting rows with index 1 and 3:
    ORDERNUMBER  QUANTITYORDERED  PRICEEACH CUSTOMERNAME
0        10100               30        100         John
2        10102               22        150          Bob


### 7. Renaming Columns

In [43]:
df = df.rename(columns={'PRICEEACH': 'price_each'})
print(df)

   ORDERNUMBER  QUANTITYORDERED  price_each CUSTOMERNAME
0        10100               30         100         John
1        10101               50         200        Alice
2        10102               22         150          Bob
3        10103               15         120         Mary


 ### 8. Deleting a Particular Row

In [44]:
df = df.drop(2)   # Deletes the row with index 4

 ### 9. Adding a Particular Row
You can add a row using loc (if index is known):

In [47]:
df.loc[4] = [10102, 40, 150, 'Pinki']
print(df)

   ORDERNUMBER  QUANTITYORDERED  price_each CUSTOMERNAME
0        10100               30         100         John
1        10101               50         200        Alice
3        10103               15         120         Mary
4        10102               40         150        Pinki


 ### 10. Updating Values

In [49]:
# Update a single value
df.loc[2, 'PRICEEACH'] = 160  # Update Bob's PRICEEACH to 160

# Update multiple values
df.loc[[0, 2], 'CUSTOMERNAME'] = ['Jonathan', 'Robert']  # Update John → Jonathan, Bob → Robert

# update based on conditions:
# Update PRICEEACH where QUANTITYORDERED > 40
df.loc[df['QUANTITYORDERED'] > 40, 'PRICEEACH'] = 210

print(df)

   ORDERNUMBER  QUANTITYORDERED  price_each CUSTOMERNAME  PRICEEACH
0      10100.0             30.0       100.0     Jonathan        NaN
1      10101.0             50.0       200.0        Alice      210.0
3      10103.0             15.0       120.0         Mary        NaN
4      10102.0             40.0       150.0        Pinki        NaN
2          NaN              NaN         NaN       Robert      160.0


# 3.6 Reading a Large CSV File in Chunks (Pandas)
 - When working with **large CSV files**, loading the entire file into memory at once may be inefficient.  
 - Pandas provides the `chunksize` parameter to read data **in smaller chunks**.

---

 #### Use Cases
 - Handling large datasets
 - Reducing memory usage
 - Performing incremental processing (cleaning, filtering, aggregation)

---


#### Key Points
 - chunksize helps read files efficiently
 - Each chunk is a DataFrame
 - Use `pd.concat()` to merge chunks if needed

In [57]:
#chunksize=10 → Reads 10 rows at a time
#data becomes an iterator, not a DataFrame
data = pd.read_csv(
    "Data/sales_data_sample.csv",
    encoding="latin1",
    chunksize=10
)

### Iterating Over Chunks
 - Each chunk is a DataFrame
 - Useful for processing large datasets step-by-step
 - `ignore_index=True` → Resets index after concatenation
 - Converts chunked data back into one DataFrame

In [59]:
chunks = []
for chunk in data:
    print(chunk)
    chunks.append(chunk)

df = pd.concat(chunks, ignore_index=True)

    ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER    SALES  \
10        10223               37     100.00                1  3965.66   
11        10237               23     100.00                7  2333.12   
12        10251               28     100.00                2  3188.64   
13        10263               34     100.00                2  3676.76   
14        10275               45      92.83                1  4177.35   
15        10285               36     100.00                6  4099.68   
16        10299               23     100.00                9  2597.39   
17        10309               41     100.00                5  4394.38   
18        10318               46      94.74                1  4358.04   
19        10329               42     100.00                1  4396.14   

     ORDERDATE   STATUS  QTR_ID  MONTH_ID  YEAR_ID  ...  \
10   2/20/2004  Shipped       1         2     2004  ...   
11    4/5/2004  Shipped       2         4     2004  ...   
12   5/18/2004  Shi

# 3.7. Handling Missing Values in Pandas
 - Missing values in Pandas are usually represented as **NaN (Not a Number)**.  
 - Pandas provides powerful methods to **remove or replace missing data**.

---

 #### Key Points
 - `dropna()` → removes missing data
 - `fillna()` → replaces missing data
 - `ffill` / `bfill` useful for time-series data
 - Use `inplace=True` to modify original DataFrame
 - Always inspect missing values before handling

In [60]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Name": ["Amit", "Riya", None, "Pinki", None],
    "Age": [25, None, 30, None, 28],
    "City": ["Delhi", None, "Mumbai", None, None],
    "Score": [85, 90, None, None, None]
})

print(df)

    Name   Age    City  Score
0   Amit  25.0   Delhi   85.0
1   Riya   NaN    None   90.0
2   None  30.0  Mumbai    NaN
3  Pinki   NaN    None    NaN
4   None  28.0    None    NaN


 ### 1. `dropna()` – Remove Missing Values
 - Drop Rows with Any Missing Value (Default)
 - Drops the entire row if any column has NaN

In [62]:
df.dropna()

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0


 ### a). Drop Columns with Missing Values
 ✔ Removes columns containing NaN

In [63]:
df.dropna(axis=1)

0
1
2
3
4


 ### b). Drop Rows with Any Missing Value
 ✔ Keeps row if at least one value exists

In [64]:
df.dropna(how="any")

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0


 ### c). Drop Rows Only If All Values Are Missing
 ✔ Drops rows where Name is NaN

In [65]:
df.dropna(subset=["Name"])

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,,,90.0
3,Pinki,,,


 ### d). Drop Missing Values In-place
✔ Modifies original DataFrame

In [66]:
# df.dropna(inplace=True)

 ## e). Drop Rows with Minimum Non-NaN Values (thresh)
 ✔ Keeps rows having at least 2 non-null values

In [67]:
df.dropna(thresh=2)

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,,,90.0
2,,30.0,Mumbai,


 ### 2. `fillna()` – Fill Missing Values
Fill All NaN with a Value

In [68]:
df.fillna(0)

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,0.0,0,90.0
2,0,30.0,Mumbai,0.0
3,Pinki,0.0,0,0.0
4,0,28.0,0,0.0


 ### a). Fill with String Value

In [69]:
df.fillna("null")

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,,,90.0
2,,30.0,Mumbai,
3,Pinki,,,
4,,28.0,,


### b). Fill Specific Column

In [70]:
df["City"].fillna("Unknown")

0      Delhi
1    Unknown
2     Mumbai
3    Unknown
4    Unknown
Name: City, dtype: object

 ### c). Forward Fill (ffill)
 ✔ Fills NaN with previous value

In [71]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,25.0,Delhi,90.0
2,Riya,30.0,Mumbai,90.0
3,Pinki,30.0,Mumbai,90.0
4,Pinki,28.0,Mumbai,90.0


 ### d). Backward Fill (bfill)
 ✔ Fills NaN with next value

In [72]:
df.fillna(method="bfill")

  df.fillna(method="bfill")


Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,30.0,Mumbai,90.0
2,Pinki,30.0,Mumbai,
3,Pinki,28.0,,
4,,28.0,,


 ### e). Fill In-place

In [73]:
 # df.fillna(0, inplace=True)

 ### f). Limit Filling
 ✔ Fills only first 2 NaN values

In [74]:
df.fillna("null", limit=2)

Unnamed: 0,Name,Age,City,Score
0,Amit,25.0,Delhi,85.0
1,Riya,,,90.0
2,,30.0,Mumbai,
3,Pinki,,,
4,,28.0,,


# 3.8 Aggregating Data in Pandas
 - Aggregation in Pandas is used to **summarize data** using functions like sum, mean, min, max, etc.  
 - The most commonly used method is **`groupby()`**.

---

 #### Key Points
 - `groupby()` splits data into groups
 - Aggregation functions summarize each group
 - `agg()` allows multiple aggregations at once
 - Common functions: `sum()`, `mean()`, `min()`, `max()`, `count()`

In [75]:
df = pd.DataFrame({
    "Name": ["Amit", "Amit", "Riya", "Riya", "Pinki", "Pinki"],
    "Dept": ["IT", "IT", "HR", "HR", "Sales", "Sales"],
    "Salary": [50000, 52000, 45000, 47000, 40000, 42000]
})

print(df)

    Name   Dept  Salary
0   Amit     IT   50000
1   Amit     IT   52000
2   Riya     HR   45000
3   Riya     HR   47000
4  Pinki  Sales   40000
5  Pinki  Sales   42000


 ### 1. GroupBy

In [79]:
grp = df.groupby("Name")
print(grp)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002056E15AEA0>


 ##### Iterating Through Groups

In [80]:
for x, y in grp:
    print(x)
    print(y)

Amit
   Name Dept  Salary
0  Amit   IT   50000
1  Amit   IT   52000
Pinki
    Name   Dept  Salary
4  Pinki  Sales   40000
5  Pinki  Sales   42000
Riya
   Name Dept  Salary
2  Riya   HR   45000
3  Riya   HR   47000


 ### 2. Basic Aggregation Functions

In [82]:
#Sum
add = grp["Salary"].sum()
print(add)

#Mean
avg = grp["Salary"].mean()
print(avg)

#Max
maxi = grp["Salary"].max()
print(maxi)

#Min
mini = grp["Salary"].min()
print(mini)

Name
Amit     102000
Pinki     82000
Riya      92000
Name: Salary, dtype: int64
Name
Amit     51000.0
Pinki    41000.0
Riya     46000.0
Name: Salary, dtype: float64
Name
Amit     52000
Pinki    42000
Riya     47000
Name: Salary, dtype: int64
Name
Amit     50000
Pinki    40000
Riya     45000
Name: Salary, dtype: int64


### 3. Getting a Specific Group

In [83]:
grp.get_group("Amit")

Unnamed: 0,Name,Dept,Salary
0,Amit,IT,50000
1,Amit,IT,52000


 ### 4. Aggregation on Entire DataFrame

In [85]:
print(df["Salary"].sum())
print(df["Salary"].mean())
print(df["Salary"].min())
print(df["Salary"].max())

276000
46000.0
40000
52000


 ### 5. GroupBy with Column Selection
  ###### Explanation
 - Groups data by Dept
 - Selects the Salary column
 - Calculates mean salary for each department

In [86]:
df.groupby("Dept")["Salary"].mean()

Dept
HR       46000.0
IT       51000.0
Sales    41000.0
Name: Salary, dtype: float64

 ### 6. Multiple Aggregations Using `agg()`
 #### Explanation
 - Groups data by Dept
 - Applies multiple aggregation functions on Salary

In [87]:
df.groupby("Dept").agg({
    "Salary": ["mean", "sum", "count"]
})

Unnamed: 0_level_0,Salary,Salary,Salary
Unnamed: 0_level_1,mean,sum,count
Dept,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,46000.0,92000,2
IT,51000.0,102000,2
Sales,41000.0,82000,2


# 3.9 Melt (Reshaping Data)
The `pandas.melt()` function is used to reshape a DataFrame from a wide format to a long format. It essentially "unpivots" columns into rows, making the data easier for analysis and visualization, especially when dealing with multiple columns that represent the same type of variable. 
 - melt() → Wide → Long format

In [88]:
df = pd.DataFrame({
    "Day": ["Mon", "Tue", "Wed"],
    "Sales_A": [100, 120, 130],
    "Sales_B": [90, 110, 140]
})

print(df)

   Day  Sales_A  Sales_B
0  Mon      100       90
1  Tue      120      110
2  Wed      130      140


 ### Melt Operation
✔ Converts wide format → long format

In [90]:
melted = pd.melt(
    df,
    id_vars=["Day"],
    value_vars=["Sales_A", "Sales_B"],
    var_name="Store",
    value_name="Sales"
)

print(melted)

   Day    Store  Sales
0  Mon  Sales_A    100
1  Tue  Sales_A    120
2  Wed  Sales_A    130
3  Mon  Sales_B     90
4  Tue  Sales_B    110
5  Wed  Sales_B    140


 # 3.10 Pivot (Reshaping Back)
 In pandas, the `pivot()` function is used to reshape a DataFrame by transforming unique values from rows into columns, creating a "wide" data format. The more versatile `pivot_table()` function is a generalization that also allows for data aggregation and handles duplicate entries. 
 - `pivot()` → Long → Wide (no aggregation)
 - `pivot_table()` → Allows aggregation

In [91]:
pivot_df = melted.pivot(
    index="Day",
    columns="Store",
    values="Sales"
)

print(pivot_df)

Store  Sales_A  Sales_B
Day                    
Mon        100       90
Tue        120      110
Wed        130      140


 ### Pivot with Another Example

In [92]:
df2 = pd.DataFrame({
    "student_name": ["Amit", "Amit", "Riya", "Riya"],
    "days": ["Mon", "Tue", "Mon", "Tue"],
    "marks": [80, 85, 90, 95]
})

pivot_marks = df2.pivot(
    index="days",
    columns="student_name",
    values="marks"
)

print(pivot_marks)

student_name  Amit  Riya
days                    
Mon             80    90
Tue             85    95


 ### Pivot Table (Aggregation Allowed)

In [93]:
pivot_table = pd.pivot_table(
    df2,
    index="student_name",
    columns="days",
    values="marks",
    aggfunc="mean"
)

print(pivot_table)

days           Mon   Tue
student_name            
Amit          80.0  85.0
Riya          90.0  95.0


 # 3.11 Merge & Concat

In [95]:
df1 = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [11, 12, 13]
})

df2 = pd.DataFrame({
    "A": [1, 2, 3, 4],
    "C": [21, 22, 23, 24]
})

 ### Merge
 In pandas, `merge()` is used for database-style joining on common columns or indices.
  - `merge()` → SQL-style joins

In [104]:
#Inner Merge (Default)
inn = pd.merge(df1, df2, on="A")
print("\nInner Merge\n", inn)

#Left Merge
lft = pd.merge(df1, df2, on="A", how="left")
print("\nLeft Merge\n", lft)

# Right Merge
rgt = pd.merge(df1, df2, on="A", how="right")
print("\nRight Merge\n", rgt)

# Outer Merge
ot = pd.merge(df1, df2, on="A", how="outer")
print("\nOuter Merge\n", ot)

# Outer Merge with Indicator
oti = pd.merge(df1, df2, on="A", how="outer", indicator=True)
print("\nOuter Merge with Indicator\n", oti)

# Merge Using Index
otIdx = pd.merge(df1, df2, left_index=True, right_index=True)
print("\nMerge Using Index\n", otIdx)

# Merge with Suffixes
suff = pd.merge(df1, df2, on="A", suffixes=("_left", "_right"))
print("\nMerge with Suffixes\n", suff)


Inner Merge
    A   B   C
0  1  11  21
1  2  12  22
2  3  13  23

Left Merge
    A   B   C
0  1  11  21
1  2  12  22
2  3  13  23

Right Merge
    A     B   C
0  1  11.0  21
1  2  12.0  22
2  3  13.0  23
3  4   NaN  24

Outer Merge
    A     B   C
0  1  11.0  21
1  2  12.0  22
2  3  13.0  23
3  4   NaN  24

Outer Merge with Indicator
    A     B   C      _merge
0  1  11.0  21        both
1  2  12.0  22        both
2  3  13.0  23        both
3  4   NaN  24  right_only

Merge Using Index
    A_x   B  A_y   C
0    1  11    1  21
1    2  12    2  22
2    3  13    3  23

Merge with Suffixes
    A   B   C
0  1  11  21
1  2  12  22
2  3  13  23


 ### Concat Operations
 `concat()` is used for stacking or appending DataFrames along an axis (rows or columns). 
  - `concat()` → Stack or combine data

In [106]:
# Series Concatenation
s1 = pd.Series([1, 2, 3, 4])
s2 = pd.Series([11, 21, 31, 41])
sr = pd.concat([s1, s2])
print("\nSeries Concatenation\n", sr)

#DataFrame Row-wise Concat
rw = pd.concat([df1, df2])
print("\nDataFrame Row-wise Concat\n", rw)

#Column-wise Concat
cw = pd.concat([df1, df2], axis=1)
print("\nColumn-wise Concat\n", cw)

#Join Types in Concat
ot = pd.concat([df1, df2], join="outer")   # Union
print("\nOuter Join Concat\n", ot)

inn = pd.concat([df1, df2], join="inner")   # Intersection
print("\nInner Join Concat\n", inn)

#Using Keys
k = pd.concat([df1, df2], keys=["df1", "df2"])
print("\nUsing Keys\n", k)


Series Concatenation
 0     1
1     2
2     3
3     4
0    11
1    21
2    31
3    41
dtype: int64

DataFrame Row-wise Concat
    A     B     C
0  1  11.0   NaN
1  2  12.0   NaN
2  3  13.0   NaN
0  1   NaN  21.0
1  2   NaN  22.0
2  3   NaN  23.0
3  4   NaN  24.0

Column-wise Concat
      A     B  A   C
0  1.0  11.0  1  21
1  2.0  12.0  2  22
2  3.0  13.0  3  23
3  NaN   NaN  4  24

Outer Join Concat
    A     B     C
0  1  11.0   NaN
1  2  12.0   NaN
2  3  13.0   NaN
0  1   NaN  21.0
1  2   NaN  22.0
2  3   NaN  23.0
3  4   NaN  24.0

Inner Join Concat
    A
0  1
1  2
2  3
0  1
1  2
2  3
3  4

Using Keys
        A     B     C
df1 0  1  11.0   NaN
    1  2  12.0   NaN
    2  3  13.0   NaN
df2 0  1   NaN  21.0
    1  2   NaN  22.0
    2  3   NaN  23.0
    3  4   NaN  24.0
