<a href="https://colab.research.google.com/github/devtayyabsajjad/Pandas-Learning-Journey-/blob/main/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Pandas Logo](https://pandas.pydata.org/static/img/pandas_white.svg)

---

## Table of Contents

1. [Introduction to Pandas](#introduction-to-pandas)
2. [Setting Up and Basic Data Structures](#setting-up-and-basic-data-structures)
   - [Installing and Importing Pandas](#installing-and-importing-pandas)
   - [Pandas Data Structures](#creating-a-series-and-dataframe)
3. [Data Loading and I/O Operations](#data-loading-and-io-operations)
   - [Reading Data from Files](#reading-data-from-files)
   - [Writing Data to Files](#writing-data-to-files)
4. [Data Exploration and Manipulation](#data-exploration-and-manipulation)
   - [Exploring Your Data](#exploring-your-data)
   - [Selecting and Filtering Data](#selecting-and-filtering-data)
5. [Handling Missing Data](#handling-missing-data)
   - [Detecting Missing Values](#detecting-missing-values)
   - [Dealing with Missing Values](#dealing-with-missing-values)
6. [Data Aggregation and Grouping](#data-aggregation-and-grouping)
   - [GroupBy Operations](#groupby-operations)
   - [Pivot Tables](#pivot-tables)
7. [Merging, Joining, and Concatenating DataFrames](#merging-joining-and-concatenating-dataframes)
   - [Merging DataFrames](#merging-dataframes)
   - [Concatenating DataFrames](#concatenating-dataframes)
8. [Advanced Pandas Topics](#advanced-pandas-topics)
   - [Applying Functions to Data](#applying-functions-to-data)
   - [Time Series Analysis](#time-series-analysis)
   - [Performance Optimization Techniques](#performance-optimization-techniques)
9. [Common Questions and How to Explain Them](#common-interview-questions-and-how-to-explain-them)


---



## 1. Introduction

  Pandas is a powerful Python library designed for data manipulation and analysis.


**Why Pandas?**  
- **Ease of Use:** Simplifies complex data operations into intuitive commands.
- **Data Analysis:** Offers powerful tools for cleaning, filtering, grouping, and aggregation.
- **Integration:** Works seamlessly with libraries like NumPy, Matplotlib, and Scikit-learn.

---



##2. Setting up and Basic Data Structure

### 2.1 Installing and Importing Pandas

Before using Pandas, ensure that it's installed. You can install it via pip if needed:

```python
# To install pandas, run:
# pip install pandas

# Then, import pandas in your Python script:
import pandas as pd


### 2.2 Pandas Data Structures

Pandas primarily uses two data structures: **Series** and **DataFrame**.

### Series

A Series is a one-dimensional array-like object with labeled indices.
![DataFrame](https://pandas.pydata.org/docs/_images/01_table_series.svg)



In [None]:
import pandas as pd

# Creating a simple Series
s = pd.Series([10, 20, 30, 40])
print("Series:")
print(s)

Series:
0    10
1    20
2    30
3    40
dtype: int64


### DataFrame

A DataFrame is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns).
![DataFrame](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)


In [None]:
# Creating a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}  # pandas deal hetrogenous data type
df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)



DataFrame:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


## 3. Data Loading and I/O Operations
Pandas makes it easy to read data from and write data to various file formats such as **CSV**, **Excel**, **JSON**, and **SQL databases**.






In [None]:
# Example: Reading a CSV file
df = pd.read_csv('/content/drive/MyDrive/Titanic-Dataset.csv')
print(df.shape)
df.tail(10)


(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Writing a DataFrame to a CSV file
df.to_csv('output.csv', index=False) # 'False' excludes the index column from the CSV file.


## 4. Data Exploration and Manipulation

### 4.1 Exploring Your Data
Before diving deep, inspect your dataset to understand its structure.



*   head(): Shows the first 5 rows by default.

*   info(): Provides details on data types and missing values.
*   describe(): Summarizes statistical measures like mean and standard deviation.




In [None]:
# Generate descriptive statistics for numerical columns:
print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


In [None]:
# Display the first few rows:
print(df.head())

# Display the last few rows:
print(df.tail())

# Display the random 5 rows:
print(df.sample(5))

# Get a concise summary of the DataFrame (data types, non-null counts):
print(df.info())

# Generate descriptive statistics for numerical columns:
print(df.describe())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

### 4.2 Selecting and Filtering Data

* Column Selection: Returns a Series that can be further analyzed.
* .loc vs. .iloc:
** .loc: Access by label.
** .iloc: Access by integer position.
* Conditional Filtering: Extracts rows that meet specific criteria.

In [None]:
# Using .loc for label-based row selection:
print(df.loc[0])  # Selects the row with label 0

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object


In [None]:
# Selecting a column (returns a Series):
print(df['Name'])

# Using .loc for label-based row selection:
print(df.loc[0])  # Selects the row with label 0

# Using .iloc for integer-based row selection:
print(df.iloc[0])  # Selects the first row

# Filtering rows based on a condition:
print(df[df['Age'] > 28])


0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                               22.0
SibSp                                1
Parch                                0
Ticket                

## 5. Handling Missing Data

### 5.1 Detecting Missing Values
* isnull(): Returns Boolean values indicating missing entries.
* sum(): Aggregates the Boolean values to count missing values per column.

In [None]:
# Check for missing values in each column:
print(df.isnull().sum())


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### 5.2 Dealing with Missing Values

* dropna(): Useful when you can afford to remove rows with missing data.
* fillna(): Replaces missing values to preserve dataset structure.

In [None]:
# Remove rows with any missing values:
df_clean = df.dropna()

# Fill missing values with a specified value (e.g., 0):
df_filled = df.fillna(0)

# Remove column from dataset
# Note: df.drop(columns=['column_name'], inplace=True) if use inplace this will parmanent change in the dataset
df.drop(columns=['column_name'])

KeyError: "['column_name'] not found in axis"

## 6. Data Aggregation and Grouping

### 6.1 GroupBy Operations
* groupby(): Segments data into groups based on a specified key.
* Aggregation: Applies functions like mean to summarize data within groups.



In [None]:
# Example DataFrame for grouping:
data_group = {
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Values': [10, 20, 15, 25, 10]
}
df_group = pd.DataFrame(data_group)

# Group by 'Category' and compute the mean of 'Values':
grouped = df_group.groupby('Category')['Values'].mean()
print(grouped)


### 6.2 Pivot Tables
Pivot Table: Rearranges and summarizes data across two dimensions, similar to pivot tables in spreadsheets.

In [None]:
# Create a pivot table to summarize data:
pivot = df_group.pivot_table(values='Values', index='Category', aggfunc='sum')
print(pivot)


## 7. Merging, Joining, and Concatenating DataFrames

### 7.1 Merging DataFrames
* Merging: Combines DataFrames similar to SQL joins using a common key.
* how Parameter: Determines the type of join (inner, left, right, outer).

* Inner = Jo dono mein common hai
* Left = Pehli table puri
* Right = Doosri table puri
* Outer = Dono tables puri

In [None]:
# Define two DataFrames with a common key:
df_left = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})
df_right = pd.DataFrame({
    'ID': [1, 2, 4],
    'Score': [85, 90, 95]
})

# Merge DataFrames on 'ID' using an inner join:
df_merged = pd.merge(df_left, df_right, on='ID', how='inner')
print(df_merged)


### 7.2 Concatenating DataFrames
Appends DataFrames without a common key, stacking them vertically or horizontally.


In [None]:
# Concatenate DataFrames vertically:
df_concat = pd.concat([df_left, df_left])
print(df_concat)


## 8. Advanced Pandas Topics

### 8.1 Applying Functions to Data
* apply(): Executes a function on each element in a Series or DataFrame.
* Useful for custom data transformations not covered by built-in methods.



> About **Lambda**: A lambda function is a small anonymous function defined with the lambda keyword. It can have any number of input parameters but only one expression. The expression is evaluated and returned when the lambda function is called.
```
lambda arguments: expression
```



In [None]:
# Apply a lambda function to increment 'Age' by 10:
df['Age_plus_10'] = df['Age'].apply(lambda x: x + 10)
print(df)


### 8.2 Time Series Analysis
* Date Ranges: pd.date_range generates a sequence of dates.
* Indexing by Date: Facilitates time-based operations like resampling and rolling statistics.

In [None]:
# Create a time series DataFrame with a date range as the index:
date_range = pd.date_range(start='2025-01-01', periods=5, freq='D')
df_timeseries = pd.DataFrame({'Date': date_range, 'Value': [10, 20, 30, 40, 50]})

# Set the 'Date' column as the index:
df_timeseries.set_index('Date', inplace=True)
print(df_timeseries)


## 9. Common Questions and How to Explain Them



1. **What is Pandas and what are its primary data structures?**  
   **Answer:**  
   - **Pandas:** Python library for data manipulation and analysis.
   - **Series:** 1D labeled array.
   - **DataFrame:** 2D labeled data structure (table).

2. **Difference between Series and DataFrame?**  
   **Answer:**  
   - **Series:** Single column of data with an index.
   - **DataFrame:** Collection of Series sharing the same index, supporting heterogeneous data.

3. **How do you handle missing data?**  
   **Answer:**  
   - Remove missing values: `df.dropna()`.
   - Impute missing values: `df.fillna(value)`.

4. **Difference between `.loc` and `.iloc`?**  
   **Answer:**  
   - **.loc:** Label-based indexing.
   - **.iloc:** Integer position-based indexing.

5. **What are GroupBy operations?**  
   **Answer:**  
   - Use `df.groupby(key).agg(func)` to split data by key, apply aggregation (mean, sum, etc.), and combine results.
