# Pandas-2: Introduction & Creation of DataFrames


In [None]:
# Lecture  1.1 and 1.2

## 1. What is a DataFrame?

A DataFrame is a two-dimensional, labeled data structure that stores data in rows and columns, 
allowing each column to hold values of different data types. 

A data structure is a way of organizing, storing, and managing data in a computer so that it can be used efficiently.

It is designed for handling and analyzing structured (tabular) data efficiently.

A **DataFrame** is:

- **Two-dimensional**: Data is arranged in rows and columns.
- **Labeled**: Each row has an index (like row numbers), and each column has a name.
- **Flexible**: Can store different data types in different columns (numbers, text, dates, etc.).
- **Mutable**: You can change, add, or delete rows and columns anytime.

**Example:**

| Name    | Age | City     |
|---------|-----|----------|
| Aman   | 25  | Delhi   |
| Ankit     | 30  | Patna |
| Ayush | 35  | Kanpur    |

This is a DataFrame with:

- **3 rows**
- **3 columns** (`Name`, `Age`, `City`)


## 2. Different ways to create a DataFrame


let's look into different ways to create a dictionary step-by-step.

In [None]:
# import numpy as pandas


**From a Dictionary of Lists**

In [None]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}

**pd.DataFrame() args**
```python
pandas.DataFrame(
    data=None, # The actual content of the DataFrame.
    index=None, # Row labels (default is 0,1,2,...). Can be list-like
    columns=None, # Column labels (default is inferred from data)
    dtype=None, # Data type to force all columns to be
    copy=False # (Default- False)If True, it copies the data. If False, may use reference.
)
```

**From a List of Dictionaries**

In [None]:
data = [
    {'Name': 'Amit', 'Age': 25, 'City': 'Mumbai'},
    {'Name': 'Anand', 'Age': 28, 'City': 'Delhi'},
    {'Name': 'Akash', 'Age': 30, 'City': 'Patna'}
]

**From a List of Lists (with column names)**

In [None]:
data = [
    ['Amit', 25, 'Mumbai'],
    ['Anand', 28, 'Delhi'],
    ['Akash', 30, 'Patna']
]

**From a NumPy Array**

In [None]:
arr = np.array([
    ['Amit', 25, 'Mumbai'],
    ['Anand', 28, 'Delhi'],
    ['Akash', 30, 'Patna']
])

**From Series objects**

In [None]:
name = pd.Series(['Amit', 'Anand', 'Akash'])
age = pd.Series([25, 28, 30])
city = pd.Series(['Mumbai', 'Delhi', 'Patna'])

**From a CSV / Excel file**

In [None]:
# CSV


In [None]:
# TSV


In [None]:
! pip install openpyxl

In [None]:
# Excel


**From a URL**

In [None]:
# CSV
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"


In [None]:
# JSON
url = "https://jsonplaceholder.typicode.com/users"


In [None]:
!pip install lxml

In [None]:
url="https://en.wikipedia.org/wiki/Indian_Institutes_of_Information_Technology"


## 3. DataFrame Internals : Why Pandas DataFrames Are So Powerful ?

Pandas DataFrames are powerful because they combine **high-level usability** with **low-level, highly optimized data structures**.  


**1. Foundation: NumPy Arrays for Speed**
- Most Pandas data is stored in **NumPy ndarray** objects.  
- NumPy arrays are **contiguous memory blocks** of homogeneous data â†’ very fast to access and process.  
- Operations are **vectorized**, running in compiled C code instead of Python loops.  


In [None]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}
df = pd.DataFrame(data)

**2. Columnar Storage**
- DataFrames store data **column-wise**, not row-wise.  
- Benefits:
  - Fast **column selection** (just a pointer, no copying).  
  - Efficient for **analytical workloads** (like SQL databases).  
  - Compression works better because similar data is in one column.  

**3. ArrayManager**
- Groups columns of the **same dtype** together in blocks for efficiency.  
- Allows mixed types in the same DataFrame without losing speed for same-type columns.

**4. Label-aware Indexing**
- DataFrames have **indexes** for rows and columns.  
- Indexes allow:
  - Fast lookups by label.  
  - Automatic alignment of data during operations.

In [None]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}
df = pd.DataFrame(data, index=data['Name'])

In [None]:
# When performing operations between Series or DataFrames, Pandas aligns data automatically by index.
s1 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
s2 = pd.Series([1, 2, 3], index=['b', 'c', 'd'])
print(s1)
print(s2)

**5. C & Cython Acceleration**
- Many Pandas functions are written in **Cython** or directly in **C**.  
- This allows operations on millions of rows to remain interactive and fast.

**6. Rich Metadata & Methods**
- Tracks column names, dtypes, missing values, categories, etc.  
- Built-in methods include:
  - Reshaping (`pivot`, `melt`)  
  - Grouping (`groupby`)  
  - Joins/merges (SQL-like)  
  - Time series handling  

**Summary : DataFrame Key Features**
- **Performance**: Vectorized C-speed computations.  
- **Flexibility**: Can hold mixed data types.  
- **Usability**: SQL-like power with Python syntax.  
- **Integration**: Works seamlessly with NumPy, Matplotlib, and Scikit-learn.

---

Happy Learning ! Team DecodeAiML !!