# Pandas-2: Introduction & Creation of DataFrames


In [None]:
# Lecture  1.1 and 1.2

## 1. What is a DataFrame?

A DataFrame is a two-dimensional, labeled data structure that stores data in rows and columns, 
allowing each column to hold values of different data types. 

A data structure is a way of organizing, storing, and managing data in a computer so that it can be used efficiently.

It is designed for handling and analyzing structured (tabular) data efficiently.

A **DataFrame** is:

- **Two-dimensional**: Data is arranged in rows and columns.
- **Labeled**: Each row has an index (like row numbers), and each column has a name.
- **Flexible**: Can store different data types in different columns (numbers, text, dates, etc.).
- **Mutable**: You can change, add, or delete rows and columns anytime.

**Example:**

| Name    | Age | City     |
|---------|-----|----------|
| Aman   | 25  | Delhi   |
| Ankit     | 30  | Patna |
| Ayush | 35  | Kanpur    |

This is a DataFrame with:

- **3 rows**
- **3 columns** (`Name`, `Age`, `City`)


## 2. Different ways to create a DataFrame


let's look into different ways to create a dictionary step-by-step.

In [1]:
# import numpy and pandas
import numpy as np
import pandas as pd

**From a Dictionary of Lists**

In [2]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}

In [3]:
data

{'Name': ['Amit', 'Anand', 'Akash'],
 'Age': [25, 28, 30],
 'City': ['Mumbai', 'Delhi', 'Patna']}

In [4]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Age,City
0,Amit,25,Mumbai
1,Anand,28,Delhi
2,Akash,30,Patna


**pd.DataFrame() args**
```python
pandas.DataFrame(
    data=None, # The actual content of the DataFrame.
    index=None, # Row labels (default is 0,1,2,...). Can be list-like
    columns=None, # Column labels (default is inferred from data)
    dtype=None, # Data type to force all columns to be
    copy=False # (Default- False)If True, it copies the data. If False, may use reference.
)
```

**From a List of Dictionaries**

In [5]:
data = [
    {'Name': 'Amit', 'Age': 25, 'City': 'Mumbai'},
    {'Name': 'Anand', 'Age': 28, 'City': 'Delhi'},
    {'Name': 'Akash', 'Age': 30, 'City': 'Patna'}
]

In [6]:
data

[{'Name': 'Amit', 'Age': 25, 'City': 'Mumbai'},
 {'Name': 'Anand', 'Age': 28, 'City': 'Delhi'},
 {'Name': 'Akash', 'Age': 30, 'City': 'Patna'}]

In [7]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Age,City
0,Amit,25,Mumbai
1,Anand,28,Delhi
2,Akash,30,Patna


**From a List of Lists (with column names)**

In [8]:
data = [
    ['Amit', 25, 'Mumbai'],
    ['Anand', 28, 'Delhi'],
    ['Akash', 30, 'Patna']
]

In [9]:
data

[['Amit', 25, 'Mumbai'], ['Anand', 28, 'Delhi'], ['Akash', 30, 'Patna']]

In [10]:
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

    Name  Age    City
0   Amit   25  Mumbai
1  Anand   28   Delhi
2  Akash   30   Patna


**From a NumPy Array**

In [11]:
arr = np.array([
    ['Amit', 25, 'Mumbai'],
    ['Anand', 28, 'Delhi'],
    ['Akash', 30, 'Patna']
])

In [12]:
arr

array([['Amit', '25', 'Mumbai'],
       ['Anand', '28', 'Delhi'],
       ['Akash', '30', 'Patna']], dtype='<U21')

In [13]:
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

    Name  Age    City
0   Amit   25  Mumbai
1  Anand   28   Delhi
2  Akash   30   Patna


**From Series objects**

In [14]:
name = pd.Series(['Amit', 'Anand', 'Akash'])
age = pd.Series([25, 28, 30])
city = pd.Series(['Mumbai', 'Delhi', 'Patna'])

In [18]:
df = pd.DataFrame({'Name': name, 'Age': age, 'City': city})
print(df)

    Name  Age    City
0   Amit   25  Mumbai
1  Anand   28   Delhi
2  Akash   30   Patna


**From a CSV / Excel file**

In [19]:
# CSV
df = pd.read_csv("dummy_data_csv.csv")
df.head()

Unnamed: 0,Name,Age,City,Salary
0,Amit,25,Patna,70000
1,Akash,30,Lucknow,80000
2,Aman,35,Delhi,75000
3,Manish,28,Jaipur,72000
4,Peeyush,32,Ranchi,78000


In [20]:
# TSV
df = pd.read_csv("dummy_data_tsv.tsv",sep='\t')
df.head()

Unnamed: 0,Name,Age,City,Salary
0,Amit,25,Patna,70000
1,Akash,30,Lucknow,80000
2,Aman,35,Delhi,75000
3,Manish,28,Jaipur,72000
4,Peeyush,32,Ranchi,78000


In [None]:
! pip install openpyxl

In [21]:
# Excel
df = pd.read_excel("dummy_data_excel.xlsx")
df.head()

Unnamed: 0,Name,Age,City,Salary
0,Amit,25,Patna,70000
1,Akash,30,Lucknow,80000
2,Aman,35,Delhi,75000
3,Manish,28,Jaipur,72000
4,Peeyush,32,Ranchi,78000


**From a URL**

In [22]:
# CSV
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [23]:
# JSON
url = "https://jsonplaceholder.typicode.com/users"
df = pd.read_json(url)
df.head()

Unnamed: 0,id,name,username,email,address,phone,website,company
0,1,Leanne Graham,Bret,Sincere@april.biz,"{'street': 'Kulas Light', 'suite': 'Apt. 556',...",1-770-736-8031 x56442,hildegard.org,"{'name': 'Romaguera-Crona', 'catchPhrase': 'Mu..."
1,2,Ervin Howell,Antonette,Shanna@melissa.tv,"{'street': 'Victor Plains', 'suite': 'Suite 87...",010-692-6593 x09125,anastasia.net,"{'name': 'Deckow-Crist', 'catchPhrase': 'Proac..."
2,3,Clementine Bauch,Samantha,Nathan@yesenia.net,"{'street': 'Douglas Extension', 'suite': 'Suit...",1-463-123-4447,ramiro.info,"{'name': 'Romaguera-Jacobson', 'catchPhrase': ..."
3,4,Patricia Lebsack,Karianne,Julianne.OConner@kory.org,"{'street': 'Hoeger Mall', 'suite': 'Apt. 692',...",493-170-9623 x156,kale.biz,"{'name': 'Robel-Corkery', 'catchPhrase': 'Mult..."
4,5,Chelsey Dietrich,Kamren,Lucio_Hettinger@annie.ca,"{'street': 'Skiles Walks', 'suite': 'Suite 351...",(254)954-1289,demarco.info,"{'name': 'Keebler LLC', 'catchPhrase': 'User-c..."


In [None]:
!pip install lxml

In [25]:
url="https://en.wikipedia.org/wiki/Indian_Institutes_of_Information_Technology"
table = pd.read_html(url)
len(table)

3

In [28]:
table[2].head()

Unnamed: 0,vteIndian Institutes of Information Technology (IIITs),vteIndian Institutes of Information Technology (IIITs).1
0,MoE–funded,Allahabad Gwalior Jabalpur Chennai Kurnool
1,PPP–funded,Agartala Bhagalpur Bhopal Dharwad Guwahati Kal...
2,Related Institutes,AIIMSs IIMs IISc IISERs IITs NITs NIPERs SPAs ...


## 3. DataFrame Internals : Why Pandas DataFrames Are So Powerful ?

Pandas DataFrames are powerful because they combine **high-level usability** with **low-level, highly optimized data structures**.  


**1. Foundation: NumPy Arrays for Speed**
- Most Pandas data is stored in **NumPy ndarray** objects.  
- NumPy arrays are **contiguous memory blocks** of homogeneous data → very fast to access and process.  
- Operations are **vectorized**, running in compiled C code instead of Python loops.  


In [29]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}
df = pd.DataFrame(data)

In [33]:
print(type(df['Name']))

<class 'pandas.core.series.Series'>


In [35]:
type(df['Name'].values)

numpy.ndarray

In [37]:
df['Age'] + df['Age']

0    50
1    56
2    60
Name: Age, dtype: int64

**2. Columnar Storage**
- DataFrames store data **column-wise**, not row-wise.  
- Benefits:
  - Fast **column selection** (just a pointer, no copying).  
  - Efficient for **analytical workloads** (like SQL databases).  
  - Compression works better because similar data is in one column.  

In [38]:
print(type(df['Name']))

<class 'pandas.core.series.Series'>


In [40]:
type(df['Name'].values)

numpy.ndarray

**3. ArrayManager**
- Groups columns of the **same dtype** together in blocks for efficiency.  
- Allows mixed types in the same DataFrame without losing speed for same-type columns.

**4. Label-aware Indexing**
- DataFrames have **indexes** for rows and columns.  
- Indexes allow:
  - Fast lookups by label.  
  - Automatic alignment of data during operations.

In [42]:
data = {
    'Name': ['Amit', 'Anand', 'Akash'],
    'Age': [25, 28, 30],
    'City': ['Mumbai', 'Delhi', 'Patna']
}
df = pd.DataFrame(data, index=data['Name'])
df

Unnamed: 0,Name,Age,City
Amit,Amit,25,Mumbai
Anand,Anand,28,Delhi
Akash,Akash,30,Patna


In [43]:
df.loc['Anand']

Name    Anand
Age        28
City    Delhi
Name: Anand, dtype: object

In [44]:
# When performing operations between Series or DataFrames, Pandas aligns data automatically by index.
s1 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
s2 = pd.Series([1, 2, 3], index=['b', 'c', 'd'])
print(s1)
print(s2)

a    100
b    200
c    300
dtype: int64
b    1
c    2
d    3
dtype: int64


In [45]:
s1 + s2

a      NaN
b    201.0
c    302.0
d      NaN
dtype: float64

**5. C & Cython Acceleration**
- Many Pandas functions are written in **Cython** or directly in **C**.  
- This allows operations on millions of rows to remain interactive and fast.

**6. Rich Metadata & Methods**
- Tracks column names, dtypes, missing values, categories, etc.  
- Built-in methods include:
  - Reshaping (`pivot`, `melt`)  
  - Grouping (`groupby`)  
  - Joins/merges (SQL-like)  
  - Time series handling  

**Summary : DataFrame Key Features**
- **Performance**: Vectorized C-speed computations.  
- **Flexibility**: Can hold mixed data types.  
- **Usability**: SQL-like power with Python syntax.  
- **Integration**: Works seamlessly with NumPy, Matplotlib, and Scikit-learn.

---

Happy Learning ! Team DecodeAiML !!