# **Pandas: A Comprehensive Guide**

## **Introduction**

Pandas is a powerful Python library for data manipulation and analysis. It provides flexible data structures such as `Series` and `DataFrame` to work efficiently with structured data.

### **Installation**

To install Pandas, use:

```bash
pip install pandas
```

Then, import it in Python  
```python
import pandas as pd
```

### **Core Data Structures**

#### **Series**

A series is a one-dimensional labeled array capable of holding any data type.

- ```python
    data = [10, 20, 30, 40]
    s = pd.Series(data, index=['A','B','C','D'])
    print(s)
    ```

- In Pandas Series, you can use any iterable for the data. This includes Lists, Tuples, NumPy Arrays, Dictionaries, Strings, Dictionaries, and Sets.
- In Pandas Series, you can use iterables such as Lists, Tuples, NumPy Arrays, Dictionaries, Strings, and Range as Index

- ```python
    data = (10, 20, 30, 40)
    s = pd.Series(data, index=['A','B','C','D'])
    print(s)
    ```

- ```python
    import pandas as pd
    import numpy as np

    data = (1,2,3,4,5)
    index = np.array(['A','B','C','D','E'])
    s = pd.Series(data, index)
    print(s)
    ```

#### **DataFrame**

A dataframe is a two-dimensional labeled data structure, similar to a table in SQL or Excel.

- Creating a Pandas DataFrame using **list of dictionaries**:
    - Each dictionary represents a row, and the keys become the column labels.

- ```python
    s = pd.DataFrame([{'Reptiles':'scales', 'Birds':'feathers'}, {'Reptiles':'regenerate tail', 'Birds':'does not regenerate'}])
    print(s)
    ```

- Creating a Pandas DataFrame using **dictionary of lists**
    - Each key in the dictionary represents a column label, and the values are lists of column data.

- ```python
    s = pd.DataFrame({'Reptiles':['scales', 'regenerate tail'], 'Birds':['feathers', 'does not regenerate']})
    print(s)
    ```

- Creating a Pandas DataFrame using **list of lists (or Tuples)**
    - Each inner list or tuple represents a row of data, and an optional columns argument can be used to specify the column names.

- ```python
    data = [['a',2],['b',3],['c',4]]

    s = pd.DataFrame(data, columns=['alphabet','numbers'])
    print(s)
    ```

- Creating a Pandas DataFrame using **single dictionary**
    - You can also pass a single dictionary with column names as keys and the corresponding values as lists.

- ```python
    data = {'alphabet':['a','b','c','d'], 'number':[1,2,3,4]}
    s = pd.DataFrame(data)
    print(s)
    ```

- Creating a Pandas DataFrame using **Numpy Array**
    - You can create a DataFrame by passing a numpy array and defining column labels.

- ```python
    import numpy as np
    import pandas as pd

    data = np.array([['a',1],['b',2]])
    s = pd.DataFrame(data, columns=['alphabet', 'number'])
    print(s)
    ```

- Creating a Pandas DataFrame from **CSV/Excel files**
    - You can read data from external files (e.g., CSV, Excel) into a DataFrame using Pandas' built-in functions.

- ```python
    data = pd.read_csv('data.csv')
    ```


### **Reading and Writing Data**

- Reading CSV Files
    ```python
    df = pd.read_csv('data.csv')
    print(df.head())
    ```

- Writing to CSV Files
    ```python
    df.to_csv('data.csv', index=False)
    ```

### **Basic Data Exploration**

#### **Viewing Data**

```python
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.info()) # Summary of the dataset
print(df.describe()) # Descriptive Statistical summary
```

#### **Selecting Columns**

```python
print(df['Column Name']) # Single column
print(df[['Column Name', 'Column Name2']]) # Multiple columns
```

#### **Selecting Rows**

##### **.loc - Label-based Indexing**
- The `.loc` indexer is used when you want to select data by label (i.e., by row and column names). It *includes* the end point in a range, which is different from the standard Python slicing.

```python
df.loc[row_label, column_label]
```

- row_label: The label (or index) of the row you want to select.
- column_label: The label (or name) of the column you want to select

- **Selecting a specific row by label**:
```python
df.loc[2] # selections row with label/index 2
```

- **Selecting specific rows and columns**:
```python
df.loc[2, 'Salary'] # Selects the 'Salary' value in row with label 2
df.loc[1:3, ['Name', 'Salary']] # Selects the 'Name' and 'Salary' values for rows with labels 1, 2, 3
```

- **Selecting all rows for specific columns**:
```python
df.loc[:, 'Salary'] # Selects the entire 'Salary' column
df.loc[:, ['Name', 'Salary']] # Selects the entire 'Name' and 'Salary' column
```

- **Conditional selection**:
```python
df.loc[df['Salary']>50000] # Selects rows where Salary is greater than 50000
df.loc[df['Age']<30, 'Name'] # Selects the 'Name' of employees younger than 30
```

##### **.iloc - Positional Indexing**
- The `.iloc` indexer is used for integer-location based indexing. It selects rows and columns by their integer positions, similar to how you would slice a list (excludes the last index).

```python
df.iloc[row_index, column_index]
```

- row_index: The index position (integer) of the row you want to select
- column_index: The index position (integer) of the column you want to select

- **Selecting a specific row by position**:
```python
df.iloc[2] # Selects the third row (position 2)
```

- **Selecting specific rows and columns by position**:
```python
df.iloc[1:3, 0:2] # Selects rows at position 1 and 2, and columns at positions 0 and 1
```

- **Selecting all rows for specific columns by position**:
```python
df.iloc[:, 1] # Selects the second column (position 1)
df.iloc[:, [0,2]] # Selects the first and third columns (position 0 and 2)
```

- **Conditional selection using `.iloc` along with `.values`**:
- When using `.iloc` with a boolean condition, you need to convert the condition to a numpy array of booleans (e.g., using `.values`), so it can work with the positional indexing.

```python
df.iloc[df['Salary'].values > 50000] # Select rows where Salary > 50000 by position
```

#### **More Advanced Use Cases**

- **Slicing rows and columns using `.loc`**
```python
df.loc[1:4, 'Name':'Salary'] # Selects rows from index 1 to 4, columns from 'Name' to 'Salary'
```

- **Selecting rows based on boolean conditions**
```python
df.loc[df['Age'] > 30, ['Name', 'Salary']] # Selects the 'Name' and 'Salary' for rows where 'Age' > 30
```

- **Modifying Values using `.loc`**
```python
df.loc[df['Age'] > 30, 'Salary'] = 60000 # Updates the 'Salary' of employees older than 30
```

- **Using `.iloc` for selecting random rows and columns**
```python
df.iloc[0:5, 1:3] # 
```

### **Data Cleaning**

- **Handling Missing Values**
```python
df.dropna(subset=['Salary', 'Age'], inplace=True) # Remove rows where any of these columns have missing values
```
- **Rename Columns**
- **Changing Data Types**



In [10]:
import numpy as np
import pandas as pd

data = np.array([['a',1],['b',2]])
s = pd.DataFrame(data, columns=['alphabet', 'number'])
print(s)

  alphabet number
0        a      1
1        b      2
