### What is a DataFrame?

A `DataFrame` is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in `pandas`. You can think of it like an in-memory spreadsheet or SQL table, or even a dict of Series objects.

Here's a breakdown of its key features:

1. **Two-Dimensional Structure**: Like a table in a spreadsheet, a DataFrame consists of rows and columns. This makes it ideal for representing real-world data like financial records, sports statistics, etc.

2. **Labeled Axes**: Each row and column in a DataFrame has a label. By default, the row labels are known as the Index, and the column labels are simply the names of each column.

3. **Heterogeneous Data Types**: A DataFrame can contain different types of data — integers, strings, floating-point numbers, Python objects, and more. Each column typically holds data of the same type.

4. **Size Mutable**: You can add or remove rows and columns from a DataFrame after it has been created.

5. **Functionality**: Pandas provides a vast array of functions to manipulate, transform, and analyze data in a DataFrame. This includes operations like filtering, sorting, groupby, merging, concatenation, and more.

6. **Data Alignment**: One of the key features of Pandas is data alignment. It automatically aligns data in operations involving multiple DataFrames or Series (a one-dimensional array in Pandas).

7. **Handling Missing Data**: Pandas is equipped to handle missing data using methods like `isna()`, `fillna()`, `dropna()`, etc.

8. **Efficient Storage and Processing**: Under the hood, Pandas DataFrames are built on top of NumPy arrays, making them efficient for numerical computations.

### Creating a DataFrame:

There are many ways to create a DataFrame. Here's one of the simplest ways using a dictionary:


In [3]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London


or you can simply create an empty DataFram:

In [4]:
df = pd.DataFrame(columns=["Name", "Article", "Quantity"])


### append columns to an empty DataFrame 


In [5]:
df['Name'] = ['Tv', 'PC', 'Desk']
df['Article'] = [97, 600, 200]
df['Quantity'] = [2200, 75, 100]


#### iterate rows


In [6]:
for index, row in df.iterrows():
    print(index, row['Name'])

0 Tv
1 PC
2 Desk


### append rows

In [7]:

df_new_row = pd.DataFrame(
    {'Name': ['Fridge'], 'Article': [97], 'Quantity': [2200]})
df = pd.concat([df, df_new_row])
print(df)


     Name  Article  Quantity
0      Tv       97      2200
1      PC      600        75
2    Desk      200       100
0  Fridge       97      2200


### printing headers (columns names)


In [8]:
cols = df.columns.tolist()
print("columns are: ", cols)
print("head: ", df.head())

columns are:  ['Name', 'Article', 'Quantity']
head:       Name  Article  Quantity
0      Tv       97      2200
1      PC      600        75
2    Desk      200       100
0  Fridge       97      2200



### Basic Operations:

You can do a variety of operations on DataFrames:


### selecting columns

```
df['Name']
```
or

```
df[['Name', 'Article']]
```

### filtering rows


In [16]:

print(df[df['Article'] > 100])


   Name  Article  Quantity
1    PC      600        75
2  Desk      200       100


### merging/ joining data
```
pd.merge(df1, df2, on='key_column')
```

### accessing rows by index


In [18]:

print("the value of column Name at row index 2:", df.loc[2, ['Name']])


the value of column Name at row index 2: Name    Desk
Name: 2, dtype: object


### updating rows


In [19]:
df.at[1, 'Name'] = 'new value'
print("the value of column Name at row index 1 after update is:",
      df.loc[1, ['Name']])


the value of column Name at row index 1 after update is: Name    new value
Name: 1, dtype: object


### updating rows in loop



In [22]:
for index in df.index:
    df.loc[index, ['Name']] = "*" + df.loc[index, ['Name']] + "*"
print(df.loc[2,['Name']])

Name    ***Desk***
Name: 2, dtype: object


### delete columns



In [9]:
# or df.pop("temp") or del df["temp"]

column_data=df.get('Quantity')
if column_data is not None:
    print("Column exists")
    df.drop('Quantity', inplace=True, axis=1)
    print('removing the Quantity column')
    print(df)
else:
    print("Column does not exist")


Column exists
removing the Quantity column
     Name  Article
0      Tv       97
1      PC      600
2    Desk      200
0  Fridge       97


### rename columns

In [27]:

df.rename(columns={'Article': 'New-Article'}, inplace=True)



### reset index column
```
df = df.reset_index(drop=True)
df.set_index('Name', inplace=True)
print(df)
```
### remove index when writhing to csv
```
df.to_csv('data/tmp.csv', index=False)
```
### read csv 

In [32]:
# index_col=False  not use the first column as the index
df = pd.read_csv("../data/tmp.csv", index_col=False)
print(df)


   New-Article
0           97
1          600
2          200
3           97


### processing data without iterating

In Pandas, it's often recommended to avoid explicit iteration (e.g., with `for` loops) over DataFrame rows or Series elements. Iterating over DataFrame rows using methods like `iterrows()` or `itertuples()` can be quite slow. Instead, one should leverage Pandas' built-in vectorized operations, which are optimized for performance.

Here are some common ways to process data in Pandas without iterating:

### 1. **Vectorized Operations**

For arithmetic operations, you can apply them directly to the whole DataFrame or Series.



In [26]:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['A'] = df['A'] * 2


### 2. **Using `apply()`**

You can apply a function along the axis of a DataFrame.

```python
def some_function(x):
    return x * x

df['A'] = df['A'].apply(some_function)
```
or some lambda function:

```python
df["age"]=df["age"].apply(lambda x:x*2)
```

### 3. **Using `map()` for Series**

The `map()` function is used to map each value in a Series to some other value.
```python
s = pd.Series(['cat', 'dog', 'mouse'])
s = s.map(str.upper)
```


### 4. **Boolean Indexing**

You can filter rows based on some condition without iterating.

```python
df_filtered = df[df['A'] > 2]
```

### 5. **Using `where()`**

The `where()` function is used to replace values in rows or columns based on some condition.

```python
df['A'] = df['A'].where(df['A'] > 2, 0)
```

### 6. **Using `assign()` for Creating New Columns**

```python
df = df.assign(C = df['A'] + df['B'])
```

### 7. **String Operations with `str` Accessor**

You can apply string operations directly on a Series without iterating.

```python
df['text_column'] = df['text_column'].str.lower()
```

### 8. **Datetime Operations with `dt` Accessor**

If you have a datetime column, you can extract components without iterating.

```python
df['year'] = df['date_column'].dt.year
```

### 9. **Aggregation Functions**

Aggregation functions like `sum()`, `mean()`, `max()`, etc., are inherently vectorized.

```python
total = df['A'].sum()
```

### 10. **Using `eval()` for Computation**
`eval()` apply an string operation on the variable. For large DataFrames, `eval()` can be faster than standard operations.

```python
df.eval('age=age+quantity', inplace=True)
```

The key takeaway is that Pandas provides numerous built-in functionalities that allow you to perform operations on entire columns or DataFrames at once. By leveraging these capabilities, you can make your code cleaner, more concise, and often much faster.



### iloc
 `iloc` stands for "integer-location" and is used primarily for selecting rows and columns in a DataFrame by their integer index positions. 

Here's a basic overview:

1. **Single Selection**
    ```python
    import pandas as pd
    df = pd.DataFrame({
        'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]
    })

    # Select the value in the first row and first column
    value = df.iloc[0, 0]  # returns 1
    ```

2. **Selecting Rows**
    ```python
    # Select the first row
    first_row = df.iloc[0]

    # Select the first and second rows
    first_two_rows = df.iloc[0:2]
    ```

3. **Selecting Columns**
    ```python
    # Select the first column
    first_column = df.iloc[:, 0]

    # Select the first and second columns
    first_two_columns = df.iloc[:, 0:2]
    ```

4. **Selecting Multiple Rows and Columns**
    ```python
    # Select the first two rows and the first two columns
    subset = df.iloc[0:2, 0:2]
    ```

5. **Using Lists**
    ```python
    # Select the first and third rows and the first and third columns
    subset = df.iloc[[0, 2], [0, 2]]
    ```

6. **Conditional Selection (using boolean indexing)**
    While `iloc` doesn't support boolean indexing directly, we can achieve this by combining it with boolean indexing on the dataframe:
    ```python
    mask = df['A'] > 1  # Boolean mask where column 'A' is greater than 1
    filtered_rows = df[mask]

    # Using iloc on the filtered dataframe
    subset = filtered_rows.iloc[:, 0:2]
    ```

Remember, `iloc` uses integer-based indexing, so it doesn't consider named indices (row labels) or column labels, unlike `loc` which is label-based indexing. If you try to access an index or column that doesn't exist, you will get an `IndexError`. Ensure that the indices you're using are valid for the DataFrame you're working with.

# Practical Advice

Working with Pandas DataFrames is a fundamental aspect of data analysis and manipulation in Python. Here are some practical pieces of advice for using Pandas DataFrames effectively:

1. **Understand Data Types**: Be aware of the data types in your DataFrame. Using the appropriate data type (like `int`, `float`, `datetime`, etc.) can save memory and improve performance.

2. **Use Vectorized Operations**: Leverage Pandas' vectorized operations instead of applying functions using loops. Vectorized operations are more efficient and concise.

3. **Handling Missing Data**: Learn how to handle missing data effectively. Methods like `dropna()`, `fillna()`, and boolean indexing are crucial for cleaning and preparing your data.

4. **Efficient Data Loading**: When reading large datasets, use parameters in `read_csv` (or similar functions) like `dtype`, `usecols`, and `chunksize` to control memory usage and only load necessary data.

5. **Indexing and Selecting Data**: Master the use of `loc[]` and `iloc[]` for label-based and integer-based indexing. Understand how to slice and dice the data efficiently.

6. **Avoid Chained Assignment**: Chained assignment (like `df[a][b] = value`) can lead to unexpected results due to Pandas' copying behavior. Prefer using `loc[]` or `iloc[]`.

7. **Use `groupby` Wisely**: Grouping data using `groupby` is a powerful feature. Combine it with aggregate functions (`sum`, `mean`, `count`, etc.) to summarize data.

8. **Datetime Handling**: If dealing with time series data, make use of Pandas' `datetime` capabilities for parsing, formatting, and manipulating dates and times.

9. **Memory Management**: For large datasets, consider using categories with `pd.Categorical` for object-type columns with few unique values to save memory.

10. **Leverage `apply` and `map`**: These functions are useful for applying a function across columns or elements but be mindful of their performance implications.

11. **Use MultiIndex for Complex Data**: MultiIndex (hierarchical indices) can be very helpful for complex data analysis and can make data aggregation tasks easier.

12. **Regularly Check Data**: Use methods like `head()`, `tail()`, `info()`, and `describe()` to regularly inspect your data throughout your analysis for sanity checks.

13. **Opt for In-built Functions**: Whenever possible, use Pandas' built-in functions which are optimized for performance over custom implementations.



# iloc, loc, and at
`iloc`, `loc`, and `at` are three methods provided by Pandas for accessing data in a DataFrame or Series. They each have different use cases:

1. **`iloc`**:
   - **Purpose**: `iloc` is used for selecting data based on integer-location based indexing. It means that you use integers to indicate the position of the rows and columns you want to access.
   - **Syntax**: `df.iloc[<row selection>, <column selection>]`.
   - **Use Case**: You use `iloc` when you want to access elements by their integer index (like in an array). It is purely position based and not label based.
   - **Example**: `df.iloc[0, 1]` would access the element at the first row and second column.

2. **`loc`**:
   - **Purpose**: `loc` is used for selecting data based on label-based indexing. It means that you use the names or labels of the rows and columns to access the data.
   - **Syntax**: `df.loc[<row selection>, <column selection>]`.
   - **Use Case**: You use `loc` when you want to access elements using their index labels or boolean arrays.
   - **Example**: `df.loc['row_label', 'column_label']` would access the element at the specified row and column labels.

3. **`at`**:
   - **Purpose**: `at` is used for accessing a single value for a row/column label pair. It is similar to `loc` but is optimized for accessing a single element.
   - **Syntax**: `df.at[<row label>, <column label>]`.
   - **Use Case**: You use `at` when you know the exact location (row and column labels) of the element you want to access and you need to access it quickly. It's faster than `loc` for accessing single values.
   - **Example**: `df.at['row_label', 'column_label']` would access the single element at the specified row and column labels.

Here's a quick comparison:

- **Speed**: `at` > `iloc` > `loc` (for accessing single values).
- **Selection by**: `iloc` uses integer index, `loc` uses labels, and `at` is for single value access by labels.
- **Use for multiple elements**: `iloc` and `loc` can be used to access multiple elements (slices, lists of index/labels), while `at` is strictly for single elements.

Choosing between these depends on your specific needs: whether your data selection is based on index position or labels, and whether you're retrieving single values or subsets of the DataFrame.

# check if a column exists
To check if a column exists in a DataFrame in Pandas, you can use one of the following methods:

### 1. Using the `in` Keyword
You can simply use the `in` keyword to check if a column label is present in the DataFrame's columns.

```python
if 'column_name' in df.columns:
    print("Column exists")
else:
    print("Column does not exist")
```

### 2. Using the `.columns` Attribute with `any()`
You can use the `.columns` attribute and check if any of the column names matches the desired name.

```python
if any(df.columns == 'column_name'):
    print("Column exists")
else:
    print("Column does not exist")
```

### 3. Try-Except Block
Although less common for this purpose, you can use a try-except block. This is particularly useful if you want to perform an operation on the column and handle the case where the column doesn't exist.

```python
try:
    # Attempt to access or perform operations on the column
    df['column_name']
    print("Column exists")
except KeyError:
    print("Column does not exist")
```

### 4. Using the `.get()` Method
This method is more about safely accessing a column rather than just checking its existence. It returns `None` or a specified default value if the column does not exist.

```python
column_data = df.get('column_name')
if column_data is not None:
    print("Column exists")
else:
    print("Column does not exist")
```

The first method using the `in` keyword is the most straightforward and commonly used approach for this purpose.