**Series:**

A Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects, etc.).

In [17]:
import pandas as pd
# Creating a Series from a list
series1 = pd.Series([10, 20, 30, 40, 50])
print(series1)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [16]:
# Creating a Series with custom index labels
series2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(series2)

a    10
b    20
c    30
dtype: int64


In [15]:
# Accessing elements
print(series1[0])  # Output: 10
print(series2['b']) # Output: 20

10
20


**DataFrames:**

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.  It's like a table in a database or a spreadsheet.

In [21]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


In [22]:
# Creating a DataFrame from a list of dictionaries
data2 = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
         {'Name': 'Bob', 'Age': 30, 'City': 'London'},
         {'Name': 'Charlie', 'Age': 28, 'City': 'Paris'}]
df2 = pd.DataFrame(data2)
print(df2)

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris


In [23]:
# Accessing columns
print(df['Name'])  # Output: a Series containing the names
print(df[['Name', 'Age']]) # Output: a DataFrame with 'Name' and 'Age' columns

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   28
Name       Alice
Age           25
City    New York
Name: 0, dtype: object
Name       Alice
Age           25
City    New York
Name: 0, dtype: object


In [24]:
# Accessing rows (using .loc or .iloc)
print(df.loc[0])    # Output: the first row (indexed by label)
print(df.iloc[0])   # Output: the first row (indexed by position)

Name       Alice
Age           25
City    New York
Name: 0, dtype: object
Name       Alice
Age           25
City    New York
Name: 0, dtype: object


In [36]:
# Adding a new column
df['Salary'] = [60000, 70000, 65000]
print(df)

            Name  Age      City  Salary
person1    Alice   25  New York   60000
person2      Bob   30    London   70000
person3  Charlie   28     Paris   65000


In [37]:
# Deleting a column
df = df.drop('Salary', axis=1) # axis=1 means column
print(df)

            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


In [41]:
# Deleting a row
df = df.drop(0, axis=0) # axis=0 means row
print(df)

  First Name  Years    City
1        Bob     30  London
2    Charlie     28   Paris


In [40]:
#Renaming columns
df = df.rename(columns={'Name': 'First Name', 'Age': 'Years'})
print(df)

# Resetting the index
df = df.reset_index(drop=True) # drop=True avoids adding the old index as a new column
print(df)

  First Name  Years      City
0      Alice     25  New York
1        Bob     30    London
2    Charlie     28     Paris
  First Name  Years      City
0      Alice     25  New York
1        Bob     30    London
2    Charlie     28     Paris


**Reading and Writing Data:**

Pandas can read and write data from various file formats:

In [10]:
# Reading from a CSV file
df_csv = pd.read_csv('data/data.csv')

# Writing to a CSV file
df.to_csv('data/output_file.csv', index=False) # index=False prevents writing the index

# Reading from an Excel file
#df_excel = pd.read_excel('your_file.xlsx') # You might need to install openpyxl: pip install openpyxl

# Writing to an Excel file
#df.to_excel('output_file.xlsx', index=False)

**Data Manipulation:**

*   **Filtering:**
Basically what happens, you pass a list of `True` and `False` values, and the indexes for `True` values are return from original data.

In [12]:
# Get rows where Age is greater than 25
filtered_df = df[df['Years'] > 25]
print(df['Years'] > 25)

print(filtered_df)

0    True
1    True
Name: Years, dtype: bool
  First Name  Years    City
0        Bob     30  London
1    Charlie     28   Paris


*   **Sorting:**

In [None]:
sorted_df = df.sort_values('Years', ascending=False)  # Sort by Age in descending order
print(sorted_df)

*   **Grouping:**

In [None]:
grouped_df = df.groupby('City')['Years'].mean()  # Group by City and calculate the mean Age
print(grouped_df)

*   **Applying functions:**

In [None]:
#Apply a function to a column
df['Years_Doubled'] = df['Years'].apply(lambda x: x * 2)
print(df)

**Missing Data:**

Pandas provides tools for handling missing data:

In [None]:
# Check for missing values
print(df.isnull())

# Fill missing values
df.fillna(0, inplace=True)  # Fill with 0
# or
df['Years'].fillna(df['Years'].mean(), inplace=True) # Fill with mean of the 'Years' column

# Drop rows with missing values
df.dropna(inplace=True)

**Basic Statistics:**

In [None]:
print(df.describe())  # Get descriptive statistics (mean, std, min, max, etc.)
print(df['Years'].mean())
print(df['Years'].median())
print(df['Years'].max())

**The Key Difference: Labels vs. Positions**

*   **`df.loc`**: This is primarily label-based. You use it when you know the specific row and/or column *names* (labels) you want to access.
*   **`df.iloc`**: This is integer-based. You use it when you know the *numerical position* of the rows and/or columns you want.


In [27]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data, index=['person1', 'person2', 'person3'])  # Custom index labels
print(df)

            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


This creates a DataFrame with custom row labels ('person1', 'person2', 'person3') and default column labels ('Name', 'Age', 'City').

**Using `loc`**

In [28]:
# Select the row with label 'person2'
print(df.loc['person2'])

# Select the 'Age' column for the row with label 'person1'
print(df.loc['person1', 'Age'])

# Select multiple rows and columns using labels
print(df.loc[['person1', 'person3'], ['Name', 'City']])

# Slicing with labels (inclusive of the end label)
print(df.loc['person1':'person3', 'Name':'City'])

Name       Bob
Age         30
City    London
Name: person2, dtype: object
25
            Name      City
person1    Alice  New York
person3  Charlie     Paris
            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


**Using `iloc`**

In [29]:
# Select the row at position 1 (second row)
print(df.iloc[1])

# Select the element at row position 0 and column position 1
print(df.iloc[0, 1])

# Select multiple rows and columns using positions
print(df.iloc[[0, 2], [0, 2]])

# Slicing with positions (exclusive of the end position)
print(df.iloc[0:3, 0:3])

Name       Bob
Age         30
City    London
Name: person2, dtype: object
25
            Name      City
person1    Alice  New York
person3  Charlie     Paris
            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


**Important Notes:**

*   **Slicing:** When using `loc` with slices, the end label is *inclusive*. When using `iloc` with slices, the end position is *exclusive* (like regular Python slicing).
*   **Mixed Indexing:** You can technically mix labels and positions with `loc`, but it's generally best to stick to one for clarity.
*   **Boolean Indexing:** Both `loc` and `iloc` can also be used with boolean arrays for more complex filtering.

**When to Use Which**

*   Use `loc` when you're working with data where you know the row and column labels, or when you want to use meaningful names to select data.
*   Use `iloc` when you're working with data where you only know the numerical positions of the rows and columns, or when you need to iterate through rows/columns using their index.


### What is a DataFrame?

A `DataFrame` is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in `pandas`. You can think of it like an in-memory spreadsheet or SQL table, or even a dict of Series objects.

Here's a breakdown of its key features:

1. **Two-Dimensional Structure**: Like a table in a spreadsheet, a DataFrame consists of rows and columns. This makes it ideal for representing real-world data like financial records, sports statistics, etc.

2. **Labeled Axes**: Each row and column in a DataFrame has a label. By default, the row labels are known as the Index, and the column labels are simply the names of each column.

3. **Heterogeneous Data Types**: A DataFrame can contain different types of data — integers, strings, floating-point numbers, Python objects, and more. Each column typically holds data of the same type.

4. **Size Mutable**: You can add or remove rows and columns from a DataFrame after it has been created.

5. **Functionality**: Pandas provides a vast array of functions to manipulate, transform, and analyze data in a DataFrame. This includes operations like filtering, sorting, groupby, merging, concatenation, and more.

6. **Data Alignment**: One of the key features of Pandas is data alignment. It automatically aligns data in operations involving multiple DataFrames or Series (a one-dimensional array in Pandas).

7. **Handling Missing Data**: Pandas is equipped to handle missing data using methods like `isna()`, `fillna()`, `dropna()`, etc.

8. **Efficient Storage and Processing**: Under the hood, Pandas DataFrames are built on top of NumPy arrays, making them efficient for numerical computations.

Here's a simple example of a Pandas DataFrame:

```python
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)
```

This code will output a DataFrame with three columns (Name, Age, City) and three rows, each representing a person's record.







### Creating a DataFrame:

There are many ways to create a DataFrame. Here's one of the simplest ways using a dictionary:

```python

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}

df = pd.DataFrame(data)
```

This will give:

```
      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
```

or you can simply create an empty DataFram:

```
df = pd.DataFrame(columns=["Name", "Article", "Quantity"])
```

### append columns to an empty DataFrame 
```
df['Name'] = ['Tv', 'PC', 'Desk']
df['Article'] = [97, 600, 200]
df['Quantity'] = [2200, 75, 100]
```


#### iterate rows
```
for index, row in df.iterrows():
    print(index, row['Name'])
```

### append rows
```
df_new_row = pd.DataFrame(
    {'Name': ['Fridge'], 'Article': [97], 'Quantity': [2200]})
df = pd.concat([df, df_new_row])
```


### printing headers (columns names)
```
cols = df.columns.tolist()
print("columns are: ", cols)
print("head: ", df.head())
```


### Basic Operations:

You can do a variety of operations on DataFrames:


### selecting columns

```
df['Name']
```
or

```
df[['Name', 'Age']]
```

### filtering rows
```
df[df['Age'] > 30]
```

### merging/ joining data

```
pd.merge(df1, df2, on='key_column')
```

### accessing rows by index
```
print("the value of column Name at row index 1:", df.loc[1, ['Name']])
```
### updating rows
```
df.at[1, 'Name'] = 'new value'
print("the value of column Name at row index 1 after update is:",
      df.loc[1, ['Name']])
```
### updating rows in loop
```
for index in df.index:
    df.loc[index, ['Name']] = "*" + df.loc[index, ['Name']] + "*"
```


### delete columns
```
# or df.pop("temp") or del df["temp"]
df.drop('Quantity', inplace=True, axis=1)
print('removing the Quantity column')
print(df)
```

### rename columns
```
df.rename(columns={'Article': 'New-Article'}, inplace=True)
```

### reset index column
```
df = df.reset_index(drop=True)
df.set_index('Name', inplace=True)
print(df)
```
### remove index when writhing to csv
```
df.to_csv('data/tmp.csv', index=False)
```
### read csv 
```
df = pd.read_csv("data/tmp.csv", index_col=False)
print(df)
```


### processing data without iterating

In Pandas, it's often recommended to avoid explicit iteration (e.g., with `for` loops) over DataFrame rows or Series elements. Iterating over DataFrame rows using methods like `iterrows()` or `itertuples()` can be quite slow. Instead, one should leverage Pandas' built-in vectorized operations, which are optimized for performance.

Here are some common ways to process data in Pandas without iterating:

### 1. **Vectorized Operations**

For arithmetic operations, you can apply them directly to the whole DataFrame or Series.

```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['A'] = df['A'] * 2
```

### 2. **Using `apply()`**

You can apply a function along the axis of a DataFrame.

```python
def some_function(x):
    return x * x

df['A'] = df['A'].apply(some_function)
```
or some lambda function:

```python
df["age"]=df["age"].apply(lambda x:x*2)
```

### 3. **Using `map()` for Series**

The `map()` function is used to map each value in a Series to some other value.

```python
s = pd.Series(['cat', 'dog', 'mouse'])
s = s.map(str.upper)
```

### 4. **Boolean Indexing**

You can filter rows based on some condition without iterating.

```python
df_filtered = df[df['A'] > 2]
```

### 5. **Using `where()`**

The `where()` function is used to replace values in rows or columns based on some condition.

```python
df['A'] = df['A'].where(df['A'] > 2, 0)
```

### 6. **Using `assign()` for Creating New Columns**

```python
df = df.assign(C = df['A'] + df['B'])
```

### 7. **String Operations with `str` Accessor**

You can apply string operations directly on a Series without iterating.

```python
df['text_column'] = df['text_column'].str.lower()
```

### 8. **Datetime Operations with `dt` Accessor**

If you have a datetime column, you can extract components without iterating.

```python
df['year'] = df['date_column'].dt.year
```

### 9. **Aggregation Functions**

Aggregation functions like `sum()`, `mean()`, `max()`, etc., are inherently vectorized.

```python
total = df['A'].sum()
```

### 10. **Using `eval()` for Computation**
`eval()` apply an string operation on the variable. For large DataFrames, `eval()` can be faster than standard operations.

```python
df.eval('age=age+quantity', inplace=True)
```

The key takeaway is that Pandas provides numerous built-in functionalities that allow you to perform operations on entire columns or DataFrames at once. By leveraging these capabilities, you can make your code cleaner, more concise, and often much faster.



### iloc
 `iloc` stands for "integer-location" and is used primarily for selecting rows and columns in a DataFrame by their integer index positions. 

Here's a basic overview:

1. **Single Selection**
    ```python
    import pandas as pd
    df = pd.DataFrame({
        'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]
    })

    # Select the value in the first row and first column
    value = df.iloc[0, 0]  # returns 1
    ```

2. **Selecting Rows**
    ```python
    # Select the first row
    first_row = df.iloc[0]

    # Select the first and second rows
    first_two_rows = df.iloc[0:2]
    ```

3. **Selecting Columns**
    ```python
    # Select the first column
    first_column = df.iloc[:, 0]

    # Select the first and second columns
    first_two_columns = df.iloc[:, 0:2]
    ```

4. **Selecting Multiple Rows and Columns**
    ```python
    # Select the first two rows and the first two columns
    subset = df.iloc[0:2, 0:2]
    ```

5. **Using Lists**
    ```python
    # Select the first and third rows and the first and third columns
    subset = df.iloc[[0, 2], [0, 2]]
    ```

6. **Conditional Selection (using boolean indexing)**
    While `iloc` doesn't support boolean indexing directly, we can achieve this by combining it with boolean indexing on the dataframe:
    ```python
    mask = df['A'] > 1  # Boolean mask where column 'A' is greater than 1
    filtered_rows = df[mask]

    # Using iloc on the filtered dataframe
    subset = filtered_rows.iloc[:, 0:2]
    ```

Remember, `iloc` uses integer-based indexing, so it doesn't consider named indices (row labels) or column labels, unlike `loc` which is label-based indexing. If you try to access an index or column that doesn't exist, you will get an `IndexError`. Ensure that the indices you're using are valid for the DataFrame you're working with.