# Tutorial: DataFrame Preparation and Manipulation with Pandas
Pandas is a powerful library in Python for data manipulation and analysis. Central to its functionality is the DataFrame, which is a two-dimensional labeled data structure. In this tutorial, we'll cover some of the most important operations to prepare and manipulate a DataFrame using Pandas.

## 1. Installing Pandas
If you haven't already installed Pandas, you can do so using pip: pip install pandas
## 2. Importing Pandas
Before using Pandas, you need to import it into your Python script or notebook:

In [16]:
import pandas as pd

## 3. Creating a DataFrame
You can create a DataFrame from various data sources such as lists, dictionaries, NumPy arrays, CSV files, Excel files, and more. Here's how you can create a DataFrame from a dictionary:

In [17]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Height': [5.2, 6.0, 5.6, 5.10],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}

df = pd.DataFrame(data)
print(df)

      Name  Age  Height         City
0    Alice   25     5.2     New York
1      Bob   30     6.0  Los Angeles
2  Charlie   35     5.6      Chicago
3    David   40     5.1      Houston


## 4. Basic DataFrame Operations
### 4.1. Viewing DataFrame Information
You can use info() method to get a concise summary of the DataFrame:

In [18]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     4 non-null      int64  
 2   Height  4 non-null      float64
 3   City    4 non-null      object 
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
None


### 4.2. Viewing DataFrame Head and Tail
To view the first few rows of the DataFrame, you can use head() method:

In [19]:
print(df.head())

      Name  Age  Height         City
0    Alice   25     5.2     New York
1      Bob   30     6.0  Los Angeles
2  Charlie   35     5.6      Chicago
3    David   40     5.1      Houston


To view the last few rows, you can use tail() method:

In [20]:
print(df.tail())

      Name  Age  Height         City
0    Alice   25     5.2     New York
1      Bob   30     6.0  Los Angeles
2  Charlie   35     5.6      Chicago
3    David   40     5.1      Houston


## 5. Indexing and Selecting Data
### 5.1. Selecting Columns
You can select a single column by specifying its name:

In [21]:
print(df['Name'])

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object


You can select multiple columns by passing a list of column names:

In [22]:
print(df[['Name', 'Age']])

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


### 5.2. Selecting Rows
You can select rows by their index using iloc[] or loc[]:

In [23]:
print(df.iloc[0])  # Select the first row
print(df.loc[1])   # Select the row with index 1

Name         Alice
Age             25
Height         5.2
City      New York
Name: 0, dtype: object
Name              Bob
Age                30
Height            6.0
City      Los Angeles
Name: 1, dtype: object


### 5.3. Conditional Selection
You can select rows based on certain conditions:

In [24]:
print(df[df['Age'] > 30])  # Select rows where Age is greater than 30

      Name  Age  Height     City
2  Charlie   35     5.6  Chicago
3    David   40     5.1  Houston


## 6. Adding and Removing Columns
### 6.1. Adding a Column
You can add a new column to the DataFrame:

In [25]:
df['Gender'] = ['Female', 'Male', 'Male', 'Female']
print(df)

      Name  Age  Height         City  Gender
0    Alice   25     5.2     New York  Female
1      Bob   30     6.0  Los Angeles    Male
2  Charlie   35     5.6      Chicago    Male
3    David   40     5.1      Houston  Female


### 6.2. Removing a Column
You can remove a column using drop() method:

In [26]:
df.drop('City', axis=1, inplace=True)
print(df)

      Name  Age  Height  Gender
0    Alice   25     5.2  Female
1      Bob   30     6.0    Male
2  Charlie   35     5.6    Male
3    David   40     5.1  Female


## 7. Data Manipulation
### 7.1. Sorting Data
You can sort the DataFrame based on one or more columns:

In [27]:
print(df.sort_values(by='Age'))  # Sort by Age

      Name  Age  Height  Gender
0    Alice   25     5.2  Female
1      Bob   30     6.0    Male
2  Charlie   35     5.6    Male
3    David   40     5.1  Female


### 7.2. Grouping Data
You can group data based on certain columns and perform operations:

In [28]:
grouped = df.groupby('Gender')
print(grouped.mean())  # Calculate mean age by gender

         Age  Height
Gender              
Female  32.5    5.15
Male    32.5    5.80


## 8. Handling Missing Data
### 8.1. Detecting Missing Data
You can detect missing data using isnull() or notnull():

In [29]:
print(df.isnull())   # True where values are NaN
print(df.notnull())  # True where values are not NaN

    Name    Age  Height  Gender
0  False  False   False   False
1  False  False   False   False
2  False  False   False   False
3  False  False   False   False
   Name   Age  Height  Gender
0  True  True    True    True
1  True  True    True    True
2  True  True    True    True
3  True  True    True    True


### 8.2. Handling Missing Data
You can handle missing data by dropping or filling them:

In [30]:
print(df.dropna())       # Drop rows with any NaN values
print(df.fillna(0))      # Fill NaN values with 0

      Name  Age  Height  Gender
0    Alice   25     5.2  Female
1      Bob   30     6.0    Male
2  Charlie   35     5.6    Male
3    David   40     5.1  Female
      Name  Age  Height  Gender
0    Alice   25     5.2  Female
1      Bob   30     6.0    Male
2  Charlie   35     5.6    Male
3    David   40     5.1  Female


## 9. Grouping Data and Computing Statistics
Pandas provides the groupby() method to group data based on one or more columns and perform operations on these groups. This is a powerful feature for analyzing and summarizing data based on different categories.

### 9.1. Grouping Data by a Single Column
You can group data by a single column using the groupby() method. Here's how you can group data by the 'Gender' column and compute statistics:

In [31]:
grouped = df.groupby('Gender')
print(grouped.describe())

         Age                                                  Height        \
       count  mean        std   min    25%   50%    75%   max  count  mean   
Gender                                                                       
Female   2.0  32.5  10.606602  25.0  28.75  32.5  36.25  40.0    2.0  5.15   
Male     2.0  32.5   3.535534  30.0  31.25  32.5  33.75  35.0    2.0  5.80   

                                                
             std  min    25%   50%    75%  max  
Gender                                          
Female  0.070711  5.1  5.125  5.15  5.175  5.2  
Male    0.282843  5.6  5.700  5.80  5.900  6.0  


This will produce a summary of statistics for each group.


### 9.2. Grouping Data by Multiple Columns
You can also group data by multiple columns. For example, you can group data by both 'Gender' and 'Age' columns:



In [32]:
grouped = df.groupby(['Gender', 'Age'])
print(grouped.mean())

            Height
Gender Age        
Female 25      5.2
       40      5.1
Male   30      6.0
       35      5.6



This will compute statistics for each combination of gender and age.



### 9.3. Computing Statistics on Grouped Data
Once data is grouped, you can compute various statistics on the groups. For example, you can compute the mean, median, sum, standard deviation, etc. For instance, to compute the mean age for each gender group:



In [33]:
print(grouped['Age'].mean())

Gender  Age
Female  25     25.0
        40     40.0
Male    30     30.0
        35     35.0
Name: Age, dtype: float64



Or to compute the median height for each gender group:



In [34]:
print(grouped['Height'].median())

Gender  Age
Female  25     5.2
        40     5.1
Male    30     6.0
        35     5.6
Name: Height, dtype: float64



### 9.4. Accessing Specific Statistics
You can access specific statistics for each group using methods such as mean(), median(), sum(), etc. For instance, to access the mean age for each gender group:



In [35]:
print(grouped['Age'].mean())

Gender  Age
Female  25     25.0
        40     40.0
Male    30     30.0
        35     35.0
Name: Age, dtype: float64


Or to access the median height for each gender group:

In [36]:
print(grouped['Height'].median())

Gender  Age
Female  25     5.2
        40     5.1
Male    30     6.0
        35     5.6
Name: Height, dtype: float64



This section provides an overview of how to group data using the `groupby()` method and compute statistics on the grouped data. It demonstrates the flexibility of Pandas for analyzing and summarizing data based on different categories.

## Apply Function

In [1]:
import pandas as pd

# Load the CSV file
df = pd.read_csv('example_dataframe.csv')

# Apply the transformation to add the dummy1 column
df['dummy1'] = df['batch'].apply(lambda x: 1 if x == 4 else 0)

# Display the updated dataframe
print(df)

   batch  dummy1
0      1       0
1      4       1
2      2       0
3      4       1
4      3       0
5      5       0
6      4       1
7      1       0
8      4       1
9      2       0
