In [None]:
Q1. List any five functions of the pandas library with execution.



Ans:


Certainly! Here are five common functions from the Pandas library in Python,
along with example executions:

1. **`read_csv()`**: This function is used to read data from a CSV file and create a DataFrame.


import pandas as pd

# Read a CSV file and create a DataFrame
data = pd.read_csv('data.csv')
print(data.head())


2. **`head()`**: It displays the first n rows (default is 5) of a DataFrame.


# Display the first 3 rows of the DataFrame
print(data.head(3))


3. **`info()`**: This function provides a summary of the DataFrame,
including data types, non-null values, and memory usage.


# Display information about the DataFrame
print(data.info())


4. **`describe()`**: It generates summary statistics of numeric columns in the DataFrame,
such as mean, standard deviation, min, max, etc.


# Generate summary statistics for numeric columns
print(data.describe())


5. **`groupby()`**: This function is used for grouping rows of a DataFrame based on a column 
or multiple columns, allowing for operations like aggregation.

# Grouping and calculating the mean of 'column_name' for each group
grouped_data = data.groupby('group_column')['column_name'].mean()
print(grouped_data)


Make sure to replace `'data.csv'`, `'column_name'`, and `'group_column'`
with appropriate file paths and column names based on your dataset.











Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.



Ans:

You can achieve this by using the `reset_index()` method in Pandas and then
modifying the index as per your requirements. Here's a Python function that
re-indexes a DataFrame with a new index that starts from 1 and increments by 2 for each row:


import pandas as pd

def reindex_with_increment(df):
    # Reset the index to default integer index
    df = df.reset_index(drop=True)
    
    # Create a new index starting from 1 and incrementing by 2
    new_index = [i * 2 + 1 for i in range(len(df))]
    
    # Assign the new index to the DataFrame
    df.index = new_index
    
    return df

# Example usage:
data = {'A': [10, 20, 30],
        'B': [40, 50, 60],
        'C': [70, 80, 90]}

df = pd.DataFrame(data)
new_df = reindex_with_increment(df)
print(new_df)


In this code, we first reset the index to the default integer index using 
`reset_index(drop=True)`. Then, we create a new index that starts from 1 and increments by 2
using a list comprehension. Finally, we assign the new index to the DataFrame using 
`df.index = new_index`. The resulting DataFrame will have the desired new index.














Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.
For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.



Ans:


You can achieve this by using the Pandas library in Python. Here's a Python 
function that iterates over the DataFrame and calculates the sum of the first
three values in the 'Values' column:


import pandas as pd

def calculate_sum_of_first_three(df):
    # Check if the 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        # Get the first three values from the 'Values' column and calculate their sum
        first_three_values = df['Values'].head(3)
        sum_of_first_three = first_three_values.sum()
        
        # Print the sum to the console
        print("Sum of the first three values in the 'Values' column:", sum_of_first_three)
    else:
        print("The 'Values' column does not exist in the DataFrame.")

# Example usage:
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
calculate_sum_of_first_three(df)


This code defines a function `calculate_sum_of_first_three` that takes a DataFrame as input, 
checks if the 'Values' column exists, extracts the first three values, calculates their 
sum, and then prints the result to the console.

In the example provided,
it will print the sum of the first three values, which is 60.












Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.



Ans:
    
    You can create a new column 'Word_Count' in a Pandas DataFrame based on the number of 
    words in each row of the 'Text' column using the `apply` function with a lambda function
    that splits each text into words and then counts the number of words. 
    Here's a Python function to do this:


import pandas as pd

def add_word_count_column(df, text_column_name='Text', word_count_column_name='Word_Count'):
    df[word_count_column_name] = df[text_column_name].apply(lambda x: len(x.split()))
    return df

# Example usage:
data = {'Text': ["This is a sample sentence.", "Another example.", "Short text."]}
df = pd.DataFrame(data)

df = add_word_count_column(df)
print(df)


This code defines a function `add_word_count_column` that takes a DataFrame `df`, the name of
the text column ('Text' by default), and the name for the new word count column
('Word_Count' by default). It then uses the `apply` method to apply a lambda function to each row 
of the 'Text' column, splitting the text into words using `split()` and calculating the length 
of the resulting list to get the word count. Finally, it adds the
'Word_Count' column to the DataFrame.

In the example usage provided, it demonstrates how to use this function with a sample DataFrame.






    
    
    
    
    
    
    
    
    
    


Q5. How are DataFrame.size() and DataFrame.shape() different?




Ans:

In the context of data manipulation with Pandas, `DataFrame.size` and `DataFrame.shape`
are two different attributes that provide information about the dimensions of a DataFrame, 
but they serve different purposes and return different values:

1. `DataFrame.size`:
   - `DataFrame.size` returns the total number of elements (cells) in the DataFrame.
   - It calculates the size by multiplying the number of rows by the number of columns.
   - The result is an integer representing the total number of data points in the DataFrame.

   Example:
   
   import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

   df_size = df.size
print(df_size)  # Output: 6 (2 rows * 3 columns = 6)
   

2. `DataFrame.shape`:
   - `DataFrame.shape` returns a tuple that represents the dimensions of the DataFrame.
   - The tuple contains two values: the number of rows and the number of columns, respectively.
   - It provides a convenient way to determine the shape (size) of the
DataFrame without counting the individual elements.

   Example:
        
import pandas as pd

   data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

   df_shape = df.shape
print(df_shape)  # Output: (3, 2) (3 rows, 2 columns)
   

In summary, `DataFrame.size` returns the total number of elements in the DataFrame,
while `DataFrame.shape` returns a tuple representing the number of rows and columns.
Depending on your specific use case, you may choose one or the other to obtain the
desired information about the DataFrame's dimensions.

    
    
    
    
    
    
    
    
    
    




Q6. Which function of pandas do we use to read an excel file?


Ans:



In pandas, you can use the `pd.read_excel()` function to read data from an Excel file
into a DataFrame. Here's how you typically use it:

import pandas as pd

# Replace 'your_excel_file.xlsx' with the path to your Excel file
df = pd.read_excel('your_excel_file.xlsx')


You can also specify additional parameters to customize how the Excel file is read,
such as the sheet name, specific columns, header rows, and more.
Here's an example with some common parameters:


import pandas as pd

# Replace 'your_excel_file.xlsx' with the path to your Excel file
# Specify the sheet name and skip the first row (header)
df = pd.read_excel('your_excel_file.xlsx', sheet_name='Sheet1', header=1)

# You can also specify specific columns using the 'usecols' parameter
# For example, to read only columns A and B:
df = pd.read_excel('your_excel_file.xlsx', usecols=['A', 'B'])


Make sure you have the pandas library installed in your Python environment before
using this function. You can install it using pip if you don't already have it:

pip install pandas

















Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.
The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.



Ans:

Achieve this by using the `str.split()` method in Pandas to split the 'Email'
column at the '@' symbol and then select the first part, which represents the username.
Here's a Python function that does this:


import pandas as pd

def extract_username(df):
    # Split the 'Email' column at the '@' symbol and select the first part (username)
    df['Username'] = df['Email'].str.split('@').str[0]
    return df

# Example usage:
data = {'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob@example.com']}
df = pd.DataFrame(data)
df = extract_username(df)
print(df)


This function defines a new column 'Username' in the DataFrame 'df' that contains the
extracted usernames from the 'Email' column. The output will look like this:


            Email       Username
0  john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith
2         bob@example.com       bob


Now, the 'Username' column contains the usernames extracted from the 'Email' column.




    
    
    
    
    
    
    
    
    
    
    

Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'.
Write a Python function that selects all rows where the value in column 'A'
is greater than 5 and the value in column 'B'
is less than 10. The function should return a new DataFrame that contains only the selected rows.
For example, if df contains the following values:
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2




Ans:


You can achieve this by using Pandas and applying boolean indexing to the DataFrame. 
Here's a Python function that does what you described:


import pandas as pd

def select_rows(df):
    # Use boolean indexing to select rows where 'A' > 5 and 'B' < 10
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    
    return selected_rows

# Example usage:
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}

df = pd.DataFrame(data)

selected_df = select_rows(df)
print(selected_df)

This code defines a function `select_rows` that takes a DataFrame as input,
applies the conditions specified (column 'A' > 5 and column 'B' < 10),
and returns a new DataFrame containing only the selected rows. When you run the
example provided with your sample data, it will output the expected result:


   A  B  C
1  8  2  7
4  9  1  2


These are the rows where column 'A' is greater than 5 and column 'B' is less than 10.
















 

Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function 
to calculate the mean, median, and standard deviation of the values in the 'Values' column.



Ans:
    
You can calculate the mean, median, and standard deviation of the values in 
the 'Values' column of a Pandas DataFrame using Python functions from the Pandas
and NumPy libraries. Here's a Python function to do that:

import pandas as pd
import numpy as np

def calculate_statistics(df):
    # Check if the 'Values' column exists in the DataFrame
    if 'Values' not in df.columns:
        return "The 'Values' column does not exist in the DataFrame."

    # Calculate mean, median, and standard deviation
    mean = df['Values'].mean()
    median = df['Values'].median()
    std_dev = df['Values'].std()

    return mean, median, std_dev

# Example usage:
# Create a sample DataFrame
data = {'Values': [10, 15, 20, 25, 30]}
df = pd.DataFrame(data)

# Calculate statistics
mean, median, std_dev = calculate_statistics(df)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")


Replace the 'data' dictionary with your actual data, and then call the `calculate_statistics` 
function with your DataFrame to obtain the mean, median, and 
standard deviation of the 'Values' column.   










    
    



Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.





Ans:
    
You can calculate the moving average for the 'Sales' column in your 
Pandas DataFrame using the `rolling` 
method with a window size of 7. Here's a Python function that accomplishes this:




import pandas as pd

def calculate_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    return df

# Example usage:
# Assuming df is your Pandas DataFrame with 'Sales' and 'Date' columns
# df = df.sort_values(by='Date')  # Ensure the DataFrame is sorted by date if it's not already
# df = calculate_moving_average(df)

This function first sorts the DataFrame by the 'Date' column to ensure that the rolling window
calculation is done correctly. Then, it calculates the moving average of the 'Sales' column 
with a window size of 7 using the `.rolling` method and assigns the result to the
'MovingAverage' column. The `min_periods=1` parameter ensures that even if there are fewer
than 7 days of data available initially, it will calculate the moving average for the available data.

After calling this function with your DataFrame, the 'MovingAverage' column will 
contain the moving average of the sales for the past 7 days, including the current day.
















Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday
The function should return the modified DataFrame.
 
    
    
Ans:
     
    
    Achieve this by using the Pandas library in Python.
    You'll need to convert the 'Date' column to datetime objects and then extract 
    the weekday names from the datetime objects. Here's a Python function that does
    this and returns the modified DataFrame:


import pandas as pd

def add_weekday_column(df):
    # Convert the 'Date' column to datetime objects
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Extract the weekday names and store them in a new 'Weekday' column
    df['Weekday'] = df['Date'].dt.strftime('%A')
    
    return df

# Example usage:
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']}
df = pd.DataFrame(data)

result_df = add_weekday_column(df)
print(result_df)

This code will create the 'Weekday' column in the DataFrame and populate it
with the corresponding weekday names for each date in the 'Date' column.
    
    
         Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


   
    
    
    
    
    
    
    
    
    
    



Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.




Ans:
    
    

You can achieve this using the Pandas library in Python. First, you need to ensure that
the 'Date' column is in datetime format, and then
you can filter the DataFrame based on the date range.
Here's a Python function to do that:


import pandas as pd

def filter_date_range(df):
    # Convert the 'Date' column to datetime format if it's not already
    if not pd.api.types.is_datetime64_any_dtype(df['Date']):
        df['Date'] = pd.to_datetime(df['Date'])
    
    # Define the start and end date for the desired range
    start_date = pd.to_datetime('2023-01-01')
    end_date = pd.to_datetime('2023-01-31')
    
    # Use boolean indexing to filter rows within the specified date range
    filtered_df = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return filtered_df

# Example usage:
# Assuming you have a DataFrame named 'df' with a 'Date' column
filtered_df = filter_date_range(df)


Make sure to replace 'df' with the actual name of your DataFrame.
This function will return a new DataFrame containing only the rows where the 'Date'
column falls within the specified date range of January 1, 2023, to January 31, 2023.





    
    
    
    
    
    
    
    







Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?




Ans:
    
    To use the basic functions of pandas in Python, you first need to import the pandas library itself.
    You can do this with the `import` statement:

import pandas as pd


Typically, pandas is imported with the alias 'pd' for convenience, which allows you
to use 'pd' as a prefix for pandas functions and objects. 
This is a common convention in the Python data science community.

Once you've imported pandas in this way, you can start using its various functions
and data structures, such as DataFrames and Series, to work with and manipulate data in Python.
    















