### Q1. List any five functions of the pandas library with execution.

read_csv(): Reading data from a CSV file and creating a DataFrame.

head(): Displaying the first N rows of a DataFrame.

last(): Displaying the last N rows of a DataFrame.

describe(): Generating descriptive statistics of numerical columns in a DataFrame.

groupby(): Grouping data based on specified columns.

drop(): Dropping specified columns from a DataFrame.

info(): Displaying a concise summary of a DataFrame, including data types and missing values.

shape: Retrieving the dimensions (number of rows and columns) of a DataFrame.

unique(): Returning unique values in a column of a DataFrame.

fillna(): Filling missing values in a DataFrame with specified values.

sort_values(): Sorting a DataFrame by one or more columns.

### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [1]:
import pandas as pd

# Sample DataFrame
data = {'A': [10, 20, 30],
        'B': [40, 50, 60],
        'C': [70, 80, 90]}

df = pd.DataFrame(data)

# Function to re-index using reindex with 'ffill'
def reindex_dataframe(df):
    new_index = range(1, 2 * len(df) + 1, 2)
    df_reindexed = df.reindex(new_index, method='ffill')
    return df_reindexed

# Reindex the DataFrame
df_reindexed = reindex_dataframe(df)

# Display the original and reindexed DataFrames
print("Original DataFrame:")
print(df)

print("\nReindexed DataFrame:")
print(df_reindexed)

Original DataFrame:
    A   B   C
0  10  40  70
1  20  50  80
2  30  60  90

Reindexed DataFrame:
    A   B   C
1  20  50  80
3  30  60  90
5  30  60  90


### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console. For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.

In [2]:
import pandas as pd

# Sample DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Function to calculate the sum of the first three values in the 'Values' column
def calculate_sum_of_first_three(df):
    # Check if the 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        # Extract the 'Values' column and calculate the sum of the first three values
        values_column = df['Values']
        sum_of_first_three = values_column.head(3).sum()

        # Print the result
        print(f"Sum of the first three values in 'Values' column: {sum_of_first_three}")
    else:
        print("DataFrame does not contain a 'Values' column.")

# Call the function with the sample DataFrame
calculate_sum_of_first_three(df)

Sum of the first three values in 'Values' column: 60


### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [3]:
import pandas as pd

# Sample DataFrame
data = {'Text': ['This is a sample text', 'Another example', 'Python is great']}
df = pd.DataFrame(data)

# Function to count the number of words in a text
def count_words(text):
    words = text.split()
    return len(words)

# Function to create a new column 'Word_Count'
def create_word_count_column(df):
    # Check if the 'Text' column exists in the DataFrame
    if 'Text' in df.columns:
        # Apply the count_words function to each row of the 'Text' column
        df['Word_Count'] = df['Text'].apply(count_words)
    else:
        print("DataFrame does not contain a 'Text' column.")

# Call the function with the sample DataFrame
create_word_count_column(df)

# Display the updated DataFrame
df

Unnamed: 0,Text,Word_Count
0,This is a sample text,5
1,Another example,2
2,Python is great,3


###  Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size:

DataFrame.size returns the total number of elements in the DataFrame.
It calculates the size by multiplying the number of rows by the number of columns.
The result is an integer representing the total number of elements (cells) in the DataFrame.

In [4]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Calculate the size of the DataFrame
size = df.size
print("DataFrame size:", size)
df

DataFrame size: 6


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


DataFrame.shape:

DataFrame.shape returns a tuple representing the dimensions (number of rows, number of columns) of the DataFrame.
It does not calculate the total number of elements but provides the structure information.
The result is a tuple (number of rows, number of columns).

In summary, DataFrame.size provides the total number of elements in the DataFrame, while DataFrame.shape provides the number of rows and columns as a tuple. The key distinction is that size returns a single integer, while shape returns a tuple with two values.

In [5]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Get the shape of the DataFrame
shape = df.shape
print("DataFrame shape:", shape)

DataFrame shape: (3, 2)


### Q6. Which function of pandas do we use to read an excel file?

In [6]:
import pandas as pd

file_path = 'test.xlsx'

# Read Excel file into a Pandas DataFrame
df = pd.read_excel(file_path)

# Display the DataFrame
df

Unnamed: 0,Name,Age
0,John,23
1,Sam,29
2,Wood,32


### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address. The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

In [7]:
# Sample DataFrame
data = {'Email': ['john.doe@example.com', 'jane.smith@example.com', 'bob@example.com']}
df = pd.DataFrame(data)

# Function to create a new 'Username' column
def extract_username(df):
    # Check if the 'Email' column exists in the DataFrame
    if 'Email' in df.columns:
        # Split the 'Email' column at the '@' symbol and get the first part (username)
        df['Username'] = df['Email'].str.split('@').str.get(0)
    else:
        print("DataFrame does not contain an 'Email' column.")

# Call the function with the sample DataFrame
extract_username(df)

# Display the updated DataFrame
df

Unnamed: 0,Email,Username
0,john.doe@example.com,john.doe
1,jane.smith@example.com,jane.smith
2,bob@example.com,bob


### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

In [8]:
import pandas as pd

# Sample DataFrame
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}

df = pd.DataFrame(data)

# Function to select rows based on conditions in columns 'A' and 'B'
def select_rows(df):
    # Check if columns 'A' and 'B' exist in the DataFrame
    if 'A' in df.columns and 'B' in df.columns:
        # Use boolean indexing to select rows based on conditions
        selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
        
        # Return a new DataFrame containing only the selected rows
        return selected_rows
    else:
        print("DataFrame does not contain columns 'A' and 'B'.")

# Call the function with the sample DataFrame
selected_df = select_rows(df)

# Display the selected DataFrame
selected_df

Unnamed: 0,A,B,C
1,8,2,7
2,6,9,4
4,9,1,2


### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [9]:
import pandas as pd

# Sample DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Function to calculate mean, median, and standard deviation of 'Values' column
def calculate_statistics(df):
    # Check if the 'Values' column exists in the DataFrame
    if 'Values' in df.columns:
        # Calculate mean, median, and standard deviation
        mean_value = df['Values'].mean()
        median_value = df['Values'].median()
        std_value = df['Values'].std()

        # Print the results
        print(f"Mean: {mean_value}")
        print(f"Median: {median_value}")
        print(f"Standard Deviation: {std_value}")
    else:
        print("DataFrame does not contain a 'Values' column.")

# Call the function with the sample DataFrame
calculate_statistics(df)

Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [10]:
import pandas as pd

# Sample DataFrame with 'Sales' and 'Date' columns
data = {'Sales': [10, 15, 20, 25, 30, 35, 40, 45, 50],
        'Date': pd.date_range('2022-01-01', periods=9)}

df = pd.DataFrame(data)

# Function to create a new 'MovingAverage' column
def calculate_moving_average(df, window_size=7):
    # Check if the 'Sales' and 'Date' columns exist in the DataFrame
    if 'Sales' in df.columns and 'Date' in df.columns:
        # Set 'Date' column as the index
        df.set_index('Date', inplace=True)

        # Calculate the moving average using a window of size 7
        df['MovingAverage'] = df['Sales'].rolling(window=window_size, min_periods=1).mean()

        # Reset index to keep 'Date' as a regular column
        df.reset_index(inplace=True)
    else:
        print("DataFrame does not contain 'Sales' and 'Date' columns.")

# Call the function with the sample DataFrame
calculate_moving_average(df)

# Display the updated DataFrame
df

Unnamed: 0,Date,Sales,MovingAverage
0,2022-01-01,10,10.0
1,2022-01-02,15,12.5
2,2022-01-03,20,15.0
3,2022-01-04,25,17.5
4,2022-01-05,30,20.0
5,2022-01-06,35,22.5
6,2022-01-07,40,25.0
7,2022-01-08,45,30.0
8,2022-01-09,50,35.0


### Q11 You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

In [11]:
import pandas as pd

# Sample DataFrame with 'Date' column
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])}
df = pd.DataFrame(data)

# Function to create a new 'Weekday' column
def add_weekday_column(df):
    # Check if the 'Date' column exists in the DataFrame
    if 'Date' in df.columns:
        # Extract the weekday name and create a new 'Weekday' column
        df['Weekday'] = df['Date'].dt.day_name()
        # Return the modified DataFrame
        return df
    else:
        print("DataFrame does not contain a 'Date' column.")

# Call the function with the sample DataFrame
modified_df = add_weekday_column(df)

# Display the modified DataFrame
modified_df

Unnamed: 0,Date,Weekday
0,2023-01-01,Sunday
1,2023-01-02,Monday
2,2023-01-03,Tuesday
3,2023-01-04,Wednesday
4,2023-01-05,Thursday


### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [12]:
import pandas as pd

# Sample DataFrame with 'Date' column
data = {'Date': pd.to_datetime(['2023-01-01', '2023-01-15', '2023-01-25', '2023-02-05', '2023-02-15'])}
df = pd.DataFrame(data)

# Function to select rows based on date range
def select_rows_in_date_range(df, start_date, end_date):
    # Check if the 'Date' column exists in the DataFrame
    if 'Date' in df.columns:
        # Use boolean indexing to select rows within the specified date range
        selected_rows = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
        
        # Return a new DataFrame containing only the selected rows
        return selected_rows
    else:
        print("DataFrame does not contain a 'Date' column.")

# Call the function with the sample DataFrame and date range
start_date = pd.to_datetime('2023-01-01')
end_date = pd.to_datetime('2023-01-31')

selected_df = select_rows_in_date_range(df, start_date, end_date)

# Display the selected DataFrame
selected_df

Unnamed: 0,Date
0,2023-01-01
1,2023-01-15
2,2023-01-25


### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported? 

To use the basic functions of pandas, the first and foremost library that needs to be imported is pandas itself. You can import it using the following convention:

import pandas as pd

By convention, pd is commonly used as an alias for the pandas library. This allows you to reference pandas functions using the pd prefix, making the code more concise. Once imported, you can then use the various functions and features provided by pandas for data manipulation and analysis.