Q1. List any five functions of the pandas library with execution.

Ans--

Here are five commonly used functions from the Pandas library along with their execution examples:

1. read_csv(): Used to read CSV (Comma-Separated Values) files and create a DataFrame.

In [4]:
import pandas as pd

# Read a CSV file and create a DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

2. head(): Displays the first few rows of a DataFrame.

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3. groupby(): Groups data based on a column and allows applying aggregate functions.

In [6]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Claire', 'Bob', 'Alice'],
        'Age': [25, 30, 27, 30, 28]}

df = pd.DataFrame(data)

# Group by 'Name' and calculate the mean age for each name
grouped = df.groupby('Name')['Age'].mean()
print(grouped)

Name
Alice     26.5
Bob       30.0
Claire    27.0
Name: Age, dtype: float64


4. fillna(): Fills missing values in a DataFrame with specified values or methods.

In [7]:
import pandas as pd
import numpy as np

data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, np.nan, 8]}

df = pd.DataFrame(data)

# Fill missing values with 0
filled_df = df.fillna(0)
print(filled_df)

     A    B
0  1.0  5.0
1  2.0  0.0
2  0.0  0.0
3  4.0  8.0


5. pivot_table(): Creates a pivot table from a DataFrame, allowing you to summarize and aggregate data.

In [8]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Claire'],
        'Age': [25, 30, 28, 30, 27],
        'Gender': ['Female', 'Male', 'Female', 'Male', 'Female']}

df = pd.DataFrame(data)

# Create a pivot table showing average age by gender
pivot_table = df.pivot_table(values='Age', index='Gender', aggfunc='mean')
print(pivot_table)

              Age
Gender           
Female  26.666667
Male    30.000000


These examples showcase how you can use different Pandas functions to read data, manipulate it, aggregate it, and perform various data analysis tasks.

Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the
DataFrame with a new index that starts from 1 and increments by 2 for each row.

Ans--

You can achieve this by using the reset_index() function along with the set_index() function in Pandas. Here's a Python function that takes a DataFrame as input and re-indexes it as described:

In this example, the reindex_dataframe() function takes a DataFrame df as input. It creates a new index using pd.RangeIndex() with the specified start, stop, and step values to generate an index starting from 1 and incrementing by 2 for each row. The function then resets the existing index using reset_index(drop=True) and sets the new index using set_index(new_index). Finally, the re-indexed DataFrame is returned.

The output will show the DataFrame with the new index:

In [10]:
import pandas as pd

def reindex_dataframe(df):
    new_index = pd.RangeIndex(start=1, stop=len(df) * 2, step=2)
    new_df = df.reset_index(drop=True).set_index(new_index)
    return new_df

# Example DataFrame
data = {'A': [10, 20, 30],
        'B': [40, 50, 60],
        'C': [70, 80, 90]}

df = pd.DataFrame(data)

# Re-index the DataFrame
new_df = reindex_dataframe(df)
print(new_df)

    A   B   C
1  10  40  70
3  20  50  80
5  30  60  90


Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that
iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The
function should print the sum to the console.

For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should
calculate and print the sum of the first three values, which is 60.

Ans--

Here's a Python function that accomplishes the task of iterating over the DataFrame and calculating the sum of the first three values in the 'Values' column:

python
Copy

In [12]:
import pandas as pd

def calculate_sum_of_first_three(df):
    sum_first_three = df['Values'].iloc[:3].sum()
    print("Sum of the first three values:", sum_first_three)

# Example DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Call the function to calculate and print the sum
calculate_sum_of_first_three(df)

Sum of the first three values: 60


In this example, the calculate_sum_of_first_three() function takes a DataFrame df as input. It uses the .iloc[:3] indexing to select the first three values from the 'Values' column and then calculates their sum using the .sum() function. The sum is printed to the console.

For the provided example DataFrame with values [10, 20, 30, 40, 50], the output will be:

Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column
'Word_Count' that contains the number of words in each row of the 'Text' column.

Ans--

We can achieve this by using the apply() function along with a lambda function to count the words in each row of the 'Text' column and create a new 'Word_Count' column. Here's a Python function that accomplishes this:

In this example, the add_word_count_column() function takes a DataFrame df as input. It uses the apply() function with a lambda function to split each text in the 'Text' column into words using the .split() method and then calculates the length of the resulting list of words to determine the word count. The 'Word_Count' column is then added to the DataFrame.

The output will show the DataFrame with the new 'Word_Count' column:

In [14]:
import pandas as pd

def add_word_count_column(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(x.split()))
    return df

# Example DataFrame
data = {'Text': ['This is a sample sentence.',
                 'Count the words in this text.',
                 'Another example with more words.']}

df = pd.DataFrame(data)

# Call the function to add the 'Word_Count' column
df_with_word_count = add_word_count_column(df)
print(df_with_word_count)

                               Text  Word_Count
0        This is a sample sentence.           5
1     Count the words in this text.           6
2  Another example with more words.           5


Q5. How are DataFrame.size() and DataFrame.shape() different?

Ans--

DataFrame.size and DataFrame.shape are both attributes of a Pandas DataFrame, but they provide different kinds of information about the DataFrame:

1. DataFrame.size: This attribute returns the total number of elements in the DataFrame. It is calculated as the product of the number of rows and the number of columns.

In [15]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

print(df.size)  # Output: 6 (2 rows * 3 columns)

6


2. DataFrame.shape: This attribute returns a tuple representing the dimensions of the DataFrame. The tuple contains two values: the number of rows and the number of columns.

In [16]:
import pandas as pd

data = {'A': [1, 2, 3],
        'B': [4, 5, 6]}

df = pd.DataFrame(data)

print(df.shape)  # Output: (3, 2) (3 rows, 2 columns)

(3, 2)


In summary, DataFrame.size returns the total number of elements in the DataFrame, while DataFrame.shape returns a tuple representing the dimensions of the DataFrame (number of rows and number of columns).

Q6. Which function of pandas do we use to read an excel file?

Ans--

To read an Excel file using Pandas, you can use the pd.read_excel() function. This function allows you to read data from Excel files (.xls or .xlsx) and create a DataFrame in Python. Here's the basic syntax:

In [None]:
import pandas as pd

# Read an Excel file and create a DataFrame
df = pd.read_excel('filename.xlsx')

Replace 'filename.xlsx' with the actual path to the Excel file you want to read. You can also provide additional arguments to the pd.read_excel() function to specify options such as sheet name, header row, and more, depending on your needs.

Here's an example of reading an Excel file named "data.xlsx" from the current directory and creating a DataFrame:

In [None]:
import pandas as pd

# Read an Excel file and create a DataFrame
df = pd.read_excel('data.xlsx')

# Display the DataFrame
print(df)

Remember to install the openpyxl library if you're working with Excel files in the .xlsx format, as it's required by Pandas to read and write Excel files. You can install it using:

In [None]:
pip install openpyxl

Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email
addresses in the format 'username@domain.com'. Write a Python function that creates a new column
'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the
email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your
function should extract the username from each email address and store it in the new 'Username'
column.

Ans--

You can achieve this by using the str.split() method along with a lambda function to extract the username from each email address and create a new 'Username' column. Here's a Python function that accomplishes this:

In this example, the extract_username() function takes a DataFrame df as input. It uses the apply() function with a lambda function to split each email in the 'Email' column using the @ symbol as the delimiter and then selects the first part of the split (i.e., the username). The 'Username' column is then added to the DataFrame.

The output will show the DataFrame with the new 'Username' column:

In [19]:
import pandas as pd

def extract_username(df):
    df['Username'] = df['Email'].apply(lambda email: email.split('@')[0])
    return df

# Example DataFrame
data = {'Email': ['john.doe@example.com', 'jane.smith@example.com', 'jim.brown@example.com']}
df = pd.DataFrame(data)

# Call the function to extract usernames and create the 'Username' column
df_with_username = extract_username(df)
print(df_with_username)

                    Email    Username
0    john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith
2   jim.brown@example.com   jim.brown


Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:
    
A B C
0 3 5 1
1 8 2 7
2 6 9 4
3 2 3 5
4 9 1 2

Your function should select the following rows: A B C

1 8 2 7
4 9 1 2

The function should return a new DataFrame that contains only the selected rows.

Ans--

We can achieve this by using boolean indexing with logical conditions. Here's a Python function that selects the rows based on the conditions you've specified and returns a new DataFrame:

In this example, the filter_dataframe() function takes a DataFrame df as input. It uses boolean indexing to select rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The conditions are combined using the & (AND) operator. The function returns a new DataFrame containing only the selected rows.

The output will show the new DataFrame with the selected rows:

In [20]:
import pandas as pd

def filter_dataframe(df):
    filtered_df = df[(df['A'] > 5) & (df['B'] < 10)]
    return filtered_df

# Example DataFrame
data = {'A': [3, 8, 6, 2, 9],
        'B': [5, 2, 9, 3, 1],
        'C': [1, 7, 4, 5, 2]}

df = pd.DataFrame(data)

# Call the function to filter the DataFrame
filtered_df = filter_dataframe(df)
print(filtered_df)

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean,
median, and standard deviation of the values in the 'Values' column.

Ans--

You can calculate the mean, median, and standard deviation of the values in the 'Values' column of a Pandas DataFrame using the built-in functions provided by Pandas. Here's a Python function that does that:

In this example, the calculate_statistics() function takes a DataFrame df as input. It uses the .mean(), .median(), and .std() functions on the 'Values' column to calculate the mean, median, and standard deviation of the values, respectively.

The output will display the calculated statistics:

In [21]:
import pandas as pd

def calculate_statistics(df):
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_deviation = df['Values'].std()
    
    return mean_value, median_value, std_deviation

# Example DataFrame
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Call the function to calculate statistics
mean, median, std = calculate_statistics(df)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std)

Mean: 30.0
Median: 30.0
Standard Deviation: 15.811388300841896


Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to
create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days
for each row in the DataFrame. The moving average should be calculated using a window of size 7 and
should include the current day.

Ans--

We can calculate the moving average for the past 7 days using the rolling() function provided by Pandas. Here's a Python function that does that

In this example, the calculate_moving_average() function takes a DataFrame df as input. It uses the .rolling() function with a window of size 7 and min_periods=1 to calculate the moving average of the 'Sales' column. The min_periods=1 ensures that even if there are fewer than 7 days' data, the moving average is still calculated based on available data. The result is stored in the 'MovingAverage' column.

The output will show the DataFrame with the new 'MovingAverage' column:

In [22]:
import pandas as pd

def calculate_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    return df

# Example DataFrame
data = {'Date': pd.date_range(start='2023-08-01', periods=10, freq='D'),
        'Sales': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]}

df = pd.DataFrame(data)

# Call the function to calculate the moving average
df_with_moving_average = calculate_moving_average(df)
print(df_with_moving_average)


        Date  Sales  MovingAverage
0 2023-08-01     10           10.0
1 2023-08-02     15           12.5
2 2023-08-03     20           15.0
3 2023-08-04     25           17.5
4 2023-08-05     30           20.0
5 2023-08-06     35           22.5
6 2023-08-07     40           25.0
7 2023-08-08     45           30.0
8 2023-08-09     50           35.0
9 2023-08-10     55           40.0


These values represent the moving average of the sales for the past 7 days for each row in the DataFrame.

Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new
column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g.
Monday, Tuesday) corresponding to each date in the 'Date' column.
For example, if df contains the following values:
Date
0 2023-01-01
1 2023-01-02
2 2023-01-03
3 2023-01-04
4 2023-01-05
Your function should create the following DataFrame:

Date Weekday
0 2023-01-01 Sunday
1 2023-01-02 Monday
2 2023-01-03 Tuesday
3 2023-01-04 Wednesday
4 2023-01-05 Thursday

The function should return the modified DataFrame.

Ans--

You can achieve this by using the dt.day_name() function provided by Pandas to extract the weekday name from the 'Date' column. Here's a Python function that accomplishes this:

In this example, the add_weekday_column() function takes a DataFrame df as input. It first converts the 'Date' column to datetime format using pd.to_datetime(). Then, it uses the .dt.day_name() function to extract the weekday name corresponding to each date and adds this information as a new 'Weekday' column.

The output will show the modified DataFrame with the 'Weekday' column:

In [23]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

# Example DataFrame
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])  # Convert 'Date' column to datetime format

# Call the function to add the 'Weekday' column
df_with_weekday = add_weekday_column(df)
print(df_with_weekday)

        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


These are the weekday names corresponding to each date in the 'Date' column.

Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python
function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

Ans--

Certainly! we can use boolean indexing to select rows based on a date range. Here's a Python function that selects rows where the date is between '2023-01-01' and '2023-01-31':

In this example, the select_rows_in_date_range() function takes a DataFrame df as input. It defines the start and end dates as strings. Then, it creates a boolean mask using the & (AND) operator to check if each date in the 'Date' column is within the specified range. The mask is used to select the rows that meet the condition.

The output will show the DataFrame with rows that fall within the date range:

In [25]:
import pandas as pd

def select_rows_in_date_range(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    
    mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
    selected_rows = df[mask]
    
    return selected_rows

# Example DataFrame
data = {'Date': ['2023-01-01', '2023-01-15', '2023-01-25', '2023-02-05']}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])  # Convert 'Date' column to datetime format

# Call the function to select rows in the date range
selected_df = select_rows_in_date_range(df)
print(selected_df)

        Date
0 2023-01-01
1 2023-01-15
2 2023-01-25


These are the rows where the date is between '2023-01-01' and '2023-01-31'.

Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to
be imported?

Ans--

To use the basic functions of Pandas, the first and foremost necessary library that needs to be imported is pandas itself. You can import the Pandas library using the following line of code:

In [26]:
import pandas as pd

By convention, pd is a commonly used alias for the Pandas library. Once you've imported Pandas, you can use its functions and data structures like DataFrames and Series to perform various data manipulation and analysis tasks.