# Q1. List any five functions of the pandas library with execution.

# Ans :

## Here are five common functions in the pandas library along with their execution examples:

1. read_csv(): This function is used to read a CSV file and create a DataFrame.

In [3]:
import pandas as pd

# Creating a DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Emily', 'David'],
        'Age': [25, 28, 30, 32, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney', 'Tokyo']}
df = pd.DataFrame(data)

# Creating the "data.csv" file
df.to_csv('data.csv', index=False)

# Reading the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# Displaying the first 5 rows of the DataFrame
print(df)


    Name  Age      City
0   John   25  New York
1   Jane   28     Paris
2   Mike   30    London
3  Emily   32    Sydney
4  David   35     Tokyo


2. head(): This function returns the first n rows of a DataFrame. By default, it returns the first 5 rows.

In [2]:
# Displaying the first 3 rows of the DataFrame
print(df.head(3))


   Name  Age      City
0  John   25  New York
1  Jane   28     Paris
2  Mike   30    London


3. info(): This function provides a concise summary of a DataFrame, including the column names, data types, and non-null counts.

In [3]:
# Displaying the information about the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
None


4. describe(): This function generates descriptive statistics of a DataFrame, including count, mean, standard deviation, minimum, maximum, and quartile values.

In [4]:
# Generating descriptive statistics of the DataFrame
print(df.describe())

             Age
count   5.000000
mean   30.000000
std     3.807887
min    25.000000
25%    28.000000
50%    30.000000
75%    32.000000
max    35.000000


5. groupby(): This function is used to group data based on one or more columns and perform aggregate operations on the grouped data.

In [5]:
# Grouping the DataFrame by the 'City' column and calculating the average age for each city
grouped = df.groupby('City')['Age'].mean()

# Displaying the grouped data
print(grouped)

City
London      30.0
New York    25.0
Paris       28.0
Sydney      32.0
Tokyo       35.0
Name: Age, dtype: float64


# Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

# Ans :

In [7]:
import pandas as pd

# Creating the DataFrame
data = {'A': [10, 20, 30, 40],
        'B': [50, 60, 70, 80],
        'C': [90, 100, 110, 120]}
df = pd.DataFrame(data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Function to re-index the DataFrame
def reindex_with_increment(df):
    df_new = df.copy()  # Create a copy of the original DataFrame
    df_new.index = range(1, len(df) * 2 + 1, 2)  
    # Set the new index starting from 1 and incrementing by 2.
    #len(df) * 2 + 1 is the expression that calculates the endpoint (stop value) for the range() function. 
    #It takes the length of the DataFrame, multiplies it by 2, and adds 1. The resulting value represents the final index value that we want to achieve in the re-indexing process
    return df_new

# Call the function to re-index the DataFrame
df_reindexed = reindex_with_increment(df)

# Displaying the re-indexed DataFrame
print("\nRe-indexed DataFrame:")
print(df_reindexed)


Original DataFrame:
    A   B    C
0  10  50   90
1  20  60  100
2  30  70  110
3  40  80  120

Re-indexed DataFrame:
    A   B    C
1  10  50   90
3  20  60  100
5  30  70  110
7  40  80  120


# Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console. 
# For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.

# Ans :

In [8]:
import pandas as pd

# Creating the DataFrame
data = {'Values': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)

# Displaying the DataFrame
print("DataFrame:")
print(df)

# Function to calculate the sum of the first three values
def calculate_sum_of_first_three(df):
    # Get the 'Values' column as a Series
    values_column = df['Values']

    # Extract the first three values
    first_three_values = values_column.head(3)

    # Calculate the sum of the first three values
    sum_of_first_three = first_three_values.sum()

    # Print the sum to the console
    print("Sum of the first three values:", sum_of_first_three)

# Call the function to calculate the sum of the first three values
calculate_sum_of_first_three(df)


DataFrame:
   Values
0      10
1      20
2      30
3      40
4      50
5      60
Sum of the first three values: 60


# Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

# Ans :

In [16]:
import pandas as pd
df = pd.DataFrame({'Text': ['This is a sample sentence.',
                            'Another sentence with more words.',
                            'Only one word.']})
print("Original DataFrame:")
print(df)

def add_word_count(df):
    
    # Split the text in each row by whitespace to get a list of words.
    '''str.split() method splits the text on whitespace characters (spaces, tabs, etc.),
    resulting in a list of individual words for each text entry.'''
    
    df['Word_Count'] = df['Text'].str.split().str.len()
    
add_word_count(df)
print("\nDataFrame with Word_Count:")
print(df)


Original DataFrame:
                                Text
0         This is a sample sentence.
1  Another sentence with more words.
2                     Only one word.

DataFrame with Word_Count:
                                Text  Word_Count
0         This is a sample sentence.           5
1  Another sentence with more words.           5
2                     Only one word.           3


# Q5. How are DataFrame.size() and DataFrame.shape() different?

# Ans :

DataFrame.size() and DataFrame.shape() are both methods of Pandas DataFrame, but they serve different purposes.

* DataFrame.size() returns the total number of elements (i.e., cells) in a DataFrame, which is equal to the product of the number of rows and the number of columns. In other words, DataFrame.size() returns the total number of values in the DataFrame, regardless of whether they are unique or not. For example, if a DataFrame has 3 rows and 4 columns, DataFrame.size() will return 12.

* DataFrame.shape() returns a tuple that represents the dimensions of a DataFrame, in the form (number of rows, number of columns). For example, if a DataFrame has 3 rows and 4 columns, DataFrame.shape() will return (3, 4).

Therefore, DataFrame.size() and DataFrame.shape() are different because they provide different information. DataFrame.size() returns the total number of elements in the DataFrame, while DataFrame.shape() returns the number of rows and columns in the DataFrame.


Here's an example to illustrate the difference between the two methods:

In [5]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9,]})

print("DataFrame size:", df.size)
print("DataFrame shape:", df.shape)


DataFrame size: 9
DataFrame shape: (3, 3)


# Q6. Which function of pandas do we use to read an excel file?

# Ans :

To read an Excel file in Pandas, you can use the pd.read_excel() function. This function allows you to read data from an Excel file into a Pandas DataFrame.
Here's an example of how to use pd.read_excel():

In [None]:
import pandas as pd

# Read Excel file into a DataFrame
df = pd.read_excel('file.xlsx')

# Display the DataFrame
print(df)


# Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

# The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

# Ans :

In [16]:
import pandas as pd
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.smith@example.com']})
print("Original DataFrame:")
print(df)


# the str.split('@') method to split each email address in the 'Email' column on the '@' symbol.
#This returns a list with two elements: the username and the domain. We then use str[0] to extract the first element (the username)
def extract_username(df):
    df['Username'] = df['Email'].str.split('@').str[0]
    return df

df = extract_username(df)
print("\nDataFrame with Username")
print(df)


Original DataFrame:
                    Email
0    john.doe@example.com
1  jane.smith@example.com

DataFrame with Username
                    Email    Username
0    john.doe@example.com    john.doe
1  jane.smith@example.com  jane.smith


# Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.
# For example, if df contains the following values:
_ A B C

0 3 5 1

1 8 2 7

2 6 9 4

3 2 3 5

4 9 1 2

In [6]:
import pandas as pd
df = pd.DataFrame({'A': [3,8,6,2,9],
                   'B': [5,2,9,3,1],
                   'C': [1,7,4,5,2]
                  })
print("Given Data Frame :\n")
print(df)

def df_sorting(df):
    df_new = df.copy()  # Create a copy of the original DataFrame
    result=df_new[(df_new["A"]>5) & (df_new["B"]<10)]
    return result
new_sorted_df=df_sorting(df)
print("\nNew Data Frame After Sorting :\n")
print(new_sorted_df)


Given Data Frame :

   A  B  C
0  3  5  1
1  8  2  7
2  6  9  4
3  2  3  5
4  9  1  2

New Data Frame After Sorting :

   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


# Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

# Ans :

In [35]:
import pandas as pd

df = pd.DataFrame({'Values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

def calculate_statistics(df):
    values = df['Values']
    mean_val = values.mean()
    median_val = values.median()
    std_dev_val = values.std()
    return mean_val, median_val, std_dev_val

mean, median, std_dev = calculate_statistics(df) # Values are assigned to the variables mean, median, and std_dev respectively.
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)


Mean: 5.5
Median: 5.5
Standard Deviation: 3.0276503540974917


# Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

# Ans :

In [1]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Date': ['2023-05-01', '2023-05-02', '2023-05-03', '2023-05-04', '2023-05-05', '2023-05-06', '2023-05-07', '2023-05-08', '2023-05-09', '2023-05-10', '2023-05-11', '2023-05-12', '2023-05-13', '2023-05-14'],
    'Sales': [10, 12, 8, 15, 9, 11, 13, 14, 16, 17, 11, 9, 12, 15]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print (df)

# Converting 'Date' column to datetime format
"""It is necessary to convert the 'Date' column to datetime format in order to perform various date-related operations,
such as sorting the DataFrame by date or calculating the moving average based on date ranges."""

df['Date'] = pd.to_datetime(df['Date'])

# Sorting DataFrame by 'Date'
df = df.sort_values('Date')

def calculate_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    # This function calculates the moving average using a window of size 7 and includes the current day by specifying min_periods=1.
    return df

# Calculating the moving average
new_df = calculate_moving_average(df)

print("\nNew DataFrame:")
print(new_df)


Original DataFrame:
          Date  Sales
0   2023-05-01     10
1   2023-05-02     12
2   2023-05-03      8
3   2023-05-04     15
4   2023-05-05      9
5   2023-05-06     11
6   2023-05-07     13
7   2023-05-08     14
8   2023-05-09     16
9   2023-05-10     17
10  2023-05-11     11
11  2023-05-12      9
12  2023-05-13     12
13  2023-05-14     15

New DataFrame:
         Date  Sales  MovingAverage
0  2023-05-01     10      10.000000
1  2023-05-02     12      11.000000
2  2023-05-03      8      10.000000
3  2023-05-04     15      11.250000
4  2023-05-05      9      10.800000
5  2023-05-06     11      10.833333
6  2023-05-07     13      11.142857
7  2023-05-08     14      11.714286
8  2023-05-09     16      12.285714
9  2023-05-10     17      13.571429
10 2023-05-11     11      13.000000
11 2023-05-12      9      13.000000
12 2023-05-13     12      13.142857
13 2023-05-14     15      13.428571


# Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.
## For example, if df contains the following values:
## Date
### 0 2023-01-01
### 1 2023-01-02
### 2 2023-01-03
### 3 2023-01-04
### 4 2023-01-05
## Your function should create the following DataFrame:

## Date Weekday
### 0 2023-01-01 Sunday
### 1 2023-01-02 Monday
### 2 2023-01-03 Tuesday
### 3 2023-01-04 Wednesday
### 4 2023-01-05 Thursday
## The function should return the modified DataFrame.

In [27]:
import pandas as pd

# Creating a sample DataFrame
data = {
    'Date': ['2023-05-01', '2023-05-02', '2023-05-03', '2023-05-04', '2023-05-05', '2023-05-06', '2023-05-07', '2023-05-08', '2023-05-09', '2023-05-10', '2023-05-11', '2023-05-12', '2023-05-13', '2023-05-14'],
        }


df = pd.DataFrame(data)
print("Original DataFrame:\n")
print (df)

def add_weekday_column(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Weekday'] = df['Date'].dt.day_name()
    df = df[~(df['Weekday'] == 'Sunday')]
    return df

new_df = add_weekday_column(df)
print("New DataFrame with weekdays :\n")
print(new_df)


Original DataFrame:

          Date
0   2023-05-01
1   2023-05-02
2   2023-05-03
3   2023-05-04
4   2023-05-05
5   2023-05-06
6   2023-05-07
7   2023-05-08
8   2023-05-09
9   2023-05-10
10  2023-05-11
11  2023-05-12
12  2023-05-13
13  2023-05-14
New DataFrame with weekdays :

         Date    Weekday
0  2023-05-01     Monday
1  2023-05-02    Tuesday
2  2023-05-03  Wednesday
3  2023-05-04   Thursday
4  2023-05-05     Friday
5  2023-05-06   Saturday
7  2023-05-08     Monday
8  2023-05-09    Tuesday
9  2023-05-10  Wednesday
10 2023-05-11   Thursday
11 2023-05-12     Friday
12 2023-05-13   Saturday


# Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

# Ans :

In [1]:
import pandas as pd

# Create a DataFrame with a 'Date' column of timestamps
df = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', end='2023-12-31', freq='H')
})

# Print the DataFrame
print("Original DataFrame:\n")
print(df)

# Function to select rows between two dates
def select_rows_between_dates(df, start_date, end_date):
    return df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

# Select rows between '2023-01-01' and '2023-01-31'
selected_rows = select_rows_between_dates(df, '2023-01-01', '2023-01-31')

# Print the selected rows
print("\nSelected rows:")
print(selected_rows)

Original DataFrame:

                    Date
0    2023-01-01 00:00:00
1    2023-01-01 01:00:00
2    2023-01-01 02:00:00
3    2023-01-01 03:00:00
4    2023-01-01 04:00:00
...                  ...
8732 2023-12-30 20:00:00
8733 2023-12-30 21:00:00
8734 2023-12-30 22:00:00
8735 2023-12-30 23:00:00
8736 2023-12-31 00:00:00

[8737 rows x 1 columns]

Selected rows:
                   Date
0   2023-01-01 00:00:00
1   2023-01-01 01:00:00
2   2023-01-01 02:00:00
3   2023-01-01 03:00:00
4   2023-01-01 04:00:00
..                  ...
716 2023-01-30 20:00:00
717 2023-01-30 21:00:00
718 2023-01-30 22:00:00
719 2023-01-30 23:00:00
720 2023-01-31 00:00:00

[721 rows x 1 columns]


# Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

# Ans :

### The first and foremost library that needs to be imported to use the basic functions of pandas is the pandas library itself.
### Here's an example of how to import the pandas library:

In [2]:
import pandas as pd
