#### Q1. List any five functions of the pandas library with execution.

 * read_csv() - reads a CSV file into a pandas DataFrame

In [None]:
import pandas as pd
data = pd.read_csv('data.csv')


 * head() - displays the first 5 rows of a DataFrame

In [None]:
import pandas as pd
data = pd.read_csv('data.csv')
print(data.head(10)) # display the first 10 rows of data


 * describe() - provides summary statistics for a DataFrame

In [None]:
import pandas as pd
data = pd.read_csv('data.csv')
print(data.describe()) # provides summary statistics for all columns in data


 * groupby() - groups a DataFrame by one or more columns and performs an aggregation function on the grouped data

In [None]:
import pandas as pd
data = pd.read_csv('data.csv')
grouped_data = data.groupby(['column1', 'column2']).sum()
print(grouped_data) # groups data by 'column1' and 'column2' and sums the values of other columns in each group


 * plot() - plots data in a pandas DataFrame

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data.plot(x='column1', y='column2', kind='scatter') # scatter plot of data with 'column1' on the x-axis and 'column2' on the y-axis
plt.show()


#### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [1]:
import pandas as pd

def reindex_df(df):
    new_index = pd.Index(range(1, 2*len(df)+1, 2))
    df = df.set_index(new_index)
    return df

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = reindex_df(df)
print(df)


   A  B  C
1  1  4  7
3  2  5  8
5  3  6  9


#### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

#### For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.

In [2]:
def sum_first_three_values(df):
    total = 0
    for i, row in df.iterrows():
        if i < 3: # only sum the first three values
            total += row['Values']
    print("The sum of the first three values is:", total)

import pandas as pd

df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
sum_first_three_values(df)


The sum of the first three values is: 60


#### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [3]:
import pandas as pd

def add_word_count_column(df):
    df['Word_Count'] = df['Text'].apply(lambda x: len(str(x).split()))
    return df

df = pd.DataFrame({'Text': ['This is a sentence', 'This is another sentence', 'And this is a third sentence']})
df = add_word_count_column(df)
print(df)


                           Text  Word_Count
0            This is a sentence           4
1      This is another sentence           4
2  And this is a third sentence           6


#### Q5. How are DataFrame.size() and DataFrame.shape() different?

DataFrame.size and DataFrame.shape are both methods of the Pandas DataFrame object, but they return different values.

DataFrame.size returns the total number of elements in the DataFrame, which is equivalent to the number of rows multiplied by the number of columns. This includes all elements, including NaN or null values.

On the other hand, DataFrame.shape returns a tuple representing the dimensions of the DataFrame, i.e., the number of rows and columns. So, DataFrame.shape will return the values (n_rows, n_cols), where n_rows is the number of rows in the DataFrame, and n_cols is the number of columns in the DataFrame.

In [4]:
#Example 

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df.size)
print(df.shape)


6
(3, 2)


#### Q6. Which function of pandas do we use to read an excel file?

In Pandas, we use the read_excel() function to read an Excel file into a DataFrame.

In [None]:
#Example 

import pandas as pd

# Read Excel file into a DataFrame
df = pd.read_excel('example.xlsx')


#### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.The username is the part of the email address that appears before the '@' symbol. 

#### For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

In [None]:
import pandas as pd

def extract_username(df):
    # Extract the username from each email address in the 'Email' column
    usernames = df['Email'].apply(lambda x: x.split('@')[0])
    
    # Create a new column 'Username' in the DataFrame and assign the usernames to it
    df['Username'] = usernames
    
    # Return the updated DataFrame
    return df


#### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

#### For example, if df contains the following values:

  A B C
  
0 3 5 1

1 8 2 7

2 6 9 4

3 2 3 5

4 9 1 2

#### Your function should select the following rows: 

  A B C

1 8 2 7

2 6 9 4

4 9 1 2

#### The function should return a new DataFrame that contains only the selected rows.

In [5]:
import pandas as pd

def select_rows(df):
    # Use boolean indexing to select rows where column 'A' > 5 and column 'B' < 10
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows

# Create the example DataFrame
df = pd.DataFrame({'A': [3, 8, 6, 2, 9], 'B': [5, 2, 9, 3, 1], 'C': [1, 7, 4, 5, 2]})

# Call the select_rows function to select rows where column 'A' > 5 and column 'B' < 10
selected_df = select_rows(df)

# Print the selected DataFrame
print(selected_df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


#### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [6]:
import pandas as pd

def calculate_statistics(df):
    # Calculate mean, median, and standard deviation of values in 'Values' column
    mean_value = df['Values'].mean()
    median_value = df['Values'].median()
    std_value = df['Values'].std()

    # Print results
    print('Mean:', mean_value)
    print('Median:', median_value)
    print('Standard deviation:', std_value)

# Create example DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Call the calculate_statistics function to calculate mean, median, and standard deviation
calculate_statistics(df)


Mean: 30.0
Median: 30.0
Standard deviation: 15.811388300841896


#### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

In [7]:
import pandas as pd

def add_moving_average(df):
    # Calculate moving average of 'Sales' column using a window of size 7
    ma = df['Sales'].rolling(window=7, min_periods=1).mean()

    # Add 'MovingAverage' column to DataFrame
    df['MovingAverage'] = ma

    # Return modified DataFrame
    return df

# Create example DataFrame
df = pd.DataFrame({'Sales': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
                   'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05',
                            '2022-01-06', '2022-01-07', '2022-01-08', '2022-01-09', '2022-01-10']})

# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Call the add_moving_average function to add 'MovingAverage' column to DataFrame
df = add_moving_average(df)

# Print modified DataFrame with 'MovingAverage' column
print(df)


   Sales       Date  MovingAverage
0     10 2022-01-01           10.0
1     20 2022-01-02           15.0
2     30 2022-01-03           20.0
3     40 2022-01-04           25.0
4     50 2022-01-05           30.0
5     60 2022-01-06           35.0
6     70 2022-01-07           40.0
7     80 2022-01-08           50.0
8     90 2022-01-09           60.0
9    100 2022-01-10           70.0


#### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

#### For example, if df contains the following values:

Date

0 2023-01-01

1 2023-01-02

2 2023-01-03

3 2023-01-04

4 2023-01-05

#### Your function should create the following DataFrame:


Date Weekday

0 2023-01-01 Sunday

1 2023-01-02 Monday

2 2023-01-03 Tuesday

3 2023-01-04 Wednesday

4 2023-01-05 Thursday

#### The function should return the modified DataFrame

In [8]:
import pandas as pd

def add_weekday_column(df):
    df['Weekday'] = df['Date'].dt.day_name()
    return df

# Example usage
df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']})
df['Date'] = pd.to_datetime(df['Date'])

df = add_weekday_column(df)
print(df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


#### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [None]:
import pandas as pd

def select_rows_between_dates(df):
    start_date = pd.to_datetime('2023-01-01')
    end_date = pd.to_datetime('2023-01-31')
    mask = df['Date'].between(start_date, end_date)
    return df[mask]


#### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

The first and foremost necessary library that needs to be imported to use the basic functions of pandas is 'pandas' itself. The standard way to import pandas is by using the following code:

In [9]:
import pandas as pd
