# Q1. 

List any five functions of the pandas library with execution.


## Answer

Here are five functions of the pandas library along with their execution:

**I. `read_csv()`**: This function is used to read a CSV file and convert it into a pandas DataFrame.

```
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('file.csv')

# Display the first 5 rows of the DataFrame
print(df.head())
```

**II. `info()`**: This function is used to display a summary of a DataFrame, including the data types of each column, the number of non-null values, and the memory usage.

In [20]:
!pip install pandas



In [21]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Display a summary of the DataFrame
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
None


**III. `groupby()`**: This function is used to group a DataFrame by one or more columns and apply an aggregation function to each group.

In [22]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
        'Age': [25, 30, 35, 27, 32],
        'Salary': [50000, 60000, 70000, 55000, 65000]}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    Alice   27   55000
4      Bob   32   65000


In [23]:

# Group the DataFrame by 'Name' and calculate the mean salary for each group
grouped = df.groupby('Name')['Salary'].mean()

# Display the result
print(grouped)


Name
Alice      52500.0
Bob        62500.0
Charlie    70000.0
Name: Salary, dtype: float64


**IV. `dropna()`**: This function is used to remove rows or columns with missing values from a DataFrame.

In [24]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {'Name': ['Alice', 'Bob', np.nan, 'Charlie'],
        'Age': [25, 30, np.nan, 35]}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)


      Name   Age
0    Alice  25.0
1      Bob  30.0
2      NaN   NaN
3  Charlie  35.0


In [25]:
# Drop rows with missing values
df = df.dropna()

# Display the result
print(df)


      Name   Age
0    Alice  25.0
1      Bob  30.0
3  Charlie  35.0


**V. `merge():`** This function is used to merge two DataFrames based on a common column.

In [26]:
import pandas as pd

# Create two DataFrames
data1 = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)

# Display the DataFrame 1
print(df1)


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [27]:

data2 = {'Name': ['Alice', 'Bob', 'David'], 'Salary': [50000, 60000, 70000]}
df2 = pd.DataFrame(data2)

# Display the DataFrame 2
print(df2)


    Name  Salary
0  Alice   50000
1    Bob   60000
2  David   70000


In [28]:

# Merge the DataFrames based on the 'Name' column
merged = pd.merge(df1, df2, on='Name') # Performs inner joins by default

# Display the result
print(merged)


    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000


# Q2. 

Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

## Answer

Here's a Python function that re-indexes a Pandas DataFrame with a new index that starts from 1 and increments by 2 for each row:

In [29]:
import pandas as pd

def reindex_df(df):
    # Get the number of rows in the DataFrame
    num_rows = df.shape[0]
    
    # Create a new index that starts from 1 and increments by 2
    new_index = pd.Index(range(1, 2*num_rows, 2))
    
    # Reindex the DataFrame with the new index
    # df = df.reindex(new_index) # using this method, By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
    df.index = new_index
    
    # Return the reindexed DataFrame
    return df

In [30]:
# Create the DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# Display the DataFrame
print(df)


   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


In [31]:
# Reindex the DataFrame
df = reindex_df(df)

# Display the reindexed DataFrame
print(df)


   A  B  C
1  1  4  7
3  2  5  8
5  3  6  9


# Q3. 

You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.



## Answer

Here's a Python function that iterates over a Pandas DataFrame and calculates the sum of the first three values in the 'Values' column:

In [32]:
import pandas as pd

def sum_first_three_values(df):
    # Initialize a variable to store the sum
    total_sum = 0
    
    # Iterate over the first three rows of the DataFrame
    for index, row in df.head(3).iterrows():
        # Add the 'Values' column of the current row to the sum
        total_sum += row['Values']
    
    # Print the sum to the console
    print("The sum of the first three values in the 'Values' column is:", total_sum)


In [33]:
# Create the DataFrame
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})

# Display the DataFrame
print(df)


   Values
0      10
1      20
2      30
3      40
4      50


In [34]:
# Call the function
sum_first_three_values(df)


The sum of the first three values in the 'Values' column is: 60


# Q4. 

Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.


## Answer

In [35]:
# Create the DataFrame
df = pd.DataFrame({'Text': ['Hello, world!', 'This is a sentence.', 'Python is fun.']})

# Display the DataFrame
print(df)


                  Text
0        Hello, world!
1  This is a sentence.
2       Python is fun.


In [36]:
# Add the 'Word_Count' column
df['Word_Count'] = df['Text'].apply(lambda x: len(x.split()))

# Display the modified DataFrame
print(df)

                  Text  Word_Count
0        Hello, world!           2
1  This is a sentence.           4
2       Python is fun.           3


# Q5. 

How are DataFrame.size() and DataFrame.shape() different?


## Answer

`DataFrame.size()` and `DataFrame.shape()` are two different methods in Pandas that can be used to obtain information about the size of a DataFrame.

`DataFrame.size()` returns the total number of elements in a DataFrame, which is equal to the product of the number of rows and the number of columns. This method returns a single integer value representing the total number of elements in the DataFrame.

On the other hand, `DataFrame.shape()` returns a tuple representing the dimensions of a DataFrame. The tuple contains two values: the number of rows and the number of columns, respectively. This method is useful for quickly checking the shape of the DataFrame and can be used to extract information about its dimensions.

Here's an example to illustrate the difference between `DataFrame.size()` and `DataFrame.shape()`:

In [37]:
import pandas as pd

# Create a DataFrame with 3 rows and 2 columns
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Display the DataFrame
print(df)


   A  B
0  1  4
1  2  5
2  3  6


In [38]:
# Use size() method to get the total number of elements
print("Total number of elements in DataFrame: ", df.size)

# Use shape() method to get the dimensions of the DataFrame
print("Dimensions of DataFrame: ", df.shape)


Total number of elements in DataFrame:  6
Dimensions of DataFrame:  (3, 2)


As you can see, `df.size` returns a single integer representing the total number of elements in the DataFrame, while `df.shape` returns a tuple representing the dimensions of the DataFrame.

# Q6. 

Which function of pandas do we use to read an excel file?


## Answer

To read an Excel file in Pandas, you can use the `read_excel()` function, which is a part of the Pandas library.
Here's an example of how to use `read_excel()` to read an Excel file named `example.xlsx`:

```
import pandas as pd

# read the Excel file
df = pd.read_excel('example.xlsx')

# display the DataFrame
print(df)
```

This code will read the contents of the Excel file `example.xlsx` and store it in a Pandas DataFrame called `df`. You can then perform operations on the DataFrame as needed.

By default, `read_excel()` reads the first sheet in the Excel file, but you can also specify the sheet name or index to read using the `sheet_name` parameter. You can also specify additional options such as which rows and columns to skip, whether to use the first row as column headers, and more.

Here's an example of how to read the second sheet in an Excel file named `example.xlsx` using `read_excel()`:

```
import pandas as pd

# read the second sheet of the Excel file
df = pd.read_excel('example.xlsx', sheet_name=1)

# display the DataFrame
print(df)
```

This code will read the second sheet in the Excel file `example.xlsx` and store it in a Pandas DataFrame called `df`.

# Q7. 

You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.


## Answer

You can use the .str attribute of a pandas series to apply string methods to each element of the series, including the .split() method to split the email address into a list of strings based on the "@" character. Then, you can use indexing to select the first element of the list, which contains the username.

Here is an example function that accomplishes this task:

In [39]:
import pandas as pd

def extract_username(df):
    df['Username'] = df['Email'].str.split('@').str[0]
    return df

You can call this function on your dataframe `df` to create the new 'Username' column. For example:

In [40]:
df = pd.DataFrame({'Email': ['john.doe@example.com', 'jane.doe@example.com']})
df

Unnamed: 0,Email
0,john.doe@example.com
1,jane.doe@example.com


In [41]:
df = extract_username(df)
print(df)

                  Email  Username
0  john.doe@example.com  john.doe
1  jane.doe@example.com  jane.doe


Note that in this example, the function assumes that the 'Email' column is already present in `df`.

# Q8. 

You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects
all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The
function should return a new DataFrame that contains only the selected rows.

For example, if df contains the following values:


|   | A | B | C |
| - | - | - | - |
| 0 | 3 | 5 | 1 |
| 1 | 8 | 2 | 7 |
| 2 | 6 | 9 | 4 |
| 3 | 2 | 3 | 5 |
| 4 | 9 | 1 | 2 |


Your function should select the following rows: 

|   | A | B | C |
| - | - | - | - |
| 1 | 8 | 2 | 7 |
| 2 | 6 | 9 | 4 |
| 4 | 9 | 1 | 2 |


The function should return a new DataFrame that contains only the selected rows.


## Answer

You can use boolean indexing to select rows that satisfy the conditions specified in the problem statement. Here is an example function that accomplishes this task:

In [42]:
import pandas as pd

def select_rows(df):
    selected_rows = df[(df['A'] > 5) & (df['B'] < 10)]
    return selected_rows


In [43]:
df = pd.DataFrame({'A': [3, 8, 6, 2, 9],
                   'B': [5, 2, 9, 3, 1],
                   'C': [1, 7, 4, 5, 2]})
# Display the DataFrame
print(df)


   A  B  C
0  3  5  1
1  8  2  7
2  6  9  4
3  2  3  5
4  9  1  2


You can call this function on your DataFrame `df` to select the desired rows. For example:

In [44]:
selected_rows = select_rows(df)
print(selected_rows)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


# Q9. 

Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.


## Answer

You can use the built-in functions of the pandas library to calculate the `mean`, `median`, and `standard deviation` of the values in the 'Values' column of your DataFrame. Here is an example function that accomplishes this task:

In [45]:
import pandas as pd

def calculate_stats(df):
    mean = df['Values'].mean()
    median = df['Values'].median()
    std_dev = df['Values'].std()
    return mean, median, std_dev


You can call this function on your DataFrame `df` to calculate the mean, median, and standard deviation of the 'Values' column. For example:

In [46]:
df = pd.DataFrame({'Values': [1, 2, 3, 4, 5]})
mean, median, std_dev = calculate_stats(df)
print("Mean: ", mean)
print("Median: ", median)
print("Standard Deviation: ", std_dev)


Mean:  3.0
Median:  3.0
Standard Deviation:  1.5811388300841898


# Q10. 

Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.


## Answer

You can use the `rolling()` method of the pandas DataFrame to calculate the moving average of the sales for the past 7 days. Here is an example function that accomplishes this task:


In [47]:
import pandas as pd

def calculate_moving_average(df):
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    return df

You can call this function on your DataFrame `df` to calculate the moving average of the sales for the past 7 days and store the result in a new column called 'MovingAverage'. For example:

In [48]:
df = pd.DataFrame({'Date': pd.date_range(start='2022-01-01', periods=10),
                   'Sales': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})

# Display the DataFrame
print(df)


        Date  Sales
0 2022-01-01     10
1 2022-01-02     20
2 2022-01-03     30
3 2022-01-04     40
4 2022-01-05     50
5 2022-01-06     60
6 2022-01-07     70
7 2022-01-08     80
8 2022-01-09     90
9 2022-01-10    100


In [49]:
df = calculate_moving_average(df)
print(df)


        Date  Sales  MovingAverage
0 2022-01-01     10           10.0
1 2022-01-02     20           15.0
2 2022-01-03     30           20.0
3 2022-01-04     40           25.0
4 2022-01-05     50           30.0
5 2022-01-06     60           35.0
6 2022-01-07     70           40.0
7 2022-01-08     80           50.0
8 2022-01-09     90           60.0
9 2022-01-10    100           70.0


In [50]:
# (10+20+30+40+50+60+70)/7
# (20+30+40+50+60+70+80)/7
# (40+50+60+70+80+90+100)/7

Note that in this example, the function assumes that the columns 'Sales' and 'Date' are already present in `df`. Also note that the `min_periods` parameter in the `rolling()` method is set to 1 to ensure that the moving average is calculated even if there are not enough values to fill the entire window.

# Q11. 

You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column. For example, if df contains the following values:


|   | Date |
| - | - |
| 0 | 2023-01-01 |
| 1 | 2023-01-02 |
| 2 | 2023-01-03 |
| 3 | 2023-01-04 |
| 4 | 2023-01-05 |


Your function should create the following DataFrame:


|   | Date | Weekday |
| - | - | - |
| 0 | 2023-01-01 | Sunday |
| 1 | 2023-01-02 | Monday |
| 2 | 2023-01-03 | Tuesday |
| 3 | 2023-01-04 | Wednesday |
| 4 | 2023-01-05 | Thursday |


The function should return the modified DataFrame.

## Answer

In [51]:
import pandas as pd

def add_weekday(df):
    df['Weekday'] = pd.to_datetime(df['Date']).dt.day_name()
    return df


In [52]:
df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']})
print(df)

         Date
0  2023-01-01
1  2023-01-02
2  2023-01-03
3  2023-01-04
4  2023-01-05


In [53]:
df = add_weekday(df)
print(df)


         Date    Weekday
0  2023-01-01     Sunday
1  2023-01-02     Monday
2  2023-01-03    Tuesday
3  2023-01-04  Wednesday
4  2023-01-05   Thursday


# Q12. 

Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.


## Answer

In [54]:
import pandas as pd

def select_january(df):
    start_date = '2023-01-01'
    end_date = '2023-01-31'
    mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
    return df.loc[mask]


In [55]:
df = pd.DataFrame({'Date': ['2023-01-01', '2023-01-15', '2023-01-31', '2023-02-01', '2023-02-15']})
print(df)


         Date
0  2023-01-01
1  2023-01-15
2  2023-01-31
3  2023-02-01
4  2023-02-15


In [56]:
df['Date'] = pd.to_datetime(df['Date'])
january_rows = select_january(df)
print(january_rows)


        Date
0 2023-01-01
1 2023-01-15
2 2023-01-31


# Q13. 

To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

## Answer

The first and foremost necessary library to use the basic functions of pandas is the pandas library itself.

To import the pandas library, you can use the following code:

In [57]:
import pandas as pd


This will import the pandas library and allow you to use its various functions and methods for data manipulation and analysis, such as reading data from different file formats, filtering data, and performing calculations on data.

Note that you may also need to import other libraries depending on the specific functions and methods you are using in pandas. For example, if you want to use the `matplotlib` library to create plots based on data in a pandas DataFrame, you would need to import `matplotlib` in addition to `pandas`.

*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************