# Pandas Advance Assignment

### Q1. List any five functions of the pandas library with execution.

The five common functions in the pandas library, along with examples of their execution: 

In [1]:
import pandas as pd

# DataFrames
data1 = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Salary': [70000, 80000, 50000, 120000, 90000]}
df2 = pd.DataFrame(data2)

1. **`head()`**: This function is used to return the first n rows of a DataFrame.

In [2]:
# Display the first 3 rows of df1
print(df1.head(3))

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago


2. **`info()`**: This function prints a concise summary of a DataFrame.

In [3]:
# Print concise summary of df1
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes


3. **`describe()`**: This function generates descriptive statistics of a DataFrame.

In [4]:
# Generate descriptive statistics of df1
print(df1.describe())

             Age
count   5.000000
mean   26.800000
std     3.962323
min    22.000000
25%    24.000000
50%    27.000000
75%    29.000000
max    32.000000


4. **`groupby()`**: This function is used for grouping data and performing aggregate operations.

In [8]:
# Group by 'City' and calculate mean age
grouped_df = df1.groupby('City')['Age'].mean()
print(grouped_df)

City
Chicago        22.0
Houston        32.0
Los Angeles    27.0
New York       24.0
Phoenix        29.0
Name: Age, dtype: float64


5. **`merge()`**: This function is used to merge two DataFrames based on a key.

In [6]:
# Merge df1 and df2 on 'Name'
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

      Name  Age         City  Salary
0    Alice   24     New York   70000
1      Bob   27  Los Angeles   80000
2  Charlie   22      Chicago   50000
3    David   32      Houston  120000
4      Eva   29      Phoenix   90000


### Q2. Given a Pandas DataFrame df with columns 'A', 'B', and 'C', write a Python function to re-index the DataFrame with a new index that starts from 1 and increments by 2 for each row.

In [9]:
import pandas as pd

def reindex_dataframe(df):
    # Create a new index starting from 1 and incrementing by 2
    new_index = range(1, 2 * len(df) + 1, 2)
    
    # Set the new index to the DataFrame
    df.index = new_index
    
    return df

# DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']}
df = pd.DataFrame(data)

# Reindex the DataFrame
reindexed_df = reindex_dataframe(df)
print(reindexed_df)


      Name  Age         City
1    Alice   24     New York
3      Bob   27  Los Angeles
5  Charlie   22      Chicago
7    David   32      Houston
9      Eva   29      Phoenix


### Q3. You have a Pandas DataFrame df with a column named 'Values'. Write a Python function that iterates over the DataFrame and calculates the sum of the first three values in the 'Values' column. The function should print the sum to the console.

#### For example, if the 'Values' column of df contains the values [10, 20, 30, 40, 50], your function should calculate and print the sum of the first three values, which is 60.

In [18]:
import pandas as pd

def Sum_of_Values(df):
    # Find the sum of the 1st three values
    sum = df.head(3)['Values'].sum()
    
    return sum

# DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [24, 27, 22, 32, 29],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
        'Values': [10,20,30,40,50]
        }
df = pd.DataFrame(data)

# print the DataFrame
sum = Sum_of_Values(df)
print('Sum of the First three Value is:', sum)


Sum of the First three Value is: 60


### Q4. Given a Pandas DataFrame df with a column 'Text', write a Python function to create a new column 'Word_Count' that contains the number of words in each row of the 'Text' column.

In [64]:
import pandas as pd

def Word_count(df):
    # Using str.split and str.len to count words
    df['Word_Count'] = df['Text'].str.split().str.len()
    
    return df

# DataFrame
data = {'Text': ['New York City, often called NYC, is the most populous city in the United States.', 'Los Angeles is the second-most populous city in the United States.', 'Chicago, often referred to as the Windy City.', 'Houston boasts a rich cultural scene, with attractions such as the Museum of Fine Arts, the Space Center Houston, and the Sam Houston Monument.', 'Phoenix, Arizona is the capital and most populous city of the U.S. state of Arizona.']}
df = pd.DataFrame(data)

# print the DataFrame
df = Word_count(df)
print(df)


                                                Text  Word_Count
0  New York City, often called NYC, is the most p...          15
1  Los Angeles is the second-most populous city i...          11
2      Chicago, often referred to as the Windy City.           8
3  Houston boasts a rich cultural scene, with att...          24
4  Phoenix, Arizona is the capital and most popul...          15


### Q5. How are DataFrame.size() and DataFrame.shape() different?

`DataFrame.size` and `DataFrame.shape` are two attributes in pandas that provide information about the dimensions of a DataFrame, but they return different types of information:

### `DataFrame.shape`

- **Purpose**: Provides the dimensions of the DataFrame.
- **Output**: Returns a tuple representing the number of rows and columns in the DataFrame.
- **Usage**: Useful for quickly understanding the structure of the DataFrame.

#### Example

In [65]:
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Get the shape of the DataFrame
shape = df.shape
print(shape)

(3, 3)


This output indicates that the DataFrame has 3 rows and 3 columns.

### `DataFrame.size`

- **Purpose**: Provides the total number of elements in the DataFrame.
- **Output**: Returns an integer representing the total number of elements (i.e., the product of the number of rows and columns).
- **Usage**: Useful for understanding the total amount of data within the DataFrame.

#### Example

In [66]:
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Get the size of the DataFrame
size = df.size
print(size)

9


This output indicates that there are 9 elements in the DataFrame (3 rows * 3 columns).

### Summary of Differences
- **`DataFrame.shape`**: Returns a tuple (number of rows, number of columns).
- **`DataFrame.size`**: Returns an integer (total number of elements).

### Use Case Comparison
- **Use `DataFrame.shape`** when you need to know the exact dimensions of the DataFrame, such as for reshaping or indexing operations.
- **Use `DataFrame.size`** when you need to know the total amount of data, such as for quick sanity checks or understanding the overall data volume.

### Q6. Which function of pandas do we use to read an excel file?

To read an Excel file in pandas, you use the `read_excel` function. This function allows you to import data from an Excel file into a pandas DataFrame.

### Syntax
```python
pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, storage_options=None)
```

### Parameters
Some of the commonly used parameters include:
- **io**: The file path or object, or URL of the Excel file to be read.
- **sheet_name**: The sheet to read. This can be the sheet name (string), sheet index (integer), or a list of sheet names/indices.
- **header**: Row number(s) to use as the column names. Defaults to the first row (0).
- **names**: List of column names to use.
- **index_col**: Column(s) to set as index (row labels).
- **usecols**: Return a subset of the columns.
- **dtype**: Data type for data or columns.

### Example

In [67]:
import pandas as pd

# Read an Excel file
df = pd.read_excel('players.xlsx')

# Display the DataFrame
print(df)

      Rk          Player Pos Age   Tm   G  GS    MP   FG  FGA  ...   FT%  ORB  \
0      1      Quincy Acy  PF  24  NYK  68  22  1287  152  331  ...  .784   79   
1      2    Jordan Adams  SG  20  MEM  30   0   248   35   86  ...  .609    9   
2      3    Steven Adams   C  21  OKC  70  67  1771  217  399  ...  .502  199   
3      4     Jeff Adrien  PF  28  MIN  17   0   215   19   44  ...  .579   23   
4      5   Arron Afflalo  SG  29  TOT  78  72  2502  375  884  ...  .843   27   
..   ...             ...  ..  ..  ...  ..  ..   ...  ...  ...  ...   ...  ...   
670  490  Thaddeus Young  PF  26  TOT  76  68  2434  451  968  ...  .655  127   
671  490  Thaddeus Young  PF  26  MIN  48  48  1605  289  641  ...  .682   75   
672  490  Thaddeus Young  PF  26  BRK  28  20   829  162  327  ...  .606   52   
673  491     Cody Zeller   C  22  CHO  62  45  1487  172  373  ...  .774   97   
674  492    Tyler Zeller   C  25  BOS  82  59  1731  340  619  ...  .823  146   

     DRB  TRB  AST  STL BLK

### Q7. You have a Pandas DataFrame df that contains a column named 'Email' that contains email addresses in the format 'username@domain.com'. Write a Python function that creates a new column 'Username' in df that contains only the username part of each email address.

#### The username is the part of the email address that appears before the '@' symbol. For example, if the email address is 'john.doe@example.com', the 'Username' column should contain 'john.doe'. Your function should extract the username from each email address and store it in the new 'Username' column.

In [77]:
import pandas as pd

def User_names(df):
    # Using str.split to extract username 
    df['Username'] = df['Email'].str.split('@').str[0]

    return df

# DataFrame
data = {'Email': ['johndoe@example.com',
    'alice.smith@emailprovider.net',
    'webmaster@companyxyz.com',
    'marketing@yourbusiness.org',
    'support@techhelpdesk.com',
    'sales@productco.com',
    'contact@startupventure.io',
    'info@creativeagency.com',
    'admin@mywebsite.org',
    'hrmanager@corporateinc.com',
    'socialmedia@brandingco.net',
    'blogeditor@contenthub.org',
    'customerservice@ecommercestore.com',
    'events@communitycenter.org',
    'freelancer@designstudio.net',
    'newsletter@subscriptionnews.com',
    'techguru@techbloggers.com',
    'booklover@library.org',
    'traveler@adventureexplorers.net',
    'foodie@restaurantreviews.com']}

df = pd.DataFrame(data)

# print the DataFrame
df = User_names(df)
print(df)


                                 Email         Username
0                  johndoe@example.com          johndoe
1        alice.smith@emailprovider.net      alice.smith
2             webmaster@companyxyz.com        webmaster
3           marketing@yourbusiness.org        marketing
4             support@techhelpdesk.com          support
5                  sales@productco.com            sales
6            contact@startupventure.io          contact
7              info@creativeagency.com             info
8                  admin@mywebsite.org            admin
9           hrmanager@corporateinc.com        hrmanager
10          socialmedia@brandingco.net      socialmedia
11           blogeditor@contenthub.org       blogeditor
12  customerservice@ecommercestore.com  customerservice
13          events@communitycenter.org           events
14         freelancer@designstudio.net       freelancer
15     newsletter@subscriptionnews.com       newsletter
16           techguru@techbloggers.com         t

### Q8. You have a Pandas DataFrame df with columns 'A', 'B', and 'C'. Write a Python function that selects all rows where the value in column 'A' is greater than 5 and the value in column 'B' is less than 10. The function should return a new DataFrame that contains only the selected rows.

In [80]:
import pandas as pd

def New_Dataframe(df):
    
    return df[(df['A'] > 5) & (df['B'] < 10) ]

# DataFrame
data = {'A': [3,8,6,2,9],
        'B': [5,2,9,3,1],
        'C': [1,7,4,5,2]
        }

df = pd.DataFrame(data)

# print the DataFrame
df = New_Dataframe(df)
print(df)


   A  B  C
1  8  2  7
2  6  9  4
4  9  1  2


### Q9. Given a Pandas DataFrame df with a column 'Values', write a Python function to calculate the mean, median, and standard deviation of the values in the 'Values' column.

In [83]:
import pandas as pd

def calculate_statistics(df):
    # Calculate mean
    mean_val = df['Values'].mean()
    
    # Calculate median
    median_val = df['Values'].median()
    
    # Calculate standard deviation
    std_val = df['Values'].std()
    
    # Return the results as a dictionary
    return {'mean': mean_val, 'median': median_val, 'std_dev': std_val}

# Example usage
data = {'Values': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calculate the statistics
statistics = calculate_statistics(df)
print(statistics)


{'mean': 30.0, 'median': 30.0, 'std_dev': 15.811388300841896}


### Q10. Given a Pandas DataFrame df with a column 'Sales' and a column 'Date', write a Python function to create a new column 'MovingAverage' that contains the moving average of the sales for the past 7 days for each row in the DataFrame. The moving average should be calculated using a window of size 7 and should include the current day.

A moving average is a statistical calculation used to analyze data points by creating a series of averages of different subsets of the full dataset. It is commonly used to smooth out short-term fluctuations and highlight longer-term trends or cycles in time-series data.

The moving average is calculated by taking the average of a specified number of consecutive data points within a sliding window or interval. The size of the window, also known as the period or lag, determines the number of data points included in each average. For example, a 7-day moving average would use the average of the current day and the previous 6 days.

In [85]:
import pandas as pd

def calculate_moving_average(df):
    # Convert 'Date' column to datetime format and set it as index
    df['Date'] = pd.to_datetime(df['Date'])
    df.set_index('Date', inplace=True)
    
    # Calculate moving average using rolling window of size 7
    df['MovingAverage'] = df['Sales'].rolling(window=7, min_periods=1).mean()
    
    return df


# Create a DataFrame
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06', '2022-01-07'],
        'Sales': [100, 200, 150, 300, 250, 400, 350]}
df = pd.DataFrame(data)

# Calculate moving average
df = calculate_moving_average(df)

print(df)


            Sales  MovingAverage
Date                            
2022-01-01    100     100.000000
2022-01-02    200     150.000000
2022-01-03    150     150.000000
2022-01-04    300     187.500000
2022-01-05    250     200.000000
2022-01-06    400     233.333333
2022-01-07    350     250.000000


### Q11. You have a Pandas DataFrame df with a column 'Date'. Write a Python function that creates a new column 'Weekday' in the DataFrame. The 'Weekday' column should contain the weekday name (e.g. Monday, Tuesday) corresponding to each date in the 'Date' column.

In [86]:
import pandas as pd

def Weekday_column(df):
    # Convert 'Date' column to datetime format
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Add a new column 'Weekday' containing the weekday name
    df['Weekday'] = df['Date'].dt.day_name()
    
    return df


# Create a DataFrame
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05']}
df = pd.DataFrame(data)

# Add weekday column
df = Weekday_column(df)

print(df)


        Date    Weekday
0 2023-01-01     Sunday
1 2023-01-02     Monday
2 2023-01-03    Tuesday
3 2023-01-04  Wednesday
4 2023-01-05   Thursday


### Q12. Given a Pandas DataFrame df with a column 'Date' that contains timestamps, write a Python function to select all rows where the date is between '2023-01-01' and '2023-01-31'.

In [92]:
import pandas as pd

def select_rows_in_date_range(df, start_date, end_date):
    # Convert 'Date' column to datetime format
    df['Date'] = pd.to_datetime(df['Date'])
    
    # Filter rows where date is between start_date and end_date
    result = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
    
    return result


# Create a DataFrame
data = {'Date': ['2023-01-01', '2023-01-31', '2023-02-15', '2023-01-15', '2023-02-28'],
        'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Define start and end dates
start_date = '2023-01-01'
end_date = '2023-01-31'

# Select rows in date range
df = select_rows_in_date_range(df, start_date, end_date)

print(df)


        Date  Value
0 2023-01-01     10
1 2023-01-31     20
3 2023-01-15     40


### Q13. To use the basic functions of pandas, what is the first and foremost necessary library that needs to be imported?

To use the basic functions of pandas, the first and foremost library that needs to be imported is the pandas library itself. The conventional way to import pandas is by using the import statement in Python:

In [93]:
import pandas as pd

In this statement:
- `import pandas` imports the pandas library.
- `as pd` provides an alias `pd` for the pandas library. This alias is commonly used in pandas code for brevity and readability.

Once pandas is imported, we can then access its functions and classes using the `pd` alias. For example, to create a DataFrame or Series, manipulate data, perform operations, or use any other pandas functionality, we would use the `pd` alias followed by the specific function or method name.