<a href="https://colab.research.google.com/github/bhargav23/AI/blob/master/Lab/APPLAB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

## What is Pandas?
*   Pandas is a Python library used for working with data sets.
*   It has functions for analyzing, cleaning, exploring, and manipulating data.
*   The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

### Pandas deals with the following three data structures -
*   Series
*   DataFrame

## What is a Series?
*   A Pandas Series is like a column in a table.
*   It is a one-dimensional array holding data of any type.

## What is a DataFrame?
*   A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns

##Creating Pandas Series
A pandas Series can be created using the following constructor -

```
pandas.Series( data, index, dtype)

```






### Parameter & Description

**data**
*   data takes various forms like ndarray, list, constants

**index**

*   Index values must be unique and hashable, same length as data. Default np.arrange(n) if no index is passed.

**dtype**

*   dtype is for data type. If None, data type will be inferred


In [None]:
import pandas as pd
import numpy as np

### Create an Empty Series
*  A basic series, which can be created is an Empty Series.

In [None]:
s = pd.Series(dtype=float)
print(s)

Series([], dtype: float64)


### Create a Series from ndarray
*  If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

In [None]:
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

0    a
1    b
2    c
3    d
dtype: object


### Create a Series from dict
*   A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [None]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

a    0.0
b    1.0
c    2.0
dtype: float64


In [None]:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


### Create a Series from Scalar
*   If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [None]:
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0    5
1    5
2    5
3    5
dtype: int64


### Accessing Data from Series with Position
*   Data in the series can be accessed similar to that in an ndarray.


In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
print (s[0])

1


In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first three element
print (s[:3])

a    1
b    2
c    3
dtype: int64


In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the last three element
print (s[-3:])

c    3
d    4
e    5
dtype: int64


### Retrieve Data Using Label (Index)
*   A Series is like a fixed-size dict in that you can get and set values by index label.

In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print (s['a'])

1


In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve multiple elements
print (s[['a','c','d']])

a    1
c    3
d    4
dtype: int64


## Creating DataFrame
* A pandas DataFrame can be created using the following constructor -

```
pandas.DataFrame( data, index, columns, dtype)
```
**data**

*   data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.

**index**

*   For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed.

**columns**

*   For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.

**dtype**

*   Data type of each column.

### A pandas DataFrame can be created using various inputs like

*   Lists
*   dict
*   Series
*   Numpy ndarrays
*   Another DataFrame

### Create an Empty DataFrame


In [None]:
df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []


### Create a DataFrame from Lists

In [None]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)

   0
0  1
1  2
2  3
3  4
4  5


In [None]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


### Create a DataFrame from Dict of ndarrays / Lists
*   All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

*   If no index is passed, then by default, index will be range(n), where n is the array length.

In [None]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print (df)

    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42


In [None]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)

        Name  Age
rank1    Tom   28
rank2   Jack   34
rank3  Steve   29
rank4  Ricky   42


### Create a DataFrame from List of Dicts
*   List of Dictionaries can be passed as input data to create a DataFrame. 
*   The dictionary keys are by default taken as column names.

In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

        a   b     c
first   1   2   NaN
second  5  10  20.0


In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print (df1)
print (df2)

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


### Create a DataFrame from Dict of Series

*   Dictionary of Series can be passed to form a DataFrame. 
*   The resultant index is the union of all the series indexes passed.


In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


### Column Selection

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print (df ['one'])

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


## Adding, Deleting, Modifying the rows/columns in a dataframe


### Column Addition

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series

# Adding a new column by passing as Series:
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df)

   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN


In [None]:
# Adding a new column using the existing columns in DataFrame:
df['four']=df['one']+df['three']

print (df)

   one  two  three  four
a  1.0    1   10.0  11.0
b  2.0    2   20.0  22.0
c  3.0    3   30.0  33.0
d  NaN    4    NaN   NaN


### Column Deletion

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}

df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)

# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)

# using pop function
print ("Deleting another column using POP function:")
df.pop('two')
print (df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


### Row Selection, Addition, and Deletion
###Selection by Label
*  Rows can be selected by passing row label to a loc function

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

print(df)

   three
a   10.0
b   20.0
c   30.0
d    NaN


In [None]:
df = pd.DataFrame(d)
print (df.loc['b'])
#The result is a series with labels as column names of the DataFrame. 
#And, the Name of the series is the label with which it is retrieved.

one    2.0
two    2.0
Name: b, dtype: float64


###Selection by integer location

In [None]:
df.iloc[2]

one    3.0
two    3.0
Name: c, dtype: float64

### Slice Rows
Multiple rows can be selected using **:** operator.

In [None]:
df[2:4]

Unnamed: 0,one,two
c,3.0,3
d,,4


### Addition of Rows 
*   Add new rows to a DataFrame using the append function. 
*   This function will append the rows at the end.

In [None]:
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
df

  df = df.append(df2)


Unnamed: 0,a,b
0,1,2
1,3,4
0,5,6
1,7,8


In [None]:
dict = {'Name':['Martha', 'Tim', 'Rob', 'Georgia'],
        'Maths':[87, 91, 97, 95],
        'Science':[83, 99, 84, 76]
       }
  
df1 = pd.DataFrame(dict)
display(df1)
  
dict = {'Name':['Amy', 'Maddy'],
        'Maths':[89, 90],
        'Science':[93, 81]
       }
  
df2 = pd.DataFrame(dict)
display(df2)

Unnamed: 0,Name,Maths,Science
0,Martha,87,83
1,Tim,91,99
2,Rob,97,84
3,Georgia,95,76


Unnamed: 0,Name,Maths,Science
0,Amy,89,93
1,Maddy,90,81


In [None]:
df3 = pd.concat([df1, df2],ignore_index=True,axis=0)
  
display(df3)

Unnamed: 0,Name,Maths,Science
0,Martha,87,83
1,Tim,91,99
2,Rob,97,84
3,Georgia,95,76
4,Amy,89,93
5,Maddy,90,81


###drop()

The drop() function in the pandas library is used to remove rows or columns from a DataFrame. It allows you to specify the labels (either row index or column names) that you want to drop.

The basic syntax for the drop() function is as follows:


```
df.drop(labels, axis=0, inplace=False)
```
Here is an explanation of the parameters:

**labels**: This parameter specifies the labels of the rows or columns to be dropped. It can take a single label or a list of labels.

**axis**: This parameter indicates the axis along which the labels will be dropped. By default, axis=0 (rows). To drop columns, set axis=1.

**inplace**: This parameter specifies whether to modify the DataFrame in place or return a modified copy. If inplace=True, the DataFrame will be modified, and the function will return None. If inplace=False (default), a modified copy of the DataFrame will be returned, and the original DataFrame will remain unchanged.

### Deletion of Rows


###Drop a single row

In [None]:
data = {
  "name": ["Sally", "Mary", "John"],
  "age": [50, 40, 30],
  "qualified": [True, False, False]
}

df = pd.DataFrame(data)

df


Unnamed: 0,name,age,qualified
0,Sally,50,True
1,Mary,40,False
2,John,30,False


In [None]:
df.drop(2)  # Drops the row with index 2

Unnamed: 0,name,age,qualified
0,Sally,50,True
1,Mary,40,False


###Drop multiple rows:


In [None]:
df.drop([1, 2])  # Drops rows with indices 1, 3, and 5

Unnamed: 0,name,age,qualified
0,Sally,50,True


##Apply functions on dataframe.


###apply()
In pandas, you can apply functions to a DataFrame using the apply() function. The apply() function allows you to apply a function along either the rows or columns of the DataFrame, depending on the axis parameter.

The basic syntax for the apply() function is as follows:


```
df.apply(func, axis=0)
```
**Here is an explanation of the parameters:**

func: This parameter specifies the function to be applied. It can be a built-in 


*  function, a lambda function, or a user-defined function.

*  axis: This parameter indicates the axis along which the function will be applied. By default, axis=0 (column-wise). To apply the function row-wise, set axis=1.

Here are a few examples of using the apply() function:


###Apply a built-in function to each column:


In [None]:
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [None]:
# Apply the built-in sum function to each column
result = df.apply(sum)

print(result)

A     6
B    15
C    24
dtype: int64


###Apply a user-defined function to each column

In [None]:
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# Define a user-defined function
def multiply_by_two(column):
    return column * 2

# Apply the user-defined function to each column
result = df.apply(multiply_by_two)

print(result)

   A   B   C
0  2   8  14
1  4  10  16
2  6  12  18


Apply a user-defined function to a column

In [None]:
# Apply the user-defined function to one column
result = df['A'].apply(multiply_by_two)

print(result)

0    2
1    4
2    6
Name: A, dtype: int64


##Iterations on dataframe

To iterate over the rows of the DataFrame, we can use the following functions: 

*  **iterrows()**: This method returns an iterator that iterates over the rows of the DataFrame, yielding the **index** and row data as a **Series**. (index,Series)


*  **itertuples**(): This method returns an iterator that iterates over the rows of the DataFrame, yielding **namedtuples** containing the **index and row values**.

*  **items()**: This method iterates over the columns of the DataFrame, yielding pairs of **column names** and series containing the **column data**. (column names, column data) 

### iterrows()
*  This method returns an iterator that iterates over the rows of the DataFrame, yielding the **index** and row data as a **Series**. (index,Series)

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

In [None]:
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Iterate over the DataFrame using iterrows()
for index, row in df.iterrows():
    print(index)

0
1
2


In [None]:
# Iterate over the DataFrame using iterrows()
for index, row in df.iterrows():
    print(row['City'])

New York
London
Paris


In [None]:
# Iterate over the DataFrame using iterrows()
for index, row in df.iterrows():
    print(row['Age'])

25
30
35


In [None]:
# Iterate over the DataFrame using iterrows()
for index, row in df.iterrows():
    print(f"Index: {index}, Name: {row['Name']}, Age: {row['Age']}, City: {row['City']}")

Index: 0, Name: John, Age: 25, City: New York
Index: 1, Name: Emily, Age: 30, City: London
Index: 2, Name: Ryan, Age: 35, City: Paris


### itertuples()
This method returns an iterator that iterates over the rows of the DataFrame, yielding **namedtuples** containing the **index and row values**.

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

In [None]:
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Iterate over the DataFrame using itertuples()
for row in df.itertuples(index=True, name='Person'):
    print(row)

Person(Index=0, Name='John', Age=25, City='New York')
Person(Index=1, Name='Emily', Age=30, City='London')
Person(Index=2, Name='Ryan', Age=35, City='Paris')


In [None]:
# Iterate over the DataFrame using itertuples()
for row in df.itertuples(index=True, name='Person'):
    print(row.Index)

0
1
2


In [None]:
for row in df.itertuples(index=True, name='Person'):
    print(row.Name)

John
Emily
Ryan


In [None]:
# Iterate over the DataFrame using itertuples()
for row in df.itertuples(index=True, name='Person'):
    print(f"Index: {row.Index}, Name: {row.Name}, Age: {row.Age}, City: {row.City}")


Index: 0, Name: John, Age: 25, City: New York
Index: 1, Name: Emily, Age: 30, City: London
Index: 2, Name: Ryan, Age: 35, City: Paris


### items()

*  This method iterates over the columns of the DataFrame, yielding pairs of **column names** and **series** containing the **column data**. (column names, column data as series) 

In [None]:
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

In [None]:
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Iterate over the DataFrame columns using items()
for column_name, series in df.items():
    print(column_name)
    

Name
Age
City


In [None]:
# Iterate over the DataFrame columns using items()
for column_name, series in df.items():
    print(series)

0     John
1    Emily
2     Ryan
Name: Name, dtype: object
0    25
1    30
2    35
Name: Age, dtype: int64
0    New York
1      London
2       Paris
Name: City, dtype: object


In [None]:
# Iterate over the DataFrame columns using items()
for column_name, series in df.items():
    print(f"Column Name: {column_name}")
    print(series)

Column Name: Name
0     John
1    Emily
2     Ryan
Name: Name, dtype: object
Column Name: Age
0    25
1    30
2    35
Name: Age, dtype: int64
Column Name: City
0    New York
1      London
2       Paris
Name: City, dtype: object


##Accessing the elements from a dataframe
To access elements from a DataFrame in pandas, you can use various methods based on your specific requirements. Here are some common ways to access elements in a DataFrame:

*   Column Access
*   Row Access
*   Element Access





### Column Access
You can access a specific column in a DataFrame by using either **dot** notation or **bracket** notation.

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Access a specific column using dot notation
name_column = df.Name
print(name_column)

0     John
1    Emily
2     Ryan
Name: Name, dtype: object


In [None]:
# Access a specific column using bracket notation
age_column = df['Age']
print(age_column)

0    25
1    30
2    35
Name: Age, dtype: int64


### Row Access
*   You can access a specific row in a DataFrame using the **loc** or **iloc** accessor.

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Access a specific row using loc (based on index label)
row_0 = df.loc[0]
print(row_0)

Name        John
Age           25
City    New York
Name: 0, dtype: object


In [None]:
# Access a specific row using iloc (based on integer index)
row_1 = df.iloc[1]
print(row_1)

Name     Emily
Age         30
City    London
Name: 1, dtype: object


### Element Access
*  You can access a specific element in a DataFrame by combining column and row access methods.

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Emily,30,London
2,Ryan,35,Paris


In [None]:
# Access a specific element using loc (based on index label)
element_0_1 = df.loc[0, 'Name']
print(element_0_1)

John


In [None]:
# Access a specific element using iloc (based on integer index)
element_2_2 = df.iloc[2, 2]
print(element_2_2)

Paris


##Different ways to deal with NA in dataframe

Dealing with missing values (NA) in a DataFrame is an important step in data preprocessing. Pandas provides several methods to handle missing values. Here are some common techniques to deal with NA in a DataFrame:

1. **Drop rows or columns**: You can remove rows or columns containing NA values using the dropna() method.

2. **Fill NA with a specific value**: You can replace NA values with a specific value using the fillna() method.

3. **Forward-fill or backward-fill NA**: You can propagate the previous or next valid value to fill NA values using the ffill() or bfill() methods.

4. **Replace NA based on condition**: You can replace NA values based on a condition using boolean indexing.

These are some common approaches to handle NA values in a DataFrame using pandas. Depending on the specific data and context, you can choose the most suitable method for your analysis or data cleaning tasks.

### Drop rows or columns
*  You can remove rows or columns containing NA values using the dropna() method

#### dorpna()   

In pandas, the dropna() function is used to remove missing values (NA) from a DataFrame. It allows you to drop either rows or columns that contain NA values based on specified criteria. The dropna() function provides several parameters to control the behavior of the operation. Here's an overview of the dropna() function in pandas:

**Syntax**:

```
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```



**Parameters**:

*  **axis**: Specifies the axis along which NA values should be dropped. By default, it is set to 0 (rows). To drop columns, set axis to 1.
*  **how**: Determines the condition for dropping NA values. It accepts the following options:
  *  **'any'**: Drops the row/column if it contains any NA value (default).
  *  **'all'**: Drops the row/column only if all values are NA.
*  **thresh**: Specifies the minimum number of non-NA values required to keep a row/column. If the count of non-NA values is less than thresh, the row/column is dropped.
*  **subset**: Defines specific columns or indices to consider for NA removal. It takes a list of column names or index labels.
*  **inplace**: Specifies whether to modify the DataFrame in place or return a new DataFrame. By default, it is set to False.

In [None]:
# Creating a sample DataFrame with NA values
data = {'Name': ['John', 'Emily', 'Ryan', 'Sophia'],
        'Age': [25, 30, None, 40],
        'City': ['New York', None, 'Paris', 'London']}
df = pd.DataFrame(data)
display(df)

Unnamed: 0,Name,Age,City
0,John,25.0,New York
1,Emily,30.0,
2,Ryan,,Paris
3,Sophia,40.0,London


In [None]:
# Drop rows with NA values
df_dropped_rows = df.dropna(axis=0)
display(df_dropped_rows)

Unnamed: 0,Name,Age,City
0,John,25.0,New York
3,Sophia,40.0,London


####fillna()
In pandas, the fillna() function is used to fill missing values (NA) in a DataFrame with a specified value or method. It allows you to replace NA values based on different strategies. The fillna() function provides several parameters to control the behavior of the operation. Here's an overview of the fillna() function in pandas:

Syntax:
```
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
```


**Parameters:**

*  **value:** Specifies the value to use for filling NA values. It can be a scalar value, a dictionary mapping column names to values, or a Series containing values to fill NA.
method: Specifies a method to fill NA values. Options include:
*  **'pad'** or **'ffill'**: Forward-fill, which propagates the previous value forward.
*  **'backfill'** or **'bfill'**: Backward-fill, which propagates the next value backward.
*  **axis**: Determines the axis along which NA values should be filled. By default, NA values are filled column-wise (axis=0).
*  **inplace**: Specifies whether to modify the DataFrame in place or return a new DataFrame. By default, it is set to False.
*  **limit**: Limits the number of consecutive NA values filled when using method.

### Fill NA with a specific value
You can replace NA values with a specific value using the fillna() method.

In [None]:
# Fill NA values with a specific value
df_filled = df.fillna('Unknown')
display(df_filled)

Unnamed: 0,Name,Age,City
0,John,25.0,New York
1,Emily,30.0,Unknown
2,Ryan,Unknown,Paris
3,Sophia,40.0,London


### Forward-fill or backward-fill NA: 
* You can propagate the previous or next valid value to fill NA values using the ffill() or bfill() methods.

In [None]:
# Forward-fill NA values
df_ffill = df.fillna(method='ffill')
display(df_ffill)

Unnamed: 0,Name,Age,City
0,John,25.0,New York
1,Emily,30.0,New York
2,Ryan,30.0,Paris
3,Sophia,40.0,London


In [None]:
# Backward-fill NA values
df_bfill = df.fillna(method='bfill')
display(df_bfill)

Unnamed: 0,Name,Age,City
0,John,25.0,New York
1,Emily,30.0,Paris
2,Ryan,40.0,Paris
3,Sophia,40.0,London


### Replace NA based on condition: 
*  You can replace NA values based on a condition using boolean indexing.

In [None]:
# Replace NA values in Age column based on a condition
df['Age'] = df['Age'].fillna(df['Age'].mean())
display(df)


Unnamed: 0,Name,Age,City
0,John,25.0,New York
1,Emily,30.0,
2,Ryan,31.666667,Paris
3,Sophia,40.0,London


##Groupby operations on dataframe

Any groupby operation involves one of the following operations on the original object. They are 

*  Splitting the Object
*  Applying a function
*  Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations 

* Aggregation - computing a summary statistic
* Transformation - perform some group-specific operation
* Filtration - discarding the data with some condition

In [None]:
# Creating a sample DataFrame
data = {'Name': ['John', 'Emily', 'Ryan', 'Emily', 'John'],
        'Age': [25, 30, 35, 30, 28],
        'City': ['New York', 'London', 'Paris', 'London', 'New York'],
        'Salary': [5000, 6000, 7000, 5500, 4500]}
df = pd.DataFrame(data)

# Applying a custom function to groups
def custom_function(group):
    return group['Age'].max() - group['Age'].min()

grouped_custom = df.groupby('Name').apply(custom_function)
print(grouped_custom)



Name
Emily    0
John     3
Ryan     0
dtype: int64


In [None]:
#import the pandas library
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

display (df)

Unnamed: 0,Team,Rank,Year,Points
0,Riders,1,2014,876
1,Riders,2,2015,789
2,Devils,2,2014,863
3,Devils,3,2015,673
4,Kings,3,2014,741
5,Kings,4,2015,812
6,Kings,1,2016,756
7,Kings,1,2017,788
8,Riders,2,2016,694
9,Royals,4,2014,701


#### Use groupby() function to group the data based on the "Team"

In [None]:
df.groupby('Team').groups

{'Devils': [2, 3], 'Kings': [4, 5, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10]}

In [None]:
g = df.groupby('Team')

In [None]:
g.first()

Unnamed: 0_level_0,Rank,Year,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,2,2014,863
Kings,3,2014,741
Riders,1,2014,876
Royals,4,2014,701


####Select a Group

In [None]:
g.get_group('Devils')

Unnamed: 0,Team,Rank,Year,Points
2,Devils,2,2014,863
3,Devils,3,2015,673


In [None]:
g.get_group('Royals')

Unnamed: 0,Team,Rank,Year,Points
9,Royals,4,2014,701
10,Royals,1,2015,804


In [None]:
for name,group in g:
  print(name)

Devils
Kings
Riders
Royals


In [None]:
for name,group in g:
  print(group)

     Team  Rank  Year  Points
2  Devils     2  2014     863
3  Devils     3  2015     673
    Team  Rank  Year  Points
4  Kings     3  2014     741
5  Kings     4  2015     812
6  Kings     1  2016     756
7  Kings     1  2017     788
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690
      Team  Rank  Year  Points
9   Royals     4  2014     701
10  Royals     1  2015     804


####Aggregations
*   An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.

In [None]:
g['Points'].agg(np.mean)

Team
Devils    768.00
Kings     774.25
Riders    762.25
Royals    752.50
Name: Points, dtype: float64

In [None]:
g.agg(np.size)

Unnamed: 0_level_0,Rank,Year,Points
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,2,2,2
Kings,4,4,4
Riders,4,4,4
Royals,2,2,2


In [None]:
g['Points'].agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,sum,mean,std
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,1536,768.0,134.350288
Kings,3097,774.25,31.899582
Riders,3049,762.25,88.567771
Royals,1505,752.5,72.831998


####Use groupby() function to form groups based on more than one category (i.e. Use more than one column to perform the splitting).

In [None]:
gm = df.groupby(['Team','Year'])

In [None]:
gm.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,Rank,Points
Team,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Devils,2014,2,863
Devils,2015,3,673
Kings,2014,3,741
Kings,2015,4,812
Kings,2016,1,756
Kings,2017,1,788
Riders,2014,1,876
Riders,2015,2,789
Riders,2016,2,694
Riders,2017,2,690


##Merging dataframes