<img src="images/pandas-intro.png">

# Learning Agenda of this Notebook:
- What is Pandas and how is it used in AI?
- Key features of Pandas
- Data Types in Pandas
- What does Pandas deal with?

- Creating Series in Pandas
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
    - Creating empty series object
- Attributes of a Pandas Series
- Arithmetic Operations on Series

- Dataframes in Pandas
    - Anatomy of a Dataframe
    - Creating Dataframe
        - An empty dataframe
        - Two-Dimensional NumPy Array
        - Dictionary of Python Lists
        - Dictionary of Panda Series
    - Attributes of a Dataframe
    - Bonus
- Different file formats in Pandas 
- Indexing, Subsetting and Slicing Dataframes
    - Practice Exercise I
- Modifying Dataframes
- Data Handling with Pandas
  - Practice Exercise I
  - Practice Exercise II
- All Statistical functions in Pandas
- Input/Output Operations
- Aggregation & Grouping
  - Practice Exercise
- Merging, Joining and Concatenation
  - Practice Exercise
- How To Perform Data Visualization with Pandas
- Exercise I
- Exercise II
- Pandas's Assignment

## Outline of Notebook

1. Identify the Columns having Null/Missing values using `df.isna()` method
2. Handle/Impute the Null/Missing Values under the `scholarship` Column using `df.loc[mask,col]=value`
3. Handle/Impute the Null/Missing Values under the `group` Column using `df.loc[mask,col]=value`
4. Handle Missing values under a Numeric/Categorical Column using `fillna()`
5. Handle Repeating Values (for same information) under the `session` Column
6. Create a new Column by Modifying an Existing Column
7. Delete Rows Having NaN values using `df.dropna()` method
8. Convert Categorical Variables into Numerical

### Important Points:
- In Pandas, To impute numerical we use mean or median of column.
- In Pandas, To impute categorical we use mode.
- In Real world, we use SimpleImputer or KNNImputer for imputation. These both are available in sklearn.

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df

In [None]:
df.describe(include='all')

- Whenever the **`pd.read_csv()`** method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.
- There can also be a scenario that a user has entered some text under a numeric field about which he/she do not have any information.

## 1. Identify the Columns having Null/Missing values
- The **`df.isna()`** method isrecommended to use than `df.isnull()`, which return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ``''`` or `numpy.inf` are not considered NA values.
- The **`df.notna()`** method is recommended to use than `df.notnull()` methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True. 

In [None]:
df.isna().head()

In [None]:
df.notna().head()

In [None]:
# Now we can use sum() on this dataframe object of Boolean values (True is mapped to 1)
df.isna().sum()

In [None]:
# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)
df.notna().sum()

## 2. Handle/Impute the Null/Missing Values under the `scholarship` Column

#### a. Identify the Rows under the `scholarship` Column having Null/Missing values
- The `df.isna()` method works equally good on Series objects as well

In [None]:
df.scholarship.isna()

In [None]:
mask = df.scholarship.isna()
mask

In [None]:
# This will return only those rows of dataframe having null values under the scholarship column
df[mask]         # df[df.scholarship.isna()] or 
df.loc[mask, :]  # df.loc[df.scholarship.isna(), :]

#### b. Replace the Null/Missing Values under the `scholarship` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values under the `scholarship` column
- Suppose, we want to put the average values at the place of missing values.

In [None]:
# compute the mean of the scholarship column
df.scholarship.mean()

In [None]:
df.loc[df.scholarship.isna(), 'scholarship']

In [None]:
df.loc[df.scholarship.isna(), 'scholarship'] = df.scholarship.mean()
df

In [None]:
# Confirm the result
df.isna().sum()

## 3. Handle/Impute the Null/Missing Values under the `group` Column
- The `group` column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values.

#### a. Identify the Rows under the `group` Column having Null/Missing values

In [None]:
mask = df.group.isna()
mask

In [None]:
df[mask]          # df[df.group.isna()]
df.loc[mask, :]   # df.loc[df.group.isna()]

#### b. Replace the Null/Missing Values under the `group` Column
- After detecting the NaN values, the next question is, what value we should write in the cells where we have Null/Missing values
- Since this is a categorical column having datatype object (group A, group B, group C, ...), so let us replace it with th value inside the column having the maximum frequency

In [None]:
# Use value_counts() function which return a Series containing counts of unique values (in descending order)
# with the most frequently-occurring element at first. It excludes NA values by default.
df.group.value_counts()

In [None]:
# df.group.value_counts(dropna=False)

In [None]:
df.group.mode()[0]

In [None]:
# List only those records under group column having Null values
mask = df.group.isna()
df.loc[mask, 'group']     # df.loc[(df.group.isna()), 'group']

In [None]:
# Let us replace these values with maximum occurring value in the `group` column
df.loc[(df.group.isna()),'group'] = df.group.mode()[0]

In [None]:
# Confirm the result
df.isna().sum()

In [None]:
df

> Note that in the original dataframe `Yusuf` group information was missing, and now it is `group C` 

## 4. Handle Missing values under a Numeric/Categorical Column using `fillna()`

#### a. Replace the Null/Missing Values under the `scholarship` Column using `fillna()`
- This is more recommended way of filling in the Null values within columns of your dataset rather than the use of the `loc` method.
```
object.fillna(value, method, inplace=True)
```
- The only required argument is either the `value`, with which we want to replace the missing values OR the `method` to be used to replace the missing values
- Returns object with missing values filled or None if ``inplace=True``

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')

In [None]:
df.sample(5)

>- Before proceeding, tell me why we use `na_values` argument in `pd.read_csv()` method?

In [None]:
df.loc[df.scholarship.isna()]

In [None]:
mean_value = df.scholarship.mean()
mean_value
median_value  = df.scholarship.median()
median_value

In [None]:
# This time instead of loc, use fillna() method with just two arguments
# inplace=True parameter ensure that this happens in the original dataframe

df.scholarship.fillna(value=mean_value, inplace=True)

In [None]:
# confirm the result
df.loc[df.scholarship.isna()]

In [None]:
df.head(6)

#### b. Replace the Null/Missing Values under the `group` Column using `fillna()`

In [None]:
df.isna().sum()

In [None]:
group_mode = df.group.mode()[0]
group_mode

In [None]:
df.group.fillna(value=group_mode, inplace=True)

In [None]:
# Confirm the result
df.isna().sum()

> Fill missing values of subj1 and subj2 columns with the mean of the column using `fillna()` method.

#### c. Replace the Null/Missing Values under the` scholarship` and `group` Column using `ffill` and `bfill` Arguments
- In above examples, we have used the mean value in case of numeric column and mode value in case of a categorical column as the filling value to the `fillna()` method
```
object.fillna(value, method, inplace=True)
```

- We can pass `ffill` or `bfill` as method argument to the `fillna()` method. This will replace the null values with other values from the DataFrame
- `ffill` (Forward fill): It fills the NaN value with the previous value
- `bfill` (Back fill): It fills the NaN value with the Next/Upcoming value


In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.sample(5)

In [None]:
df.isna().sum()

In [None]:
# forward fill or ffill attribute
# If have NaN value, just carry forward the previous value
# using ffill attribute, you can fill the NaN value with the previous value in that column
df.fillna(method = 'ffill', inplace=True)

In [None]:
df.isna().sum()

In [None]:
# Or you can use bfill method to fill the NaN values with the next value in that column
df.fillna(method = 'bfill', inplace=True)

## 5. Handle Repeating Values (for same information) under the `session` Column
- If you observe the values under the `session` column, you can observe that it is a categorical column containing four different categories (as values).
    - Notice that the categories `MORNING` and `MOR` are same
    - Similarly, `AFTERNOON` and `AFT` are same
- This happens when you have collected data from different sources, where same information is written in different ways
- So the `session` column has four different categories (as values) but should have only two.

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.sample(5)

In [None]:
df['session'].value_counts()

####  Handle  the Repeating Values under the session Column using `map()`
- To keep the data clean we will map all these values to only two categories to `MOR` , `AFT`  using the map() function.
```
df.map(mapping, na_action=None)
```
- The `map()` method is used for substituting each value in a Series with another value, that may be derived from a `dict`. The `map()` method returns a series after performing the mapping
- You can give `ignore` as second argument which will propagate NaN values, without passing them to the mapping correspondence.

In [None]:
# To do this, let us create a new mapping (dictionary) 
dict1 = {
    'MORNING' : 'MOR',
    'MOR' : 'MOR',
    'AFTERNOON' : 'AFT',
    'AFT': 'AFT',
}

In [None]:
# It returns a series with the same index as caller, the original series remains unchanged. 
# So we have assigned the resulting series to `df.session` series
df.session.map(dict1)  # or df['session'].map(dict1) 

In [None]:
df.session = df.session.map(dict1)

In [None]:
# Count of new categories in the column session
# Observe we have managed to properly manage the values inside the session column
df.session.value_counts()

In [None]:
# Lets verify the result
df.head()

## 6. Create a new Column by Modifying an Existing Column
- We have a column scholarship in the dataset, which is in Pak Rupees
- Suppose you want to have a new column which should represent the scholarship in `US Dollars`
- For that we need to add a new column by dividing each value of scholarship with 285.

In [None]:
df = pd.read_csv('datasets/groupdata.csv')

In [None]:
df.scholarship.apply(lambda x: x/285)

In [None]:
df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/285)

In [None]:
df.head()

## 7. Delete Rows Having NaN values using `df.dropna()` method

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.sample(5)

In [None]:
df.shape

In [None]:
16*10

In [None]:
df.isna().sum().sum()

In [None]:
# You can use dropna() method to drop all the rows, it it has any na value
df1 = df.dropna()
df1.shape

In [None]:
# Default Arguments to dropna()
df2 = df.dropna(axis=0, how='any')
df2.shape

In [None]:
# If we set how='all` it means drop a row only if all of its values are NA
df2 = df.dropna(axis=0, how='all')
df2.shape

In [None]:
# Use of subset argument and pass it a list of columns based on whose values you want to drop a row
df2 = df.dropna(axis=0, how='any', subset=['subj1'])
df2.shape

## 8. Convert Categorical Variables into Numerical
- Most of the machine learning algorithms do not take categorical variables so we need to convert them into numerical ones. 
- We can do this using Pandas function `pd.get_dummies()`, which will create a binary column for each of the categories. 
```
pd.get_dummies(data, drop_first=False)
```
- Where, the only required argument is `data` which can be a dataframe or a series
- The parameter drop_first : bool, default False Whether to get k-1 dummies out of k categorical levels by removing the first level.

**Note:** Making a dummy variable will take all the `K` distinct values in one coumn and make `K` columns out of them

#### a. Convert all categorical variables into dummy/indicator variables

In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.columns

In [None]:
# currently we have 10 columns in the data
df.shape

In [None]:
# Convert all categorical variables into dummy/indicator variables
df = pd.get_dummies(df)

In [None]:
df.shape

In [None]:
# Let us view the datafreame, keep a note on the number of columns
pd.set_option('display.max_columns', None)
df.head()

- So we have 37 columns
- Even though one-hot encoding is a good way to convert your categorical columns to numerical columns
- But it adds a lot of dimensionality to your data, i.e., increase the number of columns
- It also become difficult to deal with that much number of columns
- This is a trade-off, which is handled by technique called dimensionality reduction

#### b. Perform One-Hot Encoding for Categorical Column `gender` Only
- In our dataframe, the gender column is a categorical column having two values 'male' and 'female'
- It will create a dummy binary columns.  
- This is also known as `One Hot Encoding`. You will learn more encoding techniques in the data pre-processing module.


In [None]:
df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [None]:
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']])
df2.head()

In [None]:
# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this
df2 = pd.get_dummies(df1[['gender']], drop_first=True)
df2.head()

In [None]:
# We will talk about join in the next session in detail.
df3 = df.join(df2['gender_Male'])
df3.head()

## Practice Questions

For the practice questions, we will use following dataset

In [None]:
import pandas as pd
import numpy as np
dict1 ={
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,np.nan,70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,3045.6],
'ord_date': ['2012-10-05','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,3003,3002,3001,3001],
'salesman_id':[5002,5003,5001,np.nan,5002,5001,5001,np.nan,5003,5002,5003,np.nan]
}
print(dict1)

In [None]:
df = pd.DataFrame(dict1)
df

#### Write a Pandas program to drop the rows where at least one element is missing in a given DataFrame

In [None]:
df.dropna(how='any')

## Bonus
### Create a heatmap for more information about the distribution of missing values in a given DataFrame.

In [None]:
import seaborn as sns


In [None]:
sns.heatmap(df.isnull(), annot=True);

## All statistical functions
- `count()`: Returns the number of times an element/data has occurred (non-null)
- `sum()`: Returns sum of all values
- `mean()`: Returns the average of all values
- `median()`: Returns the median of all values
- `mode()`: Returns the mode
- `std()`: Returns the standard deviation
- `min()`: Returns the minimum of all values
- `max()`: Returns the maximum of all values
- `abs()`: Returns the absolute value

In [None]:
-12.34, abs(-12.34)

In [None]:
# df.describe method is used to calculate the count, mean, standard deviation, minimum, 
#maximum and percentile values
df.describe(include='all')

## Practice Questions Part 2:
- Step 1. Import the necessary libraries
- Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/ehtisham-sadiq/P0Q6R9S2-MLZone/main/Module%2004%20-%20Python%20for%20Data%20Scientists/datasets/Euro_2012_stats_TEAM.csv)
- Step 3. Assign it to a variable called `euro12`.
- Step 4. Select only the Goal column.
- Step 5. How many team participated in the Euro2012
- Step 6. What is the number of columns in the dataset
- Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
- Step 8. Sort the teams by Red Cards, then to Yellow Cards
- Step 9. Calculate the mean Yellow Cards given per Team
- Step 10. Filter teams that scored more than 6 goals
- Step 11. Select the teams that start with G
- Step 12. Select the first 7 columns and all the rows
- Step 13. Select all columns except the last 3
- Step 14. Presents/shows only the Shooting Accuracy from England, Italy and Russia

In [None]:
url = "https://raw.githubusercontent.com/ehtisham-sadiq/P0Q6R9S2-MLZone/main/Module%2004%20-%20Python%20for%20Data%20Scientists/datasets/Euro_2012_stats_TEAM.csv"
print(url)

In [None]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# displayy all the columns and rows
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [None]:
euro12 = pd.read_csv(url) # Read the dataset

In [None]:
euro12.head() # View the first 5 rows

In [None]:
euro12['Goals']

In [None]:
# how many teams played in euro 2012
# euro12.shape[0]
euro12['Team'].count()

In [None]:
# number of columns
euro12.shape[1] # or  len(euro12.columns)

In [None]:
# assign the columns to discipline 
discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]

In [None]:
# sort discipline in descending order
discipline.sort_values(by=['Red Cards', 'Yellow Cards'], ascending=False)

In [None]:
# Calculate the mean Yellow Cards given per Team
discipline['Yellow Cards'].mean()

In [None]:
# Filter teams that scored more than 6 goals
euro12['Goals'] > 6

In [None]:
euro12[euro12['Goals'] > 6]

In [None]:
# Select the teams that start with G

mask = euro12['Team'].str.startswith('G')

In [None]:
euro12[mask]

In [None]:
# Select the first 7 columns and all the rows

euro12.iloc[: , :7]

In [None]:
# Select all columns except the last 3

euro12.iloc[: , :-3]

In [None]:
# Presents/shows only the Shooting Accuracy from England, Italy and Russia
# df.loc[df["Team"].isin(["England", "Italy", "Russia"])][["Shooting Accuracy"]]

mask = euro12['Team'].isin(['England', 'Italy', 'Russia'])

In [None]:
euro12.loc[mask][['Team','Shooting Accuracy']]

# Pandas - Assignment No 01
- Click here to solve [Pandas - Assignment no 01](https://www.kaggle.com/code/ehtishamsadiq/pandas-assignment-no-01)

In [1]:
from IPython.core.display import HTML

style = """
    <style>
        body {
            background-color: #f2fff2;
        }
        h1 {
            text-align: center;
            font-weight: bold;
            font-size: 36px;
            color: #4295F4;
            text-decoration: underline;
            padding-top: 15px;
        }
        
        h2 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #4A000A;
            text-decoration: underline;
            padding-top: 10px;
        }
        
        h3 {
            text-align: left;
            font-weight: bold;
            font-size: 30px;
            color: #f0081e;
            text-decoration: underline;
            padding-top: 5px;
        }

        
        p {
            text-align: center;
            font-size: 12 px;
            color: #0B9923;
        }
    </style>
"""

html_content = """
<h1>Hello</h1>
<p>Hello World</p>
<h2> Hello</h2>
<h3> World </h3>
"""

HTML(style + html_content)