# ADVANCED PANDAS: DATA PREPROCESSING

## Course Outline:
- ***Introduction to Data Wrangling***
    - ***Case-study: Data Preprocessing for The Absolute Beginners***
- ***Data Cleaning & Preparation***
    - ***Data Cleaning (Missing & Duplicated Data)***
    - ***String Manipulation (Regular Expression)***
    - ***Data Transformation***
- Merging, Joining, and Concatenating Data
    - concat()
    - merge()
    - join()
- Aggregation and Grouping
    - groupby()
- Reshaping and Pivoting
    - pivot()
    - pivot_table()
    - crosstab()

==========

# *Introduction to Data Wrangling*

## Data Wrangling (Munging) Basics
Data wrangling is defined as the process of taking disorganized or incomplete raw data and standardizing it so that you can easily access, consolidate, and analyze it (i.e. SNR), the steps are as follow:
- Discovering (Understanding Data)
- Structuring (Features Splitting, Tidy-data)
- Cleaning (Missing Data, Outliers Detections, Remove Duplications)
- Enriching (Merging, Concatenation)
- Validating (Data Types)
- Publishing (Readiness for Analysis & Visualization)

In [None]:
from IPython.display import Image
Image("data/preprocessing.png")

### Resources:
- Cheat Sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- User Guide (10 Minutes Pandas): https://pandas.pydata.org/docs/user_guide/10min.html
- Exercises: https://www.w3resource.com/python-exercises/pandas/index.php

## Case-study: Data Preprocessing for The Absolute Beginners

Our client is a credit card company. They have brought us a dataset that includes some demographics and recent financial data (the past six months) for a sample of 30,000 of their account holders.

Data Source (Modified): https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#

### Step #0: Importing the Libraries

In [None]:
import numpy as np

In [None]:
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
# import matplotlib as mpl #additional plotting functionality
# mpl.rcParams['figure.dpi'] = 400 #high resolution figures

### Step #1: Loading the Case Study Data

In [None]:
df = pd.read_excel('data/credit-card-clients.xls')
df

### Step #2: Verifying Basic Data Integrity
we will perform a basic check on whether our dataset contains what we expect and verify whether there are the correct number of samples.

### Inspecting Properties

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.columns

### Step #3: Data Preprocessing

### Finding & Dealing with Duplicated Rows

In [None]:
# Select the target column (ID) and count unique values
df['ID'].nunique()

In [None]:
# This will list the unique IDs and how often they occur
id_counts = df['ID'].value_counts()
id_counts.head()

In [None]:
id_counts.value_counts()

In [None]:
dupe_mask = id_counts == 2

In [None]:
# We assign the indices of the duplicated IDs to a variable called dupe_mask and display the first 5 duplicated IDs 
dupe_mask[0:5]

In [None]:
id_counts.index[0:5]

In [None]:
dupe_ids = id_counts.index[dupe_mask]

In [None]:
# Convert dupe_ids to a list and then obtain the length of the list
dupe_ids = list(dupe_ids)
len(dupe_ids)

In [None]:
dupe_ids[0:5]

Using the first three IDs on our list of dupes, dupe_ids[0:3], we will plan to first find the rows containing these IDs. If we pass this list of IDs to the .isin method of the ID series, this will create another logical mask we can use on the larger DataFrame to display the rows that have these IDs.

In [None]:
# This is just for checking the data
df.loc[df['ID'].isin(dupe_ids[0:3]),:].head(10)

We can see some duplicates here, and it looks like every duplicate ID has one row with data, and another row with all zeros. Is this the case for every duplicate ID? Let's check.

In [None]:
df.shape

In [None]:
df_zero_mask = df == 0
# df_zero_mask

In [None]:
feature_zero_mask = df_zero_mask.iloc[:,1:].all(axis=1)
# feature_zero_mask

In [None]:
sum(feature_zero_mask)

It looks like there are at least as many "zero rows" as there are duplicate IDs. Let's remove all the rows with all zero features and response, and see if that gets rid of the duplicate IDs.

In [None]:
# Clean the DataFrame by eliminating the rows with all zeros, except for the ID
df_clean_1 = df.loc[~feature_zero_mask,:].copy()

In [None]:
df_clean_1.shape

In [None]:
df_clean_1['ID'].nunique()

### Finding & Dealing with Missing Data

In [None]:
df_clean_1.info()

In [None]:
df_clean_1.head()

In [None]:
df_clean_1['PAY_1'].head(5)

In [None]:
df_clean_1['PAY_1'].value_counts()

Let's throw out these missing values, which were initially hidden from us in the `.info()` output, now.

In [None]:
valid_pay_1_mask = df_clean_1['PAY_1'] != 'Not available'

In [None]:
valid_pay_1_mask[0:5]

In [None]:
sum(valid_pay_1_mask)

In [None]:
df_clean_2 = df_clean_1.loc[valid_pay_1_mask,:].copy()

In [None]:
df_clean_2.shape

In [None]:
df_clean_2['PAY_1'].value_counts()

In [None]:
df_clean_2['PAY_1'] = df_clean_2['PAY_1'].astype('int64')

In [None]:
df_clean_2[['PAY_1', 'PAY_2']].info()

==========

# *1] Data Cleaning & Preparation*

## Data Cleaning
- Detecting Missing Values
- Dealing with Missing Values
    - Removing Missing Data
    - Replacing Missing Data
- Data with Duplication
    - Detection of Duplicates
    - Handling Duplicates
- Outliers Detection / Handling

##### Importing Libraries & Data

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
# Using well-known 'titanic' dataset
titanic = sns.load_dataset('titanic')
titanic

### Detecting & Dealing with Missing Values

##### Detecting Missing Values Using isna()

In [None]:
titanic.head()

In [None]:
titanic.tail()

In [None]:
titanic.info()

In [None]:
# Listing all missing data
titanic.isna()

In [None]:
# Listing all non-missing data (the opposite)
titanic.notna()

In [None]:
titanic[titanic.isna().values]

In [None]:
# Find the total number of missing data
titanic.isna().sum()

In [None]:
# Checking if the feature has missing values or not
titanic.isna().any(axis=0)

In [None]:
# Returning all records with missing data
titanic[titanic.isna().any(axis=1)]

In [None]:
# Let's visualize missing data
plt.figure(figsize=(20,10))
sns.heatmap(titanic.isna())

In [None]:
titanic[titanic.embarked.isna()]

##### Removing Missing Data Using dropna()

In [None]:
titanic['age'].isna().sum()

In [None]:
titanic.shape

In [None]:
titanic.dropna().shape

In [None]:
titanic.dropna(axis=1, how='any', thresh=500, subset=['age'], inplace=True).shape

##### Handling Missing Data Using fillna() function

In [None]:
titanic['age'].isna().sum()

In [None]:
titanic['age'].mean(skipna=True)

In [None]:
titanic['age'].fillna(round(titanic['age'].mean(skipna=True),2), inplace=True)

In [None]:
titanic.fillna(method='ffill', axis=1)

In [None]:
titanic['age'].isna().sum()

### Data with Duplication

##### Finding Duplicated Data Using duplicated() function

In [None]:
titanic.head()

In [None]:
titanic.info()

In [None]:
titanic.duplicated(subset=None)

In [None]:
titanic[titanic.duplicated(subset=None)]

In [None]:
titanic.duplicated(keep='first').sum()

In [None]:
titanic[titanic.duplicated()].sort_values(by='fare')

In [None]:
titanic.duplicated(subset=['survived','pclass']).sum()

In [None]:
titanic[titanic.duplicated()]

##### Dealing with Duplicated Data

In [None]:
titanic.drop(index=[870,877])

In [None]:
titanic.drop_duplicates(['fare'], keep='first')

In [None]:
titanic.drop_duplicates(ignore_index=True)

### Outliers Detection & Handling

##### Finding & Dealing with Outliers

In [None]:
titanic.plot(subplots = True, figsize = (15,10))
plt.show()

In [None]:
titanic.head()

In [None]:
titanic.describe().round(2)

In [None]:
titanic.boxplot('age')

In [None]:
(titanic['age'] > 60).sum()

In [None]:
titanic.loc[titanic['age'] > 60]

##### Handling / Removing Outliers

In [None]:
titanic['fare'].sort_values(ascending=False)

In [None]:
titanic[titanic['fare'] > 300]

In [None]:
titanic.loc[titanic['fare'] > 300, 'fare'] = titanic['fare'].mean()

In [None]:
(titanic['fare'] > 300).sum()

In [None]:
titanic.iloc[679]

In [None]:
titanic.boxplot('fare')

==========

## String Manipulation (Regular Expressions)
- Python String Functions Overview
- Vectorized String Operations
- Dealing with Categorical Data
- Regular Expressions Basics

##### Python Strings Functions

| Method         	| Description                                                                                   	|
|----------------	|-----------------------------------------------------------------------------------------------	|
| capitalize()   	| Converts the first character to upper case                                                    	|
| casefold()     	| Converts string into lower case                                                               	|
| center()       	| Returns a centered string                                                                     	|
| count()        	| Returns the number of times a specified value occurs in a string                              	|
| encode()       	| Returns an encoded version of the string                                                      	|
| endswith()     	| Returns true if the string ends with the specified value                                      	|
| expandtabs()   	| Sets the tab size of the string                                                               	|
| find()         	| Searches the string for a specified value and returns the position of where it was found      	|
| format()       	| Formats specified values in a string                                                          	|
| format_map()   	| Formats specified values in a string                                                          	|
| index()        	| Searches the string for a specified value and returns the position of where it was found      	|
| isalnum()      	| Returns True if all characters in the string are alphanumeric                                 	|
| isalpha()      	| Returns True if all characters in the string are in the alphabet                              	|
| isdecimal()    	| Returns True if all characters in the string are decimals                                     	|
| isdigit()      	| Returns True if all characters in the string are digits                                       	|
| isidentifier() 	| Returns True if the string is an identifier                                                   	|
| islower()      	| Returns True if all characters in the string are lower case                                   	|
| isnumeric()    	| Returns True if all characters in the string are numeric                                      	|
| isprintable()  	| Returns True if all characters in the string are printable                                    	|
| isspace()      	| Returns True if all characters in the string are whitespaces                                  	|
| istitle()      	| Returns True if the string follows the rules of a title                                       	|
| isupper()      	| Returns True if all characters in the string are upper case                                   	|
| join()         	| Joins the elements of an iterable to the end of the string                                    	|
| ljust()        	| Returns a left justified version of the string                                                	|
| lower()        	| Converts a string into lower case                                                             	|
| lstrip()       	| Returns a left trim version of the string                                                     	|
| maketrans()    	| Returns a translation table to be used in translations                                        	|
| partition()    	| Returns a tuple where the string is parted into three parts                                   	|
| replace()      	| Returns a string where a specified value is replaced with a specified value                   	|
| rfind()        	| Searches the string for a specified value and returns the last position of where it was found 	|
| rindex()       	| Searches the string for a specified value and returns the last position of where it was found 	|
| rjust()        	| Returns a right justified version of the string                                               	|
| rpartition()   	| Returns a tuple where the string is parted into three parts                                   	|
| rsplit()       	| Splits the string at the specified separator, and returns a list                              	|
| rstrip()       	| Returns a right trim version of the string                                                    	|
| split()        	| Splits the string at the specified separator, and returns a list                              	|
| splitlines()   	| Splits the string at line breaks and returns a list                                           	|
| startswith()   	| Returns true if the string starts with the specified value                                    	|
| strip()        	| Returns a trimmed version of the string                                                       	|
| swapcase()     	| Swaps cases, lower case becomes upper case and vice versa                                     	|
| title()        	| Converts the first character of each word to upper case                                       	|
| translate()    	| Returns a translated string                                                                   	|
| upper()        	| Converts a string into upper case                                                             	|
| zfill()        	| Fills the string with a specified number of 0 values at the beginning                         	|

##### Vectorized String Operations

In [None]:
data = {'Name': ['Mustafa, Ahmed S.', 'Othman, mustafa M.', 'Mazen, Mariam ', 'Burhan, Saddik', 'Abdullah, Omnia N.', 'Jalil, Mustafa'],
       'Age': [26, 34, 18, 36, 28, 38],
       'Country': ['UAE', 'EGY', 'EGY', 'ERI', 'KSA', 'MAR'],
       'M/F': ['M','M','F','M','F', 'M'],
       'Email': ['a.mustafa@teqanny.com', 'm.othman@raqameyyah.com', 'm.mazen@teqanny.com','s.burhan@teqanny.com','o.nasser@teqanny.com','m.jalil@teqanny.com'],
       'Buy': ['Yes', 'No', 'no','Yes','No','Yes']}

students = pd.DataFrame(data)
students

In [None]:
# Using 'str' for vectorized string operations

students.Email.str.len()
# students['Email'].apply(len)

In [None]:
students.loc[students.Name.str.startswith('M')]

In [None]:
students.Name.str.find('Mustafa')

In [None]:
# String splitting
students['Name'].str.split(', ')

In [None]:
students['Name'].str.split(', ')[1][0]

In [None]:
# Getting students' last names
students['Name'].str.split(', ').str.get(1)

In [None]:
# Getting students' first names
students['Name'].str.split(', ').str.get(1)

In [None]:
students['First Name'] = students['Name'].str.split(', ').str.get(1)
students['Last Name'] = students['Name'].str.split(', ').str.get(0)

In [None]:
students

In [None]:
students['Name'].str.split(expand=True)

In [None]:
students[['Last Name', 'First Name']] = students.Name.str.split(', ', expand=True)

In [None]:
students

In [None]:
# Matching a specific features
students[students.Country.str.match('EGY')]

In [None]:
# Concatenating two features
students['M/F'].str.cat(students['Age'].astype(str), sep='_')

In [None]:
# Searching for a specific record
students[(students.Name.str.contains('Mustafa')) & (students.Age >= 35)]

In [None]:
# Replacing values
students['M/F'] = students['M/F'].str.replace('F','Female').str.replace('M','Male')
students

##### Dealing with Categorical Data

In [None]:
students['Buy'].dtype

In [None]:
students.convert_dtypes()

In [None]:
students.info()

In [None]:
students['Buy'] = students['Buy'].astype('category')

In [None]:
students.info()

In [None]:
students['Buy'] = students['Buy'].cat.rename_categories({'No':'N', 'Yes':'Y'})
# cat.set_categories()
# cat.add_categories()
# cat.remove_categories()
# cat.reorder_categories()

In [None]:
students['Buy'].dtype

In [None]:
students

In [None]:
# Label-Encoding
students['Buy'].cat.codes

In [None]:
# One-Hot-Encoding
pd.get_dummies(students, columns=['Buy'])

In [None]:
# Let's change the data a little bit to add pitfalls 
students['Buy'].cat.categories

In [None]:
students['Buy'].replace({'Noo':'No'}, inplace=True)

In [None]:
students['Buy'].str.title(inplace=True)

In [None]:
students

##### Regular Expressions Basics (RegEx)
RegEx functions fall into three categories: pattern matching, substitution, and splitting

- RegEx Cheat-sheet: https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf
- A useful tool: https://regex101.com/
- Exercises: https://www.geeksforgeeks.org/tag/python-regex-programs/

In [None]:
from IPython.display import Image
Image("data/regex.png")

In [None]:
# loading regular expression library 're'
import re

In [None]:
text = 'foo    bar\t baz    \tqux'

In [None]:
re.split('\s+', text)

# regex = re.compile('\s+')
# regex.split(text)

In [None]:
re.findall('\s+', text)

In [None]:
# A regular expression for matching email addresses
email_regex = r'\w\S*@.*raqameyyah'

In [None]:
# Find specific emails (works on Series)
students['Email'].str.findall(email_regex)

# re.findall(email_regex, 'm,othman@raqameyyah')

In [None]:
# Return all students who have 'Mustafa' as their first names
students.loc[students['Name'].str.contains(r"\s[mM]ustafa")]

==========

## Data Transformation
- Shuffling Data
- Mapping
- Discretization
- Normalizing, Standardization & Scaling 

##### Shuffling Data Using sample()

In [None]:
titanic.sample()

In [None]:
titanic_sample = titanic.sample(10).reset_index(drop=True)
titanic_sample

##### Mapping Using map() Function

In [None]:
titanic.sample(5)

In [None]:
titanic['pclass'] = titanic['pclass'].map({1:'First', 2:'Second', 3:'Third'})

In [None]:
titanic.head()

##### Discretization & Bining Using cut() Function

In [None]:
# Grouping people by their ages' ranges
pd.cut(titanic['age'], bins=[0,10,18,30,45,65,100], precision=2).value_counts()
# pd.cut(titanic['age'], 6).value_counts()

In [None]:
titanic['ages_ranges'] = pd.cut(titanic['age'], bins=[0,10,18,30,45,65,100],
                                labels=['Child', 'Teenager', 'Adult', 'Youth', 'MiddleAged', 'Senior'])
titanic

In [None]:
sns.countplot(titanic['ages_ranges'])
# titanic['ages_ranges'].hist()

In [None]:
# Calculate the average values for each ages ranges
titanic.groupby('ages_ranges')['survived'].mean()

In [None]:
# Discretize variable into equal-sized buckets
pd.qcut(titanic['fare'],3,['Cheap','Normal','Expensive']).value_counts()

##### Scaling & Standardization

In [None]:
titanic.describe()

In [None]:
plt.figure(figsize=(20,10))
titanic['fare'].plot()
plt.show()

In [None]:
titanic['fare'] = ((titanic['fare'] - titanic['fare'].mean()) / titanic['fare'].std()).round(2)
titanic

In [None]:
titanic.describe().round(2)

==========

# THANK YOU!