In [5]:
import pandas as pd

df4 = pd.read_csv('loan_data_2015.csv')
df4.head(5)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m
0,68444620,73334399,35000,35000,35000,60 months,11.99,778.38,C,C1,...,35367,49.3,0,1,5020,40.1,52200,1,4,0
1,68547583,73437441,8650,8650,8650,36 months,5.32,260.5,A,A1,...,24041,88.8,0,3,3081,57.9,26800,1,0,5
2,67849662,72708407,4225,4225,4225,36 months,14.85,146.16,C,C5,...,3830,21.9,0,0,367,22.4,4300,0,0,0
3,68506885,73396712,10000,10000,10000,60 months,11.99,222.4,C,C1,...,35354,75.5,1,1,3118,67.4,14200,1,1,1
4,68341763,72928789,20000,20000,20000,60 months,10.78,432.66,B,B4,...,10827,72.8,0,2,2081,64.7,14000,2,5,1


## Data Cleaning

Data is a very crucial part of data analysis because data with errors and faults can affect analysis and will lead to wrong predictions.

Bad Data Consists of Empty cells, Data in the wrong format, Wrong data, and Duplicates.

Before performing any sort of data analysis, data scientists spend a significant amount of time cleaning the dataset to ensure that it is in the refined state required for various data science techniques to be applied. 

Handling messy data, such as missing values, inconsistent formatting, and data types is essential to a data scientist's job.
The Process of data cleaning is almost covered in these steps:

-> Dropping irrelevant columns.
-> Renaming column names to meaningful names.
-> Making data values consistent.
-> Imputing missing values.



## Cleaning Empty Cells

In [6]:
import pandas as pd

df5 = pd.DataFrame(
    [
            ['JOHN SMITH', 'john.smith@gmail.com'],
            ['Jane Doe', 'jdoe@yahoo.com'],
            [None, 'jonathanbyers888@gmail.com'],
            ['joe schmo', 'joeschmo@hotmail.com'],
            ['Jim Hopper', None],
            ['Mike Wheeler', None]
],columns=['Name', 'Email']
)

print(df5)

new_df = df5.dropna()

print('-----------------------------------------------')
print('After droping irrelevant rows new dataframe is:') #i.e rows 3, 5, 6
print('-----------------------------------------------')
print(new_df)

           Name                       Email
0    JOHN SMITH        john.smith@gmail.com
1      Jane Doe              jdoe@yahoo.com
2          None  jonathanbyers888@gmail.com
3     joe schmo        joeschmo@hotmail.com
4    Jim Hopper                        None
5  Mike Wheeler                        None
-----------------------------------------------
After droping irrelevant rows new dataframe is:
-----------------------------------------------
         Name                 Email
0  JOHN SMITH  john.smith@gmail.com
1    Jane Doe        jdoe@yahoo.com
3   joe schmo  joeschmo@hotmail.com


dropna() will not change the original DataFrame it returns a new DataFrame, to change the original DataFrame use the inplace = True argument inside the dropna.

## Replacing The Empty Values

We can also replace the empty values inside our DataFrame and insert a new value instead.
With this, we won't be deleting those entire rows just because of some empty cells.
Here, fillna() method used to replace the value, for instance:

In [7]:
import pandas as pd

df6 = pd.DataFrame(
    [
        ['JOHN SMITH', 'john.smith@gmail.com'],
        ['Jane Doe', 'jdoe@yahoo.com'],
        [None, 'jonathanbyers888@gmail.com'],
        ['joe schmo', 'joeschmo@hotmail.com'],
        ['Jim Hopper', None],
        ['Mike Wheeler', None]
], columns=['Name', 'Email']
)

print(df6)

df6.fillna('Not Given', inplace=True)

print('-----------------------------------------------')
print('After replacing irrelevant rows cells our dataframe is:') 
print('-----------------------------------------------')
print(df6)

           Name                       Email
0    JOHN SMITH        john.smith@gmail.com
1      Jane Doe              jdoe@yahoo.com
2          None  jonathanbyers888@gmail.com
3     joe schmo        joeschmo@hotmail.com
4    Jim Hopper                        None
5  Mike Wheeler                        None
-----------------------------------------------
After replacing irrelevant rows cells our dataframe is:
-----------------------------------------------
           Name                       Email
0    JOHN SMITH        john.smith@gmail.com
1      Jane Doe              jdoe@yahoo.com
2     Not Given  jonathanbyers888@gmail.com
3     joe schmo        joeschmo@hotmail.com
4    Jim Hopper                   Not Given
5  Mike Wheeler                   Not Given


We can also just replace the value in some specified columns, for Example: here only the Name column value is replaced with 'will be updated shortly'.

In [8]:
import pandas as pd

df7 = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  [None, 'jonathanbyers888@gmail.com'],
  ['joe schmo', 'joeschmo@hotmail.com'],
  ['Jim Hopper', None],
  ['Mike Wheeler', None]
],
columns=['Name', 'Email'])

# two ways
#1 
# df7['Name'].fillna('Will be updated shortly', inplace=True)
#2
df7['Name'] = df7['Name'].fillna('Will be updated shortly')

print(df7)

                      Name                       Email
0               JOHN SMITH        john.smith@gmail.com
1                 Jane Doe              jdoe@yahoo.com
2  Will be updated shortly  jonathanbyers888@gmail.com
3                joe schmo        joeschmo@hotmail.com
4               Jim Hopper                        None
5             Mike Wheeler                        None


## Working with Duplicates

In Data Sets Duplicates can be entered for a variety of reasons. 
Sometimes it is valid and can also cause errors in your data analysis.
This is why it is essential to deal with and detect duplicates in the data.

In [9]:
import pandas as pd

df8 = pd.DataFrame.from_dict(
    {
    'Name': ['Nikita', 'Katrina', 'Evan', 'Kygo', 'Kavya', 'Anne'],
    'Age': [33, 32, 40, 57, 33, 32],
    'Location': ['Mumbai', 'London', 'New York', 'Atlanta', 'Mumbai', 'Paris'],
    'Date Modified': ['08-06-2022', '01-02-2022', '08-12-2022', '09-12-2022', '01-01-2022', '12-09-2022']
}
)

print(df8)

      Name  Age  Location Date Modified
0   Nikita   33    Mumbai    08-06-2022
1  Katrina   32    London    01-02-2022
2     Evan   40  New York    08-12-2022
3     Kygo   57   Atlanta    09-12-2022
4    Kavya   33    Mumbai    01-01-2022
5     Anne   32     Paris    12-09-2022


In Pandas, we have a helpful method, duplicated() Which allows you to identify duplicate records in a dataset.
This method returns boolean values about whether duplicate records exist or not.

In [10]:
# Identifying Duplicate Records in a Pandas DataFrame

print(df8.duplicated())

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool


## Removing Duplicates in Pandas DataFrame
Pandas come with an easy method to remove duplicate records using the .drop_duplicates().

In [11]:
# df8.drop_duplicates(
#     subset=None,            # Which columns to consider 
#     keep='first',           # Which duplicate record to keep
#     inplace=False,          # Whether to drop in place
#     ignore_index=False      # Whether to relabel the index
# )

In [12]:
df9 = df8.drop_duplicates('Location')
# this returned a DataFrame where only all items are unique based on the 'Location' Column.
print(df9)

      Name  Age  Location Date Modified
0   Nikita   33    Mumbai    08-06-2022
1  Katrina   32    London    01-02-2022
2     Evan   40  New York    08-12-2022
3     Kygo   57   Atlanta    09-12-2022
5     Anne   32     Paris    12-09-2022


## Cleaning Strings in Pandas

One of the most helpful things about pandas is that it has various methods and attributes regarding dealing with text data, which certainly helps in Natural Language Processing (NLP).
This is enhanced by the ability to access any type of string method and apply it directly to an entire array of data.
In this tutorial, we are going to learn how to trim white space, split strings into columns, and replace text in strings.

In [13]:
import pandas as pd

df10 = pd.DataFrame.from_dict({
    'Name': ['Shivansh, Gupta', 'Sonia, Abhel', 'Soumya, Gupta', 'Vasu, Tiwari', 'Aravind, Mishra'],
    'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
    'Location': ['Mumbai', 'New Delhi', 'Hyderabad', 'Bangalore', 'Mumbai'],
    'Favorite Color': ['   green  ', 'red', '  yellow', 'blue', 'purple  ']
})

print(df10)

              Name    Region   Location Favorite Color
0  Shivansh, Gupta  Region A     Mumbai        green  
1     Sonia, Abhel  Region A  New Delhi            red
2    Soumya, Gupta  Region B  Hyderabad         yellow
3     Vasu, Tiwari  Region C  Bangalore           blue
4  Aravind, Mishra  Region D     Mumbai       purple  


## Removing White Space in Pandas

We can remove the additional White space from the text in pandas. Pandas Come with the front and back methods to remove whitespace from strings but in the above data especially in the 'Favorite Color' column we will be using the .strip() method to remove whitespace from both sides.

In [14]:
df10['Favorite Color'] = df10['Favorite Color'].str.strip()

print(df10)

              Name    Region   Location Favorite Color
0  Shivansh, Gupta  Region A     Mumbai          green
1     Sonia, Abhel  Region A  New Delhi            red
2    Soumya, Gupta  Region B  Hyderabad         yellow
3     Vasu, Tiwari  Region C  Bangalore           blue
4  Aravind, Mishra  Region D     Mumbai         purple


## Replacing Text using Pandas

In DataFrame, we have a column named Region which has the word 'Region' in it which seems unnecessary.
We are going to use string replace() method

In [15]:
df10['Region'] = df10['Region'].str.replace('Region ', '')

print(df10)

              Name Region   Location Favorite Color
0  Shivansh, Gupta      A     Mumbai          green
1     Sonia, Abhel      A  New Delhi            red
2    Soumya, Gupta      B  Hyderabad         yellow
3     Vasu, Tiwari      C  Bangalore           blue
4  Aravind, Mishra      D     Mumbai         purple


## Splitting Strings into Columns in Pandas

We can see in our DataFrame that we have a column named 'Name' which has first and last name but with ',' character in between them.
Now, we are going to make two new columns based on first and last name and will remove the original 'Name' Column: Here we are going to use string .split() method which will split the string based on the character we gave as an argument and we need to pass in the expand=True argument, in order to instruct Pandas to split the values into separate items.

In [16]:
df10['Favorite Color'] = df10['Favorite Color'].str.strip()
df10[['First Name', 'Last Name']] = df10['Name'].str.split(',', expand=True)
df10['Region'] = df10['Region'].str.replace('Region ', '')
df10 = df10.drop(['Name'], axis=1)

print(df10)

  Region   Location Favorite Color First Name Last Name
0      A     Mumbai          green   Shivansh     Gupta
1      A  New Delhi            red      Sonia     Abhel
2      B  Hyderabad         yellow     Soumya     Gupta
3      C  Bangalore           blue       Vasu    Tiwari
4      D     Mumbai         purple    Aravind    Mishra


Along with these string methods, we can also use some String Case methods in pandas like:

.upper() will convert a string to all upper case
.lower() will convert a string to all lower case
.title() will convert a string to title case