# Exploring the data
Let's import pandas and the db_utils code we've written.

In [4]:
import db_utils as dbu
import pandas as pd

Now we can set up a dataframe from the csv file.

In [5]:
df = dbu.csv_to_df('loan_payments.csv')

Let's have a quick look at the data.

In [6]:
print(df.head())

         id  member_id  loan_amount  funded_amount  funded_amount_inv  \
0  38676116   41461848         8000         8000.0             8000.0   
1  38656203   41440010        13200        13200.0            13200.0   
2  38656154   41439961        16000        16000.0            16000.0   
3  38656128   41439934        15000        15000.0            15000.0   
4  38656121   41439927        15000        15000.0            15000.0   

        term  int_rate  instalment grade sub_grade  ... recoveries  \
0  36 months      7.49      248.82     A        A4  ...        0.0   
1  36 months      6.99      407.52     A        A3  ...        0.0   
2  36 months      7.49      497.63     A        A4  ...        0.0   
3  36 months     14.31      514.93     C        C4  ...        0.0   
4  36 months      6.03      456.54     A        A1  ...        0.0   

  collection_recovery_fee  last_payment_date last_payment_amount  \
0                     0.0           Jan-2022              248.82   
1   

In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 54231 entries, 0 to 54230
Data columns (total 43 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           54231 non-null  int64  
 1   member_id                    54231 non-null  int64  
 2   loan_amount                  54231 non-null  int64  
 3   funded_amount                51224 non-null  float64
 4   funded_amount_inv            54231 non-null  float64
 5   term                         49459 non-null  object 
 6   int_rate                     49062 non-null  float64
 7   instalment                   54231 non-null  float64
 8   grade                        54231 non-null  object 
 9   sub_grade                    54231 non-null  object 
 10  employment_length            52113 non-null  object 
 11  home_ownership               54231 non-null  object 
 12  annual_inc                   54231 non-null  float64
 13  verification_status  

We can see that 'term' is an object when it makes more sense for it just to be the number of months.  We can use the DataTransform class to fix this.  Let's create an instance of the class with the dataframe.

In [8]:
transform = dbu.DataTransform(df)

We can now split off the number of months as a new column.

In [9]:
transform.extract_numerical('term')

Column 'term_numerical' created with numerical values extracted from 'term'.


Let's check it worked.  

In [14]:
df.head()

Unnamed: 0,id,member_id,loan_amount,funded_amount,funded_amount_inv,term,int_rate,instalment,grade,sub_grade,...,collection_recovery_fee,last_payment_date,last_payment_amount,next_payment_date,last_credit_pull_date,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,term_numerical
0,38676116,41461848,8000,8000.0,8000.0,36 months,7.49,248.82,A,A4,...,0.0,2022-01-01,248.82,Feb-2022,Jan-2022,0.0,5.0,1,INDIVIDUAL,1970-01
1,38656203,41440010,13200,13200.0,13200.0,36 months,6.99,407.52,A,A3,...,0.0,2022-01-01,407.52,Feb-2022,Jan-2022,0.0,,1,INDIVIDUAL,1970-01
2,38656154,41439961,16000,16000.0,16000.0,36 months,7.49,497.63,A,A4,...,0.0,2021-10-01,12850.16,,Oct-2021,0.0,,1,INDIVIDUAL,1970-01
3,38656128,41439934,15000,15000.0,15000.0,36 months,14.31,514.93,C,C4,...,0.0,2021-06-01,13899.67,,Jun-2021,0.0,,1,INDIVIDUAL,1970-01
4,38656121,41439927,15000,15000.0,15000.0,36 months,6.03,456.54,A,A1,...,0.0,2022-01-01,456.54,Feb-2022,Jan-2022,0.0,,1,INDIVIDUAL,1970-01


Next let's try to format all the date columns as dates. 

In [15]:
transform.format_to_date("last_payment_date")
transform.format_to_date("issue_date")
transform.format_to_date("earliest_credit_line")
transform.format_to_date("next_payment_date")
transform.format_to_date("last_credit_pull_date")

Column 'last_payment_date' converted to datetime format.
Column 'issue_date' converted to datetime format.
Column 'earliest_credit_line' converted to datetime format.
Column 'next_payment_date' converted to datetime format.
Column 'last_credit_pull_date' converted to datetime format.


  """
  """
  """
  """


Now let's create an instance of the DataFrameInfo class from du_utils and test some of the methods.

In [16]:
df_info = dbu.DataFrameInfo(df)

AttributeError: module 'db_utils' has no attribute 'DataFrameInfo'