# Fetching Data from Kaggle

In [1]:
%pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Change Working Directory

The notebooks are stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to its parent folder

We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Bartek\\Desktop\\Predictive-Analysis\\jupyter_notebooks'

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Bartek\\Desktop\\Predictive-Analysis'

## Install Kaggle

To fetch data from Kaggle using their API, you can follow these steps:

1. Install the Kaggle package: First, you need to install the Kaggle package using pip. Open your terminal or command prompt and run the following command:

In [5]:
%pip install kaggle==1.5.12

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


2. Create a Kaggle API token: To access Kaggle datasets programmatically, you need to create an API token. Go to the Kaggle website, sign in to your account, and navigate to your account settings. Scroll down to the "API" section and click on the "Create New API Token" button. This will download a JSON file containing your API credentials.

3. Place the API token in the correct location: Move the downloaded JSON file to the appropriate location on your machine. For most operating systems, this location is ~/.kaggle/kaggle.json. If the .kaggle directory doesn't exist, you can create it manually.

4. Set permissions for the API token: To ensure the security of your API token, you need to set the correct permissions. In your terminal or command prompt, run the following command:

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

'chmod' is not recognized as an internal or external command,
operable program or batch file.


In [7]:
KaggleDatasetPath = "datasets/bhavikjikadara/loan-status-prediction/data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

403 - Forbidden


In [8]:
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

403 - Forbidden


## Load and Inspect Kaggle Data

Store data in Pandas dataframe

In [26]:
import pandas as pd
loan_data = pd.read_csv("inputs/datasets/raw/loan_data.csv")

Inspect the database using a DataFrame summery and check if there are any duplicates.

In [28]:
duplicate_rows = loan_data.duplicated(subset=['Loan_ID'])
duplicate_rows.sum()
loan_data.head(300)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,LP002489,Female,No,1,Not Graduate,,5191,0.0,132.0,360.0,1.0,Semiurban,Y
296,LP002493,Male,No,0,Graduate,No,4166,0.0,98.0,360.0,0.0,Semiurban,N
297,LP002494,Male,No,0,Graduate,No,6000,0.0,140.0,360.0,1.0,Rural,Y
298,LP002500,Male,Yes,3+,Not Graduate,No,2947,1664.0,70.0,180.0,0.0,Urban,N


In [22]:
loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            381 non-null    object 
 1   Gender             376 non-null    object 
 2   Married            381 non-null    object 
 3   Dependents         373 non-null    object 
 4   Education          381 non-null    object 
 5   Self_Employed      360 non-null    object 
 6   ApplicantIncome    381 non-null    int64  
 7   CoapplicantIncome  381 non-null    float64
 8   LoanAmount         381 non-null    float64
 9   Loan_Amount_Term   370 non-null    float64
 10  Credit_History     351 non-null    float64
 11  Property_Area      381 non-null    object 
 12  Loan_Status        381 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 38.8+ KB


We noticed that the database consists of objects and some float/int values. We will need to convert these values so that we can run a machine learning algorithms on the data.

In [29]:
loan_data['Married'] = loan_data['Married'].map({'No': 0, 'Yes': 1})
loan_data['Gender'] = loan_data['Gender'].map({'Female': 0, 'Male': 1})
loan_data['Education'] = loan_data['Education'].map({'Not Graduate': 0, 'Graduate': 1})
loan_data['Self_Employed'] = loan_data['Self_Employed'].map({'No': 0, 'Yes': 1})
loan_data['Property_Area'] = loan_data['Property_Area'].map({'Rural': 0, 'Semiurban': 1, 'Urban': 2})
loan_data['Loan_Status'] = loan_data['Loan_Status'].map({'N': 0, 'Y': 1})
loan_data['Dependents'] = loan_data['Dependents'].map({'0': 0, '1' : 1, '2': 1, '3+': 2})
loan_data.info()
loan_data.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381 entries, 0 to 380
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            381 non-null    object 
 1   Gender             376 non-null    float64
 2   Married            381 non-null    int64  
 3   Dependents         373 non-null    float64
 4   Education          381 non-null    int64  
 5   Self_Employed      360 non-null    float64
 6   ApplicantIncome    381 non-null    int64  
 7   CoapplicantIncome  381 non-null    float64
 8   LoanAmount         381 non-null    float64
 9   Loan_Amount_Term   370 non-null    float64
 10  Credit_History     351 non-null    float64
 11  Property_Area      381 non-null    int64  
 12  Loan_Status        381 non-null    int64  
dtypes: float64(7), int64(5), object(1)
memory usage: 38.8+ KB


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,1.0,1,1.0,1,0.0,4583,1508.0,128.0,360.0,1.0,0,0
1,LP001005,1.0,1,0.0,1,1.0,3000,0.0,66.0,360.0,1.0,2,1
2,LP001006,1.0,1,0.0,0,0.0,2583,2358.0,120.0,360.0,1.0,2,1
3,LP001008,1.0,0,0.0,1,0.0,6000,0.0,141.0,360.0,1.0,2,1
4,LP001013,1.0,1,0.0,0,0.0,2333,1516.0,95.0,360.0,1.0,2,1


## Handle Missing Data

In [39]:
loan_data.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [38]:
loan_data['Credit_History'].fillna(loan_data['Credit_History'].mean(), inplace=True)
loan_data['Loan_Amount_Term'].fillna(loan_data['Loan_Amount_Term'].mean(), inplace=True)
loan_data['Self_Employed'].fillna(loan_data['Self_Employed'].mean(), inplace=True)
loan_data['Gender'].fillna(loan_data['Gender'].mean(), inplace=True)
loan_data['Dependents'].fillna(loan_data['Dependents'].mean(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loan_data['Credit_History'].fillna(loan_data['Credit_History'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loan_data['Loan_Amount_Term'].fillna(loan_data['Loan_Amount_Term'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace metho

Now that all the fields have been converted into float and int values, the database is ready to be pushed to the repo and saved in the output folder for later use when training the ML model. Last step drop the Loan_ID as it will not be used.

In [None]:
loan_data = loan_data.drop(['Loan_ID'] , axis=1)

## Push File to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs')
except Exception as a:
  print(a)

loan_data.to_csv(f"outputs/loan_data.csv",index=False)