# Capstone Project - Data Science @ BrainStation

##  Introduction

This Jupiter notebook forms the core component of my Capstone project under the Data Science Diploma Program at BrainStation, Vancouver. In this notebook, we use base Python together with some of the widely used libraries to investigate, analyse a dataset sourced from loan data released by Lending Club, a US-based peer-to-peer lender. At it's peak, Lending club was largest P2P lender in the world with assets of ~16 billion USD. The dataset contains the features of various loans extended by Lending Club spanning from 2007 and 2018 and the corresponding details of the borrowers, who had availed the loan. The key detail of the loan captured in the dataset is 'loan_status', which has two values 'Fully Paid' or 'Charged Off', which means all these loans have been closed and there are no running loans in this dataset.

The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/jeandedieunyandwi/lending-club-dataset/code?datasetId=608703&sortBy=voteCount).

In this note, I will be demonstrating how to perform EDA, visualise data, and apply some machine learning techniques to solves problem of prediction.


## Table of Contents
[1. Loading data & checking high-level details](#Step-1:-Loading-data-&-checking-high-level-details) <br>
- [Data Dictionary](#Data-Dictionary)



[2. Verifying assumptions associated with linear regression models](#Step-2:-Verifying-assumptions-associated-with-linear-regression-models) <br>
- [Linearity](#2.1.-Linearity) <br>
- [Independence](#2.2.-Independence-or-No-Multicollienearity)
- [Normality](#2.3.-Residuals-Are-Normally-Distributed)
- [Homoscedasticity](#2.4.-Homoscedasticity)

[3. Variable selection for model](#Step-3:-Variable-selection-for-model) <br>
- [Backward or Top-Down approach](#3.1.-Backward-or-Top-Down-approach)
- [Forward or Bottom-Up approach](#3.2.-Forward-or-Bottom-Up-approach)

[4. Model Diagnostics](#Step-4:-Model-Diagnostics)
- [Residuals](#4.1.-Residuals)
- [Homoscedasticity](#4.2-Homoscedasticity)

[5. Conclusion](#5.-Conclusion)

### Data Dictionary

| S.No | Column Name           | Description                                                                                                 |
|------|-----------------------|-------------------------------------------------------------------------------------------------------------|
| 0    | loan_amount           | The listed amount of the loan in USD applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.|
| 1    | term                  | The number of payments on the loan. Values are in months and can be either 36 or 60.                          |
| 2    | int_rate              | Interest Rate on the loan                                                                                    |
| 3    | installment           | The monthly payment owed by the borrower                                                                    |                                           
| 4    | grade                 | Lending Club assigned loan grade                                                                                      |
| 5    | sub_grade             | Lending Club assigned loan subgrade                                                                                   |
| 6    | emp_title             | The job title supplied by the Borrower when applying for the loan.                                          |   
| 7    | emp_length            | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.|
| 8    | home_ownership        | The homeownership status provided by the borrower during registration or obtained from the credit report. The values are: RENT, OWN, MORTGAGE, OTHER|
| 9    | annual_inc            | The self-reported annual income provided by the borrower during registration.                                 |
| 10   | verification_status   | Indicates if income was verified by LC, not verified, or if the income source was verified                    |
| 11   | issue_d               | The month which the loan was funded                                                                         |
| 12   | loan_status           | Current status of the loan                                                                                 |
| 13   | purpose               | A category provided by the borrower for the loan request.                                                   |
| 14   | title                 | The loan title provided by the borrower                                                                    |
| 15   | zip_code              | The first 3 numbers of the zip code provided by the borrower in the loan application.                         |
| 16   | addr_state            | The state provided by the borrower in the loan application                                                  |
| 17   | dti                   | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.|
| 18   | earliest_cr_line      | The month the borrower's earliest reported credit line was opened                                            |
| 19   | open_acc              | The number of open credit lines in the borrower's credit file.                                               |
| 20   | pub_rec               | Number of derogatory public records                                                                        |
| 21   | revol_bal             | Total credit revolving balance                                                                             |
| 22   | revol_util            | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.|
| 23   | total_acc             | The total number of credit lines currently in the borrower's credit file                                     |
| 24   | initial_list_status   | The initial listing status of the loan. Possible values are – W, F. W stands for Whole loan and F stands for fractional.                                           |
| 25   | application_type      | Indicates whether the loan is an individual application or a joint application with two co-borrowers         |
| 26   | mort_acc              | Number of mortgage accounts.                                                                               |
| 27   | pub_rec_bankruptcies  | Number of public record bankruptcies 

## Data Exploration

In [1]:
#Import the required libraries

import numpy as np # Linear alzebra
import pandas as pd # Data manipulation

# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import hvplot.pandas 

# import library to filter warnings
import warnings
warnings.filterwarnings('ignore')



In [2]:
# read CSV using pandas and name data fram as raw_df
raw_df = pd.read_csv("lending_club_loan_two.csv")

In [None]:
# Check for the size of the dataset
raw_df.shape

Data frame has 396030 rows or observations and 27 columns

In [3]:
# Change the default pandas dataframe display option to enable viewing all columns of the dataframe
pd.set_option('display.max_columns', None) # This option is enabled to look at all the columns in a data frame

In [None]:
# Take a glimpse of the dataframe
raw_df.head(5)

In [None]:
# Check the column names and data types
raw_df.info()

**Datatypes:** Column `Term` could transformed from object to numeric.

**Datatypes:** Column `Term` could transformed from object to numeric later on.

**Feature Engineering**: 
1. New columns `month` and `year` could be extracted from `issue_d` column  
2. New columns `city`, `state` and `pincode` could be extracted from `address` column 
3. New column `inc_by_loan` = (`annual_inc`)/(`loan_amnt`) could be calculated.
4. New column `debt` = `dti` * `annual_inc` could be calculated


### Unique values

In [None]:
# Percentage of unique values in columns, which

# Calculate number of unique values in each column of data frame
unique_values = raw_df.nunique()

# Calculate total values in each column
count_values  = raw_df.count()

# Calculate percentage of unique values and sort the values in descending order
percentage_unique_values = (unique_values * 100 / count_values).sort_values(ascending = False)

# Print the percentages, rounded to one decimal
display(percentage_unique_values.round())

**Observations**

* Any column having high percentage of unique values makes it difficult to summarise and analyse. Further, it is difficult to apply encoding.  
* `address`:  99% percent of addresses are unique. We may drop this column after we extract city, state and pin code values. We 
* `emp_title`: 46% percent of employee titles are unique. Similar to address, employee title column is not useful for analysis and may be dropped.

In [4]:
# drop the 'emp_title' column
raw_df = raw_df.drop(columns = ['emp_title'])

### Summary Statistics of Dataset

In [None]:
# Calculate brief summary statistics
raw_df.describe().round().T

1. **loan_amount** The average loan is ~14,113 USD. Min loan amount is 500.00 USD and max is 40,000.00 USD. 
2. **int_rate**	The interest rate on an average is 13.63%

### Duplicates

In [None]:
# check for duplicate rows
raw_df.duplicated().sum() 

There are no duplicate rows in the dataset.

### Null Values

In [None]:
# Calculate the number of nulls in each column
columns_null_count = raw_df.isna().sum(axis=0)

# Total rows in data frame
total_rows = raw_df.shape[0]

# Calculate percentage of nulls and apply filter of non-zero null count in column
percentage_of_nulls = (columns_null_count * 100 / total_rows).round(2)

# Filter columns with non-zero null count
non_zero_nulls = percentage_of_nulls[percentage_of_nulls != 0]

# Print percentage of nulls in the columns having null values decreasing order
print(non_zero_nulls.sort_values(ascending = False))

 **Observations:** 
 
 * Column `mort_acc` has almost 10% of its values as null. 
 * Column `emp_title` and column `emp_length` has a fair percentage (5.79% and 4.62% respectively) of its values as null.
 * Column `title` has 0.44% of its values as null.
 * Other columns shown above have insignificant null percentage.

#### Handling Null values in columns
* `revol_util` and `pub_rec_bankruptcies`: These columns have low percentage of null values. We may remove the rows with null values in these columns. 
* There are null values in columns emp_title, emp_length, title, revol_util, mort_acc, pub_rec_bankruptcies. These nulls have to handled appropriately.

**I. Null values in `revol_util` and `pub_rec_bankruptcies`**

In [5]:
# Use dropna()method to drop rows with NA values in columns `revol_util` and `pub_rec_bankruptcies`
raw_df = raw_df.dropna(subset=['revol_util', 'pub_rec_bankruptcies'])

We have removed the rows with NA values in columns `revol_util` and `pub_rec_bankruptcies`. 

**II. Null values in `emp_length`**

In [None]:
# Number of unique values in emp_length
print(f"Column `emp_length` has {raw_df['emp_length'].nunique()} unique values \n") 

# List the categories of unique values in emp_length
print(f"The categories are as under: \n\n {raw_df['emp_length'].unique()}")

Since `emp_length` has 4.62% of null values, it may not be prudent to remove the corresponding rows as we may lose important patterns in data. Let's examine how the distribution of `loan_status` is with respect to various categories of `emp_length` to understand how the distribution changes.

We use the crosstab approach to understand this.

In [None]:
# Compute a simple cross tabulation of ''emp_length' and 'loan_status'. 
cross_tab = pd.crosstab(index=raw_df['emp_length'], columns=raw_df['loan_status'], normalize='index').round(2)

# print normalised porportions for each category of emp_length
print(cross_tab)

In [12]:
raw_df['emp_length_numeric'] = raw_df['emp_length'].map({
    '< 1 year': 0,
    '1 year': 1,
    '2 years': 2,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    '10+ years': 10,
    None: None  # to handle NaN values
})


In [14]:
raw_df[['emp_length_numeric','loan_status']].corr()

Unnamed: 0,emp_length_numeric,loan_status
emp_length_numeric,1.0,0.013805
loan_status,0.013805,1.0


The above data that the proportions are quite consistent across different employment lengths. This indicates that practical impact of `emp_length` on predicting `loan_status` is limited. We may proceed to remove this column.

In [18]:
#from scipy.stats import chi2_contingency

#Create a contingency table
#contingency_table = pd.crosstab(raw_df['emp_length'], raw_df['loan_status'])

# Perform the Chi-Square test
#chi2, p, dof, expected = chi2_contingency(contingency_table)

#print("Chi-Square Statistic:", chi2)
#print("P-value:", p)
#print("Degrees of Freedom:", dof)
#print("Expected Frequencies:\n", expected)

**III. Null values in `title`**

In [None]:
# Calculate the top 5 most frequently ocurring categories in column 'title'
raw_df['title'].value_counts()[:5] 

It looks like there are many duplicated categories. For instance, Debt consolidation is captured as different values. We can merge the categories to reduce the number of categories. Let us look at the `title` column and compare with `purpose` column.

In [None]:
raw_df[['purpose','title']].sample(5)

We observe that it has values similar to the purpose column. Infact, the `purpose` column appears to have input validation control and `title` column has appears to be more customised text. We may proceed to remove the `title` column.

**IV. Null values in `mort_acc`**

In [6]:
# Map 'Charged Off' to 0 and 'Fully Paid' to 1 in the 'loan_status' column
raw_df['loan_status'] = raw_df['loan_status'].map({'Charged Off': 0, 'Fully Paid': 1})

In [15]:
raw_df[['mort_acc','loan_status']].corr()

Unnamed: 0,mort_acc,loan_status
mort_acc,1.0,0.073048
loan_status,0.073048,1.0


In [9]:
raw_df[['mort_acc','loan_status']].sample(10)

Unnamed: 0,mort_acc,loan_status
81550,0.0,1
124663,2.0,1
387738,0.0,0
371582,0.0,0
255136,0.0,1
271145,1.0,0
377318,,1
234228,2.0,1
386291,4.0,0
283251,,1


In [None]:
numeric_df = raw_df.select_dtypes("number")

plot_num = 1

plt.subplots(6,2, figsize=(20,20))

for col in numeric_df.columns:
    plt.subplot(6,2,plot_num)
    sns.histplot(raw_df[col])
    plot_num +=1

plt.tight_layout()
plt.show()

In [None]:
cols_with_outliers = ['annual_inc', 'dti', 'revol_bal', 'revol_util']

plt.subplots(2,2, figsize=(15,5.5))
plot_num = 1

for col in cols_with_outliers:
    q1 = raw_df[col].quantile(0.25)
    q3 = raw_df[col].quantile(0.75)
    iqr = q3 - q1

    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    filtered_data = raw_df[(raw_df[col] >= lower_bound) & (raw_df[col] <= upper_bound)]

    plt.subplot(2, 2, plot_num)  # Adjust the subplot grid as needed
    
    sns.histplot(filtered_data[col])
    plt.title(f"Distribution of {col} (excluding outliers)")
    plt.xlabel(col)
    plt.ylabel("Number of Loans")

    plot_num += 1

plt.tight_layout()  # Adjust layout for better spacing
plt.show()

In [None]:
#categorical_cols = ['term','grade','sub_grade','emp_title','emp_length','home_ownership','verification_status',
                  #'issue_d', 'loan_status','purpose','title','initial_list_status','application_type']

categorical_cols = ['term','grade','emp_length','home_ownership','verification_status','loan_status',
                   'initial_list_status','application_type']

plot_num = 1

plt.subplots(4, 2, figsize=(15, 10))

for col in categorical_cols:
    plt.subplot(4, 2, plot_num)
    
    # Calculate normalized counts manually
    counts = raw_df[col].value_counts(normalize=True)
    
    # Plot the bar chart
    sns.barplot(x=counts.index, y=counts.values)
    
    plt.title(f"Normalized Count of {col}")
    plt.xlabel(col)
    plt.ylabel("Percentage")
    plot_num += 1

plt.tight_layout()
plt.show()

In [None]:
categorical_df = raw_df.select_dtypes("object")
categorical_df.columns

1. Maximum loans are in the range of 5k to 12.5k USD.
2. Minimum loans are extended in range of 27k to 32k USD.


Most loans are in the range of 11% to 16%. The loans above 21% taper off.

In [None]:
rates = raw_df.groupby(["grade", "term"])["term"].count()/raw_df.groupby(["grade"])["term"].count()
rates.unstack().plot(kind="barh", stacked=True)
sns.despine()

In [None]:
# Create a mask to hide the upper triangle
mask = np.triu(np.ones_like(raw_df.corr(), dtype=bool))

# Set up the figure with size
plt.figure(figsize=(15, 10))

# Draw the heatmap with the mask
sns.heatmap(data = raw_df.corr(), cmap="coolwarm", annot=True, mask=mask)

plt.show()


In [None]:
raw_df['loan_status'].value_counts().hvplot.bar(
    title="Loan Status Counts", xlabel='Loan Status', ylabel='Count', 
    width=500, height=350
)

## Exploring columns with null values

We had earlier seem that some of the columns have null values. Let's calculate the percentages of nulls in these columns.

In [None]:
#Extract columns with null values and calcuate the percentages.
columns_with_nulls = (raw_df.isna().sum(axis=0) * 100 / raw_df.shape[0]).round(2).loc[lambda x: x != 0]

# print percentage of nulls in the columns having null values
print(columns_with_nulls.sort_values(ascending = True))

`revol_util` and `pub_rec_bankruptcies` have small percentage of null values and we can remove the rows in these columns that have `revol_util` and `pub_rec_bankruptcies` as null values

In [None]:
raw_df.dropna(subset=['revol_util', 'pub_rec_bankruptcies'], inplace=True)

In [None]:
raw_df.shape

Let us look at `title` column. We observe that it has values similar to the `purpose` column. 

In [None]:
raw_df[['purpose','title']].sample(5)

Based on above, we may proceed to remove the column `title`

In [None]:
# number of unique values on `emp_length` column
raw_df['emp_length'].nunique() 

In [None]:
cross_tab = pd.crosstab(index=raw_df['emp_length'], columns=raw_df['loan_status'], normalize='index').round(2)

# print normalised porportions for each category of emp_length
print(cross_tab)


We notice that for every category of emp_length, the charged off and full paid values maintain the same proportion. This implies that  emp_length does not influence loan_status. Therefore, we may remove the `emp_length` column.

In [None]:
raw_df['emp_title'].nunique() #Check the number of unique 'employee title' values

We may proceed to delete this column `emp_title` as there are too many unique and we would not be able to convert this numerical column.

In [None]:
raw_df[['purpose','title']].sample(5)

In [None]:
raw_df.corr()['mort_acc'].round(1)

In [None]:
raw_df['mort_acc'].value_counts()

In [None]:
raw_df[raw_df['mort_acc'].notna()].corr()['mort_acc'].round(1)

The highest correlation is observed for `total_acc`, but the same is not high enough for us to be able to delete the same.

 We observe there is no common ratio. There is some relation between `mort_acc` and `loan_status`. So, we may not delete the column and instead proceed with imputation by filling with some value. We can use the fillna() method for imputation of `mort_acc'. 

As we explored all the columns which have null values, we proceed with dropping all columns with null values, except for `mort_acc`

In [None]:
raw_df['mort_acc'].fillna(raw_df['mort_acc'].mean(), inplace=True)



In [None]:
raw_df.isnull().sum() # checking for nulls

In [None]:
#drop the columns 'title','emp_length','revol_util','pub_rec_bankruptcies','emp_title' 
raw_df = raw_df.drop(columns = ['title','emp_length','emp_title'])

In [None]:
raw_df.shape

After dropping some columns, let us extract categorical columns again from raw_df

In [None]:
raw_df = raw_df.drop(columns = ['sub_grade'])

In [None]:
import re

# Assuming raw_df is your original dataframe
# Extracting Month and Year from earliest_cr_line
raw_df['earliest_cr_line'] = pd.to_datetime(raw_df['earliest_cr_line'])
raw_df['earliest_cr_line_month'] = raw_df['earliest_cr_line'].dt.month
raw_df['earliest_cr_line_year'] = raw_df['earliest_cr_line'].dt.year

# Extracting State Code from address
state_pattern = r',\s(\w{2})\s\d+'
state_extract = raw_df['address'].str.extract(state_pattern)
raw_df['address_state_code'] = state_extract[0]

# Extracting Month and Year from issue_d
raw_df['issue_d'] = pd.to_datetime(raw_df['issue_d'])
raw_df['issue_d_month'] = raw_df['issue_d'].dt.month
raw_df['issue_d_year'] = raw_df['issue_d'].dt.year

# Calculate Annual_Income/Loan Amount
raw_df['annual_inc_loan_amnt_ratio'] = raw_df['annual_inc'] / raw_df['loan_amnt']

# Calculate Annual_Income/Interest Rate
raw_df['annual_inc_int_rate_ratio'] = raw_df['annual_inc'] / raw_df['int_rate']

# Calculate Debt using dti and Annual_Income
raw_df['debt'] = raw_df['dti'] * raw_df['annual_inc']

# Display the updated dataframe
print(raw_df.head())


In [None]:
columns_to_drop = ['address', 'earliest_cr_line', 'issue_d']
raw_df.drop(columns=columns_to_drop, inplace=True)

In [None]:
# Map 'Charged Off' to 0 and 'Fully Paid' to 1 in the 'loan_status' column
raw_df['loan_status'] = raw_df['loan_status'].map({'Charged Off': 0, 'Fully Paid': 1})

# Display the updated dataframe
print(raw_df.head())

In [None]:
categorical_df = raw_df.select_dtypes("object")


In [None]:
dummies_df = pd.get_dummies(categorical_df, drop_first=True)

# Now, concatenate the numerical columns with the original dataframe

processed_df = pd.concat([raw_df, dummies_df], axis=1)

# Drop the original categorical columns if needed
processed_df = processed_df.drop(categorical_df.columns, axis=1)



# Now, processed_df contains the original numerical columns and the one-hot encoded categorical columns

In [None]:
raw_df.head()

In [None]:
processed_df.head()

In [None]:
processed_df = processed_df.drop(columns=["revol_bal","installment","revol_util","open_acc","grade_B","grade_C","grade_D","grade_E","purpose_moving","purpose_other","purpose_renewable_energy","purpose_small_business","purpose_vacation","purpose_wedding"])

In [None]:
processed_df = processed_df.drop(columns = ["grade_F","grade_G","home_ownership_MORTGAGE","home_ownership_NONE","home_ownership_OTHER","home_ownership_OWN","home_ownership_RENT","verification_status_Source Verified","verification_status_Verified","purpose_credit_card","purpose_debt_consolidation","purpose_educational","purpose_home_improvement","purpose_house","purpose_major_purchase","purpose_medical","initial_list_status_w"])

In [None]:
processed_df.shape

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler



# Select features (X) and target variable (y)
X = processed_df.drop('loan_status', axis=1)  #  'loan_status' is target variable
y = processed_df['loan_status']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=49)

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform all columns in training set
X_train_scaled = scaler.fit_transform(X_train)

# Transform all columns in testing set
X_test_scaled = scaler.transform(X_test)

# Convert the scaled arrays back to DataFrames with column names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)


In [None]:
processed_df.shape

In [None]:
processed_df[processed_df['loan_status'] == 1].shape[0]/processed_df[['loan_status']].shape[0]

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


# Initialize Logistic Regression model
model = LogisticRegression(random_state=449)

# Fit the model on the training set
model.fit(X_train_scaled, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report_str)

accuracy: 0.8073

Accuracy is the ratio of correctly predicted observations to the total observations.
In this case, the model correctly predicts the class for approximately 80.73% of the instances.
Precision, Recall, and F1-Score for Class 0 (Charged Off)

Precision (Positive Predictive Value): 0.57
Of all instances predicted as "Charged Off," 57% were actually "Charged Off."
Recall (Sensitivity or True Positive Rate): 0.05
Of all actual "Charged Off" instances, only 5% were correctly predicted.
F1-Score: 0.10
The harmonic mean of precision and recall.

In [None]:
C_range = np.array([.00000001, .0000001, .000001, .00001, .0001, .001, .1,
                   1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000])

# Lists to store training and testing scores
train_scores = []
test_scores = []

# Iterate over different values of C
for C_value in C_range:
    my_logreg = LogisticRegression(C=C_value, random_state=1)

    # Fit the model on the training set
    my_logreg.fit(X_train_scaled, y_train)

    # Append training and testing scores to the respective lists
    train_scores.append(my_logreg.score(X_train_scaled, y_train))
    test_scores.append(my_logreg.score(X_test_scaled, y_test))

# Plotting results
plt.figure()
plt.plot(C_range, train_scores, label='Train Score', marker='.')
plt.plot(C_range, test_scores, label='Test Score', marker='.')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.xscale("log")
plt.grid()
plt.legend()
plt.show()

Accuracy: 0.8073

Accuracy is the ratio of correctly predicted observations to the total observations.
In this case, the model correctly predicts the class for approximately 80.73% of the instances.
Precision, Recall, and F1-Score for Class 0 (Charged Off)

Precision (Positive Predictive Value): 0.57
Of all instances predicted as "Charged Off," 57% were actually "Charged Off."
Recall (Sensitivity or True Positive Rate): 0.05
Of all actual "Charged Off" instances, only 5% were correctly predicted.
F1-Score: 0.10
The harmonic mean of precision and recall.

In [None]:
from sklearn.linear_model import LogisticRegression

# Assuming X_train, y_train are your training data
# Assuming class 0 is the minority class
class_weights = {0: 2, 1: 1}  # You can adjust the weights based on the imbalance ratio

model = LogisticRegression(class_weight=class_weights)
model.fit(X_train_scaled, y_train)

In [None]:
# Make predictions on the testing set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report_str)

In [None]:
processed_df.head()

In [None]:
X_train_scaled.head()

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Create an instance of the SMOTE class with desired sampling strategy (50:50 balance)
sampling_strategy = 1.0  # Adjust this to achieve a 50:50 balance
smote = SMOTE(sampling_strategy=sampling_strategy, random_state=42)

# Fit and apply SMOTE to the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
X_test_resampled, y_test_resampled = smote.fit_resample(X_test_scaled, y_test)


# Initialize Logistic Regression model
model = LogisticRegression(random_state=449)

# Fit the model on the resampled training set
model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the testing set
y_pred = model.predict(X_test_resampled)

# Evaluate the model
accuracy = accuracy_score(y_test_resampled, y_pred)
classification_report_str = classification_report(y_test_resampled, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report_str)

In [None]:
C_range = np.array([.00000001, .0000001, .000001, .00001, .0001, .001, .1,
                   1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000])


# Lists to store training and testing scores
train_scores = []
test_scores = []

# Iterate over different values of C
for C_value in C_range:
    # Initialize Logistic Regression model with the current C value
    model = LogisticRegression(C=C_value, random_state=1)

    # Fit the model on the resampled training set
    model.fit(X_train_resampled, y_train_resampled)

     # Make predictions on the training set
    y_train_pred = model.predict(X_train_resampled)
    
    # Make predictions on the testing set
    y_pred = model.predict(X_test_resampled)

    # Append testing scores to the respective list
    train_scores.append(accuracy_score(y_train_resampled, y_train_pred))
    test_scores.append(accuracy_score(y_test_resampled, y_pred))

# Plotting results
plt.figure()
plt.plot(C_range, train_scores, label='Train Score', marker='.')
plt.plot(C_range, test_scores, label='Test Score', marker='.')
plt.xlabel('C (Log Scale)')
plt.ylabel('Accuracy')
plt.xscale("log")
plt.grid()
plt.legend()
plt.show()

In [None]:
# Assuming 'y' is your output variable
class_counts = np.bincount(y_train_resampled)
class_0_count = class_counts[0]  # Count of class 0
class_1_count = class_counts[1]  # Count of class 1

print(f"Percentage of Class 0 : {class_0_count * 100 /(class_0_count+class_1_count)}")
print(f"Percentage of Class 1 : {class_1_count * 100 /(class_0_count+class_1_count)}")

In [None]:
# Initialize and train a Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_resampled, y_train_resampled)

# Make predictions on the testing set
y_pred = rf_classifier.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)

# Display the results
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_report_str)

In [None]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

# Lists to store training and testing scores
train_scores = []
test_scores = []

n_estimators_range = [10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500]

# Iterate over different values of n_estimators
for n_estimators_value in n_estimators_range:
    # Initialize Random Forest Classifier with the current n_estimators value
    rf_classifier = RandomForestClassifier(n_estimators=n_estimators_value, random_state=42)
    
    # Fit the model on the training set
    rf_classifier.fit(X_train_resampled, y_train_resampled)
    
    # Make predictions on the training set
    y_train_pred = rf_classifier.predict(X_train_resampled)
    
    # Make predictions on the testing set
    y_test_pred = rf_classifier.predict(X_test_scaled)
    
    # Append training and testing scores to the respective lists
    train_scores.append(accuracy_score(y_train_resampled, y_train_pred))
    test_scores.append(accuracy_score(y_test, y_test_pred))

# Plotting results
plt.figure()
plt.plot(n_estimators_range, train_scores, label='Train Score', marker='.')
plt.plot(n_estimators_range, test_scores, label='Test Score', marker='.')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.grid()
plt.legend()
plt.show()


In [None]:
print("hello 12:01")