# Investigating `mths_since_last_delinq` and `mths_since_last_major_derog`

**Question**: Are these fields specific to this LC loan, or do they reflect the borrower's entire credit bureau profile?

**Data dictionary definitions**:
- `mths_since_last_delinq`: "The number of months since the **borrower's** last delinquency."
- `mths_since_last_major_derog`: "Months since most recent 90-day or worse rating"

The language suggests credit bureau data, but let's verify empirically.

In [1]:
import pandas as pd
import numpy as np

# --- UPDATE THIS PATH ---
df = pd.read_csv(r'C:\Users\gavro\OneDrive\Desktop\aravalli\data\accepted_2007_to_2018Q4.csv', low_memory=False)

# Parse dates
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], format='%b-%Y', errors='coerce')
df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y', errors='coerce')

# Quick look at the fields we care about
cols = ['id', 'loan_status', 'issue_d', 'last_pymnt_d', 'term',
        'mths_since_last_delinq', 'mths_since_last_major_derog',
        'delinq_2yrs', 'acc_now_delinq']
print(f"Total loans: {len(df):,}")
print(f"\nNull rates for key fields:")
for c in ['mths_since_last_delinq', 'mths_since_last_major_derog']:
    print(f"  {c}: {df[c].isna().mean():.1%} null")
print(f"\nLoan status distribution:")
print(df['loan_status'].value_counts())

Total loans: 2,260,701

Null rates for key fields:
  mths_since_last_delinq: 51.2% null
  mths_since_last_major_derog: 74.3% null

Loan status distribution:
loan_status
Fully Paid                                             1076751
Current                                                 878317
Charged Off                                             268559
Late (31-120 days)                                       21467
In Grace Period                                           8436
Late (16-30 days)                                         4349
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     40
Name: count, dtype: int64


---
## Check 1: Do delinquency months align with this loan's payment gap?

If `mths_since_last_delinq` is loan-specific, then for delinquent loans it should roughly equal
the number of months between `last_pymnt_d` and the reporting date (or the latest date in the dataset).

We'll compute the "months since last payment" and compare it to `mths_since_last_delinq`.

In [2]:
# Use the max last_pymnt_d in the dataset as a proxy for the reporting/snapshot date
snapshot_date = df['last_pymnt_d'].max()
print(f"Approximate snapshot date (max last_pymnt_d): {snapshot_date}")

# Calculate months since last payment
df['months_since_last_pymt'] = (
    (snapshot_date.year - df['last_pymnt_d'].dt.year) * 12 +
    (snapshot_date.month - df['last_pymnt_d'].dt.month)
)

# Focus on late loans
late_statuses = ['Late (31-120 days)', 'Late (16-30 days)', 'Default', 'Charged Off']
late = df[df['loan_status'].isin(late_statuses)].copy()

print(f"\nLate/default loans: {len(late):,}")
print(f"Of those, have mths_since_last_delinq: {late['mths_since_last_delinq'].notna().sum():,}")

# Compare: if loan-specific, these should be close
late_with_both = late.dropna(subset=['mths_since_last_delinq', 'months_since_last_pymt'])
late_with_both['gap'] = late_with_both['mths_since_last_delinq'] - late_with_both['months_since_last_pymt']

print(f"\n--- Difference: mths_since_last_delinq MINUS months_since_last_payment ---")
print(late_with_both['gap'].describe())
print(f"\nDistribution of the gap:")
print(late_with_both['gap'].value_counts().sort_index().head(20))

Approximate snapshot date (max last_pymnt_d): 2019-03-01 00:00:00

Late/default loans: 294,415
Of those, have mths_since_last_delinq: 151,549

--- Difference: mths_since_last_delinq MINUS months_since_last_payment ---
count    150468.000000
mean         10.075936
std          27.663833
min        -132.000000
25%          -9.000000
50%           9.000000
75%          29.000000
max         216.000000
Name: gap, dtype: float64

Distribution of the gap:
gap
-132.0    1
-130.0    3
-129.0    2
-127.0    4
-126.0    1
-125.0    6
-123.0    4
-122.0    2
-121.0    1
-120.0    1
-119.0    2
-118.0    6
-117.0    2
-116.0    1
-115.0    3
-114.0    5
-113.0    5
-112.0    4
-111.0    6
-110.0    2
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  late_with_both['gap'] = late_with_both['mths_since_last_delinq'] - late_with_both['months_since_last_pymt']


In [3]:
# Visual: scatter of the two measures for late loans
sample = late_with_both.sample(min(5000, len(late_with_both)), random_state=42)
print("If loan-specific, points should cluster on the diagonal (y=x):")
print(f"Correlation: {sample['mths_since_last_delinq'].corr(sample['months_since_last_pymt']):.3f}")

# Crosstab: how often is mths_since_last_delinq LESS than months_since_last_payment?
# (would indicate a MORE RECENT delinquency than this loan's last payment gap -- i.e. another account)
pct_delinq_more_recent = (late_with_both['mths_since_last_delinq'] < late_with_both['months_since_last_pymt']).mean()
pct_delinq_much_older = (late_with_both['mths_since_last_delinq'] > late_with_both['months_since_last_pymt'] + 6).mean()

print(f"\n% where mths_since_last_delinq < months_since_last_payment (delinq more recent than this loan's gap): {pct_delinq_more_recent:.1%}")
print(f"% where mths_since_last_delinq > months_since_last_payment + 6 (delinq much older): {pct_delinq_much_older:.1%}")
print("\nIf either of these is substantial, the field is NOT loan-specific.")

If loan-specific, points should cluster on the diagonal (y=x):
Correlation: 0.009

% where mths_since_last_delinq < months_since_last_payment (delinq more recent than this loan's gap): 36.0%
% where mths_since_last_delinq > months_since_last_payment + 6 (delinq much older): 53.4%

If either of these is substantial, the field is NOT loan-specific.


---
## Check 2: Current loans with non-null `mths_since_last_delinq`

If this field is loan-specific, **Current** loans should almost never have a value here
(they've never been delinquent on this loan). If many Current loans DO have values,
it's pulling from the borrower's broader credit file.

In [4]:
current = df[df['loan_status'] == 'Current'].copy()

print(f"Current loans: {len(current):,}")
print(f"Current loans WITH mths_since_last_delinq: {current['mths_since_last_delinq'].notna().sum():,} ({current['mths_since_last_delinq'].notna().mean():.1%})")
print(f"Current loans WITH mths_since_last_major_derog: {current['mths_since_last_major_derog'].notna().sum():,} ({current['mths_since_last_major_derog'].notna().mean():.1%})")

print(f"\n--- mths_since_last_delinq for CURRENT loans ---")
print(current['mths_since_last_delinq'].describe())

# KEY TEST: Current loans where mths_since_last_delinq is very small (e.g. 0-3)
# If loan-specific, this should be nearly impossible for a Current loan
current_recent_delinq = current[current['mths_since_last_delinq'] <= 3]
print(f"\nCurrent loans with mths_since_last_delinq <= 3: {len(current_recent_delinq):,}")
if len(current_recent_delinq) > 0:
    print("  --> These borrowers were recently delinquent (on some account) but this LC loan is Current.")
    print("  --> STRONG evidence this is a credit bureau field, not loan-specific.")
    print(current_recent_delinq[['loan_status', 'last_pymnt_d', 'mths_since_last_delinq',
                                  'acc_now_delinq', 'delinq_2yrs']].head(10))

Current loans: 878,317
Current loans WITH mths_since_last_delinq: 415,677 (47.3%)
Current loans WITH mths_since_last_major_derog: 216,600 (24.7%)

--- mths_since_last_delinq for CURRENT loans ---
count    415677.000000
mean         34.990781
std          21.863361
min           0.000000
25%          17.000000
50%          32.000000
75%          51.000000
max         195.000000
Name: mths_since_last_delinq, dtype: float64

Current loans with mths_since_last_delinq <= 3: 10,776
  --> These borrowers were recently delinquent (on some account) but this LC loan is Current.
  --> STRONG evidence this is a credit bureau field, not loan-specific.
     loan_status last_pymnt_d  mths_since_last_delinq  acc_now_delinq  \
328      Current   2019-02-01                     2.0             0.0   
608      Current   2019-02-01                     3.0             0.0   
784      Current   2019-02-01                     2.0             0.0   
1238     Current   2019-02-01                     3.0        

In [5]:
# Also check: Fully Paid loans with mths_since_last_delinq populated
fully_paid = df[df['loan_status'] == 'Fully Paid'].copy()
print(f"Fully Paid loans: {len(fully_paid):,}")
print(f"Fully Paid WITH mths_since_last_delinq: {fully_paid['mths_since_last_delinq'].notna().sum():,} ({fully_paid['mths_since_last_delinq'].notna().mean():.1%})")

# If a Fully Paid loan has mths_since_last_delinq = 0 or small, it was never late on THIS loan
# but the borrower was recently delinquent elsewhere
fp_recent = fully_paid[fully_paid['mths_since_last_delinq'] <= 6]
print(f"Fully Paid with mths_since_last_delinq <= 6: {len(fp_recent):,}")
if len(fp_recent) > 0:
    print("  --> Loan was paid off successfully, but borrower had recent delinquency elsewhere.")
    print(fp_recent[['loan_status', 'last_pymnt_d', 'mths_since_last_delinq',
                      'acc_now_delinq', 'delinq_2yrs']].head(10))

Fully Paid loans: 1,076,751
Fully Paid WITH mths_since_last_delinq: 528,750 (49.1%)
Fully Paid with mths_since_last_delinq <= 6: 41,885
  --> Loan was paid off successfully, but borrower had recent delinquency elsewhere.
    loan_status last_pymnt_d  mths_since_last_delinq  acc_now_delinq  \
1    Fully Paid   2016-06-01                     6.0             0.0   
7    Fully Paid   2017-01-01                     3.0             0.0   
49   Fully Paid   2018-11-01                     6.0             0.0   
117  Fully Paid   2016-08-01                     3.0             0.0   
142  Fully Paid   2016-07-01                     6.0             0.0   
204  Fully Paid   2019-01-01                     1.0             0.0   
213  Fully Paid   2018-10-01                     4.0             0.0   
216  Fully Paid   2019-01-01                     1.0             0.0   
242  Fully Paid   2019-02-01                     4.0             0.0   
245  Fully Paid   2018-05-01                     1.0       

---
## Check 3: Same last payment date, different loan status — why?

Look at loans with `last_pymnt_d` = Feb 2019. Some are Current, some are Late/Default.
If both stopped paying in Feb 2019, why the different statuses?

**Possible explanations**:
1. Snapshot timing: "Current" loans with Feb 2019 last payment may have been reported while
   the Feb payment was still the most recent expected payment (i.e. data was pulled in early March).
2. Payment due dates differ: some loans may have had a March payment due, others may not have.
3. Grace periods: LC may allow a grace period before marking late.
4. The data is a point-in-time snapshot, not end-of-life.

In [6]:
# Pick a specific last_pymnt_d to investigate
target_date = '2019-02-01'  # Feb 2019
feb19 = df[df['last_pymnt_d'] == target_date].copy()

print(f"Loans with last payment in Feb 2019: {len(feb19):,}")
print(f"\nLoan status breakdown:")
print(feb19['loan_status'].value_counts())

print(f"\n--- Comparing Current vs Late (31-120) loans, both last paid Feb 2019 ---")
for status in ['Current', 'Late (31-120 days)', 'Late (16-30 days)', 'Charged Off']:
    subset = feb19[feb19['loan_status'] == status]
    if len(subset) == 0:
        continue
    print(f"\n=== {status} (n={len(subset):,}) ===")
    print(f"  issue_d range: {subset['issue_d'].min()} to {subset['issue_d'].max()}")
    print(f"  mths_since_last_delinq: mean={subset['mths_since_last_delinq'].mean():.1f}, median={subset['mths_since_last_delinq'].median():.1f}, null%={subset['mths_since_last_delinq'].isna().mean():.1%}")
    print(f"  acc_now_delinq: mean={subset['acc_now_delinq'].mean():.2f}")
    print(f"  delinq_2yrs: mean={subset['delinq_2yrs'].mean():.2f}")
    print(f"  last_pymnt_amnt: mean=${subset['last_pymnt_amnt'].mean():.2f}")

Loans with last payment in Feb 2019: 97,074

Loan status breakdown:
loan_status
Current               52493
Fully Paid            33627
In Grace Period        4605
Late (16-30 days)      3495
Late (31-120 days)     2038
Charged Off             816
Name: count, dtype: int64

--- Comparing Current vs Late (31-120) loans, both last paid Feb 2019 ---

=== Current (n=52,493) ===
  issue_d range: 2013-10-01 00:00:00 to 2018-12-01 00:00:00
  mths_since_last_delinq: mean=34.4, median=31.0, null%=51.2%
  acc_now_delinq: mean=0.00
  delinq_2yrs: mean=0.31
  last_pymnt_amnt: mean=$532.24

=== Late (31-120 days) (n=2,038) ===
  issue_d range: 2014-01-01 00:00:00 to 2018-12-01 00:00:00
  mths_since_last_delinq: mean=31.2, median=27.0, null%=44.0%
  acc_now_delinq: mean=0.00
  delinq_2yrs: mean=0.43
  last_pymnt_amnt: mean=$576.30

=== Late (16-30 days) (n=3,495) ===
  issue_d range: 2014-02-01 00:00:00 to 2018-12-01 00:00:00
  mths_since_last_delinq: mean=33.6, median=30.0, null%=46.2%
  acc_now_de

In [7]:
# Deeper dive: look at individual examples side by side
detail_cols = ['loan_status', 'issue_d', 'last_pymnt_d', 'last_pymnt_amnt', 'installment',
               'term', 'mths_since_last_delinq', 'mths_since_last_major_derog',
               'delinq_2yrs', 'acc_now_delinq', 'grade', 'sub_grade']

feb19_current = feb19[feb19['loan_status'] == 'Current'][detail_cols]
feb19_late = feb19[feb19['loan_status'].isin(['Late (31-120 days)', 'Late (16-30 days)'])][detail_cols]

print("=== Sample CURRENT loans, last paid Feb 2019 ===")
print(feb19_current.head(10).to_string())

print("\n=== Sample LATE loans, last paid Feb 2019 ===")
print(feb19_late.head(10).to_string())

=== Sample CURRENT loans, last paid Feb 2019 ===
   loan_status    issue_d last_pymnt_d  last_pymnt_amnt  installment        term  mths_since_last_delinq  mths_since_last_major_derog  delinq_2yrs  acc_now_delinq grade sub_grade
3      Current 2015-12-01   2019-02-01           829.90       829.90   60 months                     NaN                          NaN          0.0             0.0     C        C5
10     Current 2015-12-01   2019-02-01           508.30       508.30   60 months                    54.0                         54.0          0.0             0.0     C        C2
11     Current 2015-12-01   2019-02-01           363.07       363.07   60 months                     NaN                          NaN          0.0             0.0     C        C2
18     Current 2015-12-01   2019-02-01           471.77       471.77   60 months                    29.0                          NaN          0.0             0.0     B        B1
34     Current 2015-12-01   2019-02-01           381.23 

In [8]:
# KEY CHECK: Compare last_pymnt_amnt to installment
# If a "Current" loan's last payment was much larger than the installment,
# they may have paid ahead (prepaid next month), explaining why they're still Current
# despite last_pymnt_d being "old"

feb19['pymt_vs_installment'] = feb19['last_pymnt_amnt'] / feb19['installment']

print("Payment amount / installment ratio by loan status:")
print(feb19.groupby('loan_status')['pymt_vs_installment'].describe().round(2))

# If Current loans paid ~2x the installment, they may have covered the next month too
current_overpay = feb19[(feb19['loan_status'] == 'Current') & (feb19['pymt_vs_installment'] > 1.5)]
print(f"\nCurrent loans that paid >1.5x installment in Feb 2019: {len(current_overpay):,} / {len(feb19[feb19['loan_status']=='Current']):,}")

Payment amount / installment ratio by loan status:
                      count   mean    std   min   25%   50%    75%    max
loan_status                                                              
Charged Off           816.0   0.76   1.78  0.01  0.16  0.44   0.94  32.09
Current             52493.0   1.16   1.75  0.00  1.00  1.00   1.00  45.94
Fully Paid          33627.0  10.98  11.38  0.00  1.00  6.77  19.57  49.76
In Grace Period      4605.0   1.16   1.57  0.00  1.00  1.00   1.00  34.13
Late (16-30 days)    3495.0   1.13   1.38  0.00  1.00  1.00   1.00  45.97
Late (31-120 days)   2038.0   1.13   1.09  0.00  1.00  1.00   1.05  36.43

Current loans that paid >1.5x installment in Feb 2019: 982 / 52,493


---
## Check 4: Cross-reference `acc_now_delinq` with loan status

`acc_now_delinq` = "The number of accounts on which the borrower is now delinquent."

If a loan is Current but `acc_now_delinq` > 0, the delinquency is on **other** accounts.
This further confirms the credit bureau interpretation.

In [None]:
print("acc_now_delinq by loan_status:")
cross = df.groupby('loan_status')['acc_now_delinq'].agg(['mean', 'median', 'max', 'count'])
print(cross.round(3))

# Current loans where borrower is delinquent on other accounts
current_but_delinq_elsewhere = current[current['acc_now_delinq'] > 0]
print(f"\nCurrent loans where acc_now_delinq > 0: {len(current_but_delinq_elsewhere):,} ({len(current_but_delinq_elsewhere)/len(current):.1%})")
print("  --> These borrowers are delinquent on OTHER accounts but current on this LC loan.")
print("  --> Confirms these fields are from the borrower's full credit profile.")

---
## Check 5: Timing sanity — can `mths_since_last_delinq` predate the loan?

If `mths_since_last_delinq` (in months before snapshot) points to a date BEFORE the loan
was even issued, it clearly refers to a prior delinquency on another account.

In [9]:
# months_since_issue = months from issue_d to snapshot_date
df['months_since_issue'] = (
    (snapshot_date.year - df['issue_d'].dt.year) * 12 +
    (snapshot_date.month - df['issue_d'].dt.month)
)

# If mths_since_last_delinq > months_since_issue, the delinquency happened BEFORE the loan existed
has_delinq = df.dropna(subset=['mths_since_last_delinq']).copy()
predates_loan = has_delinq[has_delinq['mths_since_last_delinq'] > has_delinq['months_since_issue']]

print(f"Loans where mths_since_last_delinq > months since loan issue: {len(predates_loan):,} / {len(has_delinq):,} ({len(predates_loan)/len(has_delinq):.1%})")
print("  --> These delinquencies happened BEFORE this LC loan was even originated.")
print("  --> Definitive proof the field is from the borrower's credit bureau history.")

if len(predates_loan) > 0:
    print(f"\nSample:")
    print(predates_loan[['loan_status', 'issue_d', 'mths_since_last_delinq',
                          'months_since_issue', 'delinq_2yrs']].head(10).to_string())

Loans where mths_since_last_delinq > months since loan issue: 545,846 / 1,102,166 (49.5%)
  --> These delinquencies happened BEFORE this LC loan was even originated.
  --> Definitive proof the field is from the borrower's credit bureau history.

Sample:
   loan_status    issue_d  mths_since_last_delinq  months_since_issue  delinq_2yrs
6   Fully Paid 2015-12-01                    49.0                39.0          0.0
9   Fully Paid 2015-12-01                    75.0                39.0          0.0
10     Current 2015-12-01                    54.0                39.0          0.0
14  Fully Paid 2015-12-01                    42.0                39.0          0.0
29  Fully Paid 2015-12-01                    47.0                39.0          0.0
43  Fully Paid 2015-12-01                    70.0                39.0          0.0
46  Fully Paid 2015-12-01                    59.0                39.0          0.0
47  Fully Paid 2015-12-01                    45.0                39.0          0.0

---
## Summary / Interpretation

**Expected findings:**

| Check | If loan-specific | If credit bureau |
|-------|-----------------|------------------|
| Current loans with `mths_since_last_delinq` | Very rare | Common |
| Fully Paid with small `mths_since_last_delinq` | Impossible | Possible |
| Delinquency predates loan issuance | Impossible | Common |
| `acc_now_delinq` > 0 for Current loans | Contradictory | Expected |
| Correlation of delinq months vs payment gap | Very high | Moderate/weak |

**Re: Current vs Late with same `last_pymnt_d`:**

This is most likely a **snapshot timing** issue. LC data is a point-in-time extract. A loan
marked "Current" with a Feb 2019 last payment date likely means the data was pulled before
the March 2019 payment was due (or within the grace period). The "Late" loans with the same
date were already past due when the snapshot was taken — they had an earlier payment due
that they'd missed.

Also check whether the "Current" ones made a larger-than-installment payment (paid ahead)
or had a different origination date / payment schedule.