In [1]:
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../../..')))

In [2]:
from data_sc.src.utils.db import get_db_connection

# Call the function
conn = get_db_connection()

Database connection established successfully (using .env).


In [13]:
import pandas as pd

# Load table into DataFrame
query = "SELECT * FROM legislative_proposal"
df = pd.read_sql(query, conn)

### Import data into DataFrame

We use raw SQL to query the `legislative_proposal` table and load it into a DataFrame.

In [14]:
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7877 entries, 0 to 7876
Data columns (total 28 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   id                                7877 non-null   int64         
 1   title                             7877 non-null   object        
 2   idp                               7591 non-null   float64       
 3   senate_registration_number        7860 non-null   object        
 4   cdep_registration_number          7877 non-null   object        
 5   government_registration_number    7530 non-null   object        
 6   first_chamber                     7877 non-null   object        
 7   initiative                        7877 non-null   object        
 8   opinion                           2945 non-null   object        
 9   urgent_procedure                  7877 non-null   object        
 10  status                            7875 non-null 

Unnamed: 0,id,title,idp,senate_registration_number,cdep_registration_number,government_registration_number,first_chamber,initiative,opinion,urgent_procedure,...,first_senate_registration_number,cdep_active,senate_active,status_cdep,status_senate,promulgare,matching_title,normalized_title,latest_procedure_date,search_vector
count,7877.0,7877,7591.0,7860,7877.0,7530.0,7877,7877,2945,7877,...,244.0,7877,7877,2440,1797,7877,260.0,7877,7877,7877
unique,,7093,,7793,7648.0,2588.0,8,2541,2829,4,...,1.0,2,2,68,29,3,174.0,7023,1016,6985
top,,Propunere legislativă pentru modificarea și co...,,L98/2022,,,Camera Deputaţilor,Proiect de Lege,x,nu,...,,False,False,trimis pentru raport la comisiile permanente a...,procedura legislativa încetata prin respingere...,finalizat,,propunere legislativa pentru modificarea si co...,2025-02-10,'complet':6 'educ':8 'leg':7 'legisl':2 'modif...
freq,,34,,3,222.0,4941.0,3495,1189,100,4840,...,244.0,5619,7627,1394,1236,5374,67.0,38,142,39
mean,6099.180779,,18032.933606,,,,,,,,...,,,,,,,,,,
min,136.0,,13413.0,,,,,,,,...,,,,,,,,,,
25%,3045.0,,15836.5,,,,,,,,...,,,,,,,,,,
50%,5128.0,,18042.0,,,,,,,,...,,,,,,,,,,
75%,7136.0,,20218.5,,,,,,,,...,,,,,,,,,,
max,24679.0,,22310.0,,,,,,,,...,,,,,,,,,,


### Dataset Overview

We begin by inspecting the overall structure of the `LegislativeProposal` dataset using `.info()` and `.describe()` to get a summary of data types, non-null counts, and distribution statistics. This helps us identify missing values, categorical vs numerical fields, and potential anomalies.


In [15]:
# Percentage of missing values per column
missing_percentage = df.isnull().mean().sort_values(ascending=False) * 100

# Unique value counts
unique_counts = df.nunique().sort_values(ascending=False)

# Combine into a profiling DataFrame
profile = pd.DataFrame({
    'Data Type': df.dtypes,
    'Missing (%)': missing_percentage,
    'Unique Values': unique_counts,
    'Sample Values': df.apply(lambda col: col.dropna().unique()[:5])
})

profile


Unnamed: 0,Data Type,Missing (%),Unique Values,Sample Values
active,bool,0.0,2,"[True, False]"
cdep_active,bool,0.0,2,"[True, False]"
cdep_registration_number,object,0.0,7648,"[PLX392/2024, PLX158/2018, PLX310/2014, PLX190..."
created_at,datetime64[ns],0.0,7877,"[2024-08-25 05:39:41.773574, 2024-09-21 22:17:..."
deadline,object,67.030595,65,"[, Termenul de adoptare tacită este de 45 de z..."
first_chamber,object,0.0,8,"[Camera Deputaților, Senatul, Camera Deputaţil..."
first_senate_registration_number,object,96.902374,1,[]
government_registration_number,object,4.40523,2588,"[, E82/28.06.2016, E40/22.02.2024, E5/16.01.20..."
id,int64,0.0,7877,"[136, 7659, 4748, 6307, 4747]"
idp,float64,3.630824,7591,"[21715.0, 16973.0, 13775.0, 15302.0, 14219.0]"


### Column Profiling

The table below summarizes each column in the dataset, showing:
- The data type
- Percentage of missing values
- Number of unique values (helps identify categorical vs identifier fields)
- A few sample values to understand their structure

This is useful for data cleaning, transformation decisions, and detecting outliers.


In [21]:
text_fields = df.select_dtypes(include='object').columns
numeric_fields = df.select_dtypes(include='number').columns
bool_fields = df.select_dtypes(include='bool').columns
date_fields = df.select_dtypes(include='datetime').columns

In [22]:
# Text: Length and uniqueness
df[text_fields].apply(lambda x: pd.Series({
    'Missing (%)': x.isnull().mean() * 100,
    'Unique Count': x.nunique(),
    'Average Length': x.dropna().apply(len).mean() if x.dropna().apply(lambda v: isinstance(v, str)).all() else 0
}))

Unnamed: 0,title,senate_registration_number,cdep_registration_number,government_registration_number,first_chamber,initiative,opinion,urgent_procedure,status,law_character,deadline,first_senate_registration_number,status_cdep,status_senate,promulgare,matching_title,normalized_title,latest_procedure_date,search_vector
Missing (%),0.0,0.215818,0.0,4.40523,0.0,0.0,62.61267,0.0,0.02539,0.304685,67.030595,96.902374,69.02374,77.186746,0.0,96.699251,0.0,0.0,0.0
Unique Count,7093.0,7793.0,7648.0,2588.0,8.0,2541.0,2829.0,4.0,3361.0,8.0,65.0,1.0,68.0,29.0,3.0,174.0,7023.0,1016.0,6985.0
Average Length,179.163387,8.834097,10.530913,5.001726,16.08328,39.51314,13.696095,2.0,112.116825,5.800204,17.110897,0.0,63.386885,68.5665,6.775676,159.334615,179.163387,0.0,222.889933


In [23]:
# Numeric: Descriptive stats
df[numeric_fields].describe()

Unnamed: 0,id,idp,year_issue
count,7877.0,7591.0,7865.0
mean,6099.180779,18032.933606,2019.281627
std,5500.831452,2494.303088,3.330842
min,136.0,13413.0,2013.0
25%,3045.0,15836.5,2016.0
50%,5128.0,18042.0,2019.0
75%,7136.0,20218.5,2022.0
max,24679.0,22310.0,2025.0


In [25]:
# Boolean: Value counts
for col in bool_fields:
    print(f"{df[col].value_counts(dropna=False)}\n")

active
False    5374
True     2503
Name: count, dtype: int64

published
False    4523
True     3354
Name: count, dtype: int64

cdep_active
False    5619
True     2258
Name: count, dtype: int64

senate_active
False    7627
True      250
Name: count, dtype: int64



### Field Type Profiling

We profile columns by data type:

- **Text fields** are analyzed for missing values, number of unique entries, and average length.
- **Numeric fields** are summarized using standard statistics (mean, std, min, max).
- **Boolean fields** show the distribution of True/False to identify dominant values.


In [19]:
# Correlation (only meaningful for numerical columns)
df.corr(numeric_only=True)

# Duplicate rows
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates}")


Duplicate rows: 0


### Correlation & Duplication Analysis

We check:
- **Correlations** among numeric fields to identify related fields.
- **Duplicate rows** which may indicate redundancy or data ingestion issues.


## 🧾 LegislativeProposal Dataset Profiling Report

This notebook provides a column-wise profile of the `LegislativeProposal` model. Key findings:

- Certain fields such as `normalized_title`, `opinion`, and `status_*` have high missing rates and may need imputation or exclusion depending on the task.
- Boolean fields like `active`, `published`, etc. are well distributed but worth analyzing for imbalance.
- ID fields such as `idp` are mostly unique, while fields like `initiative` and `title` have redundancy potential.
- Potential data issues include high cardinality in some categorical fields and some text fields with unexpectedly long lengths.

Next steps:
- Handle missing data
- Normalize or encode categorical features
- Possibly enrich `report_cttee` with external joins
