# Extracting data from PDfs

In this small project, I undertook the task of extracting specific data fields from a collection of PDF documents. The data fields of interest included the individual's name, CPF (Brazilian national identification number), date of birth, and a field called "julgados."

To achieve this, I developed a set of functions that utilize various Python libraries. Each function is responsible for extracting a particular piece of information from the PDF documents:

- Name Extraction: I created a function to extract the names of individuals from the PDFs. This involved locating patterns and structures in the text that typically represent names.

- CPF Extraction: Another function was designed to extract CPF numbers from the PDFs. This required identifying the specific format and structure of CPFs within the documents.

- Date of Birth Extraction: A function was developed to extract date of birth information from the PDFs. This involved recognizing date patterns and handling variations in date formats.

- Julgados Extraction: I created a function to extract the "julgados" field from the PDFs. This might involve identifying specific keywords or structures within the text that pertain to judgments or legal decisions.

After extracting the relevant information, I organized it into a DataFrame. This DataFrame allowed for easier manipulation and analysis of the extracted data.

To provide further insights, I calculated the age of each individual based on their date of birth. This age information was added as an additional column in the DataFrame, allowing for more comprehensive analysis.

Overall, this project demonstrates the application of Python programming and libraries in extracting, cleaning, and organizing data from a collection of PDF documents. The resulting DataFrame with extracted and processed data can serve as a valuable resource for further analysis and decision-making.

In [12]:
#Importing library 
import pandas as pd
import os
import re
from pdfminer.high_level import extract_text
from datetime import datetime
import numpy as np

In [13]:
## Creating def to 
def extract_name(content):
    name = r'Nome:\s*(.*)'
    names = re.findall(name, content, re.IGNORECASE)
    return names

## Criado def para extrair data nascimento
def extract_birthday(birthday_content):
    birthday = r'Data de nascimento: \s*(.*)'
    birthdays = re.findall(birthday,birthday_content)
    return birthdays

## Criado def para extrair data nascimento
def extract_cpf(cpf_content):
    cpf = r'CPF: \s*(.*)'
    cpfs = re.findall(cpf,cpf_content)
    return cpfs


## Criando def para extrair julgados
def extract_julgado(julgo_content):
    search_phrases = [
        "JULGO PROCEDENTE",
        "JULGO IMPROCEDENTE",
        "JULGO PROCEDENTE EM PARTE",
        "JULGO PARCIALMENTE PROCEDENTE"
    ] 

    julgos = []

    sorted_phrases = sorted(search_phrases, key=len, reverse=True)
    pattern = '|'.join(re.escape(phrase) for phrase in sorted_phrases)

    matched_phrases = re.findall(pattern, julgo_content)
    julgos.extend(matched_phrases)

    return julgos

In [14]:
## Caminho dos arquivos
path = 'pdfs/'
files= os.listdir(path)
all_names = []
all_birthdays = []
all_cpfs = []
all_julgos = []

for x in files:
    if x.endswith('.pdf'):
        pdf_path = os.path.join(path, x)
        pdf_content =  extract_text(pdf_path)
        #names
        names = extract_name(pdf_content)
        all_names.extend(names)
        #birthday
        birthdays = extract_birthday(pdf_content)
        all_birthdays.extend(birthdays)
        #cpfs
        cpfs = extract_cpf(pdf_content)
        all_cpfs.extend(cpfs)
       
        #julgado
        julgos = extract_julgado(pdf_content)
        all_julgos.extend(julgos)

In [16]:
df = pd.DataFrame({ 'name': all_names, 
                   'birthday': all_birthdays, 
                   'CPF': all_cpfs, 
                   'age': np.nan, 
                   'julgado' : all_julgos} )
df

Unnamed: 0,name,birthday,CPF,age,julgado
0,Felipe de Castro,/11/04/1988,001.002.003-88,,JULGO PROCEDENTE
1,Ana Maria Padrão,08/07/1960,002.004.005-90,,JULGO IMPROCEDENTE
2,Marcos Almeida Rodrigues,05/10/1970,697.831247-90,,JULGO PROCEDENTE EM PARTE
3,Tânia Dias Pinto,23/08/1936,977.027.684-79,,JULGO IMPROCEDENTE
4,Joao Martins Ferreira,12/07/1986,866.779.039-74,,JULGO PARCIALMENTE PROCEDENTE
5,Nicole Sousa Rocha,27/08/1954,231003005-92,,JULGO PROCEDENTE
6,Pedro Paula Andrade,12/01/1995,111.003.005-91,,JULGO PROCEDENTE EM PARTE


In [None]:
df.CPF = df.CPF.astype(str)
df.CPF = df.CPF.str.replace('.', '')
df.CPF = df.CPF.str.replace('-', '')
df.birthday = df.birthday.str.replace('/', '')
df.julgado = df.julgado.str.replace('JULGO', '')

In [18]:
df['birthday'] = pd.to_datetime(df['birthday'], format='%d%m%Y')
df

Unnamed: 0,name,birthday,CPF,age,julgado
0,Felipe de Castro,1988-04-11,100200388,,PROCEDENTE
1,Ana Maria Padrão,1960-07-08,200400590,,IMPROCEDENTE
2,Marcos Almeida Rodrigues,1970-10-05,69783124790,,PROCEDENTE EM PARTE
3,Tânia Dias Pinto,1936-08-23,97702768479,,IMPROCEDENTE
4,Joao Martins Ferreira,1986-07-12,86677903974,,PARCIALMENTE PROCEDENTE
5,Nicole Sousa Rocha,1954-08-27,23100300592,,PROCEDENTE
6,Pedro Paula Andrade,1995-01-12,11100300591,,PROCEDENTE EM PARTE


In [None]:
df["age"] = df["birthday"].apply(lambda x : (pd.datetime.now().year - x.year))

In [20]:
df

Unnamed: 0,name,birthday,CPF,age,julgado
0,Felipe de Castro,1988-04-11,100200388,35,PROCEDENTE
1,Ana Maria Padrão,1960-07-08,200400590,63,IMPROCEDENTE
2,Marcos Almeida Rodrigues,1970-10-05,69783124790,53,PROCEDENTE EM PARTE
3,Tânia Dias Pinto,1936-08-23,97702768479,87,IMPROCEDENTE
4,Joao Martins Ferreira,1986-07-12,86677903974,37,PARCIALMENTE PROCEDENTE
5,Nicole Sousa Rocha,1954-08-27,23100300592,69,PROCEDENTE
6,Pedro Paula Andrade,1995-01-12,11100300591,28,PROCEDENTE EM PARTE
